Advanced Introduction to C++, Scientific Computing and Machine Learning

Claudius Gros, SS 2024

Institut for Theoretical Physics
Goethe-University Frankfurt a.M.

Boltzmann Machines

statistical physics

binary neurons $ \qquad y_i\quad\to\quad s_i = -1/1$
: inactive/active
: akin to spins (physics)
stochastic probability to flip a spin

energy functional

$$ E = -\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j - \sum_i h_i \, s_i $$

factor $ \ \ 1/2 \ \ $ for double counting
magnetic field $\ \ h_i \ \ $ corresponds to $\ \ -b_i \ \ $ (threshold)

Boltzmann factor

thermal equilibrium, temperature $\ \ T$
probability to occupy state $\ \ \alpha=\{s_j\}$

$$ \fbox{$\phantom{\big|}\displaystyle p_\alpha = \frac{\mbox{e}^{-\beta E_\alpha}}{Z} \phantom{\big|}$}\,, \qquad\quad Z= \sum_{\alpha'} \mbox{e}^{-\beta E_{\alpha'}}, \qquad\quad \beta=\frac{1}{k_B T}\ \to\ 1 $$

partition function $\ \ Z$

Glauber dynamics

detailed balance

$$ p_\alpha P(\alpha\to\beta) = p_\beta P(\beta\to\alpha), \qquad\quad \fbox{$\phantom{\big|}\displaystyle \frac{P(\alpha\to\beta)}{P(\beta\to\alpha)} = \frac{p_\beta}{p_\alpha} = \mbox{e}^{E_\alpha-E_\beta} \phantom{\big|}$} $$

detail balance fulfilled by Glauber dynamics

$$ P(\alpha\to\beta)= \frac{\mbox{e}^{-E_\beta}}{ \mbox{e}^{-E_\alpha}+\mbox{e}^{-E_\beta}} = \frac{1}{1+\mbox{e}^{\,E_\beta-E_\alpha}} $$

single spin flips

$$ E_\beta -E_\alpha = \left(\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j + \sum_i h_i \, s_i \right)_\alpha -\left(\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j + \sum_i h_i \, s_i \right)_\beta $$ $\qquad$ for $\ \ \def\arraystretch{1.3} \begin{array}{r|l} \mbox{in}\ \alpha & \mbox{in}\ \beta\\ \hline s_k & -s_k \end{array}$

terms $\ \ w_{ij} \, s_i \, s_j \ \ $ compensate if $\ \ i,j\ne k$

$$ \fbox{$\phantom{\big|}\displaystyle E_\beta -E_\alpha = \epsilon_ks_k \phantom{\big|}$}\,, \qquad\quad \epsilon_k = 2h_k +\sum_j w_{kj}s_j +\sum_i w_{ik}s_i $$

here $\ \ w_{ii}=0$
often symmetric weights $\ \ w_{ij}=w_{ji}$
single spin flips not used in physics (critical slowing down)
: stochastic series expansion (loop algorithm)

training Boltzmann machines

purpose

$N_d\ $ training data $\ \ \mathbf{s}_\alpha$
goal: maximize

$$ \prod_{\alpha\in\mathrm{data}} p_\alpha = \exp\left(\sum_{\alpha\in\mathrm{data}}\log(p_\alpha)\right), \qquad\quad \log(p_\alpha)\ : \ \mathrm{log-likelihood} $$

equivalence:
the data configurations $\ \mathbf{s}_\alpha \ $ should be attracting fixed points of the Glauber dynamics
in terms of the cross-correlation function

$$ \big\langle s_i s_j \big\rangle_{\mathrm{thermal}} \quad\to\quad \big\langle s_is_j\big\rangle_{\mathrm{data}} $$

learning is unsupervised (no explicit classifier)

Hopfield networks

similar in spirit, predecessor
weights as covariance matrix of data

$$ w_{ij} = \frac{1}{N_d} \sum_{\alpha\in\mathrm{data}} \big((\mathbf{s}_\alpha)_i- \langle\mathbf{s}_\alpha\rangle_i \big) \big((\mathbf{s}_\alpha)_j- \langle\mathbf{s}_\alpha\rangle_j \big) $$

storage capacity: $\ \ N_d<0.15\, N$
: number of neurons $ \ \ N$
additional data?

log-likelihood maximization

Boltzmann factor $$ p_\alpha = \frac{\mbox{e}^{-E_\alpha}}{Z}, \qquad\quad Z= \sum_{\alpha} \mbox{e}^{-E_\alpha}, \qquad\quad E = -\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j - \sum_i h_i \, s_i $$ derivative $$ \frac{\partial \log(p_\alpha)}{\partial w_{ij}} = \frac{\partial}{\partial w_{ij}}\Big(-E_\alpha-\log(Z)\Big) = s_is_j -\frac{1}{Z}\sum_\alpha s_i s_j \mbox{e}^{-E_\alpha} $$ and hence $$ \frac{\partial \log(p_\alpha)}{\partial w_{ij}} = s_is_j -\big\langle s_i s_j \big\rangle_{\mathrm{thermal}} $$

average over training data
number of data points $ \ \ N_d$

$$ \frac{1}{N_d} \sum_{\beta\in\mathrm{data}} \frac{\partial \log(p_\beta)}{\partial w_{ij}} = \big\langle s_is_j\big\rangle_{\mathrm{data}} -\big\langle s_i s_j \big\rangle_{\mathrm{thermal}} $$

updating the weight matrix

$$ \fbox{$\phantom{\big|}\displaystyle \tau_w \frac{d}{dt}w_{ij} = \big\langle s_is_j\big\rangle_{\mathrm{data}} -\big\langle s_i s_j \big\rangle_{\mathrm{thermal}} \phantom{\big|}$} $$

learning rule is
- local (biological plausible)
- greedy (local and not global optimization)
equivalently for the magnetic field $\ \ h_i$

$$ \tau_b \frac{d}{dt}h_{i} = \big\langle s_i\big\rangle_{\mathrm{data}} -\big\langle s_i \big\rangle_{\mathrm{thermal}} $$

Kullback-Leibler divergence

two distibution functions

$$ p_\alpha,\,q_\alpha\ge 0, \qquad\quad \sum_\alpha p_\alpha = 1= \sum_\alpha q_\alpha $$

positiveness $\ \ \to$
standard scalar product not a metric

similarity measure

$$ \fbox{$\phantom{\big|}\displaystyle K(p;q) = \sum_\alpha p_\alpha\log\left(\frac{p_\alpha}{q_\alpha}\right) \phantom{\big|}$}\,, \qquad\quad K(p;q) \ge 0 \qquad\quad\mbox{(strict)} $$

positive because $\ \log \ $ is concave
vanishes only if $\ \ p_\alpha=q_\alpha, \qquad \forall\alpha$
asymmetric

KL divergence for Boltzmann machine

$$ p_\alpha\ / \ q_\alpha \qquad\quad \mbox{(Boltzmann machine / training data)} $$

interchanged Kullback-Leibler divergence

$$ K(q;p) = \sum_\alpha q_\alpha\log\left(\frac{q_\alpha}{p_\alpha}\right) = \underbrace{\sum_\alpha q_\alpha\log(q_\alpha)}_{-H[q]} - \sum_\alpha q_\alpha\log(p_\alpha) $$

entropy $ \ H[q] \ $ of data independent of $ \ \ w_{ij}$
minimizing $ \ \ K(q;p)$
corresponds to maximizing $$ \sum_\alpha q_\alpha\log(p_\alpha) = \frac{1}{N_d}\sum_{\beta\in\mathrm{data}} \log(p_\beta) \qquad\quad\mbox{(log-likelihood of data)} $$ with respect to the synaptic weights $\ \ w_{ij}$
such that $\ \ \big\langle s_i s_j \big\rangle_{\mathrm{thermal}} \to \big\langle s_is_j\big\rangle_{\mathrm{data}} $

Boltzmann machines encode data statistics

restricted Boltzmann machines (RBM)

standard Boltzmann machine not converging
for complex problems

statistical independency

seperate into hidden/visible units
- no intra-layer connections (bipartite graph)
- all units of one layer are statistical
  independent if the other layer is frozen
probability distributions factorizes for
statistical independent processes

aim

fixpoints of restricted Boltzmann machine
- visible units $\ \ \to \ \ $ data
- hidden units $\ \ \to \ \ $ contextual information
e.g. movie ratings
- visible units: ratings for individual movies (data)
- hidden units: classify movies (romance, drama, science-fiction, ..)

joint Boltzmann distribution

variables $\ \ \mathbf{v}, \mathbf{h}$
magnetic fields $\ \ \mathbf{b}, \mathbf{c}$
energy

$\qquad\quad E(\mathbf{v}, \mathbf{h}) = -\sum_{i,j} v_iw_{ij}h_j - \sum_i b_iv_i - \sum_j c_j h_j $

no factor 1/2 (no double counting)

joint PDF

$\qquad\quad p(\mathbf{v}, \mathbf{h}) = \frac{1}{Z} \exp\big\{- E(\mathbf{v}, \mathbf{h})\big\}, \quad\qquad Z = \sum_{\mathbf{v},\mathbf{h}} \exp\big\{- E(\mathbf{v}, \mathbf{h})\big\} $

the joint PDF $p(\mathbf{v}, \mathbf{h})\ $ is the probability to find the hidden system in
state $ \ \mathbf{h} \ $ and, at the same time, the visible system in state $ \ \mathbf{v} \ $

statistical independency

hidden units statistical indendent
for clamped visible units (and viceversa)
total energy

$\qquad\quad E(\mathbf{v}, \mathbf{h}) = -\sum_{i,j} v_iw_{ij}h_j - \sum_i b_iv_i - \sum_j c_j h_j $
energy difference $\ \ h_j=0,1$

$\qquad\quad E(\mathbf{v},1)- E(\mathbf{v},0) = -\sum_{i} v_iw_{ij}- c_j$

Glauber dynamics

$$ \fbox{$\phantom{\big|}\displaystyle P(0\to1)= \frac{1}{1+\mbox{e}^{\,E(1,\mathbf{v})-E(0,\mathbf{v})}} \phantom{\big|}$}\,, \qquad\quad P(\alpha\to\beta)= \frac{1}{1+\mbox{e}^{\,E_\beta-E_\alpha}} $$

transition rates, sigmoidal $\sigma()$

$$ P(0\to1)=\sigma(\epsilon_j), \qquad\quad P(1\to0)=1-\sigma(\epsilon_j), \qquad\quad \epsilon_j = \sum_{i} v_iw_{ij} + c_j $$

detailed balance

$$ p_j(0)P(0\to1)= p_j(1)P(1\to0), \qquad\quad \fbox{$\phantom{\big|}\displaystyle p_j(1) =\sigma(\epsilon_j) \phantom{\big|}$} $$

analytic result
$\hat{=} \ $ deterministic neuron

no need for (numerical) statistical sampling

training RBMs

maximizing log-likelyhood $\ \ \log\big(p(h_j)\big)$
possible, somewhat elaborate

contrastive divergence

simple shortcut

t=0	start with a training vector on visible units
	update all hidden units in parallel
	measure $\ \langle v_i h_j\rangle^0$
t=1	update visible units in parallel
	$\to \ $ 'reconstruction'
	update hidden units again
	measure $\ \langle v_i h_j\rangle^1$

stationary, if visible units did not change

$$ \tau_w \frac{d}{dt} w_{ij} = \langle v_i h_j\rangle^0 - \langle v_i h_j\rangle^1 $$

repeat for all training vectors
efficient RVMs use mixtures of methods