Advanced Introduction to C++, Scientific Computing and Machine Learning
Claudius Gros, SS 2024
Institut for Theoretical Physics
Goethe-University Frankfurt a.M.
Boltzmann Machines
statistical physics
- binary neurons $ \qquad y_i\quad\to\quad s_i = -1/1$
: inactive/active
: akin to spins (physics)
- stochastic probability to flip a spin
energy functional
$$
E = -\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j - \sum_i h_i \, s_i
$$
- factor $ \ \ 1/2 \ \ $ for double counting
- magnetic field $\ \ h_i \ \ $ corresponds to $\ \ -b_i \ \ $ (threshold)
Boltzmann factor
- thermal equilibrium, temperature $\ \ T$
- probability to occupy state $\ \ \alpha=\{s_j\}$
$$
\fbox{$\phantom{\big|}\displaystyle
p_\alpha = \frac{\mbox{e}^{-\beta E_\alpha}}{Z}
\phantom{\big|}$}\,,
\qquad\quad
Z= \sum_{\alpha'} \mbox{e}^{-\beta E_{\alpha'}},
\qquad\quad
\beta=\frac{1}{k_B T}\ \to\ 1
$$
- partition function $\ \ Z$
Glauber dynamics
detailed balance
$$
p_\alpha P(\alpha\to\beta) = p_\beta P(\beta\to\alpha),
\qquad\quad
\fbox{$\phantom{\big|}\displaystyle
\frac{P(\alpha\to\beta)}{P(\beta\to\alpha)} =
\frac{p_\beta}{p_\alpha} =
\mbox{e}^{E_\alpha-E_\beta}
\phantom{\big|}$}
$$
- detail balance fulfilled by Glauber dynamics
$$
P(\alpha\to\beta)= \frac{\mbox{e}^{-E_\beta}}{
\mbox{e}^{-E_\alpha}+\mbox{e}^{-E_\beta}}
= \frac{1}{1+\mbox{e}^{\,E_\beta-E_\alpha}}
$$
single spin flips
$$
E_\beta -E_\alpha =
\left(\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j + \sum_i h_i \, s_i \right)_\alpha
-\left(\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j + \sum_i h_i \, s_i \right)_\beta
$$
$\qquad$ for $\ \ \def\arraystretch{1.3} \begin{array}{r|l}
\mbox{in}\ \alpha & \mbox{in}\ \beta\\ \hline
s_k & -s_k \end{array}$
- terms $\ \ w_{ij} \, s_i \, s_j \ \ $ compensate if $\ \ i,j\ne k$
$$
\fbox{$\phantom{\big|}\displaystyle
E_\beta -E_\alpha = \epsilon_ks_k
\phantom{\big|}$}\,,
\qquad\quad
\epsilon_k = 2h_k +\sum_j w_{kj}s_j +\sum_i w_{ik}s_i
$$
- here $\ \ w_{ii}=0$
- often symmetric weights $\ \ w_{ij}=w_{ji}$
- single spin flips not used in physics (critical slowing down)
: stochastic series expansion (loop algorithm)
training Boltzmann machines
purpose
- $N_d\ $ training data $\ \ \mathbf{s}_\alpha$
- goal: maximize
$$
\prod_{\alpha\in\mathrm{data}} p_\alpha =
\exp\left(\sum_{\alpha\in\mathrm{data}}\log(p_\alpha)\right),
\qquad\quad
\log(p_\alpha)\ : \ \mathrm{log-likelihood}
$$
- equivalence:
the data configurations $\ \mathbf{s}_\alpha \ $
should be attracting fixed points of the Glauber
dynamics
- in terms of the cross-correlation function
$$ \big\langle s_i s_j \big\rangle_{\mathrm{thermal}}
\quad\to\quad
\big\langle s_is_j\big\rangle_{\mathrm{data}}
$$
- learning is unsupervised (no explicit classifier)
Hopfield networks
- similar in spirit, predecessor
- weights as covariance matrix of data
$$
w_{ij} = \frac{1}{N_d} \sum_{\alpha\in\mathrm{data}}
\big((\mathbf{s}_\alpha)_i- \langle\mathbf{s}_\alpha\rangle_i \big)
\big((\mathbf{s}_\alpha)_j- \langle\mathbf{s}_\alpha\rangle_j \big)
$$
- storage capacity: $\ \ N_d<0.15\, N$
: number of neurons $ \ \ N$
- additional data?
log-likelihood maximization
- Boltzmann factor
$$
p_\alpha = \frac{\mbox{e}^{-E_\alpha}}{Z},
\qquad\quad
Z= \sum_{\alpha} \mbox{e}^{-E_\alpha},
\qquad\quad
E = -\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j - \sum_i h_i \, s_i
$$
derivative
$$
\frac{\partial \log(p_\alpha)}{\partial w_{ij}} =
\frac{\partial}{\partial w_{ij}}\Big(-E_\alpha-\log(Z)\Big)
= s_is_j -\frac{1}{Z}\sum_\alpha s_i s_j \mbox{e}^{-E_\alpha}
$$
and hence
$$
\frac{\partial \log(p_\alpha)}{\partial w_{ij}} =
s_is_j -\big\langle s_i s_j \big\rangle_{\mathrm{thermal}}
$$
- average over training data
number of data points $ \ \ N_d$
$$
\frac{1}{N_d}
\sum_{\beta\in\mathrm{data}} \frac{\partial \log(p_\beta)}{\partial w_{ij}}
= \big\langle s_is_j\big\rangle_{\mathrm{data}} -\big\langle s_i s_j \big\rangle_{\mathrm{thermal}}
$$
updating the weight matrix
$$
\fbox{$\phantom{\big|}\displaystyle
\tau_w \frac{d}{dt}w_{ij} =
\big\langle s_is_j\big\rangle_{\mathrm{data}} -\big\langle s_i s_j \big\rangle_{\mathrm{thermal}}
\phantom{\big|}$}
$$
- learning rule is
- local (biological plausible)
- greedy (local and not global optimization)
- equivalently for the magnetic field $\ \ h_i$
$$
\tau_b \frac{d}{dt}h_{i} =
\big\langle s_i\big\rangle_{\mathrm{data}} -\big\langle s_i \big\rangle_{\mathrm{thermal}}
$$
Kullback-Leibler divergence
- two distibution functions
$$
p_\alpha,\,q_\alpha\ge 0,
\qquad\quad
\sum_\alpha p_\alpha = 1= \sum_\alpha q_\alpha
$$
- positiveness $\ \ \to$
standard scalar product not a metric
similarity measure
$$
\fbox{$\phantom{\big|}\displaystyle
K(p;q) = \sum_\alpha p_\alpha\log\left(\frac{p_\alpha}{q_\alpha}\right)
\phantom{\big|}$}\,,
\qquad\quad
K(p;q) \ge 0
\qquad\quad\mbox{(strict)}
$$
- positive because $\ \log \ $ is concave
- vanishes only if $\ \ p_\alpha=q_\alpha, \qquad \forall\alpha$
- asymmetric
KL divergence for Boltzmann machine
$$
p_\alpha\ / \ q_\alpha \qquad\quad \mbox{(Boltzmann machine / training data)}
$$
- interchanged Kullback-Leibler divergence
$$
K(q;p) = \sum_\alpha q_\alpha\log\left(\frac{q_\alpha}{p_\alpha}\right)
=
\underbrace{\sum_\alpha q_\alpha\log(q_\alpha)}_{-H[q]}
- \sum_\alpha q_\alpha\log(p_\alpha)
$$
- entropy $ \ H[q] \ $ of data independent of $ \ \ w_{ij}$
- minimizing $ \ \ K(q;p)$
corresponds to maximizing
$$
\sum_\alpha q_\alpha\log(p_\alpha) = \frac{1}{N_d}\sum_{\beta\in\mathrm{data}} \log(p_\beta)
\qquad\quad\mbox{(log-likelihood of data)}
$$
with respect to the synaptic weights $\ \ w_{ij}$
- such that $\ \ \big\langle s_i s_j \big\rangle_{\mathrm{thermal}} \to
\big\langle s_is_j\big\rangle_{\mathrm{data}} $
Boltzmann machines encode data statistics
restricted Boltzmann machines (RBM)
- standard Boltzmann machine not converging
for complex problems
statistical independency
- seperate into hidden/visible units
- no intra-layer connections (bipartite graph)
- all units of one layer are statistical
independent if the other layer is frozen
- probability distributions factorizes for
statistical independent processes
aim
- fixpoints of restricted Boltzmann machine
- visible units $\ \ \to \ \ $ data
- hidden units $\ \ \to \ \ $ contextual information
- e.g. movie ratings
- visible units: ratings for individual movies (data)
- hidden units: classify movies (romance, drama, science-fiction, ..)
joint Boltzmann distribution
- variables $\ \ \mathbf{v}, \mathbf{h}$
- magnetic fields $\ \ \mathbf{b}, \mathbf{c}$
- energy
$\qquad\quad
E(\mathbf{v}, \mathbf{h}) = -\sum_{i,j} v_iw_{ij}h_j - \sum_i b_iv_i - \sum_j c_j h_j
$
- no factor 1/2 (no double counting)
joint PDF
$\qquad\quad
p(\mathbf{v}, \mathbf{h}) =
\frac{1}{Z} \exp\big\{- E(\mathbf{v}, \mathbf{h})\big\},
\quad\qquad
Z = \sum_{\mathbf{v},\mathbf{h}} \exp\big\{- E(\mathbf{v}, \mathbf{h})\big\}
$
- the joint PDF $p(\mathbf{v}, \mathbf{h})\ $ is the probability
to find the hidden system in
state $ \ \mathbf{h} \ $
and, at the same time, the visible system in state $ \ \mathbf{v} \ $
statistical independency
- hidden units statistical indendent
for clamped visible units (and viceversa)
- total energy
$\qquad\quad
E(\mathbf{v}, \mathbf{h}) = -\sum_{i,j} v_iw_{ij}h_j - \sum_i b_iv_i - \sum_j c_j h_j
$
- energy difference $\ \ h_j=0,1$
$\qquad\quad
E(\mathbf{v},1)- E(\mathbf{v},0)
= -\sum_{i} v_iw_{ij}- c_j$
Glauber dynamics
$$
\fbox{$\phantom{\big|}\displaystyle
P(0\to1)=
\frac{1}{1+\mbox{e}^{\,E(1,\mathbf{v})-E(0,\mathbf{v})}}
\phantom{\big|}$}\,,
\qquad\quad
P(\alpha\to\beta)=
\frac{1}{1+\mbox{e}^{\,E_\beta-E_\alpha}}
$$
- transition rates, sigmoidal $\sigma()$
$$
P(0\to1)=\sigma(\epsilon_j),
\qquad\quad
P(1\to0)=1-\sigma(\epsilon_j),
\qquad\quad
\epsilon_j = \sum_{i} v_iw_{ij} + c_j
$$
$$
p_j(0)P(0\to1)= p_j(1)P(1\to0),
\qquad\quad
\fbox{$\phantom{\big|}\displaystyle
p_j(1) =\sigma(\epsilon_j)
\phantom{\big|}$}
$$
- analytic result
$\hat{=} \ $ deterministic neuron
no need for (numerical) statistical sampling
training RBMs
- maximizing log-likelyhood $\ \ \log\big(p(h_j)\big)$
- possible, somewhat elaborate
contrastive divergence
t=0 |
start with a training vector on visible units
|
|
update all hidden units in parallel
|
|
measure $\ \langle v_i h_j\rangle^0$
|
t=1 |
update visible units in parallel
|
|
$\to \ $ 'reconstruction'
|
|
update hidden units again
|
|
measure $\ \langle v_i h_j\rangle^1$
|
- stationary, if visible units did not change
$$
\tau_w \frac{d}{dt} w_{ij} = \langle v_i h_j\rangle^0 -
\langle v_i h_j\rangle^1
$$
- repeat for all training vectors
- efficient RVMs use mixtures of methods