Advanced Introduction to C++, Scientific Computing and Machine Learning




Claudius Gros, SS 2024

Institut for Theoretical Physics
Goethe-University Frankfurt a.M.

Boltzmann Machines

statistical physics

energy functional

$$ E = -\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j - \sum_i h_i \, s_i $$

Boltzmann factor

$$ \fbox{$\phantom{\big|}\displaystyle p_\alpha = \frac{\mbox{e}^{-\beta E_\alpha}}{Z} \phantom{\big|}$}\,, \qquad\quad Z= \sum_{\alpha'} \mbox{e}^{-\beta E_{\alpha'}}, \qquad\quad \beta=\frac{1}{k_B T}\ \to\ 1 $$

Glauber dynamics

detailed balance

$$ p_\alpha P(\alpha\to\beta) = p_\beta P(\beta\to\alpha), \qquad\quad \fbox{$\phantom{\big|}\displaystyle \frac{P(\alpha\to\beta)}{P(\beta\to\alpha)} = \frac{p_\beta}{p_\alpha} = \mbox{e}^{E_\alpha-E_\beta} \phantom{\big|}$} $$
$$ P(\alpha\to\beta)= \frac{\mbox{e}^{-E_\beta}}{ \mbox{e}^{-E_\alpha}+\mbox{e}^{-E_\beta}} = \frac{1}{1+\mbox{e}^{\,E_\beta-E_\alpha}} $$

single spin flips

$$ E_\beta -E_\alpha = \left(\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j + \sum_i h_i \, s_i \right)_\alpha -\left(\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j + \sum_i h_i \, s_i \right)_\beta $$ $\qquad$ for $\ \ \def\arraystretch{1.3} \begin{array}{r|l} \mbox{in}\ \alpha & \mbox{in}\ \beta\\ \hline s_k & -s_k \end{array}$ $$ \fbox{$\phantom{\big|}\displaystyle E_\beta -E_\alpha = \epsilon_ks_k \phantom{\big|}$}\,, \qquad\quad \epsilon_k = 2h_k +\sum_j w_{kj}s_j +\sum_i w_{ik}s_i $$

training Boltzmann machines

purpose

$$ \prod_{\alpha\in\mathrm{data}} p_\alpha = \exp\left(\sum_{\alpha\in\mathrm{data}}\log(p_\alpha)\right), \qquad\quad \log(p_\alpha)\ : \ \mathrm{log-likelihood} $$ $$ \big\langle s_i s_j \big\rangle_{\mathrm{thermal}} \quad\to\quad \big\langle s_is_j\big\rangle_{\mathrm{data}} $$

Hopfield networks

$$ w_{ij} = \frac{1}{N_d} \sum_{\alpha\in\mathrm{data}} \big((\mathbf{s}_\alpha)_i- \langle\mathbf{s}_\alpha\rangle_i \big) \big((\mathbf{s}_\alpha)_j- \langle\mathbf{s}_\alpha\rangle_j \big) $$

log-likelihood maximization

$$ \frac{1}{N_d} \sum_{\beta\in\mathrm{data}} \frac{\partial \log(p_\beta)}{\partial w_{ij}} = \big\langle s_is_j\big\rangle_{\mathrm{data}} -\big\langle s_i s_j \big\rangle_{\mathrm{thermal}} $$

updating the weight matrix

$$ \fbox{$\phantom{\big|}\displaystyle \tau_w \frac{d}{dt}w_{ij} = \big\langle s_is_j\big\rangle_{\mathrm{data}} -\big\langle s_i s_j \big\rangle_{\mathrm{thermal}} \phantom{\big|}$} $$ $$ \tau_b \frac{d}{dt}h_{i} = \big\langle s_i\big\rangle_{\mathrm{data}} -\big\langle s_i \big\rangle_{\mathrm{thermal}} $$

Kullback-Leibler divergence

$$ p_\alpha,\,q_\alpha\ge 0, \qquad\quad \sum_\alpha p_\alpha = 1= \sum_\alpha q_\alpha $$

similarity measure

$$ \fbox{$\phantom{\big|}\displaystyle K(p;q) = \sum_\alpha p_\alpha\log\left(\frac{p_\alpha}{q_\alpha}\right) \phantom{\big|}$}\,, \qquad\quad K(p;q) \ge 0 \qquad\quad\mbox{(strict)} $$

KL divergence for Boltzmann machine

$$ p_\alpha\ / \ q_\alpha \qquad\quad \mbox{(Boltzmann machine / training data)} $$ $$ K(q;p) = \sum_\alpha q_\alpha\log\left(\frac{q_\alpha}{p_\alpha}\right) = \underbrace{\sum_\alpha q_\alpha\log(q_\alpha)}_{-H[q]} - \sum_\alpha q_\alpha\log(p_\alpha) $$
Boltzmann machines encode data statistics

restricted Boltzmann machines (RBM)



statistical independency

aim

joint Boltzmann distribution



$\qquad\quad E(\mathbf{v}, \mathbf{h}) = -\sum_{i,j} v_iw_{ij}h_j - \sum_i b_iv_i - \sum_j c_j h_j $

joint PDF

$\qquad\quad p(\mathbf{v}, \mathbf{h}) = \frac{1}{Z} \exp\big\{- E(\mathbf{v}, \mathbf{h})\big\}, \quad\qquad Z = \sum_{\mathbf{v},\mathbf{h}} \exp\big\{- E(\mathbf{v}, \mathbf{h})\big\} $

statistical independency



Glauber dynamics

$$ \fbox{$\phantom{\big|}\displaystyle P(0\to1)= \frac{1}{1+\mbox{e}^{\,E(1,\mathbf{v})-E(0,\mathbf{v})}} \phantom{\big|}$}\,, \qquad\quad P(\alpha\to\beta)= \frac{1}{1+\mbox{e}^{\,E_\beta-E_\alpha}} $$ $$ P(0\to1)=\sigma(\epsilon_j), \qquad\quad P(1\to0)=1-\sigma(\epsilon_j), \qquad\quad \epsilon_j = \sum_{i} v_iw_{ij} + c_j $$ $$ p_j(0)P(0\to1)= p_j(1)P(1\to0), \qquad\quad \fbox{$\phantom{\big|}\displaystyle p_j(1) =\sigma(\epsilon_j) \phantom{\big|}$} $$
no need for (numerical) statistical sampling

training RBMs



contrastive divergence

t=0 start with a training vector on visible units
update all hidden units in parallel
measure $\ \langle v_i h_j\rangle^0$
t=1 update visible units in parallel
$\to \ $ 'reconstruction'
update hidden units again
measure $\ \langle v_i h_j\rangle^1$
$$ \tau_w \frac{d}{dt} w_{ij} = \langle v_i h_j\rangle^0 - \langle v_i h_j\rangle^1 $$