Advanced Introduction to C++, Scientific Computing and Machine Learning
 
 
Claudius Gros, SS 2024
Institut for Theoretical Physics
Goethe-University Frankfurt a.M.
Boltzmann Machines
statistical physics
-  binary neurons $ \qquad y_i\quad\to\quad s_i = -1/1$ 
     : inactive/active 
     : akin to spins (physics) 
      
-  stochastic probability to flip a spin
     
 
energy functional
$$
E = -\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j - \sum_i h_i \, s_i 
$$
-  factor $ \ \ 1/2 \ \ $ for double counting 
 
-  magnetic field $\ \ h_i \ \ $ corresponds to $\ \ -b_i \ \ $ (threshold) 
 
Boltzmann factor
-  thermal equilibrium, temperature $\ \ T$ 
 
-  probability to occupy state $\ \ \alpha=\{s_j\}$
     
 
$$
\fbox{$\phantom{\big|}\displaystyle   
p_\alpha = \frac{\mbox{e}^{-\beta E_\alpha}}{Z}
\phantom{\big|}$}\,,
\qquad\quad
Z= \sum_{\alpha'} \mbox{e}^{-\beta E_{\alpha'}},
\qquad\quad
\beta=\frac{1}{k_B T}\ \to\ 1
$$
-  partition function $\ \ Z$
     
 
 
Glauber dynamics
detailed balance
$$
p_\alpha P(\alpha\to\beta) = p_\beta P(\beta\to\alpha),
\qquad\quad
\fbox{$\phantom{\big|}\displaystyle   
\frac{P(\alpha\to\beta)}{P(\beta\to\alpha)} =
\frac{p_\beta}{p_\alpha} =
\mbox{e}^{E_\alpha-E_\beta}
\phantom{\big|}$}
$$
-  detail balance fulfilled by Glauber dynamics
     
 
$$
P(\alpha\to\beta)= \frac{\mbox{e}^{-E_\beta}}{
\mbox{e}^{-E_\alpha}+\mbox{e}^{-E_\beta}}
= \frac{1}{1+\mbox{e}^{\,E_\beta-E_\alpha}}
$$
 single spin flips 
$$
E_\beta -E_\alpha =
 \left(\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j + \sum_i h_i \, s_i \right)_\alpha
-\left(\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j + \sum_i h_i \, s_i \right)_\beta
$$
$\qquad$ for $\ \ \def\arraystretch{1.3} \begin{array}{r|l}
 \mbox{in}\ \alpha & \mbox{in}\ \beta\\ \hline
   s_k         & -s_k \end{array}$
-  terms $\ \ w_{ij} \, s_i \, s_j \ \ $ compensate if $\ \ i,j\ne k$
     
 
$$
\fbox{$\phantom{\big|}\displaystyle   
E_\beta -E_\alpha = \epsilon_ks_k
\phantom{\big|}$}\,,
\qquad\quad
\epsilon_k = 2h_k +\sum_j w_{kj}s_j +\sum_i w_{ik}s_i
$$
-  here $\ \ w_{ii}=0$ 
 
-  often symmetric weights $\ \ w_{ij}=w_{ji}$ 
 
-  single spin flips not used in physics (critical slowing down) 
     : stochastic series expansion (loop algorithm)
      
 
training Boltzmann machines
purpose
-  $N_d\ $ training data $\ \ \mathbf{s}_\alpha$ 
 
-  goal: maximize 
 
 
$$
\prod_{\alpha\in\mathrm{data}} p_\alpha = 
\exp\left(\sum_{\alpha\in\mathrm{data}}\log(p_\alpha)\right),
\qquad\quad
\log(p_\alpha)\ : \ \mathrm{log-likelihood}
$$
-  equivalence: 
     the data configurations $\ \mathbf{s}_\alpha \ $ 
     should be attracting fixed points of the Glauber
     dynamics
      
-  in terms of the cross-correlation function
     
 
$$ \big\langle s_i s_j \big\rangle_{\mathrm{thermal}} 
\quad\to\quad
    \big\langle s_is_j\big\rangle_{\mathrm{data}} 
$$
-  learning is unsupervised (no explicit classifier)
     
 
Hopfield networks
-  similar in spirit, predecessor 
 
-  weights as covariance matrix of data 
 
$$
w_{ij} = \frac{1}{N_d} \sum_{\alpha\in\mathrm{data}}
\big((\mathbf{s}_\alpha)_i- \langle\mathbf{s}_\alpha\rangle_i \big)
\big((\mathbf{s}_\alpha)_j- \langle\mathbf{s}_\alpha\rangle_j \big)
$$
-  storage capacity: $\ \ N_d<0.15\, N$ 
 
     : number of neurons $ \ \ N$
      
-  additional data?
     
 
 
log-likelihood maximization
-  Boltzmann factor
$$
p_\alpha = \frac{\mbox{e}^{-E_\alpha}}{Z},
\qquad\quad
Z= \sum_{\alpha} \mbox{e}^{-E_\alpha},
\qquad\quad
E = -\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j - \sum_i h_i \, s_i
$$
derivative
$$
\frac{\partial \log(p_\alpha)}{\partial w_{ij}} =
\frac{\partial}{\partial w_{ij}}\Big(-E_\alpha-\log(Z)\Big)
= s_is_j -\frac{1}{Z}\sum_\alpha s_i s_j \mbox{e}^{-E_\alpha}
$$
and hence
$$
\frac{\partial \log(p_\alpha)}{\partial w_{ij}} =
s_is_j -\big\langle s_i s_j \big\rangle_{\mathrm{thermal}}
$$
     
 
-  average over training data
     number of data points $ \ \ N_d$
      
$$
\frac{1}{N_d}
\sum_{\beta\in\mathrm{data}} \frac{\partial \log(p_\beta)}{\partial w_{ij}}
= \big\langle s_is_j\big\rangle_{\mathrm{data}} -\big\langle s_i s_j \big\rangle_{\mathrm{thermal}}
$$
updating the weight matrix
$$
\fbox{$\phantom{\big|}\displaystyle 
\tau_w \frac{d}{dt}w_{ij} =
\big\langle s_is_j\big\rangle_{\mathrm{data}} -\big\langle s_i s_j \big\rangle_{\mathrm{thermal}}
  \phantom{\big|}$}
$$
-  learning rule is
     
     -  local (biological plausible) 
 
     -  greedy (local and not global optimization) 
 
     
 
-  equivalently for the magnetic field $\ \ h_i$
 
$$
\tau_b \frac{d}{dt}h_{i} =
\big\langle s_i\big\rangle_{\mathrm{data}} -\big\langle s_i \big\rangle_{\mathrm{thermal}}
$$
 
Kullback-Leibler divergence
-  two distibution functions
 
$$
p_\alpha,\,q_\alpha\ge 0,
\qquad\quad
\sum_\alpha p_\alpha = 1= \sum_\alpha q_\alpha
$$
-  positiveness $\ \ \to$ 
     standard scalar product not a metric
      
similarity measure
$$
\fbox{$\phantom{\big|}\displaystyle   
K(p;q) = \sum_\alpha p_\alpha\log\left(\frac{p_\alpha}{q_\alpha}\right)
\phantom{\big|}$}\,,
\qquad\quad
K(p;q) \ge 0
\qquad\quad\mbox{(strict)}
$$
-  positive because $\ \log \ $ is concave
     
 
-  vanishes only if $\ \ p_\alpha=q_\alpha, \qquad \forall\alpha$
     
 
-  asymmetric
     
 
KL divergence for Boltzmann machine
$$ 
p_\alpha\ / \ q_\alpha \qquad\quad \mbox{(Boltzmann machine / training data)}
$$
-  interchanged Kullback-Leibler divergence 
 
$$
K(q;p) = \sum_\alpha q_\alpha\log\left(\frac{q_\alpha}{p_\alpha}\right)
 = 
\underbrace{\sum_\alpha q_\alpha\log(q_\alpha)}_{-H[q]}
- \sum_\alpha q_\alpha\log(p_\alpha)
$$
-  entropy $ \ H[q] \ $ of data independent of $ \ \ w_{ij}$
     
 
-  minimizing $ \ \ K(q;p)$ 
     corresponds to maximizing
$$
\sum_\alpha q_\alpha\log(p_\alpha) = \frac{1}{N_d}\sum_{\beta\in\mathrm{data}} \log(p_\beta)
\qquad\quad\mbox{(log-likelihood of data)}
$$
with respect to the synaptic weights $\ \ w_{ij}$
      
-  such that $\ \ \big\langle s_i s_j \big\rangle_{\mathrm{thermal}} \to
                   \big\langle s_is_j\big\rangle_{\mathrm{data}} $
     
 
     Boltzmann machines encode data statistics
 
restricted Boltzmann machines (RBM)
-  standard Boltzmann machine not converging
     for complex problems 
      
statistical independency
-  seperate into hidden/visible units 
     
     -  no intra-layer connections (bipartite graph)
          
 
      -  all units of one layer are statistical 
           independent if the other layer is frozen
      
     
      
-  probability distributions factorizes for 
     statistical independent processes
      
aim
-  fixpoints of restricted Boltzmann machine
     
     -  visible units $\ \ \to \ \ $ data 
 
     -  hidden units $\ \ \to \ \ $ contextual information 
 
     
      
-  e.g. movie ratings
     
     -  visible units: ratings for individual movies (data) 
 
     -  hidden units: classify movies (romance, drama, science-fiction, ..) 
 
     
      
     
 
joint Boltzmann distribution
-  variables $\ \ \mathbf{v}, \mathbf{h}$ 
 
-  magnetic fields $\ \ \mathbf{b}, \mathbf{c}$ 
 
-  energy
 
$\qquad\quad
E(\mathbf{v}, \mathbf{h}) = -\sum_{i,j} v_iw_{ij}h_j - \sum_i b_iv_i - \sum_j c_j h_j
$
-  no factor 1/2 (no double counting) 
 
joint PDF
$\qquad\quad
p(\mathbf{v}, \mathbf{h}) = 
\frac{1}{Z} \exp\big\{- E(\mathbf{v}, \mathbf{h})\big\},
\quad\qquad
Z = \sum_{\mathbf{v},\mathbf{h}} \exp\big\{- E(\mathbf{v}, \mathbf{h})\big\}
$
-  the joint PDF $p(\mathbf{v}, \mathbf{h})\ $ is the probability 
     to find the hidden system in 
 state $ \ \mathbf{h} \ $
     and, at the same time, the visible system in state $ \ \mathbf{v} \ $
      
 
statistical independency
-  hidden units statistical indendent
     for clamped visible units (and viceversa)
      
-  total energy
$\qquad\quad
E(\mathbf{v}, \mathbf{h}) = -\sum_{i,j} v_iw_{ij}h_j - \sum_i b_iv_i - \sum_j c_j h_j
$
      
-  energy difference $\ \ h_j=0,1$ 
$\qquad\quad
E(\mathbf{v},1)- E(\mathbf{v},0)
 = -\sum_{i} v_iw_{ij}- c_j$
      
Glauber dynamics
$$
\fbox{$\phantom{\big|}\displaystyle   
P(0\to1)= 
 \frac{1}{1+\mbox{e}^{\,E(1,\mathbf{v})-E(0,\mathbf{v})}}
\phantom{\big|}$}\,,
\qquad\quad
P(\alpha\to\beta)= 
\frac{1}{1+\mbox{e}^{\,E_\beta-E_\alpha}}
$$
-  transition rates, sigmoidal $\sigma()$
     
 
$$
P(0\to1)=\sigma(\epsilon_j),
\qquad\quad
P(1\to0)=1-\sigma(\epsilon_j),
\qquad\quad
\epsilon_j = \sum_{i} v_iw_{ij} + c_j
$$
$$
p_j(0)P(0\to1)= p_j(1)P(1\to0),
\qquad\quad
\fbox{$\phantom{\big|}\displaystyle 
p_j(1) =\sigma(\epsilon_j)
\phantom{\big|}$}
$$
-  analytic result 
     $\hat{=} \ $ deterministic neuron
      
no need for (numerical) statistical sampling
 
training RBMs
-  maximizing log-likelyhood $\ \ \log\big(p(h_j)\big)$
     
 
-  possible, somewhat elaborate
     
 
contrastive divergence
|  t=0 |  
start with a training vector on visible units 
 | 
|   |  
update all hidden units in parallel 
 | 
|   |  
measure $\ \langle v_i h_j\rangle^0$ 
 | 
|  t=1 |  
update visible units in parallel 
 | 
|   |  
$\to \ $ 'reconstruction' 
 | 
|   |  
update hidden units again 
 | 
|   |  
measure $\ \langle v_i h_j\rangle^1$ 
 | 
-  stationary, if visible units did not change
     
 
$$
\tau_w \frac{d}{dt} w_{ij} = \langle v_i h_j\rangle^0 -
                      \langle v_i h_j\rangle^1
$$
-  repeat for all training vectors 
 
-  efficient RVMs use mixtures of methods