Machine Learning Primer -- Part II: Deep Learning




Claudius Gros, WS 2024/25

Institut für theoretische Physik
Goethe-University Frankfurt a.M.

Boltzmann Machines

statistical physics

energy functional

$$ E = -\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j - \sum_i h_i \, s_i $$

Boltzmann factor

$$ \fbox{$\phantom{\big|}\displaystyle p_\alpha = \frac{\mbox{e}^{-\beta E_\alpha}}{Z} \phantom{\big|}$}\,, \qquad\quad Z = \sum_{\alpha'} \mbox{e}^{-\beta E_{\alpha'}}, \qquad\quad \beta=\frac{1}{k_B T}\ \to\ 1 $$

Metropolis sampling

detailed balance

$$ p_\alpha P(\alpha\to\beta) = p_\beta P(\beta\to\alpha), \qquad\quad \fbox{$\phantom{\big|}\displaystyle \frac{P(\alpha\to\beta)}{P(\beta\to\alpha)} = \frac{p_\beta}{p_\alpha} = \mbox{e}^{E_\alpha-E_\beta} \phantom{\big|}$} $$
$$ P(\alpha\to\beta)= \frac{\mbox{e}^{-E_\beta}}{ \mbox{e}^{-E_\alpha}+\mbox{e}^{-E_\beta}} = \frac{1}{1+\mbox{e}^{\,E_\beta-E_\alpha}} $$

Glauber dynamics

$$ E_\beta -E_\alpha = \left(\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j + \sum_i h_i \, s_i \right)_\alpha -\left(\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j + \sum_i h_i \, s_i \right)_\beta $$ $\qquad$ for $\ \ \def\arraystretch{1.3} \begin{array}{r|l} \mbox{in}\ \alpha & \mbox{in}\ \beta\\ \hline s_k & -s_k \end{array}$ $$ \fbox{$\phantom{\big|}\displaystyle E_\beta -E_\alpha = \epsilon_ks_k \phantom{\big|}$}\,, \qquad\quad \epsilon_k = 2h_k +\sum_j w_{kj}s_j +\sum_i w_{ik}s_i $$

training Boltzmann machines

purpose

$$ \prod_{\alpha\in\mathrm{data}} p_\alpha = \exp\left(\sum_{\alpha\in\mathrm{data}}\log(p_\alpha)\right), \qquad\quad \log(p_\alpha)\ : \ \mathrm{log-likelihood} $$ $$ \big\langle s_i s_j \big\rangle_{\mathrm{thermal}} \quad\to\quad \big\langle s_is_j\big\rangle_{\mathrm{data}} $$

Hopfield networks

$$ w_{ij} = \frac{1}{N_d} \sum_{\alpha\in\mathrm{data}} \big((\mathbf{s}_\alpha)_i- \langle\mathbf{s}_\alpha\rangle_i \big) \big((\mathbf{s}_\alpha)_j- \langle\mathbf{s}_\alpha\rangle_j \big) $$

log-likelihood maximization

$$ \frac{1}{N_d} \sum_{\beta\in\mathrm{data}} \frac{\partial \log(p_\beta)}{\partial w_{ij}} = \big\langle s_is_j\big\rangle_{\mathrm{data}} -\big\langle s_i s_j \big\rangle_{\mathrm{thermal}} $$

updating the weight matrix

$$ \fbox{$\phantom{\big|}\displaystyle \tau_w \frac{d}{dt}w_{ij} = \big\langle s_is_j\big\rangle_{\mathrm{data}} -\big\langle s_i s_j \big\rangle_{\mathrm{thermal}} \phantom{\big|}$} $$ $$ \tau_b \frac{d}{dt}h_{i} = \big\langle s_i\big\rangle_{\mathrm{data}} -\big\langle s_i \big\rangle_{\mathrm{thermal}} $$

Kullback-Leibler divergence

$$ p_\alpha,\,q_\alpha\ge 0, \qquad\quad \sum_\alpha p_\alpha = 1= \sum_\alpha q_\alpha $$

similarity measure

$$ \fbox{$\phantom{\big|}\displaystyle K(p;q) = \sum_\alpha p_\alpha\log\left(\frac{p_\alpha}{q_\alpha}\right) \phantom{\big|}$}\,, \qquad\quad K(p;q) \ge 0 \qquad\quad\mbox{(strict)} $$

KL divergence for Boltzmann machine

$$ p_\alpha\ / \ q_\alpha \qquad\quad \mbox{(Boltzmann machine / training data)} $$ $$ K(q;p) = \sum_\alpha q_\alpha\log\left(\frac{q_\alpha}{p_\alpha}\right) = \underbrace{\sum_\alpha q_\alpha\log(q_\alpha)}_{-H[q]} - \sum_\alpha q_\alpha\log(p_\alpha) $$
Boltzmann machines encode data statistics

restricted Boltzmann machines (RBM)



statistical independency

aim

joint Boltzmann distribution



$\qquad\quad E(\mathbf{v}, \mathbf{h}) = -\sum_{i,j} v_iw_{ij}h_j - \sum_i b_iv_i - \sum_j c_j h_j $

joint PDF

$\qquad\quad p(\mathbf{v}, \mathbf{h}) = \frac{1}{Z} \exp\big\{- E(\mathbf{v}, \mathbf{h})\big\}, \quad\qquad Z = \sum_{\mathbf{v},\mathbf{h}} \exp\big\{- E(\mathbf{v}, \mathbf{h})\big\} $

statistical independency



Glauber dynamics

$$ \fbox{$\phantom{\big|}\displaystyle P(0\to1)= \frac{1}{1+\mbox{e}^{\,E(1,\mathbf{v})-E(0,\mathbf{v})}} \phantom{\big|}$}\,, \qquad\quad P(\alpha\to\beta)= \frac{1}{1+\mbox{e}^{\,E_\beta-E_\alpha}} $$ $$ P(0\to1)=\sigma(\epsilon_j), \qquad\quad P(1\to0)=1-\sigma(\epsilon_j), \qquad\quad \epsilon_j = \sum_{i} v_iw_{ij} + c_j $$ $$ p_j(0)P(0\to1)= p_j(1)P(1\to0), \qquad\quad \fbox{$\phantom{\big|}\displaystyle p_j(1) =\sigma(\epsilon_j) \phantom{\big|}$} $$
no need for (numerical) statistical sampling

training RBMs



contrastive divergence

t=0 start with a training vector on visible units
update all hidden units in parallel
measure $\ \langle v_i h_j\rangle^0$
t=1 update visible units in parallel
$\to \ $ 'reconstruction'
update hidden units again
measure $\ \langle v_i h_j\rangle^1$
$$ \tau_w \frac{d}{dt} w_{ij} = \langle v_i h_j\rangle^0 - \langle v_i h_j\rangle^1 $$

RBM code

$$ F = - Tk_B\log(Z),\quad\qquad Z = \sum_{\alpha} \mbox{e}^{-E_\alpha}, \qquad\quad E = -\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j - \sum_i h_i \, s_i $$
Copy Copy to clipboad
Downlaod Download
#!/usr/bin/env python3

# restricted Boltzmann machine; source:
# https://blog.paperspace.com/beginners-guide-to-boltzmann-machines-pytorch/

# torchvision datasets
# https://pytorch.org/vision/stable/datasets.html

import numpy as np
import torch
import torch.utils.data
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torchvision.utils import make_grid , save_image
import matplotlib.pyplot as plt

# loading the MNIST dataset
# 'train_loader' is an instance of a DataLoader
#  'test_loader' for testing (currently not used)

batch_size = 64                           # samples per epoch 
train_loader = torch.utils.data.DataLoader(
  datasets.MNIST(                         # which dataset to load
    './data',                             # store in local directory
    train=True,
    download = True,                      # do download in dir
    transform = transforms.Compose([transforms.ToTensor()])
                ), 
  batch_size=batch_size                   )

test_loader = torch.utils.data.DataLoader(
  datasets.MNIST('./data', train=False,
    transform=transforms.Compose([transforms.ToTensor()])
                ), batch_size=batch_size )

#
# defining the restricted Boltzmann machine
# 'vis' : visible
# 'hin' : hidden
#
class RBM(nn.Module):
   def __init__(self, n_vis=784, n_hin=500, k=5):
        super(RBM, self).__init__()
        self.W = nn.Parameter(torch.randn(n_hin,n_vis)*1e-2)
        self.v_bias = nn.Parameter(torch.zeros(n_vis))
        self.h_bias = nn.Parameter(torch.zeros(n_hin))
        self.k = k                        # iteration depth

   def sample_from_p(self, p):            # p -> 0/1  stochastically
       return F.relu(torch.sign(p-torch.rand(p.size())))

   def v_to_h(self, v):
        p_h = F.sigmoid(F.linear(v, self.W, self.h_bias))  # update hidden
        sample_h = self.sample_from_p(p_h)                 # 0/1 sample
        return p_h, sample_h

   def h_to_v(self, h):                   # transpose W for other direction
        p_v = F.sigmoid(F.linear(h, self.W.t(), self.v_bias))
        sample_v = self.sample_from_p(p_v)
        return p_v, sample_v

   def forward(self, v):
        pre_h1, h1 = self.v_to_h(v)
        h_ = h1
        for _ in range(self.k):           # consistency loop
            pre_v_, v_ = self.h_to_v(h_)  # with 0/1 samples
            pre_h_, h_ = self.v_to_h(v_)
        return v, v_                      # return input, 0/1 reconstruction

   def free_energy(self, v):
        """hidden term: sum over hidden units
           1+exp(): s_hidden = 0/1 
           wx_b   : energy for hidden units
           v: visible activity, data / reconstructed
            : only one term -->  -log(exp(-E)) = E (modulo sign)
        """
        vbias_term  = v.mv(self.v_bias)
        wx_b        = F.linear(v, self.W, self.h_bias)
        hidden_term = wx_b.exp().add(1).log().sum(1)
        return (-hidden_term - vbias_term).mean()

#
# define model, register optimizer
# SGD: stochatic gradient descent
#
model = RBM(k=1)
train_op = optim.SGD(model.parameters(),0.1) 

#
# training model
#
for epoch in range(2):
    loss_ = []
    for _, (data,target) in enumerate(train_loader):
        data = data.view(-1, 784)
        sample_data = data.bernoulli()

        v, v1 = model(sample_data)
        loss = model.free_energy(v) - model.free_energy(v1)
        loss_.append(loss.data)
        train_op.zero_grad()
        loss.backward()
        train_op.step()

    print("Training loss for {} epoch: {}".format(epoch, np.mean(loss_)))

#
# storing images
#
def show_and_save(file_id,img,fig,position):
    npimg = np.transpose(img.numpy(),(1,2,0))
    fileName = "RBM_out_" + file_id + ".png"
    fig.add_subplot(1, 2, position)
    plt.title(file_id)
    plt.axis('off')
    plt.imshow(npimg)
    plt.imsave(fileName,npimg)

#
# visualising training outputs
#
fig = plt.figure(figsize=(12, 6))
show_and_save("real"     ,make_grid( v.view(32,1,28,28).data),fig,1)
show_and_save("generated",make_grid(v1.view(32,1,28,28).data),fig,2)
plt.show()