Machine Learning Primer -- Python Tutorial

Claudius Gros, WS 2025/26

Institut für theoretische Physik
Goethe-University Frankfurt a.M.

statistical physics

binary neurons $ \qquad y_i\quad\to\quad s_i = -1/1$
: inactive/active
: akin to spins (physics)
stochastic probability to flip a spin

energy functional

$$ E = -\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j - \sum_i h_i \, s_i $$

factor $ \ \ 1/2 \ \ $ for double counting
magnetic field $\ \ h_i \ \ $ corresponds to $\ \ -b_i \ \ $ (threshold)

Boltzmann factor

thermal equilibrium, temperature $\ \ T$
probability to occupy state $\ \ \alpha=\{s_j\}$

$$ \fbox{$\phantom{\big|}\displaystyle p_\alpha = \frac{\mbox{e}^{-\beta E_\alpha}}{Z} \phantom{\big|}$}\,, \qquad\quad Z = \sum_{\alpha'} \mbox{e}^{-\beta E_{\alpha'}}, \qquad\quad \beta=\frac{1}{k_B T}\ \to\ 1 $$

partition function $\ \ Z$

Metropolis sampling

detailed balance

stochastic sampling of partition function via a
random walk through configuration space
:: Monte-Carlo / Metropolis sampling

$$ p_\alpha P(\alpha\to\beta) = p_\beta P(\beta\to\alpha), \qquad\quad \fbox{$\phantom{\big|}\displaystyle \frac{P(\alpha\to\beta)}{P(\beta\to\alpha)} = \frac{p_\beta}{p_\alpha} = \mbox{e}^{E_\alpha-E_\beta} \phantom{\big|}$} $$

$P(\alpha\to\beta)\ $ probability of selecting $\ \beta$
when in being in state $\ \alpha$
detail balance fulfilled

$$ P(\alpha\to\beta)= \frac{\mbox{e}^{-E_\beta}}{ \mbox{e}^{-E_\alpha}+\mbox{e}^{-E_\beta}} = \frac{1}{1+\mbox{e}^{\,E_\beta-E_\alpha}} $$

Glauber dynamics

single spin flips

$$ E_\beta -E_\alpha = \left(\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j + \sum_i h_i \, s_i \right)_\alpha -\left(\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j + \sum_i h_i \, s_i \right)_\beta $$ $\qquad$ for $\ \ \def\arraystretch{1.3} \begin{array}{r|l} \mbox{in}\ \alpha & \mbox{in}\ \beta\\ \hline s_k & -s_k \end{array}$

terms $\ \ w_{ij} \, s_i \, s_j \ \ $ compensate if $\ \ i,j\ne k$

$$ \fbox{$\phantom{\big|}\displaystyle E_\beta -E_\alpha = \epsilon_ks_k \phantom{\big|}$}\,, \qquad\quad \epsilon_k = 2h_k +\sum_j w_{kj}s_j +\sum_i w_{ik}s_i $$

here $\ \ w_{ii}=0$
often symmetric weights $\ \ w_{ij}=w_{ji}$
single spin flips not used in physics (critical slowing down)
: stochastic series expansion (loop algorithm)

Boltzmann machines

2024 physics Nobel prize; Geoff Hinton

purpose

$N_d\ $ training data $\ \ \mathbf{s}_\alpha$
goal
maximize the probality to visit the training data

$$ \prod_{\alpha\in\mathrm{data}} p_\alpha = \exp\left(\sum_{\alpha\in\mathrm{data}}\log(p_\alpha)\right), \qquad\quad \log(p_\alpha)\ : \ \mathrm{log-likelihood} $$

equivalence
data configurations $\ \mathbf{s}_\alpha \ $ should be
attracting fixed points of the Glauber dynamics
in terms of the cross-correlation function

$$ \big\langle s_i s_j \big\rangle_{\mathrm{thermal}} \quad\to\quad \big\langle s_is_j\big\rangle_{\mathrm{data}} $$

learning is unsupervised (no explicit classifier)

Hopfield network

2024 physics Nobel prize; John Hopfield

data ↔ memories

similar to Boltzmann machines, predecessor
weights set by hand, not learned,
as covariance matrix of data

$$ w_{ij} = \frac{1}{N_d} \sum_{\alpha\in\mathrm{data}} \big((\mathbf{s}_\alpha)_i- \langle s_i\rangle\big) \big((\mathbf{s}_\alpha)_j- \langle s_j\rangle\big) $$

case: a single datapoint, $\ N_d=1$
$\ \mathbf{s}_\alpha = (t_0,t_1,\dots)$

$$ E = -\frac{1}{2}\sum_{i,j} t_i t_j \, s_i \, s_j = -\frac{1}{2}\sum_{i,j} (t_i s_i)\, (t_j s_j) $$

minimal for $\ s_i\equiv t_i$
modulo black ↔ white

storage capacity

(Glauber) dynamics towards
attractor → memory
storage capacity: $\ \ N_d<0.15\, N$
:: number of neurons $ \ \ N$
catastrophic forgetting
performance decreases rapidly above storage capacity
problem: how to add additional data

log-likelihood maximization

Boltzmann factor $$ p_\alpha = \frac{\mbox{e}^{-E_\alpha}}{Z}, \qquad\quad Z = \sum_{\alpha} \mbox{e}^{-E_\alpha}, \qquad\quad E = -\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j - \sum_i h_i \, s_i $$ derivative $$ \frac{\partial \log(p_\alpha)}{\partial w_{ij}} = \frac{\partial}{\partial w_{ij}}\Big(-E_\alpha-\log(Z)\Big) = s_is_j -\frac{1}{Z}\sum_\alpha s_i s_j \mbox{e}^{-E_\alpha} $$ and hence $$ \frac{\partial \log(p_\alpha)}{\partial w_{ij}} = s_is_j -\big\langle s_i s_j \big\rangle_{\mathrm{thermal}} $$

average over training data
number of data points $ \ \ N_d$

$$ \frac{1}{N_d} \sum_{\beta\in\mathrm{data}} \frac{\partial \log(p_\beta)}{\partial w_{ij}} = \big\langle s_is_j\big\rangle_{\mathrm{data}} -\big\langle s_i s_j \big\rangle_{\mathrm{thermal}} $$

updating the weight matrix

$$ \fbox{$\phantom{\big|}\displaystyle \tau_w \frac{d}{dt}w_{ij} = \big\langle s_is_j\big\rangle_{\mathrm{data}} -\big\langle s_i s_j \big\rangle_{\mathrm{thermal}} \phantom{\big|}$} $$

learning rule is
- local (biological plausible)
- greedy (local and not global optimization)
equivalently for the magnetic field $\ \ h_i$

$$ \tau_b \frac{d}{dt}h_{i} = \big\langle s_i\big\rangle_{\mathrm{data}} -\big\langle s_i \big\rangle_{\mathrm{thermal}} $$

Kullback-Leibler divergence

two distibution functions

$$ p_\alpha,\,q_\alpha\ge 0, \qquad\quad \sum_\alpha p_\alpha = 1= \sum_\alpha q_\alpha $$

positiveness $\ \ \to$
standard scalar product not a metric

similarity measure

$$ \fbox{$\phantom{\big|}\displaystyle K(p;q) = \sum_\alpha p_\alpha\log\left(\frac{p_\alpha}{q_\alpha}\right) \phantom{\big|}$} \qquad\quad K(p;q) \ge 0 \qquad\quad\mbox{(strict)} $$

positive because $\ \log \ $ is concave
vanishes only if $\ \ p_\alpha=q_\alpha, \qquad \forall\alpha$
asymmetric

KL divergence for Boltzmann machine

$$ p_\alpha\ / \ q_\alpha \qquad\quad \mbox{(Boltzmann machine / training data)} $$

interchanged Kullback-Leibler divergence

$$ K(q;p) = \sum_\alpha q_\alpha\log\left(\frac{q_\alpha}{p_\alpha}\right) = \underbrace{\sum_\alpha q_\alpha\log(q_\alpha)}_{-H[q]} - \sum_\alpha q_\alpha\log(p_\alpha) $$

$\ H[q] \ $ entropy of data, independent of $ \ \ w_{ij}$
$-\sum_\alpha q_\alpha\log(p_\alpha)\ $ cross entropy
minimizing $ \ \ K(q;p)$
corresponds to maximizing $$ \sum_\alpha q_\alpha\log(p_\alpha) = \frac{1}{N_d}\sum_{\beta\in\mathrm{data}} \log(p_\beta) \qquad\quad\mbox{(log-likelihood of data)} $$ with respect to the synaptic weights $\ \ w_{ij}$
such that $\ \ \big\langle s_i s_j \big\rangle_{\mathrm{thermal}} \to \big\langle s_is_j\big\rangle_{\mathrm{data}} $

Boltzmann machines encode data statistics

restricted Boltzmann machines (RBM)

standard Boltzmann machine not converging
for complex problems

statistical independency

seperate into hidden/visible units
- no intra-layer connections (bipartite graph)
- all units of one layer are statistical
  independent if the other layer is frozen
probability distributions factorizes for
statistical independent processes

aim

fixpoints of restricted Boltzmann machine
- visible units $\ \ \to \ \ $ data
- hidden units $\ \ \to \ \ $ contextual information
e.g. movie ratings
- visible units: ratings for individual movies (data)
- hidden units: unsupervised movie classification (romance, drama, science-fiction, ..)

joint Boltzmann distribution

variables $\ \ \mathbf{v}, \mathbf{h}$
magnetic fields $\ \ \mathbf{b}, \mathbf{c}$
energy

$\qquad\quad E(\mathbf{v}, \mathbf{h}) = -\sum_{i,j} v_iw_{ij}h_j - \sum_i b_iv_i - \sum_j c_j h_j $

no factor 1/2 (no double counting)

joint PDF

$\qquad\quad p(\mathbf{v}, \mathbf{h}) = \frac{1}{Z} \exp\big\{- E(\mathbf{v}, \mathbf{h})\big\}, \quad\qquad Z = \sum_{\mathbf{v},\mathbf{h}} \exp\big\{- E(\mathbf{v}, \mathbf{h})\big\} $

the joint PDF $p(\mathbf{v}, \mathbf{h})\ $ is the probability to find the hidden system in
state $ \ \mathbf{h} \ $ and, at the same time, the visible system in state $ \ \mathbf{v} \ $

statistical independency

hidden units statistical indendent
for clamped visible units (and viceversa)
total energy

$\qquad\quad E(\mathbf{v}, \mathbf{h}) = -\sum_{i,j} v_iw_{ij}h_j - \sum_i b_iv_i - \sum_j c_j h_j $
energy difference $\ \ h_j=0,1$

$\qquad\quad E(\mathbf{v},1)- E(\mathbf{v},0) = -\sum_{i} v_iw_{ij}- c_j$

Glauber dynamics

$$ \fbox{$\phantom{\big|}\displaystyle P(0\to1)= \frac{1}{1+\mbox{e}^{\,E(1,\mathbf{v})-E(0,\mathbf{v})}} \phantom{\big|}$}\,, \qquad\quad P(\alpha\to\beta)= \frac{1}{1+\mbox{e}^{\,E_\beta-E_\alpha}} $$

transition rates, sigmoidal $\sigma()$

$$ P(0\to1)=\sigma(\epsilon_j), \qquad\quad P(1\to0)=1-\sigma(\epsilon_j), \qquad\quad \epsilon_j = \sum_{i} v_iw_{ij} + c_j $$

detailed balance

$$ p_j(0)P(0\to1)= p_j(1)P(1\to0), \qquad\quad \fbox{$\phantom{\big|}\displaystyle p_j(1) =\sigma(\epsilon_j) \phantom{\big|}$} $$

analytic result
$\hat{=} \ $ deterministic neuron

no need for (numerical) statistical sampling

training RBMs

maximizing log-likelyhood $\ \ \log\big(p(h_j)\big)$
possible, somewhat elaborate

contrastive divergence

simple shortcut

t=0	start with a training vector on visible units
	update all hidden units in parallel
	measure $\ \langle v_i h_j\rangle^0$
t=1	update visible units in parallel
	$\to \ $ 'reconstruction'
	update hidden units again
	measure $\ \langle v_i h_j\rangle^1$

stationary, if visible units did not change

$$ \tau_w \frac{d}{dt} w_{ij} = \langle v_i h_j\rangle^0 - \langle v_i h_j\rangle^1 $$

repeat for all training vectors
efficient RVMs use mixtures of methods

RBM code

TorchVision common standard datasets
:: contains utilities for loading datasets, and for transforming images to PyTorch tensors
MNIST database for handwritten digits
-batch size- is the number of training samples per -epoch-
x = Relu(Sign(p-rand)) tensor generalization of
x = 1 if (p>rand) else 0
torch.linear(x,W,b) Wx+b, matrix W
tensor.mv(vec) product of matrix (tensor) and vector (vec)
tensor.mean() nomen est omen
free energy $\ F$
temperature $\ Tk_B\to1$

$$ F = - Tk_B\log(Z),\quad\qquad Z = \sum_{\alpha} \mbox{e}^{-E_\alpha}, \qquad\quad E = -\frac{1}{2}\sum_{i,j} w_{ij} \, s_i \, s_j - \sum_i h_i \, s_i $$

Boltzmann factor: visible bias is multiplicative
free energy: visible bias is additive
minimize
freeEnergy(data) - freeEnergy(reconstructed)

#!/usr/bin/env python3

# restricted Boltzmann machine; source:
# https://blog.paperspace.com/beginners-guide-to-boltzmann-machines-pytorch/

# torchvision datasets
# https://pytorch.org/vision/stable/datasets.html

import numpy as np
import torch
import torch.utils.data
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torchvision.utils import make_grid , save_image
import matplotlib.pyplot as plt

# loading the MNIST dataset
# 'train_loader' is an instance of a DataLoader
#  'test_loader' for testing (currently not used)

batch_size = 64                           # samples per epoch 
train_loader = torch.utils.data.DataLoader(
  datasets.MNIST(                         # which dataset to load
    './data',                             # store in local directory
    train=True,
    download = True,                      # do download in dir
    transform = transforms.Compose([transforms.ToTensor()])
                ), 
  batch_size=batch_size                   )

test_loader = torch.utils.data.DataLoader(
  datasets.MNIST('./data', train=False,
    transform=transforms.Compose([transforms.ToTensor()])
                ), batch_size=batch_size )

#
# defining the restricted Boltzmann machine
# 'vis' : visible
# 'hin' : hidden
#
class RBM(nn.Module):
   def __init__(self, n_vis=784, n_hin=500, k=5):
        super(RBM, self).__init__()
        self.W = nn.Parameter(torch.randn(n_hin,n_vis)*1e-2)
        self.v_bias = nn.Parameter(torch.zeros(n_vis))
        self.h_bias = nn.Parameter(torch.zeros(n_hin))
        self.k = k                        # iteration depth

   def sample_from_p(self, p):            # p -> 0/1  stochastically
       return F.relu(torch.sign(p-torch.rand(p.size())))

   def v_to_h(self, v):
        p_h = F.sigmoid(F.linear(v, self.W, self.h_bias))  # update hidden
        sample_h = self.sample_from_p(p_h)                 # 0/1 sample
        return p_h, sample_h

   def h_to_v(self, h):                   # transpose W for other direction
        p_v = F.sigmoid(F.linear(h, self.W.t(), self.v_bias))
        sample_v = self.sample_from_p(p_v)
        return p_v, sample_v

   def forward(self, v):
        pre_h1, h1 = self.v_to_h(v)
        h_ = h1
        for _ in range(self.k):           # consistency loop
            pre_v_, v_ = self.h_to_v(h_)  # with 0/1 samples
            pre_h_, h_ = self.v_to_h(v_)
        return v, v_                      # return input, 0/1 reconstruction

   def free_energy(self, v):
        """hidden term: sum over hidden units
           1+exp(): s_hidden = 0/1 
           wx_b   : energy for hidden units
           v: visible activity, data / reconstructed
            : only one term -->  -log(exp(-E)) = E (modulo sign)
        """
        vbias_term  = v.mv(self.v_bias)
        wx_b        = F.linear(v, self.W, self.h_bias)
        hidden_term = wx_b.exp().add(1).log().sum(1)
        return (-hidden_term - vbias_term).mean()

#
# define model, register optimizer
# SGD: stochatic gradient descent
#
model = RBM(k=1)
train_op = optim.SGD(model.parameters(),0.1) 

#
# training model
#
for epoch in range(2):
    loss_ = []
    for _, (data,target) in enumerate(train_loader):
        data = data.view(-1, 784)
        sample_data = data.bernoulli()

        v, v1 = model(sample_data)
        loss = model.free_energy(v) - model.free_energy(v1)
        loss_.append(loss.data)
        train_op.zero_grad()
        loss.backward()
        train_op.step()

    print("Training loss for {} epoch: {}".format(epoch, np.mean(loss_)))

#
# storing images
#
def show_and_save(file_id,img,fig,position):
    npimg = np.transpose(img.numpy(),(1,2,0))
    fileName = "RBM_out_" + file_id + ".png"
    fig.add_subplot(1, 2, position)
    plt.title(file_id)
    plt.axis('off')
    plt.imshow(npimg)
    plt.imsave(fileName,npimg)

#
# visualising training outputs
#
fig = plt.figure(figsize=(12, 6))
show_and_save("real"     ,make_grid( v.view(32,1,28,28).data),fig,1)
show_and_save("generated",make_grid(v1.view(32,1,28,28).data),fig,2)
plt.show()

#!/usr/bin/env python3

# restricted Boltzmann machine; source:
# https://blog.paperspace.com/beginners-guide-to-boltzmann-machines-pytorch/

# torchvision datasets
# https://pytorch.org/vision/stable/datasets.html

import numpy as np
import torch
import torch.utils.data
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torchvision.utils import make_grid , save_image
import matplotlib.pyplot as plt

# loading the MNIST dataset
# 'train_loader' is an instance of a DataLoader
#  'test_loader' for testing (currently not used)

batch_size = 64                           # samples per epoch 
train_loader = torch.utils.data.DataLoader(
  datasets.MNIST(                         # which dataset to load
    './data',                             # store in local directory
    train=True,
    download = True,                      # do download in dir
    transform = transforms.Compose([transforms.ToTensor()])
                ), 
  batch_size=batch_size                   )

test_loader = torch.utils.data.DataLoader(
  datasets.MNIST('./data', train=False,
    transform=transforms.Compose([transforms.ToTensor()])
                ), batch_size=batch_size )

#
# defining the restricted Boltzmann machine
# 'vis' : visible
# 'hin' : hidden
#
class RBM(nn.Module):
   def __init__(self, n_vis=784, n_hin=500, k=5):
        super(RBM, self).__init__()
        self.W = nn.Parameter(torch.randn(n_hin,n_vis)*1e-2)
        self.v_bias = nn.Parameter(torch.zeros(n_vis))
        self.h_bias = nn.Parameter(torch.zeros(n_hin))
        self.k = k                        # iteration depth

def sample_from_p(self, p):            # p -> 0/1  stochastically
       return F.relu(torch.sign(p-torch.rand(p.size())))

def v_to_h(self, v):
        p_h = F.sigmoid(F.linear(v, self.W, self.h_bias))  # update hidden
        sample_h = self.sample_from_p(p_h)                 # 0/1 sample
        return p_h, sample_h

def h_to_v(self, h):                   # transpose W for other direction
        p_v = F.sigmoid(F.linear(h, self.W.t(), self.v_bias))
        sample_v = self.sample_from_p(p_v)
        return p_v, sample_v

def forward(self, v):
        pre_h1, h1 = self.v_to_h(v)
        h_ = h1
        for _ in range(self.k):           # consistency loop
            pre_v_, v_ = self.h_to_v(h_)  # with 0/1 samples
            pre_h_, h_ = self.v_to_h(v_)
        return v, v_                      # return input, 0/1 reconstruction

def free_energy(self, v):
        """hidden term: sum over hidden units
           1+exp(): s_hidden = 0/1 
           wx_b   : energy for hidden units
           v: visible activity, data / reconstructed
            : only one term -->  -log(exp(-E)) = E (modulo sign)
        """
        vbias_term  = v.mv(self.v_bias)
        wx_b        = F.linear(v, self.W, self.h_bias)
        hidden_term = wx_b.exp().add(1).log().sum(1)
        return (-hidden_term - vbias_term).mean()

#
# define model, register optimizer
# SGD: stochatic gradient descent
#
model = RBM(k=1)
train_op = optim.SGD(model.parameters(),0.1)

#
# training model
#
for epoch in range(2):
    loss_ = []
    for _, (data,target) in enumerate(train_loader):
        data = data.view(-1, 784)
        sample_data = data.bernoulli()

v, v1 = model(sample_data)
        loss = model.free_energy(v) - model.free_energy(v1)
        loss_.append(loss.data)
        train_op.zero_grad()
        loss.backward()
        train_op.step()

print("Training loss for {} epoch: {}".format(epoch, np.mean(loss_)))

#
# storing images
#
def show_and_save(file_id,img,fig,position):
    npimg = np.transpose(img.numpy(),(1,2,0))
    fileName = "RBM_out_" + file_id + ".png"
    fig.add_subplot(1, 2, position)
    plt.title(file_id)
    plt.axis('off')
    plt.imshow(npimg)
    plt.imsave(fileName,npimg)

#
# visualising training outputs
#
fig = plt.figure(figsize=(12, 6))
show_and_save("real"     ,make_grid( v.view(32,1,28,28).data),fig,1)
show_and_save("generated",make_grid(v1.view(32,1,28,28).data),fig,2)
plt.show()

Machine Learning Primer -- Python Tutorial

Boltzmann Machines

statistical physics

energy functional

Boltzmann factor

Metropolis sampling

detailed balance

Glauber dynamics

Boltzmann machines

purpose

Hopfield network

data ↔ memories

storage capacity

log-likelihood maximization

updating the weight matrix

Kullback-Leibler divergence

similarity measure

KL divergence for Boltzmann machine

restricted Boltzmann machines (RBM)

statistical independency

aim

joint Boltzmann distribution

joint PDF

statistical independency

Glauber dynamics

training RBMs

contrastive divergence

RBM code