Machine Learning Primer -- Basics

Claudius Gros, WS 2024/25

Institut für theoretische Physik
Goethe-University Frankfurt a.M.

Information Theory

probability distributions

variables / symbols
probability distributions
mean, standard deviation
median $\ \tilde{x}$

examples

exponential distribution, $\ t\in[0,\infty]$
:: radioactive decay / waiting times
:: correlations functions
Gaussian / normal distribution, $\ x\in[-\infty,\infty]$
→ central limit theorem

power laws

decay of spatial / temporal correlations in
non-critical and critical systems $\qquad\qquad\qquad\displaystyle \begin{array}{|c|c|} \hline \mathrm{non\!-\!critical} & \mathrm{critical} \\ \hline \mathrm{exponential} & \mathrm{power\!-\!law} \\ \hline \end{array} $
::
neural scaling laws
Tsalis-Pareto, $\ x\in[0,\infty]$
normalized for $\ \alpha>1$
often observed $\ \fbox{$\ 2<\alpha<3\ $}$

normalized probability distribution functions
may have a diverging mean and/or variance

#!/usr/bin/env python3
# plotting a few distribution functions

import math                      # math
import numpy as np               # numerics
import matplotlib.pyplot as plt  # plotting

max_X   = 5.0
nPoints = 100

xData   = [ii*max_X/nPoints for ii in range(nPoints)]

yNormal = [math.exp(-0.5*xx*xx) for xx in xData]
yExp    = [math.exp(-xx)        for xx in xData]
yTsalis = [1.0/(1.0+xx)**2      for xx in xData]

plt.plot(xData, yNormal, "ob", markersize=4, label="normall")
plt.plot(xData, yExp,    "og", markersize=4, label="exponential")
plt.plot(xData, yTsalis, "or", markersize=4, label="Tsalis-Pareto")
#
plt.xlim(0, max_X)
plt.ylim(0, 1.05)                 # axis limits
plt.legend(loc="upper right")     # legend location
plt.savefig('foo.svg')            # export figure
plt.show()

central limit theorem

the probability distribution of a
sum of statistically independent
stochastic variables
converges to a Gaussian

deep learning nets are large
::
Gaussian processes

law of large numbers

the variance of a
sum of statistically independent
stochastic variables
is additative

$\sigma^2 = \sum_i \sigma_i^2$

#!/usr/bin/env python3

# sum of uniform distributions, binning with 
# np.histogram()

import numpy as np               # numerics
import matplotlib.pyplot as plt  # plotting

nData = 10000
nBins = 20

# ploting at bin midpoints
xData = [(iBin+0.5)/nBins for iBin in range(nBins)]

# (adding) uniform distributions in [0,1]
data_1 = [ np.random.uniform()      for _ in range(nData)]
data_2 = [(np.random.uniform()+\
           np.random.uniform())/2.0 for _ in range(nData)]
data_3 = [(np.random.uniform()+\
           np.random.uniform()+\
           np.random.uniform())/3.0 for _ in range(nData)]

hist_1, bins = np.histogram(data_1, bins=nBins, range=(0.0,1.0))
hist_2, bins = np.histogram(data_2, bins=nBins, range=(0.0,1.0))
hist_3, bins = np.histogram(data_3, bins=nBins, range=(0.0,1.0))
#print("Bin Edges", bins)
#print("Histogram Counts", hist_1)

y_one   = hist_1*nBins*1.0/nData      # normalize
y_two   = hist_2*nBins*1.0/nData    
y_three = hist_3*nBins*1.0/nData    

plt.plot(xData, y_one,   "ob", markersize=4, label="flat-one")
plt.plot(xData, y_two,   "og", markersize=4, label="flat-two")
plt.plot(xData, y_three, "or", markersize=4, label="flat-three")
plt.plot(xData, y_three,  "r")
#
plt.xlim(0, 1)
plt.ylim(0, 2.5)
plt.legend(loc="upper right")    
plt.savefig('foo.svg')          
plt.show()

#!/usr/bin/env python3

# sum of uniform distributions, binning with 
# np.histogram()

import numpy as np               # numerics
import matplotlib.pyplot as plt  # plotting

nData = 10000
nBins = 20

# ploting at bin midpoints
xData = [(iBin+0.5)/nBins for iBin in range(nBins)]

# (adding) uniform distributions in [0,1]
data_1 = [ np.random.uniform()      for _ in range(nData)]
data_2 = [(np.random.uniform()+\
           np.random.uniform())/2.0 for _ in range(nData)]
data_3 = [(np.random.uniform()+\
           np.random.uniform()+\
           np.random.uniform())/3.0 for _ in range(nData)]

hist_1, bins = np.histogram(data_1, bins=nBins, range=(0.0,1.0))
hist_2, bins = np.histogram(data_2, bins=nBins, range=(0.0,1.0))
hist_3, bins = np.histogram(data_3, bins=nBins, range=(0.0,1.0))
#print("Bin Edges", bins)
#print("Histogram Counts", hist_1)

y_one   = hist_1*nBins*1.0/nData      # normalize
y_two   = hist_2*nBins*1.0/nData    
y_three = hist_3*nBins*1.0/nData

plt.plot(xData, y_one,   "ob", markersize=4, label="flat-one")
plt.plot(xData, y_two,   "og", markersize=4, label="flat-two")
plt.plot(xData, y_three, "or", markersize=4, label="flat-three")
plt.plot(xData, y_three,  "r")
#
plt.xlim(0, 1)
plt.ylim(0, 2.5)
plt.legend(loc="upper right")    
plt.savefig('foo.svg')          
plt.show()

conditional probability

observing $x$, given $y$
marginal probabilities $\ p(x)$ / $\ p(y)$
joint probability $\ p(x,y)$

Bayes theorem

$$ p(x,y) = p(x|y) p(y) = p(y|x) p(x) \quad\qquad \fbox{$\phantom{\big|} p(y|x) = \frac{p(x|y) p(y)}{p(x)} \phantom{\big|}$} $$

posterior $\ p(y|x) $
something will always happen: $\ 1=\int p(y|x) dy $
:: updated world model
likelihood $\ p(x|y) $
:: of observation $\ x\ $, given the world model $\ p(y)$
prior $\ p(y) $
:: orginal world model
marignal $\ p(x) = \int p(x|y) p(y) dy $
:: of the observation, viz of the data point

Bayesian inference

$$ \fbox{$\phantom{\big|} p(y|x) = \frac{p(x|y) p(y)}{p(x)} \phantom{\big|}$} $$

inference $\ \equiv\ $ learning from data
:: core of ML
:: observation, data, evidence

$$ \left(\begin{array}{c} \mathbf{prior} \\ p(y) \end{array}\right) \ \longrightarrow\ \left(\begin{array}{c} \mathbf{data} \\ x \end{array}\right) \ \longrightarrow\ \left|\begin{array}{c} \mathrm{Bayes} \\ \mathrm{theorem} \end{array}\right| \ \longrightarrow\ \left(\begin{array}{c} \mathbf{postior} \\ p(y|x) \end{array}\right) \ \longrightarrow\ \left|\begin{array}{c} \mathrm{new} \\ \mathrm{prior} \end{array}\right| $$

variational inference
:: parametrize distribution functions
:: Gaussians, ...
::
variational autoencoder
numerical discretizaton
:: possible only in low dimensions

Baysian coin flips

biased coin $$ p(1) = \alpha \quad\qquad p(0) = 1-\alpha $$
task
:: estimate true bias $\ \alpha$

#!/usr/bin/env python3

import numpy as np

# ***
# *** global variables
# ***
trueBias = 0.3                             # true coin bias
nX       = 11                              # numerical discretization
xx = [i*1.0/(nX-1.0) for i in range(nX)]   
pp = [1.0/nX  for _ in range(nX)]          # starting prior

# ***
# *** Baysian updating
# ***
def updatePrior():
  """Bayesian inference for coin flips;
     marginal correspond to normalization of posterior"""
#
  evidence = 0.0 if (np.random.uniform()>trueBias) else 1.0   # coin flip
  for iX in range(nX):     
    prior = pp[iX]
    likelihood = xx[iX] if (evidence==1) else 1.0-xx[iX]
    pp[iX] = likelihood*prior              # Bayes theorem
#
  norm = sum(pp)
  for iX in range(nX):                     # normalization of posterior
    pp[iX] = pp[iX]/norm
#
  return evidence
    
# ***
# *** main
# ***

ee = 0.0
for _ in range(200):                      # iteration over observations
  for iX in range(nX):     
    print(f'{pp[iX]:6.3f}', end="")
  print(f' | {ee:3.1f}')
  ee = updatePrior()                     
#
for iX in range(nX):                      # support points
  print(f'{xx[iX]:4.1f}  ', end="")
print()

#!/usr/bin/env python3

import numpy as np

# ***
# *** global variables
# ***
trueBias = 0.3                             # true coin bias
nX       = 11                              # numerical discretization
xx = [i*1.0/(nX-1.0) for i in range(nX)]   
pp = [1.0/nX  for _ in range(nX)]          # starting prior

# ***
# *** Baysian updating
# ***
def updatePrior():
  """Bayesian inference for coin flips;
     marginal correspond to normalization of posterior"""
#
  evidence = 0.0 if (np.random.uniform()>trueBias) else 1.0   # coin flip
  for iX in range(nX):     
    prior = pp[iX]
    likelihood = xx[iX] if (evidence==1) else 1.0-xx[iX]
    pp[iX] = likelihood*prior              # Bayes theorem
#
  norm = sum(pp)
  for iX in range(nX):                     # normalization of posterior
    pp[iX] = pp[iX]/norm
#
  return evidence
    
# ***
# *** main
# ***

ee = 0.0
for _ in range(200):                      # iteration over observations
  for iX in range(nX):     
    print(f'{pp[iX]:6.3f}', end="")
  print(f' | {ee:3.1f}')
  ee = updatePrior()                     
#
for iX in range(nX):                      # support points
  print(f'{xx[iX]:4.1f}  ', end="")
print()

Baysian statistics

example
diagnostic test
probability to be ill $\ p(\mathrm{ill})$
:: $\ p(\mathrm{healthy}) = 1- p(\mathrm{ill})$
test if somebody is ill (or not)
:: likelihood of outcomes $$ p(x|y) \quad\qquad x=\mathrm{positive/negative} \quad\qquad y=\mathrm{ill/healthy} $$

case study

epidemic outbreak with $$ p(\mbox{positive}|\mbox{ill})=0.99 \qquad\quad p(\mbox{positive}|\mbox{healty})=0.02 $$
probability that a positively tested person is infected $$ \begin{array}{rcl} p(\mbox{ill}|\mbox{pos}) & =& \frac{\displaystyle p(\mbox{pos}|\mbox{ill})p(\mbox{ill})} {\displaystyle p(\mbox{pos}|\mbox{ill})p(\mbox{ill}) +p(\mbox{pos}|\mbox{healthy})p(\mbox{healthy})} \\[0.5ex] &=& \frac{\displaystyle 0.99\cdot 0.01}{\displaystyle 0.99*0.01+0.02*0.99} = \frac{\displaystyle 1}{\displaystyle 3} \end{array} $$
just one in three
:: second test necessary

maximum likelihood

world-model parameter $\ y \to \vartheta$
likelihood $\ p(x|\vartheta)$
:: to observe $\ x\ $, given $\ \vartheta$.
Likelihood function $\ \mathcal{L}(\vartheta|x)$
:: notation a bit confusion

maximize probability to
observe a given event

$\qquad\qquad\fbox{$\displaystyle \ \ \frac{\partial}{\partial\vartheta}\,\mathcal{L}(\vartheta|x) = 0\ \ $}$

log likelihood

maximise $\quad \log\big(\mathcal{L}(\vartheta|x)\big)$
:: Gaussian → least square fit

batch / online processing

batch: a series of observation (coin flips) all at once
:: typical for max likelihood
online: updating after every event (coin flip)
:: typical for Baysian inference

entropy

$$ \mathrm{high\ entropy}\ \hat{=}\ \left\{ \begin{array}{rcl} \mathrm{large\ disorder} & & \mathrm{physics}\\[0.1ex] \mathrm{high\ information\ content} &\phantom{s}& \mathrm{information\ theory}\\ \end{array} \right. $$

extensive information

entropy should be extensive
:: the information content of independent systems should add up
the joint distribution of independent systems is the
product of the individual distributions $ \ p(x,y) = p_1(x)\,p_2(y)$
the logarithm is the only function mapping
:: product → sum

Shannon entropy

$$ \fbox{$\displaystyle\ \ H[p] = -\sum\nolimits_{x_i} p(x_i)\,\log_b(p(x_i)) = -\langle\,\log_b(p)\,\rangle \ \ $} \qquad\quad H[p] \ge 0 $$

physics $\ S[p] = -k_BT\, \langle\,\log_2(p)\,\rangle $
in bits $\ \log_2()$
→
decision trees

Shannon source coding theorem

Given a random variable $x$ with a PDF $p(x)$ and entropy $H[p]$.

For $N\to\infty$, the cumulative entropy $N H[p]$ is a lower bound
for the number of bits necessary for the lossless compression
of $N$ independent processes drawn from $p(x)$.

compressing more than the entropy leads inevitably
to information loss
→ prove that entropy encodes information

example

four letter alphabet
entropy
naive encoding
optimal encoding
:: prefix free
average word length
Q.E.D.

maximal entropy distributions

entropy positive $\ \ 0\le p_\alpha \le 1, \qquad \lim_{x\to0} x\log(x)=0$
logarithm concave

maximal entropy distributions

no contraint: uniform distribution
: variational calculus

$$ H[p] = -\sum_{\alpha=1}^M p_\alpha\log(p_\alpha) = -\frac{M}{M}\log(1/M) = \log(M),\qquad\quad p_\alpha = \frac{1}{M} $$

number of events $\ \ M$
with contraint: exponential distribution

entropy in physics

statistical physics

$$ H[p]=-k_BT\big\langle \ln(p)\big\rangle, \qquad\quad \beta=\frac{1}{k_bT} $$

Boltzman constant $\ \ k_B$
temperature $\ \ T$

$$ \begin{array}{rcl} \mbox{micro-canonical ensemble} && \mbox{minimal entropy} \\[1ex] \mbox{canonical ensemble} && \mbox{minimal entropy, given an energy E} \\ && \mbox{Boltzman distribution} \ \ \fbox{$\phantom{\big|} p_\alpha\sim\exp(-\beta E_\alpha) \phantom{\big|}$} \\[1ex] \mbox{grand-canonical ensemble} && \mbox{minimal entropy, given energy E, particle number N} \\ && \mbox{Boltzman distribution} \ \ \fbox{$\phantom{\big|} p_\alpha\sim\exp(-\beta (E_\alpha-\mu N)) \phantom{\big|}$} \\ \end{array} $$

chemical potential $\ \ \mu$

entropy as a measure of disorder

$$ H(T)\quad\to\quad \left\{\begin{array}{rcl} 0 &\mbox{for}& T\to0 \\[1ex] Nk_B\ln(f) &\mbox{for}& T\to\infty \end{array}\right. $$

degees of freedom per particle $\ \ f$

Kullback-Leibler divergence

$\displaystyle K[p;q] = \int p(x)\log\left(\frac{p(x)}{q(x)}\right)dx$

'distance' measure between two PDFs
:: asymmetric
:: 'relative entropy'
→
Boltzmann machines
positive definite $\ K[p;q]\ge0$

example

$$ p(0) = 1/2 = p(1) \quad\quad q(0) = \alpha\quad\quad q(1) = 1-\alpha $$

cross-entropy loss

output of a ML architecture
- $\mathbf{y}$ vector
  $\mathbf{Y}_t$ target
  $L=(\mathbf{Y}_t-\mathbf{y})^2$ loss, to be minimized
- $\mathbf{P}$ probability distribution
  $\mathbf{P}_t$ target distribution
  $L=K[\mathbf{P}_t;\mathbf{P}]$ minimize Kullback-Leibler divergence
classification / action selection
:: probability that image contains, e.g., a bicycle
:: probability to make a certain move in a game (policy)

$$ K[\mathbf{P}_t;\mathbf{P}] = \sum_i P_{t,i}\log\left( {P_{t,i}\over P_i}\right) = \sum_i P_{t,i}\log\left( P_{t,i}\right) - \sum_i P_{t,i}\log\left(P_i\right) $$

cross entropy loss

target probability given, not to be mimimized

$\displaystyle L = -\sum\nolimits_i P_{t,i}\log\left(P_i\right) $