Advanced Introduction to C++, Scientific Computing and Machine Learning

Claudius Gros, SS 2024

Institut for Theoretical Physics
Goethe-University Frankfurt a.M.

Neural Networks

neurons in the brain

a brain full of neurons

80-100 billion
1000-10000 connections (synapse) per neuron

binary information transmission

spike $\ \hat{=} \ $ electrical solitary wave
: action potential

synapses are chemical

pre-synaptic spike $\ \ \to \ \ $ vesicles pop
neurotransmitter $\ \to \ $ receptors
: glutamate (excitatory)
: GABA (inhibitory)

all constituent proteins recycled
(days, weeks), functional stationarity

artificial neurons

rate encoding

neural activity $\ \ y_i\in[-1,1]$
$\hat{=} \ $ normalized number of spikes per time

$\qquad\quad y_i = \sigma(x_i-b_i),\qquad\quad x_i=\sum_j w_{ij}y_j $

membrane potential $\ \ x_i$
threshold $\ \ b_i$
transfer function
$\qquad\qquad\qquad\quad \fbox{$\phantom{\big|} y(z)=\tanh(z)\phantom{\big|}$}$

synaptic weights

weighted connectivity matrix $\ \ w_{ij}$
: post-synaptic: $ \ \ i\ \leftarrow j \ \ $ pre-synaptic
synaptic plasticity
: find optimal $\ \ w_{ij}$
: supervised / unsupervised

linear classifier

hyperplane $$ \fbox{$\phantom{\big|} \sum_j w_{ij} y_j = b_i \phantom{\big|}$} \qquad\quad \mathbf{w}\cdot\mathbf{y}=b $$ in the space of pre-synaptic activities $ \ \ y_j$
: every neuron $\ \ i$
soft classification $ \ \ y_i=\sigma(x_i-b_i)$

unsupervised learning

what fires together,
wires together

Hebbian learning

unsupervised - no teacher
extraction of statistical correlation
for the data input stream
causality

$$ \frac{d}{dt} w_{ij} \sim y_i y_j $$

linear model

average over time

$\qquad\quad \fbox{$\phantom{\big|} \left\langle \frac{d}{dt} w_{ij}\right\rangle \sim \sum_k w_{ik}\big\langle y_k y_j\big\rangle \phantom{\big|}$}\,, \qquad\quad y_i\sim x_i = \sum_k w_{ik}y_k $

covariance matrix of inputs

$$ S_{kj} = \big\langle (y_k-\overline{y}_k) (y_j-\overline{y}_j)\big\rangle = \big\langle y_k y_j\big\rangle, \qquad\quad \overline{y}_k\to0 $$

generalizable for finite mean $ \ \ \overline{y}_k$
limit growth via overall normalization $\ \ \sum_j w_{ij}^2\to\mbox{const.}$

principal compoment analysis

Hebbian learning $$ \left\langle \frac{d}{dt} w_{ij}\right\rangle \sim \sum_k w_{ik}S_{kj}, \qquad\quad \fbox{$\phantom{\big|}\displaystyle \tau_w\frac{d}{dt} \hat{w} = \hat{w}\cdot\hat{S} \phantom{\big|}$}\,, \qquad\quad \big(\hat{w}\big)_{ij} = w_{ij} $$ with $$ \big(\hat{S}\big)_{kj} = \big\langle (y_k-\overline{y}_k) (y_j-\overline{y}_j)\big\rangle, \qquad\quad \sum_j w_{ij}^2\to\mbox{const.} $$

competitive growth of components

diagonalize input-covariance matrix $\ \ \hat{S}$

$$ \hat{S} = \sum_{\{\lambda\}} \lambda \, \mathbf{e}_\lambda^{\phantom{T}} \mathbf{e}_\lambda^T, \qquad\quad \hat{S}\, \mathbf{e}_\gamma= \lambda\, \mathbf{e}_\gamma, \qquad\quad \mathbf{e}_\lambda\cdot\mathbf{e}_\gamma= \mathbf{e}_\lambda^T\mathbf{e}_\gamma^{\phantom{T}} = \delta_{\lambda,\gamma} \qquad\quad $$

growth of weight components

$$ \tau_w\frac{d}{dt}\big( \hat{w}\cdot\mathbf{e}_\gamma^{\phantom{T}}\big) = \hat{w}\cdot\hat{S}\cdot\mathbf{e}_\gamma^{\phantom{T}} = \sum_{\{\lambda\}}\lambda\, \hat{w}\cdot \mathbf{e}_\lambda^{\phantom{T}} \underbrace{ \mathbf{e}_\lambda^T \cdot\mathbf{e}_\gamma^{\phantom{T}} }_{\delta_{\lambda,\gamma}}, \qquad\quad \fbox{$\phantom{\big|}\displaystyle \tau_w\frac{d}{dt}\big( \hat{w}\cdot\mathbf{e}_\gamma^{\phantom{T}}\big) = \gamma\,\big( \hat{w}\cdot\mathbf{e}_\gamma^{\phantom{T}}\big) \phantom{\big|}$} $$

normalization $\ \ \to \ \ $ competive growth
Hebbian learning finds the direction of the
covariance matrix with the largest eigenvector $\ \gamma\ $, the principal component,
viz the input activities with the largest variance

Hebbian learning rule derived from mimizing the Fisher information
[Echeveste & Gros, 2014 ]

neural networks

feed-forward

organized in layers

$ \fbox{$\phantom{\big|}$ input $\phantom{\big|}$} \quad\to\quad \fbox{$\phantom{\big|}$ hidden $\phantom{\big|}$} \quad\to\quad\dots\quad\to\quad \fbox{$\phantom{\big|}$ output $\phantom{\big|}$} $

$\hat{=}\ \ $ cortical layers (brain)
directed flow of information
'hidden' means internal
widespread use
- deep learning
- ...

recurrent

closed dynamical system
- fixpoint attractor
- limit cycles
- chaotic attractors
if deterministic
attractors may correspond to
- memories (fixpoints)
- motor primitive (limit cycles)
- ...
behavior may be
- self-organized (emergent)
- learned

perceptron

single layer

$ \fbox{$\phantom{\big|}$ input $\phantom{\big|}$} \quad\to\quad \fbox{$\phantom{\big|}$ output $\phantom{\big|}$} $

input units (not 'input neurons')

supervised learning

encode mapping input to ouput

$ \fbox{$\phantom{\big|}\displaystyle \mathbf{I}_\alpha\ \to\ \mathbf{y}_\alpha \phantom{\big|}$}\,, \qquad\quad \mathbf{y} = \mathbf{y}(\mathbf{x}_\alpha), \qquad\quad \mathbf{x}_\alpha = \hat{w}\cdot\mathbf{I}_\alpha $

for a training pairs $\ (\mathbf{I}_\alpha,\mathbf{y}_\alpha)$, where

$ \mathbf{I}=\big(I_1,\,..,\,I_{N_0}\big), \qquad\quad \mathbf{y}=\big(y_1,\,..,\,y_{N_1}\big), \qquad\quad \big(\hat{w}\big)_{ij} = w_{ij} $
minimize $$ \fbox{$\phantom{\big|}\displaystyle E = \frac{1}{2} \big|\mathbf{y}_\alpha-\mathbf{y}(\mathbf{x}_\alpha)\big|^2 \phantom{\big|}$}\,, \qquad\quad (\mathbf{y}(\mathbf{x}_\alpha))_i=\sigma\big((\mathbf{x}_\alpha)_i-b_i\big) =\sigma\left(\sum_j w_{ij}(\mathbf{I}_\alpha)_j-b_i\right) $$ transfer function $\ \ \sigma(z)$

steepest descent

gradient for a single input-ouput pair

$$ \frac{d}{dt}w_{ij} \sim -\frac{\partial E}{\partial w_{ij}} = \left[ (\mathbf{y}_\alpha)_i-(\mathbf{y})_i \right]\,\sigma'(.) \,(\mathbf{I}_\alpha)_j $$

learning stops when $ \ \ \mathbf{y}\to \mathbf{y}_\alpha$
monotonic transfer function $\ \ \sigma'(.)>0$

the XOR problem

single neurons are linear classifiers

XOR not solvable via linear classification

the neural-network winter

1970' and 1980'
universal computation needs XOR
$\Rightarrow\ \ $ (single-layer) neural networks not universal
deep learning did not work
- problems too simple
- limited training data
- engineering

universality of multilayer perceptrons

supperpositions of linear functions
are still linear

parametrized family of non-linear functions

$\qquad\quad \begin{array}{rcl} y_5 &=& \sigma(w_{5,3}y_3+w_{5,4}y_4) \\ &=& \sigma\Big( w_{5,3}\sigma(w_{3,1}I_1+w_{3,2}I_2) \\ & & \phantom{\sigma}+ w_{5,4}\sigma(w_{4,1}I_1+w_{4,2}I_2) \Big) \end{array} $

linear neurons

case $\ \ y(x) = a x$
additional layer do not improve performance

non-linear neurons

output may be a ridge if

$ w_{3,2}=-w_{3,1}, \qquad\quad w_{4,2}=-w_{4,1}, \qquad\quad b_3\ne b_4$

4 hidden-layer neurons representing
two orthognal ridges make a bump
bumps are universal

given enough hidden layer neurons, non-linear
neurons can represent any smooth function

derivative of sigmoidal

here $\ \ y_i\in[0,1]$

$\qquad\quad y_i = \sigma(x_i-b_i), \qquad\quad \sigma(z) =\frac{1}{1+\mathrm{e}^{-z}} $
derivative of sigmoidal

$\qquad\quad \frac{d\sigma}{dz} = \frac{\mathrm{e}^{-z}}{(1+\mathrm{e}^{-z})^2} =\sigma^2(1/\sigma-1) $

viz

$$\fbox{$\phantom{\big|}\displaystyle \sigma' = \sigma(1-\sigma) \phantom{\big|}$}\,, \qquad\quad \frac{d}{dx} y_i= y_i(1-y_i) $$

backpropagation

layers $\ \ i,\,j,\, k\,\,\dots$

$ x_i = \sum_j w_{ij} y_j = \sum_j w_{ij} \sigma(x_j-b_j) =\dots $

training multilayer perceptrons

$\qquad\quad \frac{\partial y_i}{\partial w_{\alpha\beta}} = y_i(1-y_i)\sum_j w_{ij} \frac{\partial y_j}{\partial w_{\alpha\beta}} $

layer indices $\ \ \alpha,\beta $
target value $ \ \ \tilde{y}_i$
gradient of error function $$ E=\frac{1}{2} \sum_i\big(y_i-\tilde{y}_i\big)^2, \qquad\quad \frac{\partial E}{\partial w_{\alpha\beta}} = \sum_i \big(y_i-\tilde{y}_i\big) \frac{\partial y_i}{\partial w_{\alpha\beta}} $$ viz $$ \fbox{$\phantom{\big|}\displaystyle \frac{\partial E}{\partial w_{\alpha\beta}} = \sum_i \Delta E_i \frac{\partial y_i}{\partial w_{\alpha\beta}} \phantom{\big|}$}\,, \qquad\quad \Delta E_i=y_i-\tilde{y}_i $$ with the intial error $\ \ \Delta E_i$

recursive derivatives

$$ \frac{\partial E}{\partial w_{\alpha\beta}} = \sum_i\big(y_i-\tilde{y}_i\big)\, y_i(1-y_i)\sum_j w_{ij} \frac{\partial y_j}{\partial w_{\alpha\beta}} $$ $\quad\quad$ and hence $$ \frac{\partial E}{\partial w_{\alpha\beta}} = \sum_j \Delta E_j \frac{\partial y_j}{\partial w_{\alpha\beta}}, \quad\qquad \fbox{$\phantom{\big|}\displaystyle \Delta E_j = \sum_i\Delta E_i\, y_i(1-y_i)\, w_{ij} \phantom{\big|}$} $$

back-propagated error $ \ \ \Delta E_j$
note: $\ \ y_i(1-y_i)>0$
: may be omitted

supervised learning via gradient descent is
equivalent to linear backpropagation of errors

long short-term memory

applications may need long-term correlations
:: predictions, language processing
recurrent neural networks ($\rightarrow$ ML)
long short-term memory (Schmidhuber et al.)

neurons with interal states

$h_{t-1}\,$:: input from previous layer
$c_{t-1}\,$:: internal state (memory) at the start of time step
$x_{t\phantom{-1}}\,$:: external input

internal state manipulation

forget/update $\ \Rightarrow\ $ new internal state $c_t$
output to next layer: $\ h_t$
:: nonlinear weighted superpostion of $\ \ x_t$, $h_{t-1}$, $c_t$
all operations learned (weights adapted)