# Advanced Introduction to C++, Scientific Computing and Machine Learning

Claudius Gros, SS 2023

Institut for Theoretical Physics
Goethe-University Frankfurt a.M.

# neurons in the brain

### a brain full of neurons

• 80-100 billion
• 1000-10000 connections (synapse) per neuron

### binary information transmission

• spike $\ \hat{=} \$ electrical solitary wave
: action potential

### synapses are chemical

• pre-synaptic spike $\ \ \to \ \$ vesicles pop
• neurotransmitter $\ \to \$ receptors
: glutamate (excitatory)
: GABA (inhibitory)

all constituent proteins recycled
(days, weeks), functional stationarity

# artificial neurons

### rate encoding

• neural activity $\ \ y_i\in[-1,1]$
$\hat{=} \$ normalized number of spikes per time
$\qquad\quad y_i = \sigma(x_i-b_i),\qquad\quad x_i=\sum_j w_{ij}y_j$
• membrane potential $\ \ x_i$
• threshold $\ \ b_i$
• transfer function
$\qquad\qquad\qquad\quad \fbox{$\phantom{\big|} y(z)=\tanh(z)\phantom{\big|}$}$

### synaptic weights

• weighted connectivity matrix $\ \ w_{ij}$
: post-synaptic: $\ \ i\ \leftarrow j \ \$ pre-synaptic
• synaptic plasticity
: find optimal $\ \ w_{ij}$
: supervised / unsupervised

### linear classifier

• hyperplane $$\fbox{\phantom{\big|} \sum_j w_{ij} y_j = b_i \phantom{\big|}} \qquad\quad \mathbf{w}\cdot\mathbf{y}=b$$ in the space of pre-synaptic activities $\ \ y_j$
: every neuron $\ \ i$
• soft classification $\ \ y_i=\sigma(x_i-b_i)$

# unsupervised learning

what fires together,
wires together

### Hebbian learning

• unsupervised - no teacher
• extraction of statistical correlation
for the data input stream
• causality
$$\frac{d}{dt} w_{ij} \sim y_i y_j$$

### linear model

• average over time
$\qquad\quad \fbox{$\phantom{\big|} \left\langle \frac{d}{dt} w_{ij}\right\rangle \sim \sum_k w_{ik}\big\langle y_k y_j\big\rangle \phantom{\big|}$}\,, \qquad\quad y_i\sim x_i = \sum_k w_{ik}y_k$
• covariance matrix of inputs
$$S_{kj} = \big\langle (y_k-\overline{y}_k) (y_j-\overline{y}_j)\big\rangle = \big\langle y_k y_j\big\rangle, \qquad\quad \overline{y}_k\to0$$
• generalizable for finite mean $\ \ \overline{y}_k$
• limit growth via overall normalization $\ \ \sum_j w_{ij}^2\to\mbox{const.}$

# principal compoment analysis

• Hebbian learning $$\left\langle \frac{d}{dt} w_{ij}\right\rangle \sim \sum_k w_{ik}S_{kj}, \qquad\quad \fbox{\phantom{\big|}\displaystyle \tau_w\frac{d}{dt} \hat{w} = \hat{w}\cdot\hat{S} \phantom{\big|}}\,, \qquad\quad \big(\hat{w}\big)_{ij} = w_{ij}$$ with $$\big(\hat{S}\big)_{kj} = \big\langle (y_k-\overline{y}_k) (y_j-\overline{y}_j)\big\rangle, \qquad\quad \sum_j w_{ij}^2\to\mbox{const.}$$

### competitive growth of components

• diagonalize input-covariance matrix $\ \ \hat{S}$
$$\hat{S} = \sum_{\{\lambda\}} \lambda \, \mathbf{e}_\lambda^{\phantom{T}} \mathbf{e}_\lambda^T, \qquad\quad \hat{S}\, \mathbf{e}_\gamma= \lambda\, \mathbf{e}_\gamma, \qquad\quad \mathbf{e}_\lambda\cdot\mathbf{e}_\gamma= \mathbf{e}_\lambda^T\mathbf{e}_\gamma^{\phantom{T}} = \delta_{\lambda,\gamma} \qquad\quad$$
• growth of weight components
$$\tau_w\frac{d}{dt}\big( \hat{w}\cdot\mathbf{e}_\gamma^{\phantom{T}}\big) = \hat{w}\cdot\hat{S}\cdot\mathbf{e}_\gamma^{\phantom{T}} = \sum_{\{\lambda\}}\lambda\, \hat{w}\cdot \mathbf{e}_\lambda^{\phantom{T}} \underbrace{ \mathbf{e}_\lambda^T \cdot\mathbf{e}_\gamma^{\phantom{T}} }_{\delta_{\lambda,\gamma}}, \qquad\quad \fbox{\phantom{\big|}\displaystyle \tau_w\frac{d}{dt}\big( \hat{w}\cdot\mathbf{e}_\gamma^{\phantom{T}}\big) = \gamma\,\big( \hat{w}\cdot\mathbf{e}_\gamma^{\phantom{T}}\big) \phantom{\big|}}$$
• normalization $\ \ \to \ \$ competive growth
• Hebbian learning finds the direction of the
covariance matrix with the largest eigenvector $\ \gamma\$, the principal component,
viz the input activities with the largest variance

# neural networks

### feed-forward

• organized in layers

$\fbox{$\phantom{\big|}$input$\phantom{\big|}$} \quad\to\quad \fbox{$\phantom{\big|}$hidden$\phantom{\big|}$} \quad\to\quad\dots\quad\to\quad \fbox{$\phantom{\big|}$output$\phantom{\big|}$}$

$\hat{=}\ \$ cortical layers (brain)
• directed flow of information
'hidden' means internal
• deep learning
• ...

### recurrent

• closed dynamical system
• fixpoint attractor
• limit cycles
• chaotic attractors
if deterministic
• attractors may correspond to
• memories (fixpoints)
• motor primitive (limit cycles)
• ...
• behavior may be
• self-organized (emergent)
• learned

# perceptron

• single layer

$\fbox{$\phantom{\big|}$input$\phantom{\big|}$} \quad\to\quad \fbox{$\phantom{\big|}$output$\phantom{\big|}$}$

input units (not 'input neurons')

### supervised learning

• encode mapping input to ouput

$\fbox{$\phantom{\big|}\displaystyle \mathbf{I}_\alpha\ \to\ \mathbf{y}_\alpha \phantom{\big|}$}\,, \qquad\quad \mathbf{y} = \mathbf{y}(\mathbf{x}_\alpha), \qquad\quad \mathbf{x}_\alpha = \hat{w}\cdot\mathbf{I}_\alpha$

for a training pairs $\ (\mathbf{I}_\alpha,\mathbf{y}_\alpha)$, where

$\mathbf{I}=\big(I_1,\,..,\,I_{N_0}\big), \qquad\quad \mathbf{y}=\big(y_1,\,..,\,y_{N_1}\big), \qquad\quad \big(\hat{w}\big)_{ij} = w_{ij}$
• minimize $$\fbox{\phantom{\big|}\displaystyle E = \frac{1}{2} \big|\mathbf{y}_\alpha-\mathbf{y}(\mathbf{x}_\alpha)\big|^2 \phantom{\big|}}\,, \qquad\quad (\mathbf{y}(\mathbf{x}_\alpha))_i=\sigma\big((\mathbf{x}_\alpha)_i-b_i\big) =\sigma\left(\sum_j w_{ij}(\mathbf{I}_\alpha)_j-b_i\right)$$ transfer function $\ \ \sigma(z)$

### steepest descent

• gradient for a single input-ouput pair
$$\frac{d}{dt}w_{ij} \sim -\frac{\partial E}{\partial w_{ij}} = \left[ (\mathbf{y}_\alpha)_i-(\mathbf{y})_i \right]\,\sigma'(.) \,(\mathbf{I}_\alpha)_j$$
• learning stops when $\ \ \mathbf{y}\to \mathbf{y}_\alpha$
• monotonic transfer function $\ \ \sigma'(.)>0$

# the XOR problem

• single neurons are linear classifiers
• XOR not solvable via linear classification

### the neural-network winter

• 1970' and 1980'
• universal computation needs XOR
$\Rightarrow\ \$ (single-layer) neural networks not universal
• deep learning did not work
• problems too simple
• limited training data
• engineering

# universality of multilayer perceptrons

supperpositions of linear functions
are still linear
• parametrized family of non-linear functions
$\qquad\quad \begin{array}{rcl} y_5 &=& \sigma(w_{5,3}y_3+w_{5,4}y_4) \\ &=& \sigma\Big( w_{5,3}\sigma(w_{3,1}I_1+w_{3,2}I_2) \\ & & \phantom{\sigma}+ w_{5,4}\sigma(w_{4,1}I_1+w_{4,2}I_2) \Big) \end{array}$

### linear neurons

• case $\ \ y(x) = a x$
• additional layer do not improve performance

### non-linear neurons

• output may be a ridge if

$w_{3,2}=-w_{3,1}, \qquad\quad w_{4,2}=-w_{4,1}, \qquad\quad b_3\ne b_4$

• 4 hidden-layer neurons representing
two orthognal ridges make a bump
• bumps are universal
given enough hidden layer neurons, non-linear
neurons can represent any smooth function

# derivative of sigmoidal

• here $\ \ y_i\in[0,1]$

$\qquad\quad y_i = \sigma(x_i-b_i), \qquad\quad \sigma(z) =\frac{1}{1+\mathrm{e}^{-z}}$
• derivative of sigmoidal

$\qquad\quad \frac{d\sigma}{dz} = \frac{\mathrm{e}^{-z}}{(1+\mathrm{e}^{-z})^2} =\sigma^2(1/\sigma-1)$

viz
$$\fbox{\phantom{\big|}\displaystyle \sigma' = \sigma(1-\sigma) \phantom{\big|}}\,, \qquad\quad \frac{d}{dx} y_i= y_i(1-y_i)$$

# backpropagation

• layers $\ \ i,\,j,\, k\,\,\dots$

$x_i = \sum_j w_{ij} y_j = \sum_j w_{ij} \sigma(x_j-b_j) =\dots$

### training multilayer perceptrons

$\qquad\quad \frac{\partial y_i}{\partial w_{\alpha\beta}} = y_i(1-y_i)\sum_j w_{ij} \frac{\partial y_j}{\partial w_{\alpha\beta}}$
• layer indices $\ \ \alpha,\beta$
• target value $\ \ \tilde{y}_i$
• gradient of error function $$E=\frac{1}{2} \sum_i\big(y_i-\tilde{y}_i\big)^2, \qquad\quad \frac{\partial E}{\partial w_{\alpha\beta}} = \sum_i \big(y_i-\tilde{y}_i\big) \frac{\partial y_i}{\partial w_{\alpha\beta}}$$ viz $$\fbox{\phantom{\big|}\displaystyle \frac{\partial E}{\partial w_{\alpha\beta}} = \sum_i \Delta E_i \frac{\partial y_i}{\partial w_{\alpha\beta}} \phantom{\big|}}\,, \qquad\quad \Delta E_i=y_i-\tilde{y}_i$$ with the intial error $\ \ \Delta E_i$

### recursive derivatives

$$\frac{\partial E}{\partial w_{\alpha\beta}} = \sum_i\big(y_i-\tilde{y}_i\big)\, y_i(1-y_i)\sum_j w_{ij} \frac{\partial y_j}{\partial w_{\alpha\beta}}$$ $\quad\quad$ and hence $$\frac{\partial E}{\partial w_{\alpha\beta}} = \sum_j \Delta E_j \frac{\partial y_j}{\partial w_{\alpha\beta}}, \quad\qquad \fbox{\phantom{\big|}\displaystyle \Delta E_j = \sum_i\Delta E_i\, y_i(1-y_i)\, w_{ij} \phantom{\big|}}$$
• back-propagated error $\ \ \Delta E_j$
• note: $\ \ y_i(1-y_i)>0$
: may be omitted
supervised learning via gradient descent is
equivalent to linear backpropagation of errors

# long short-term memory

### neurons with interal states

• $h_{t-1}\,$:: input from previous layer
• $c_{t-1}\,$:: internal state (memory) at the start of time step
• $x_{t\phantom{-1}}\,$:: external input

### internal state manipulation

• forget/update $\ \Rightarrow\$ new internal state $c_t$
• output to next layer: $\ h_t$
:: nonlinear weighted superpostion of $\ \ x_t$, $h_{t-1}$, $c_t$
• all operations learned (weights adapted)