Advanced Introduction to C++, Scientific Computing and Machine Learning

Claudius Gros, SS 2024

Institut for Theoretical Physics
Goethe-University Frankfurt a.M.

Generative Architectures

variational inference

Graves (2011)
Practical variational inference for neural networks

conditional probabilities

$\mathcal{D}\ $ : dataset with elements $(\mathbf{x},\mathbf{y})$,
$\mathbf{x}\ $ : input
$\mathbf{y}\ $ : output
$\mathbf{w}\ $ : network parameters (weights)
$P(\mathcal{D}|\mathbf{w})\ $ : conditional probability for data $\mathcal{D}$, for a given configuration of $\mathbf{w}$
:: both $\mathbf{x}$ and $\mathbf{y}$ are given for supervised learning

accuracy objective

$L^N(\mathbf{w},\mathcal{D})\ $ : network loss function

$$ L^N(\mathbf{w},\mathcal{D}) = -\log\big(P(\mathcal{D}|\mathbf{w})\big) = - \sum_{(\mathbf{x},\mathbf{y})\in\mathcal{D}} \log\big(P(\mathbf{y}|\mathbf{x},\mathbf{w})\big) $$

is minimal if there is one $\mathbf{y}$ for every $\mathbf{x}$
:: entropy is positive definite, vanishing in the absence of noise

variational inference

$P(\mathbf{w}|\alpha)\ $ : prior distribution of weights (before training), depending on parameter $\alpha$
$P(\mathbf{w}|\mathcal{D},\alpha)\ $ : posterior weight distribution (given training data $\mathcal{D}$)
:: not tractable
$P(\mathbf{w}|\mathcal{D},\alpha)$ is approximated by
$Q(\mathbf{w}|\beta)$, determined variationally, viz by minimizing a given objective function
$\mathcal{F}\ $ : variational free energy

$$ \mathcal{F} = - \left\langle \log\left[ \frac{P(\mathcal{D}|\mathbf{w})P(\mathbf{w}|\alpha)}{Q(\mathbf{w}|\beta)} \right]\right\rangle_{\mathbf{w}\sim Q(\beta)} $$

Kullback Leibler divergence

$$ \mathcal{F} = \left\langle L^N(\mathbf{w},\mathcal{D}) \right\rangle_{\mathbf{w}\sim Q(\beta)} + D_{\rm KL}\big(Q(\beta)\parallel P(\alpha)\big) $$

$\langle L^N(\mathbf{w},\mathcal{D}) \rangle_{\mathbf{w}\sim Q(\beta)}\ $ : parametrized by $Q(\beta)$, instead of $\mathbf{w}$
$D_{\rm KL}(Q(\beta)\parallel P(\alpha)) \ $ : Kullback-Liebler divergence ('distance')
:: positive definite
:: $D_{\rm KL}\equiv 0\ $ if and only if $Q(\beta)$ and $P(\alpha)$ identical
:: converged if posterior $\equiv$ prior

variational distributions

minimization procedure possible for a given parametrization of $Q(\mathbf{w},\beta)$
Gaussian: mean $\mu_i$ and standard deviation $\sigma_i$ for parameter $w_i$
same parametrization for prior $P(\mathbf{w}|\alpha)$

inference

result is a distribution, parametrized by $(\mu_i,\sigma_i)$,
for network parameters: not a single value
set of $(\mu_i,\sigma_i)$ updated with every new observation,
:: viz with every new data $(\mathbf{x},\mathbf{y})$
:: inference: updating of a distribution with new observations
next step: start posterior as new prior
:: $(\mu_i,\sigma_i)$ are now the parameters of the prior

variational autoencoder

latent space ('encoded space')
- for normal autoencoder
  :: not structured, just a bottleneck
- for variational autoencoder
  :: maximal information
maximize information when encoding
minimized reconstruction error when decoding
$\mathbf{x}\ $ : input
$\mathbf{z}\ $ : latent state
$\mathbf{y}\ $ : output

probabilistic network representations

$Q(\mathbf{w}|\beta)\ $ : encoder
$Q^\prime(\mathbf{w}^\prime|\beta^\prime)\ $ : decoder
:: probabilities to generated networks parameters $\mathbf{w}\,/\,\mathbf{w}^\prime$
:: encoded by variational parameters $\beta\,/\,\beta^\prime$ (Gaussians)

accuracy target

$P(\mathbf{y}|\mathbf{z},\mathbf{w}^\prime)\ $ : probability to find $\mathbf{y}$ for given $\mathbf{z}$ and $\mathbf{w}^\prime$

$$ \mathcal{F}_{\mathrm{accuracy}} = - \sum_{(\mathbf{x},\mathbf{y})\in\mathcal{D}} \log\big(P(\mathbf{y}|\mathbf{z},\mathbf{w}^\prime)\big) $$

$\mathbf{y}\hat{=}\mathbf{x}\ $ for autoencoder
minimal if there is an $\mathbf{y}$ for every $\mathbf{z}$

information maximization

variational inference between input and latent space
posterior and prior (both variational)

$$ \mathcal{F}_{\mathrm{information}} = D_{\mathrm{KL}} \big(P_{\mathrm{posterior}} \parallel P_{\mathrm{prior}} \big) $$

with

$$ P_{\mathrm{prior/posterior}} = P(\mathbf{z}|\mathbf{x},\mathbf{w})\Big|_{\mathrm{before/after\ updating}} $$

natural language processing (NLP)

[Smerity]

encoder-decoder architectures for NLP

encoder
every layer receives an input
$ \mathbf{h}_{t} = \mathbf{f}(\mathbf{h}_{t-1}, \mathbf{x}_t) $
$\mathbf{h}_{t}\ $ : hidden layer state
$\mathbf{x}_{t}\ $ : input (embedded words)
decoder
every layer makes a prediction,
receiving previous predictions as input
$ \mathbf{s}_{t} = \mathbf{g}(\mathbf{s}_{t-1}, \mathbf{y}_{t-1}) $
$\mathbf{s}_{t}\ $ : hidden layer state
$\mathbf{y}_{t}\ $ : predicted word (linear classifier, softmax)

word embedding

preprocessing: word $\to$ vector
consider inter-word correlations
:: related words are closer in vector space
example: Google Word2vec

variants

apply to
- variational encoders, decoders, and/or
- recurrent neural networks (RNN)
- normal units / long short-term units
state-of-the-art until 2014-2017

semantic correlations

If the universe has an explanation of its existence,
that explanation is God.

[Mohebbi et al 2022]

posibly long-distance semantic correlations
- classical approach: memory (fading)
- state-of-the-art: attention

query, key & value

$\sim0.75$ word -- tokens
$\sim 5\cdot10^4$: vocabulary size
activity $\ \mathbf{x}$: current state
three matrices (entries adapted via backpropagation)
$\hat{Q}\ $: query
$\hat{K}\ $: key
$\hat{V}\ $: value
$\mathbf{Q}$, $\mathbf{K}$, $\mathbf{V}$ vectors

$$ \mathbf{Q} = \hat{Q}\cdot\mathbf{x},\qquad \mathbf{K} = \hat{K}\cdot\mathbf{x},\qquad \mathbf{V} = \hat{V}\cdot\mathbf{x},\qquad $$

[Bahdanau, Cho, Bengio 2014]

Alice: should I pay attention to Bob?

databank query formalism
- Alice sends its query $\ \mathbf{Q}_A\ $ to Bob
- Bob compares $\ \mathbf{Q}_A\ $ with its key $\ \mathbf{K}_B$
- if there is a match, sends back its value $\ \mathbf{V}_B$
content-based attention via similarity
self attention: between words of a sentence

dot-product attention

[Attention Is All You Need (2017)]
[uvadlc-notebooks]

scanning keys for similarity with query
$\mathbf{K}_j \ $ : key of object
$\mathbf{V}_j \ $ : value of object
$\mathbf{Q}_{\phantom{j}} \ $ : query

$$ \mathbf{a} = \sum_j \alpha_j\mathbf{V}_j, \quad\qquad \alpha_{j} = \frac{\mathrm{e}^{e_{j}}}{\sum_i\mathrm{e}^{e_{i}}} \quad\qquad e_{j} = \mathbf{Q}\cdot\mathbf{K}_{j} $$

$\mathbf{a}\ $ : attention vector
:: activity of query token
:: output of attention layer
softmax (Boltzman distribution) superposition of values
similarity: scalar product

[Bahdanau, Cho, Bengio 2014]

transformer block

[Attention Is All You Need (2017)]
[Peter Bloem]

$$ \begin{array}{ccccccc} \fbox{$\phantom{\big|}\mathrm{input}\phantom{\big|}$} &\to& \fbox{$\phantom{\big|}\mathrm{attention}\phantom{\big|}$} &\to& \fbox{$\phantom{\big|}\mathrm{normalization}\phantom{\big|}$} & & \\[0.5ex] &\to& \fbox{$\phantom{\big|}\mathrm{feed forward}\phantom{\big|}$} &\to& \fbox{$\phantom{\big|}\mathrm{normalization}\phantom{\big|}$} &\to& \fbox{$\phantom{\big|}\mathrm{output}\phantom{\big|}$} \end{array} $$

input sentence
:: embedded sequence of words (tokens)
residual connections around attention layer are essential
MLP: linear feedword layer:
output of block $\ \equiv\ $ input of next block

multi-headed attention

$h\ $ attention blocks in parallel ($h=8$)
:: every token generates $\ h \ $ Q/K/V matrices
allows for context specific attention

skip connections

[Smerity]

skip/residual connections

add identity in parallel to hidden layer
$\mathbf{Layer}(\mathbf{x}) \ \to \ \mathbf{x}+\mathbf{Layer}(\mathbf{x})$
additional normalization (transformer model)
$\mathbf{Norm}\Big(\mathbf{x}+\mathbf{Layer}(\mathbf{x})\Big)$

advantages

helps with vanishing gradient problem
when backprogating learning signals
during the training of deep networks
if layer is superfluous, training can ignore it
allows to add (self-) attention layers (transformer model)

positional encoding

[Attention Is All You Need (2017)]

NLP before transformer (2017)
:: read word by word
:: encoder / decoder with RNN and LTSM
NLP with transformer
:: reading sentence by sentence
:: no RNN, no LTSM; good for parallelization, GPUs
:: uses self-attention with positional encoding

position of words in a sentence

system needs position information when
reading entire sentences at once
for every word:
$\mathbf{P}_e\ $ : positional encoding vector
$\mathbf{W}_e\ $ : word embedding vector
transformer architecture
$\mathbf{P}_e$ and $\mathbf{W}_e$ have identical dimension
$d_{\mathrm{model}}=512 \ $ : model dimension
encoder input:
$\mathbf{P}_e + \mathbf{W}_e\ $ : vector sum (times number of words)

discrete fourier analysis

many types of positional encodings possible
transfomer uses fourier components
$L_s\ $ : length of input sentence
$k\ $ : word position
$d_{\rm model}\ $ : embedding dimension
$i\in[0,d_{\rm model}/2]\ $ : encoding the embedding index

$$ P_e(k,2i ) = \sin\left(\frac{k}{K^{2i}}\right), \quad\qquad P_e(k,2i+1) = \cos\left(\frac{k}{K^{2i}}\right), \quad\qquad K = n^{1/d_{\mathrm{model}}} $$

for $d_{\rm model}=512$ (transformer), one has a geometric series of
$d_{\rm model}/2=256\ $ wavevectors $\ K^{2i}=1,\dots,n$
$n=10^4\ $ (transformer)
for the $k$th word, $P_e(k,j)$ is the $j$th entry
of the positional embedding vector

recursive prediction

[Peter Bloem]

hello $\ \to\ $ hello

own prediction $\ \equiv\ $ input (shifted by one token)

beam search

the best output is

not the most likely individual words/characters,
but the most likely sentences

run for the N mostly likely first words
run for the N mostly likely two-word combinations
...

beam search

[Dive Into Deep Learning]

large language models (LLMs) generate
probabilities for output tokens
:: words, syllables, letters
greedy
take token with highest probabilty
chatGPT uses beam search

beam search

classical search algorithm for probalistic trees
N=2: width of search beam
first position:
select the N tokes with highest probabilities: A and C
rerun system N times, with first position fixed
:: generate new probabilities for second position
evaluate combined probalities $\ \ p_{AX}$, $\ \ p_{CX}$
for all second-position tokens $\ X$
select N highest 2-token combinations: AB and CE
rerun N times with fixed first/second positions
... repeat

combined probabilities

position of output token: time step
left (greedy)
$0.5\cdot0.4\cdot0.4\cdot0.6=0.048$
right (non-greedy second position)
$0.5\cdot0.3\cdot0.6\cdot0.6=0.054$
changed probabilities due to rerun

masked self attention

[Jay Alammar]

ordered input tokes

attention to the left only (causality)
value, key, query
$ \mathbf{V}_m = \hat{V}_m \cdot \mathbf{x}_m, \qquad \mathbf{K}_m = \hat{K}_m \cdot \mathbf{x}_m $
$ \mathbf{Q}_m = \hat{Q}_m \cdot \mathbf{x}_m $

linearized attention
[Linear Transformers]

$$ \begin{array}{rcl} \mathbf{s}_q^{\mathrm{(head)}} &= & \sum_k \mathbf{V}_k\, \mathrm{softmax}\Big(F_a(\mathbf{K}_k,\mathbf{Q}_{q})\Big) \\[0.5ex] &\to& \frac{\sum_k\mathbf{V}_k \big(\mathbf{K}_k\,\cdot\,\mathbf{Q}_{q}\big)} {\sum_k \mathbf{K}_k\,\cdot\,\mathbf{Q}_{q}} = \frac{\big(\sum_k \mathbf{V}_k \otimes\mathbf{K}_k\big)\,\cdot\,\mathbf{Q}_{q}} {\big(\sum_k \mathbf{K}_k\big)\,\cdot\,\mathbf{Q}_{q}} \end{array} $$

outer product: $\ \otimes$
additionally: positiveness

associative soft weight computing

enters linearized attention (mean-field approximation)
:: $\ W_{\rm tot}=\sum_k \mathbf{V}_k \otimes\mathbf{K}_k$

$$ W_m = W_{m-1} + \mathbf{V}_m \otimes\mathbf{K}_m $$

akin to time decomposition of a autoassociative recurrent net
:: internal state: $\ W_m$
key-value associations

$$ \mathbf{V}_m \otimes\mathbf{K}_m = \mathbf{x}_m\,\cdot\,\big(\hat{V}_m \otimes\hat{K}_m \big)\,\cdot\,\mathbf{x_m} $$

why does it work?

attention in neuroscience / psychology

external (saliency, 'something fast')
top-down attention commands
:: 'looking for a red object'
$\color{brown}\Rightarrow \ $ humunculus problem
“No one knows what attention is” [Hommel et al., 2019]

ML attention

active during training $\ \color{brown}\Rightarrow\ $ self-consitency loop

Q, K, V matrices adapt to Q-K matching condition
:: dot product
attention $\ \color{brown}\equiv\ $ information routing
:: self-optimized information bootleneck

self-optimized information routing
solves the fitting / generalization dilemma

statistical learning of causal relations
:: not (only) of information

foundation models

GPT - generative pretrained transformer

number of adaptable parameters (matrix elements)
:: $10^8$ / $10^9$ / $10^{11}$ / $10^{12}$
:: GPT 1 / 2 / 3 / 4
$10^3-10^4\ $ context tokens, $\ \sim 80\ $ layers

automatic pretraining by predicting next word

$ \begin{array}{rl} {\color{red}\equiv} & \mathrm{base\ mode\ (GPT\!-\!3)} \\[0.0ex] {\color{red}+} & \mathrm{human\ supervised\ finetuning\ (SFT\ model)} \\[0.0ex] {\color{red}+} & \mathrm{human\ supervised\ reinforement\ learning} \\[0.0ex] {\color{red}\Rightarrow} & \mathrm{chat\ assistent\ (foundation\ model)} \end{array} $

[Foundation Models for Decision Making]

SFT: $\ 10^4\!-\!10^5\ $ hand-crafted (prompt, response) pairs
:: learns to answer; ¬(text completion)

applications / tasks on top of foundation model

value alignment

[TechTalks]

reinforcement learning from human feedback (RLHF)

reinforcement learning
:: learning from feedback
humans train the reward model
the reward model trains GPT
:: circumvents behavioral cloning
$\color{brown}\Rightarrow\ $ value alignment
$\color{brown}\Rightarrow\ $ ChatGPT

open-source explosion

LLaMA (Meta AI)

public release Feb 2023
:: 7B, 13B, 33B, and 65B parameters
:: parameters leaked within a week

'We have no moat'

leaked Google memo, May 2023

Open-source models are faster, more customizable, more private, and pound-for-pound more capable. They are doing things with $\$$100 and 13B params that we struggle with at $\$$10M and 540B. And they are doing so in weeks, not months.

The barrier to entry for training and experimentation has dropped from the total output of a major research organization to one person, an evening, and a beefy laptop.

AI psychology

[Yao et.al, 2023]

Tree of Thoughts:
Deliberate Problem Solving with Large Language Models

helping large language models to 'think'
:: substantial performance boost

potential problems

value alignment

democratic / liberal for most western chatBots
:: no opinion given
possible alternative value alignments
:: socialist / suppressive (China)
:: religious, e.g. islamic
:: ...

copyright

training corpus
¬(generated text)

social media polution

mass generation of social media posts
:: value marketing
$\to\ \ $ training data polution

regulation

data privacy / age verification / risk of wrong answers (Italy)
EU AI act (dangerous applications)
however, open-source explosion
:: regulation no possible?

Advanced Introduction to C++, Scientific Computing and Machine Learning

Generative Architectures

variational inference

conditional probabilities

accuracy objective

variational inference

Kullback Leibler divergence

variational distributions

inference

variational autoencoder

probabilistic network representations

accuracy target

information maximization

natural language processing (NLP)

encoder-decoder architectures for NLP

word embedding

variants

semantic correlations

query, key & value

Alice: should I pay attention to Bob?

dot-product attention

transformer block

multi-headed attention

skip connections

skip/residual connections

advantages

positional encoding

position of words in a sentence

discrete fourier analysis

recursive prediction

hello $\ \to\ $ hello

beam search

beam search

beam search

combined probabilities

masked self attention

ordered input tokes

associative soft weight computing

why does it work?

attention in neuroscience / psychology

ML attention

foundation models

GPT - generative pretrained transformer

applications / tasks on top of foundation model

value alignment

reinforcement learning from human feedback (RLHF)

open-source explosion

LLaMA (Meta AI)

'We have no moat'

AI psychology

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

potential problems

value alignment

copyright

social media polution

regulation

Tree of Thoughts:
Deliberate Problem Solving with Large Language Models