Advanced Introduction to C++, Scientific Computing and Machine Learning




Claudius Gros, SS 2024

Institut for Theoretical Physics
Goethe-University Frankfurt a.M.

Generative Architectures

variational inference



Graves (2011)
Practical variational inference for neural networks

conditional probabilities


accuracy objective

$$ L^N(\mathbf{w},\mathcal{D}) = -\log\big(P(\mathcal{D}|\mathbf{w})\big) = - \sum_{(\mathbf{x},\mathbf{y})\in\mathcal{D}} \log\big(P(\mathbf{y}|\mathbf{x},\mathbf{w})\big) $$

variational inference

$$ \mathcal{F} = - \left\langle \log\left[ \frac{P(\mathcal{D}|\mathbf{w})P(\mathbf{w}|\alpha)}{Q(\mathbf{w}|\beta)} \right]\right\rangle_{\mathbf{w}\sim Q(\beta)} $$

Kullback Leibler divergence

$$ \mathcal{F} = \left\langle L^N(\mathbf{w},\mathcal{D}) \right\rangle_{\mathbf{w}\sim Q(\beta)} + D_{\rm KL}\big(Q(\beta)\parallel P(\alpha)\big) $$

variational distributions


inference

variational autoencoder


probabilistic network representations


accuracy target

$$ \mathcal{F}_{\mathrm{accuracy}} = - \sum_{(\mathbf{x},\mathbf{y})\in\mathcal{D}} \log\big(P(\mathbf{y}|\mathbf{z},\mathbf{w}^\prime)\big) $$

information maximization

$$ \mathcal{F}_{\mathrm{information}} = D_{\mathrm{KL}} \big(P_{\mathrm{posterior}} \parallel P_{\mathrm{prior}} \big) $$ $$ P_{\mathrm{prior/posterior}} = P(\mathbf{z}|\mathbf{x},\mathbf{w})\Big|_{\mathrm{before/after\ updating}} $$

natural language processing (NLP)



[Smerity]

encoder-decoder architectures for NLP

word embedding

variants

semantic correlations

If the universe has an explanation of its existence,
that explanation is God.
[Mohebbi et al 2022]




query, key & value

$$ \mathbf{Q} = \hat{Q}\cdot\mathbf{x},\qquad \mathbf{K} = \hat{K}\cdot\mathbf{x},\qquad \mathbf{V} = \hat{V}\cdot\mathbf{x},\qquad $$
[Bahdanau, Cho, Bengio 2014]


Alice: should I pay attention to Bob?



dot-product attention

[Attention Is All You Need (2017)]
[uvadlc-notebooks]
$$ \mathbf{a} = \sum_j \alpha_j\mathbf{V}_j, \quad\qquad \alpha_{j} = \frac{\mathrm{e}^{e_{j}}}{\sum_i\mathrm{e}^{e_{i}}} \quad\qquad e_{j} = \mathbf{Q}\cdot\mathbf{K}_{j} $$ [Bahdanau, Cho, Bengio 2014]

transformer block

[Attention Is All You Need (2017)]
[Peter Bloem]
$$ \begin{array}{ccccccc} \fbox{$\phantom{\big|}\mathrm{input}\phantom{\big|}$} &\to& \fbox{$\phantom{\big|}\mathrm{attention}\phantom{\big|}$} &\to& \fbox{$\phantom{\big|}\mathrm{normalization}\phantom{\big|}$} & & \\[0.5ex] &\to& \fbox{$\phantom{\big|}\mathrm{feed forward}\phantom{\big|}$} &\to& \fbox{$\phantom{\big|}\mathrm{normalization}\phantom{\big|}$} &\to& \fbox{$\phantom{\big|}\mathrm{output}\phantom{\big|}$} \end{array} $$

multi-headed attention



skip connections



[Smerity]

skip/residual connections

advantages

positional encoding

[Attention Is All You Need (2017)]

position of words in a sentence

discrete fourier analysis

$$ P_e(k,2i ) = \sin\left(\frac{k}{K^{2i}}\right), \quad\qquad P_e(k,2i+1) = \cos\left(\frac{k}{K^{2i}}\right), \quad\qquad K = n^{1/d_{\mathrm{model}}} $$

recursive prediction

[Peter Bloem]

hello $\ \to\ $ hello

beam search

not the most likely individual words/characters,
but the most likely sentences


beam search




[Dive Into Deep Learning]

beam search



combined probabilities

masked self attention




[Jay Alammar]

ordered input tokes

$$ \begin{array}{rcl} \mathbf{s}_q^{\mathrm{(head)}} &= & \sum_k \mathbf{V}_k\, \mathrm{softmax}\Big(F_a(\mathbf{K}_k,\mathbf{Q}_{q})\Big) \\[0.5ex] &\to& \frac{\sum_k\mathbf{V}_k \big(\mathbf{K}_k\,\cdot\,\mathbf{Q}_{q}\big)} {\sum_k \mathbf{K}_k\,\cdot\,\mathbf{Q}_{q}} = \frac{\big(\sum_k \mathbf{V}_k \otimes\mathbf{K}_k\big)\,\cdot\,\mathbf{Q}_{q}} {\big(\sum_k \mathbf{K}_k\big)\,\cdot\,\mathbf{Q}_{q}} \end{array} $$

associative soft weight computing

$$ W_m = W_{m-1} + \mathbf{V}_m \otimes\mathbf{K}_m $$ $$ \mathbf{V}_m \otimes\mathbf{K}_m = \mathbf{x}_m\,\cdot\,\big(\hat{V}_m \otimes\hat{K}_m \big)\,\cdot\,\mathbf{x_m} $$

why does it work?







attention in neuroscience / psychology


ML attention

active during training $\ \color{brown}\Rightarrow\ $ self-consitency loop


self-optimized information routing
solves the fitting / generalization dilemma


foundation models

GPT - generative pretrained transformer

automatic pretraining by predicting next word
$ \begin{array}{rl} {\color{red}\equiv} & \mathrm{base\ mode\ (GPT\!-\!3)} \\[0.0ex] {\color{red}+} & \mathrm{human\ supervised\ finetuning\ (SFT\ model)} \\[0.0ex] {\color{red}+} & \mathrm{human\ supervised\ reinforement\ learning} \\[0.0ex] {\color{red}\Rightarrow} & \mathrm{chat\ assistent\ (foundation\ model)} \end{array} $

[Foundation Models for Decision Making]


applications / tasks on top of foundation model






value alignment

[TechTalks]

reinforcement learning from human feedback (RLHF)





open-source explosion

LLaMA (Meta AI)

'We have no moat'


Open-source models are faster, more customizable, more private, and pound-for-pound more capable. They are doing things with $\$$100 and 13B params that we struggle with at $\$$10M and 540B. And they are doing so in weeks, not months.

The barrier to entry for training and experimentation has dropped from the total output of a major research organization to one person, an evening, and a beefy laptop.

AI psychology

[Yao et.al, 2023]

Tree of Thoughts:
Deliberate Problem Solving with Large Language Models



potential problems

value alignment

copyright

social media polution

regulation