Machine Learning Primer -- Python Tutorial

Claudius Gros, WS 2025/26

Institut für theoretische Physik
Goethe-University Frankfurt a.M.

Generative Architectures

natural language processing (NLP)

[Smerity]

encoder-decoder architectures for NLP

encoder
every layer receives an input
$ \mathbf{h}_{t} = \mathbf{f}(\mathbf{h}_{t-1}, \mathbf{x}_t) $
$\mathbf{h}_{t}\ $ : hidden layer state
$\mathbf{x}_{t}\ $ : input (embedded words)
decoder
every layer makes a prediction,
receiving previous predictions as input
$ \mathbf{s}_{t} = \mathbf{g}(\mathbf{s}_{t-1}, \mathbf{y}_{t-1}) $
$\mathbf{s}_{t}\ $ : hidden layer state
$\mathbf{y}_{t}\ $ : predicted word (linear classifier, softmax)

word embedding

preprocessing: word $\to$ vector
consider inter-word correlations
:: related words are closer in vector space
example: Google Word2vec

variants

apply to
- variational encoders, decoders, and/or
- recurrent neural networks (RNN)
- normal units / long short-term units
state-of-the-art until 2014-2017

semantic correlations

If the universe has an explanation of its existence,
that explanation is God.

[Mohebbi et al 2022]

long-distance semantic correlations
:: classical approach: memory (fading)
attention state-of-the-art
information routing in addition to information processing

query, key & value

[Bahdanau, Cho, Bengio 2014]

attention: token-based information routing

$\sim0.75$ word -- tokens
$\sim 5\cdot10^4$: vocabulary size
$\mathbf{x}_i$ current state / ativity of token $\ i$
three matrices per layer
:: broadcasted along context direction (translational invariance)
:: entries adapted via backpropagation
$\hat{Q}\ $: query
$\hat{K}\ $: key
$\hat{V}\ $: value
$\mathbf{Q}$, $\mathbf{K}$, $\mathbf{V}$ vectors; per token $\ i$

$$ \mathbf{Q}_i = \hat{Q}\cdot\mathbf{x}_i,\qquad \mathbf{K}_i = \hat{K}\cdot\mathbf{x}_i,\qquad \mathbf{V}_i = \hat{V}\cdot\mathbf{x}_i,\qquad $$

Alice: should I pay attention to Bob?

databank query formalism
- Alice sends its query $\ \mathbf{Q}_A\ $ to Bob
- Bob compares $\ \mathbf{Q}_A\ $ with its key $\ \mathbf{K}_B$
- if there is a match, sends back its value $\ \mathbf{V}_B$
content-based attention via similarity
self attention: between words of a sentence

dot-product attention

[Attention Is All You Need (2017)]
[uvadlc-notebooks]

scanning keys for similarity with query
$\mathbf{K}_j \ $ : key of object
$\mathbf{V}_j \ $ : value of object
$\mathbf{Q}_{\phantom{j}} \ $ : query

$$ \mathbf{a} = \sum_j \alpha_j\mathbf{V}_j, \quad\qquad \alpha_{j} = \frac{\mathrm{e}^{e_{j}}}{\sum_i\mathrm{e}^{e_{i}}} \quad\qquad e_{j} = \mathbf{Q}\cdot\mathbf{K}_{j} $$

$\mathbf{a}\ $ : attention vector
:: activity of query token
:: output of attention layer
softmax (Boltzmann distribution) superposition of values
similarity: scalar product

[Bahdanau, Cho, Bengio 2014]

transformer block

[Attention Is All You Need (2017)]
[Peter Bloem]

$$ \begin{array}{ccccccc} \fbox{$\phantom{\big|}\mathrm{input}\phantom{\big|}$} &\to& \fbox{$\phantom{\big|}\mathrm{attention}\phantom{\big|}$} &\to& \fbox{$\phantom{\big|}\mathrm{normalization}\phantom{\big|}$} & & \\[0.5ex] &\to& \fbox{$\phantom{\big|}\mathrm{feed forward}\phantom{\big|}$} &\to& \fbox{$\phantom{\big|}\mathrm{normalization}\phantom{\big|}$} &\to& \fbox{$\phantom{\big|}\mathrm{output}\phantom{\big|}$} \end{array} $$

input sentence
:: embedded sequence of words (tokens)
residual connections around attention layer are essential
MLP: position-wise feedword layers $$ \hat{W}_2\cdot \mathrm{ReLU}\big(\hat{W_1}\cdot\mathbf{x} +\mathbf{b}_1\big)+\mathbf{b}_2 $$
output of block $\ \equiv\ $ input of next block

multi-headed attention

$h\ $ attention blocks in parallel ($h=8$)
:: every token generates $\ h \ $ Q/K/V matrices
allows for context specific attention

skip connections

skip/residual connections

add identity in parallel to hidden layer
$\mathbf{Layer}(\mathbf{x}) \ \to \ \mathbf{x}+\mathbf{Layer}(\mathbf{x})$
transformer: for both attention/MLP sublayers

advantages

helps with vanishing gradient problem
when backprogating learning signals
during the training of deep networks
if layer is superfluous, training can ignore it
most modern architectures
skip connections instead of ReLU units

transformer layer norm

renormalization of activites between sublayer
:: token-wise whitening

positional encoding

[Attention Is All You Need (2017)]

NLP before transformer (2017)
:: read word by word
:: encoder / decoder with RNN and LTSM
NLP with transformer
:: reading entire text at once (context)
:: no RNN, no LTSM; good for parallelization, GPUs
:: uses self-attention with positional encoding

position of words

system needs position information when
reading entire the entire text at once
vanilla GPT
positional encoding vector
added to embedding vector
:: (range of) Fourier coefficients
:: absolute positions
:: once at the beginning
LLama (state-of-the-art)
rotary positional encoding (RoPE)
:: rotations in complex plane
:: relative positions
:: all layers, part of the attention dot product

rotary positional encoding (RoPE)

[Su et al (2021/24)]

doubling all dimensions
:: token activity

$$ \mathbf{x} = \big(x_1, x_2, \dots\big) \quad\to\quad \big(\vec{x}_1, \vec{x}_2,\dots\big), \quad\qquad \vec{x}_i = {x_i\choose y_i} $$

idem for Q/K/V matrices
- block diagonal x/y components not mixed
Q/K/V vectors

$$ \mathbf{Q}\quad \to\quad \big( \vec{Q}_1, \vec{Q}_2, \dots \big), \quad\qquad \vec{Q}_i = {Q_i^{(x)}\choose Q_i^{(y)}} $$

$l,\,m$ token positions $\ \mathbf{x}\to\mathbf{x}(l)$
:: vanilla dot product attention (no rotations yet)

$$ \mathbf{Q}(l)\cdot \mathbf{K}(m) \ =\ \sum_i Q_i(l) K_i(m) \quad \to\quad \sum_i \vec{Q}_i(l)\cdot\vec{K}_i(m) $$

rotations in two dimensions

rotate as vectors / complex numbers
Q/K vectors, but not V vector
angle encodes position

$$ \vec{K}_i(m) \quad\to\quad \left(\begin{array}{cc} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{array}\right) \vec{K}_i(m) $$

angles are additive when rotating twice

$$ \vec{Q}_i(l) \cdot \vec{K}_i(m) \quad\to\quad \vec{Q}_i(l) \left(\begin{array}{cc} \cos((m-l)\theta_i) & -\sin((m-l)\theta_i) \\ \sin((m-l)\theta_i) & \cos((m-l)\theta_i) \end{array}\right) \vec{K}_i(m) $$

only relative positions enter, distance $\ m-l$
:: done for all layers
$\theta_i$ reference angles, e.g.
$\theta_i = C^{-i}$ with large $\ C$

"attention (dot-product) knows about distances between words"

ALiBi positional encoding

Attention with Linear Biases (ALiBi)

$$ \mathbf{Q}_j\cdot \mathbf{K}_i \ \to\ \mathbf{Q}_j\cdot \mathbf{K}_i + m(i-j) \quad\qquad j\ge i \ \ \mathrm{(causal)} $$

attention weights (not yes normalized)

$$ \mathrm{e}^{\mathbf{Q}_j\cdot \mathbf{K}_i} \ \to\ P(i-j) \mathrm{e}^{\mathbf{Q}_j\cdot \mathbf{K}_i} \quad\qquad P(i-j) = \mathrm{e}^{-m|i-j|} $$

number of attention heads $\ N_{\rm head}$
:: slopes (example)

$$ \fbox{$\phantom{\big|}\displaystyle m = \frac{1}{A^{h/N_{\rm head}}} \phantom{\big|}$} \quad\qquad h=1,\cdots,N_{\rm head} \quad\qquad A=2^8 $$

'row-vector' plus 'column-vector' = matrix
:: in PyTorch
broadcasting to all attention heads

#!/usr/bin/env python3

# ALiBi positional embedding via broadcasting

import torch
import math

nC = 6       # context length
nHead = 2    # number of attention heads

rel_dist = torch.arange(0, nC).view(1, 1, nC) -\
           torch.arange(0, nC).view(1, nC, 1)
slopes = torch.tensor([1.0/(2.0**(h*1.0/nHead)) for h in range(nHead)])
biases = -slopes.view(nHead, 1, 1) * rel_dist.abs() 
ALiBi_tensor = biases.exp()

print(rel_dist)
print(slopes)
print(biases)
print(ALiBi_tensor)
print()
print("# === testing ===")
print()
test = torch.ones(nHead,nC,nC)
print(test)
print(ALiBi_tensor*test)
print()

expressive attention

dot-product attention

attention space geometry
:: large / small attention weights
:: restricted to a 1-dimensional subspace

$\qquad\displaystyle \alpha_{ji}\big|_{\scriptsize\rm DPA} \ \sim\ \mathrm{e}^{\mathbf{Q}_j\cdot\mathbf{K}_i} $

expressive attention

C. Gros (2024)
:: large / small attention weights
:: parallel / orthogonal
:: using entire attention space

$\qquad\displaystyle \alpha_{ji}\big|_{\scriptsize\rm EA} \ \sim\ \frac{\scriptsize(\mathbf{Q}_j\cdot\mathbf{K}_i)^2} {\scriptsize1+(\mathbf{Q}_j\cdot\mathbf{K}_i)^2} $
works nicely

masked self attention

[Jay Alammar]

ordered input tokes

attention to the left only (causality)
value, key, query
$ \mathbf{V}_m = \hat{V}_m \cdot \mathbf{x}_m, \qquad \mathbf{K}_m = \hat{K}_m \cdot \mathbf{x}_m $
$ \mathbf{Q}_m = \hat{Q}_m \cdot \mathbf{x}_m $

linearized attention
[Linear Transformers]

$$ \begin{array}{rcl} \mathbf{s}_q^{\mathrm{(head)}} &= & \sum_k \mathbf{V}_k\, \mathrm{softmax}\Big(F_a(\mathbf{K}_k,\mathbf{Q}_{q})\Big) \\[0.5ex] &\to& \frac{\sum_k\mathbf{V}_k \big(\mathbf{K}_k\,\cdot\,\mathbf{Q}_{q}\big)} {\sum_k \mathbf{K}_k\,\cdot\,\mathbf{Q}_{q}} = \frac{\big(\sum_k \mathbf{V}_k \otimes\mathbf{K}_k\big)\,\cdot\,\mathbf{Q}_{q}} {\big(\sum_k \mathbf{K}_k\big)\,\cdot\,\mathbf{Q}_{q}} \end{array} $$

outer product: $\ \otimes$
additionally: positiveness

associative soft weight computing

enters linearized attention (mean-field approximation)
:: $\ W_{\rm tot}=\sum_k \mathbf{V}_k \otimes\mathbf{K}_k$

$$ W_m = W_{m-1} + \mathbf{V}_m \otimes\mathbf{K}_m $$

akin to time decomposition of a autoassociative recurrent net
:: internal state: $\ W_m$
key-value associations

$$ \mathbf{V}_m \otimes\mathbf{K}_m = \mathbf{x}_m\,\cdot\,\big(\hat{V}_m \otimes\hat{K}_m \big)\,\cdot\,\mathbf{x_m} $$

recursive prediction / character decoding

[Peter Bloem]

hello $\ \to\ $ hello

add own prediction to input (shifted by one token)

ouput → token probability

$T_j$ token in dictionary
$\mathbf{y}$ output activity

$$ p(T_j|\mathbf{y}) = \frac{1}{Z}\,\mathrm{e}^{\langle T_j|\mathbf{y}\rangle/T}, \quad\qquad Z = \sum_i \mathrm{e}^{\langle T_i|\mathbf{y}\rangle}, \quad\qquad \sum_j p(T_j|\mathbf{y}) = 1 $$

100% for $\ p_j=p(T_j|\mathbf{y})=1$
$T\sim 0.9$ decoding temperature (softmax)

beam search / top-Z / top-p

text generation

the best output is

not the most likely individual word/token,
but the most likely sentence

run for the N mostly likely first words
run for the N mostly likely two-word combinations
...

[Dive Into Deep Learning]

[How to generate text]

beam search variants

greedy token with highest probabilty
:: depth of one
beam search Z token with highest probabilty
:: beam width Z
:: keep original probabilities
top-Z Z tokens with highest probabilty
:: (re-)normalize probabilities of selected tokens
top-p top tokens such that cummulative probabilty exceeds $\ P_{\rm top}\sim0.9$
:: (re-)renormalize probabilities to selected tokens
:: used by chatGPT / LLAMA

combined probabilities

position of output token: time step
left (greedy)
$0.5\cdot0.4\cdot0.4\cdot0.6=0.048$
right (non-greedy second position)
$0.5\cdot0.3\cdot0.6\cdot0.6=0.054$
changed probabilities due to rerun

why does it work?

attention in neuroscience / psychology

external (saliency, 'something fast')
top-down attention commands
:: 'looking for a red object'
$\color{brown}\Rightarrow \ $ humunculus problem
“No one knows what attention is” [Hommel et al., 2019]

ML attention

active during training $\ \color{brown}\Rightarrow\ $ self-consitency loop

Q, K, V matrices adapt to Q-K matching condition
:: dot product
attention $\ \color{brown}\equiv\ $ information routing
:: self-optimized information bootleneck

self-optimized information routing
solves the fitting / generalization dilemma

statistical learning of causal relations
:: not (only) of information

hands-on transformer bilayer code

explicit code, no layer norm, batch update, ...
rescaling in __init__() outside computation graph
padding mask for masked attention
training sequences $ \mathbf{x}_0,\ \mathbf{x}_1,\ \mathbf{x}_2,\ ..$
:: vectors on unit sphere
$$ \mathbf{x}_{t+1} = \frac{\mathbf{y}_{t+1}}{|\mathbf{y}_{t+1}|} \quad\qquad \mathbf{y}_{t+1} = \mathbf{x}_{t}-\mathbf{x}_{t-1} $$ :: settles into limit cycles of period six

#!/usr/bin/env python3

# basic attention layer
# no batch processing, no layer norm
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

nLayer   = 24              # number of layers
nContext = 24              # context length
dim      =  2              # embedding dimension

nIter    = 4000            # training iterations
learning_rate = 1.0e-3

# ==========================
# transformer bilayer module
# attention plus FFN
# ==========================
class transformerBilayer(torch.nn.Module):
  def __init__(self, dim, nContext):
    """Q/K/V matrices will be broadcasted along the
       context dimension
    """
    super().__init__()
    self.alpha    = torch.zeros(nContext,nContext)
    self.nContext = nContext     # context length
    self.dim      = dim          # embedding dimension
    self.hidden   = 4            # hidden layer expanded size
#
# Q/K/V matrices: requires_grad after scaline
#
    mySigma = 0.1/(dim*dim)
    self.Q_mat = mySigma*torch.randn(1,dim,dim)
    self.K_mat = mySigma*torch.randn(1,dim,dim)
    self.V_mat = mySigma*torch.randn(1,dim,dim)
    self.Q_mat.requires_grad_(True)
    self.K_mat.requires_grad_(True)
    self.V_mat.requires_grad_(True)
#
# FFN: feed forward network  
#      FFN_1 --> hidden --> FFN_2
#
    self.FFN_1  = torch.randn(1,self.hidden*dim,dim)*mySigma
    self.FFN_2  = torch.randn(1,dim,self.hidden*dim)*mySigma
    self.FFN_b  = torch.randn(1,self.hidden*dim,requires_grad=True)
    self.FFN_1.requires_grad_(True)
    self.FFN_2.requires_grad_(True)
#
# padding mask for causal attention
#
    self.paddingMask = torch.zeros(nContext,nContext)   # for masking
    for ii in range(nContext):
      for jj in range(ii+1,nContext):
        self.paddingMask[ii][jj] = -1.0e9               # exp -> 0

  def FFN(self, x):
    """FFN sublayer
    """ 
# broadasting  (1,.. ,.. ) --> (context,.. ,.. )
    shape_1 = (self.nContext,self.hidden*self.dim,self.dim)
    shape_2 = (self.nContext,self.dim,self.hidden*self.dim)
    shape_b = (self.nContext,self.hidden*self.dim)
    FFN_1_all = self.FFN_1.expand(shape_1)
    FFN_2_all = self.FFN_2.expand(shape_2)
    FFN_b_all = self.FFN_b.expand(shape_b)
#
    xx = torch.einsum("chd,cd->ch",FFN_1_all,x)
    hh = torch.tanh(xx+FFN_b_all)
    yy = torch.einsum("cdh,ch->cd",FFN_2_all,hh)
    return yy + x


  def attention(self, x):
    """attention sublayer
    """ 
# broadasting  (1,dim,dim) --> (context,dim,dim)
    expanded_shape = (self.nContext,self.dim,self.dim)
    Q_all = self.Q_mat.expand(expanded_shape)
    K_all = self.K_mat.expand(expanded_shape)
    V_all = self.V_mat.expand(expanded_shape)
# c and C: context (input length)
# d and D: dim (enbedding dimension)
    QQ = torch.einsum("cdD,cD->cd",Q_all,x)
    KK = torch.einsum("cdD,cD->cd",K_all,x)
    VV = torch.einsum("cdD,cD->cd",V_all,x)
# normalized attention matrix
    logits  = torch.einsum("cd,Cd->cC",QQ,KK)
    alpha   = torch.exp(logits+self.paddingMask)
    row_sum = alpha.sum(dim=-1, keepdim=True) 
    alpha   = alpha / row_sum
#  store detached attention matrix
    self.alpha = alpha.detach()
# return with skip connections
    return torch.matmul(alpha,VV) + x   

  def forward(self, x):
    """indexing always from the back
    """ 
    x = self.attention(x)
    x = self.FFN(x)
    return x

  def update(self, eps):                # updating parameters
    with torch.no_grad():
      self.Q_mat -= eps*self.Q_mat.grad
      self.K_mat -= eps*self.K_mat.grad
      self.V_mat -= eps*self.V_mat.grad
      self.Q_mat.grad = None
      self.K_mat.grad = None
      self.V_mat.grad = None
      self.FFN_1 -= eps*self.FFN_1.grad
      self.FFN_2 -= eps*self.FFN_2.grad
      self.FFN_b -= eps*self.FFN_b.grad
      self.FFN_1.grad = None
      self.FFN_2.grad = None
      self.FFN_b.grad = None

# ======================
# n-idential layer model
# ======================
allLayers = [transformerBilayer(dim,nContext) for iL in range(nLayer)]
def model(x):
  for iLayer in range(nLayer):
    x = allLayers[iLayer](x)
  return x

# ====================================
# console printing of attention matrix
# ====================================
def printAttenionMatrix():
  for iLayer in range(nLayer):
    print()
    print("# attention matrix for layer ", iLayer)
    for ss in range(nContext):
      for tt in range(nContext):
         alpha = allLayers[iLayer].alpha[ss][tt]
         print(f'{alpha:9.4f}', end="")
      print()

# ======================
# standard loss function
# ======================
def lossFunction(outputActivity, targetActivity):
  return torch.square(outputActivity - targetActivity).sum()

# ================================================
# training data, token at position i: x[i]
# x[i+1] = x[i] - x[i-1]  (modulo normalization)
# settles into a limit cycle of period six 
# initial warm-up is discarded
# ================================================
def trainingSequence(seqLength, myDim):
  """generates a random vector difference sequence"""
  data    = torch.zeros(2*seqLength,myDim)
  data[0] = F.normalize(torch.randn(myDim), p=2, dim=0)
  data[1] = F.normalize(torch.randn(myDim), p=2, dim=0)
  for ss in range(2,2*seqLength):
    vector_sum = data[ss-1]-data[ss-2]
    data[ss] = F.normalize(vector_sum, p=2, dim=0)
  return data[seqLength:]        # discard warm-up phase

if (1==2):
  testLength = 20
  training_data = trainingSequence(testLength,dim)
  print("# --- training_data ---")
  for ll in range(testLength):
    for dd in range(dim): 
      print(f'{training_data[ll][dd]:8.3f}',end="")
    print()

# ===========================================
# training with random sequences
# token prediction == shifting prompt by (-1)
# ===========================================
print("# --- training ---")
for iIter in range(nIter):
  training_all   = trainingSequence(nContext+1,dim)
  training_data  = training_all[:-1]
  training_value = training_all[1:]
  loss = lossFunction(model(training_data),training_value)
  if (loss<0.001):
    break
  loss.backward()
#
  for iLayer in range(nLayer):
    allLayers[iLayer].update(learning_rate)
  if (iIter%200==0):                # print progress
    print(f'{iIter:4d} {loss.item():9.4f}')
#
# visual testing
#
test_all   = trainingSequence(nContext+1,dim)
test_data  = test_all[:-1]
test_value = test_all[1:]
yy = model(test_data)
print("# --- value vs. output ---")
for ll in range(nContext):
  for dd in range(dim): 
    print(f'{test_value[ll][dd]:8.3f}',end="")
  print(" |",end="")
  for dd in range(dim): 
    print(f'{        yy[ll][dd]:8.3f}',end="")
  print()
#
# print attention matrix
#
if (1==2):
  print()
  printAttenionMatrix()

#!/usr/bin/env python3

# basic attention layer
# no batch processing, no layer norm
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

nLayer   = 24              # number of layers
nContext = 24              # context length
dim      =  2              # embedding dimension

nIter    = 4000            # training iterations
learning_rate = 1.0e-3

# ==========================
# transformer bilayer module
# attention plus FFN
# ==========================
class transformerBilayer(torch.nn.Module):
  def __init__(self, dim, nContext):
    """Q/K/V matrices will be broadcasted along the
       context dimension
    """
    super().__init__()
    self.alpha    = torch.zeros(nContext,nContext)
    self.nContext = nContext     # context length
    self.dim      = dim          # embedding dimension
    self.hidden   = 4            # hidden layer expanded size
#
# Q/K/V matrices: requires_grad after scaline
#
    mySigma = 0.1/(dim*dim)
    self.Q_mat = mySigma*torch.randn(1,dim,dim)
    self.K_mat = mySigma*torch.randn(1,dim,dim)
    self.V_mat = mySigma*torch.randn(1,dim,dim)
    self.Q_mat.requires_grad_(True)
    self.K_mat.requires_grad_(True)
    self.V_mat.requires_grad_(True)
#
# FFN: feed forward network  
#      FFN_1 --> hidden --> FFN_2
#
    self.FFN_1  = torch.randn(1,self.hidden*dim,dim)*mySigma
    self.FFN_2  = torch.randn(1,dim,self.hidden*dim)*mySigma
    self.FFN_b  = torch.randn(1,self.hidden*dim,requires_grad=True)
    self.FFN_1.requires_grad_(True)
    self.FFN_2.requires_grad_(True)
#
# padding mask for causal attention
#
    self.paddingMask = torch.zeros(nContext,nContext)   # for masking
    for ii in range(nContext):
      for jj in range(ii+1,nContext):
        self.paddingMask[ii][jj] = -1.0e9               # exp -> 0

def FFN(self, x):
    """FFN sublayer
    """ 
# broadasting  (1,.. ,.. ) --> (context,.. ,.. )
    shape_1 = (self.nContext,self.hidden*self.dim,self.dim)
    shape_2 = (self.nContext,self.dim,self.hidden*self.dim)
    shape_b = (self.nContext,self.hidden*self.dim)
    FFN_1_all = self.FFN_1.expand(shape_1)
    FFN_2_all = self.FFN_2.expand(shape_2)
    FFN_b_all = self.FFN_b.expand(shape_b)
#
    xx = torch.einsum("chd,cd->ch",FFN_1_all,x)
    hh = torch.tanh(xx+FFN_b_all)
    yy = torch.einsum("cdh,ch->cd",FFN_2_all,hh)
    return yy + x

def attention(self, x):
    """attention sublayer
    """ 
# broadasting  (1,dim,dim) --> (context,dim,dim)
    expanded_shape = (self.nContext,self.dim,self.dim)
    Q_all = self.Q_mat.expand(expanded_shape)
    K_all = self.K_mat.expand(expanded_shape)
    V_all = self.V_mat.expand(expanded_shape)
# c and C: context (input length)
# d and D: dim (enbedding dimension)
    QQ = torch.einsum("cdD,cD->cd",Q_all,x)
    KK = torch.einsum("cdD,cD->cd",K_all,x)
    VV = torch.einsum("cdD,cD->cd",V_all,x)
# normalized attention matrix
    logits  = torch.einsum("cd,Cd->cC",QQ,KK)
    alpha   = torch.exp(logits+self.paddingMask)
    row_sum = alpha.sum(dim=-1, keepdim=True) 
    alpha   = alpha / row_sum
#  store detached attention matrix
    self.alpha = alpha.detach()
# return with skip connections
    return torch.matmul(alpha,VV) + x

def forward(self, x):
    """indexing always from the back
    """ 
    x = self.attention(x)
    x = self.FFN(x)
    return x

def update(self, eps):                # updating parameters
    with torch.no_grad():
      self.Q_mat -= eps*self.Q_mat.grad
      self.K_mat -= eps*self.K_mat.grad
      self.V_mat -= eps*self.V_mat.grad
      self.Q_mat.grad = None
      self.K_mat.grad = None
      self.V_mat.grad = None
      self.FFN_1 -= eps*self.FFN_1.grad
      self.FFN_2 -= eps*self.FFN_2.grad
      self.FFN_b -= eps*self.FFN_b.grad
      self.FFN_1.grad = None
      self.FFN_2.grad = None
      self.FFN_b.grad = None

# ======================
# n-idential layer model
# ======================
allLayers = [transformerBilayer(dim,nContext) for iL in range(nLayer)]
def model(x):
  for iLayer in range(nLayer):
    x = allLayers[iLayer](x)
  return x

# ====================================
# console printing of attention matrix
# ====================================
def printAttenionMatrix():
  for iLayer in range(nLayer):
    print()
    print("# attention matrix for layer ", iLayer)
    for ss in range(nContext):
      for tt in range(nContext):
         alpha = allLayers[iLayer].alpha[ss][tt]
         print(f'{alpha:9.4f}', end="")
      print()

# ======================
# standard loss function
# ======================
def lossFunction(outputActivity, targetActivity):
  return torch.square(outputActivity - targetActivity).sum()

# ================================================
# training data, token at position i: x[i]
# x[i+1] = x[i] - x[i-1]  (modulo normalization)
# settles into a limit cycle of period six 
# initial warm-up is discarded
# ================================================
def trainingSequence(seqLength, myDim):
  """generates a random vector difference sequence"""
  data    = torch.zeros(2*seqLength,myDim)
  data[0] = F.normalize(torch.randn(myDim), p=2, dim=0)
  data[1] = F.normalize(torch.randn(myDim), p=2, dim=0)
  for ss in range(2,2*seqLength):
    vector_sum = data[ss-1]-data[ss-2]
    data[ss] = F.normalize(vector_sum, p=2, dim=0)
  return data[seqLength:]        # discard warm-up phase

if (1==2):
  testLength = 20
  training_data = trainingSequence(testLength,dim)
  print("# --- training_data ---")
  for ll in range(testLength):
    for dd in range(dim): 
      print(f'{training_data[ll][dd]:8.3f}',end="")
    print()

# ===========================================
# training with random sequences
# token prediction == shifting prompt by (-1)
# ===========================================
print("# --- training ---")
for iIter in range(nIter):
  training_all   = trainingSequence(nContext+1,dim)
  training_data  = training_all[:-1]
  training_value = training_all[1:]
  loss = lossFunction(model(training_data),training_value)
  if (loss<0.001):
    break
  loss.backward()
#
  for iLayer in range(nLayer):
    allLayers[iLayer].update(learning_rate)
  if (iIter%200==0):                # print progress
    print(f'{iIter:4d} {loss.item():9.4f}')
#
# visual testing
#
test_all   = trainingSequence(nContext+1,dim)
test_data  = test_all[:-1]
test_value = test_all[1:]
yy = model(test_data)
print("# --- value vs. output ---")
for ll in range(nContext):
  for dd in range(dim): 
    print(f'{test_value[ll][dd]:8.3f}',end="")
  print(" |",end="")
  for dd in range(dim): 
    print(f'{        yy[ll][dd]:8.3f}',end="")
  print()
#
# print attention matrix
#
if (1==2):
  print()
  printAttenionMatrix()

functional transformer block

nn.MultiheadAttention() nomen est omen
nn.Sequential() sequence of objects
nn.LayerNorm()
layer norm with adaptable parameters $$ \mathbf{LayerNorm}(x)_i = \gamma_i \frac{x_i-\mu}{\sigma^2+\epsilon} + \beta_i $$
torch.nn.utils.clip_grad_norm_
:: clipping gradients $\mathbf{g} = (g_1,g_2,..)$
:: maximal norm $g_{\bf max}$ $$ g_k \ \to\ g_k\,\frac{g_{\bf max}}{|\mathbf{g}|} \qquad\mbox{if}\qquad |\mathbf{g}|>g_{\bf max} $$
self.parameters()
(learnable) parameters are automatically registered
when using torch.nn modules (exclusively)
self.register_buffer()
a buffer is like a parameter, but not learnable.
It will not appear in self.parameters()
will be change when calling .to(device), etc.

#!/usr/bin/env python3

# basic attention layer
# no batch processing, no layer norm
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt

nLayer   = 24              # number of layers
nContext = 24              # context length
dim      =  2              # embedding dimension

nIter    = 4000            # training iterations
learning_rate = 1.0e-3

# ==========================
# transformer bilayer module
# using torch.nn modules
# ==========================
class transformerBlock(nn.Module):
    def __init__(self, dim, nContext, hidden_mult=4):
        super().__init__()
        self.nContext = nContext

# LayerNorms for stability
        self.norm1 = nn.LayerNorm(dim)
        self.norm2 = nn.LayerNorm(dim)

# single-head attention
        self.attn = nn.MultiheadAttention(embed_dim=dim, 
                    num_heads=1, batch_first=True)

# feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(dim, hidden_mult * dim),
            nn.Tanh(),
            nn.Linear(hidden_mult * dim, dim)
        )

# causal mask
        mask = torch.triu(torch.ones(nContext, nContext)\
             * float('-inf'), diagonal=1)
        self.register_buffer("causal_mask", mask)

# store last attention weights for inspection
        self.alpha = torch.zeros(nContext, nContext)

    def forward(self, x):
        # (nContext, dim) → (1, nContext, dim)
        x = x.unsqueeze(0)

# attention with layer norm and skip connections
        x_norm = self.norm1(x)
        attn_out, attn_weights = self.attn(
            x_norm, x_norm, x_norm,
            attn_mask=self.causal_mask
        )
        self.alpha = attn_weights[0].detach()
        x = x + attn_out     

# feed-forward with layer normalization
        ff_out = self.ffn(self.norm2(x))
        x = x + ff_out 

        return x.squeeze(0)  # (nContext, dim)

    def update(self, eps):
        """manual SGD update with gradient clipping."""
        with torch.no_grad():
            torch.nn.utils.clip_grad_norm_(self.parameters(),
                                           max_norm=1.0)
            for p in self.parameters():
                if p.grad is not None:
                    p -= eps * p.grad
                    p.grad = None

# ======================
# n-idential layer model
# ======================
allLayers = [transformerBlock(dim,nContext) for iL in range(nLayer)]
def model(x):
  for iLayer in range(nLayer):
    x = allLayers[iLayer](x)
  return x

# ====================================
# console printing of attention matrix
# ====================================
def printAttenionMatrix():
  for iLayer in range(nLayer):
    print()
    print("# attention matrix for layer ", iLayer)
    for ss in range(nContext):
      for tt in range(nContext):
         alpha = allLayers[iLayer].alpha[ss][tt]
         print(f'{alpha:9.4f}', end="")
      print()

# ======================
# standard loss function
# ======================
def lossFunction(outputActivity, targetActivity):
  return torch.square(outputActivity - targetActivity).sum()

# ================================================
# training data, token at position i: x[i]
# x[i+1] = x[i] - x[i-1]  (modulo normalization)
# settles into a limit cycle of period six 
# initial warm-up is discarded
# ================================================
def trainingSequence(seqLength, myDim):
  """generates a random vector difference sequence"""
  data    = torch.zeros(2*seqLength,myDim)
  data[0] = F.normalize(torch.randn(myDim), p=2, dim=0)
  data[1] = F.normalize(torch.randn(myDim), p=2, dim=0)
  for ss in range(2,2*seqLength):
    vector_sum = data[ss-1]-data[ss-2]
    data[ss] = F.normalize(vector_sum, p=2, dim=0)
  return data[seqLength:]        # discard warm-up phase

if (1==2):
  testLength = 20
  training_data = trainingSequence(testLength,dim)
  print("# --- training_data ---")
  for ll in range(testLength):
    for dd in range(dim): 
      print(f'{training_data[ll][dd]:8.3f}',end="")
    print()

# ===========================================
# training with random sequences
# token prediction == shifting prompt by (-1)
# ===========================================
print("# --- training ---")
for iIter in range(nIter):
  training_all   = trainingSequence(nContext+1,dim)
  training_data  = training_all[:-1]
  training_value = training_all[1:]
  loss = lossFunction(model(training_data),training_value)
  if (loss<0.001):
    break
  loss.backward()
#
  for iLayer in range(nLayer):
    allLayers[iLayer].update(learning_rate)
  if (iIter%200==0):                # print progress
    print(f'{iIter:4d} {loss.item():9.4f}')
#
# visual testing
#
test_all   = trainingSequence(nContext+1,dim)
test_data  = test_all[:-1]
test_value = test_all[1:]
yy = model(test_data)
print("# --- value vs. output ---")
for ll in range(nContext):
  for dd in range(dim): 
    print(f'{test_value[ll][dd]:8.3f}',end="")
  print(" |",end="")
  for dd in range(dim): 
    print(f'{        yy[ll][dd]:8.3f}',end="")
  print()
#
# print attention matrix
#
if (1==2):
  print()
  printAttenionMatrix()

#!/usr/bin/env python3

# basic attention layer
# no batch processing, no layer norm
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt

nLayer   = 24              # number of layers
nContext = 24              # context length
dim      =  2              # embedding dimension

nIter    = 4000            # training iterations
learning_rate = 1.0e-3

# ==========================
# transformer bilayer module
# using torch.nn modules
# ==========================
class transformerBlock(nn.Module):
    def __init__(self, dim, nContext, hidden_mult=4):
        super().__init__()
        self.nContext = nContext

# LayerNorms for stability
        self.norm1 = nn.LayerNorm(dim)
        self.norm2 = nn.LayerNorm(dim)

# single-head attention
        self.attn = nn.MultiheadAttention(embed_dim=dim, 
                    num_heads=1, batch_first=True)

# feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(dim, hidden_mult * dim),
            nn.Tanh(),
            nn.Linear(hidden_mult * dim, dim)
        )

# causal mask
        mask = torch.triu(torch.ones(nContext, nContext)\
             * float('-inf'), diagonal=1)
        self.register_buffer("causal_mask", mask)

# store last attention weights for inspection
        self.alpha = torch.zeros(nContext, nContext)

def forward(self, x):
        # (nContext, dim) → (1, nContext, dim)
        x = x.unsqueeze(0)

# attention with layer norm and skip connections
        x_norm = self.norm1(x)
        attn_out, attn_weights = self.attn(
            x_norm, x_norm, x_norm,
            attn_mask=self.causal_mask
        )
        self.alpha = attn_weights[0].detach()
        x = x + attn_out

# feed-forward with layer normalization
        ff_out = self.ffn(self.norm2(x))
        x = x + ff_out

return x.squeeze(0)  # (nContext, dim)

def update(self, eps):
        """manual SGD update with gradient clipping."""
        with torch.no_grad():
            torch.nn.utils.clip_grad_norm_(self.parameters(),
                                           max_norm=1.0)
            for p in self.parameters():
                if p.grad is not None:
                    p -= eps * p.grad
                    p.grad = None

# ======================
# n-idential layer model
# ======================
allLayers = [transformerBlock(dim,nContext) for iL in range(nLayer)]
def model(x):
  for iLayer in range(nLayer):
    x = allLayers[iLayer](x)
  return x

# ====================================
# console printing of attention matrix
# ====================================
def printAttenionMatrix():
  for iLayer in range(nLayer):
    print()
    print("# attention matrix for layer ", iLayer)
    for ss in range(nContext):
      for tt in range(nContext):
         alpha = allLayers[iLayer].alpha[ss][tt]
         print(f'{alpha:9.4f}', end="")
      print()

# ======================
# standard loss function
# ======================
def lossFunction(outputActivity, targetActivity):
  return torch.square(outputActivity - targetActivity).sum()

# ================================================
# training data, token at position i: x[i]
# x[i+1] = x[i] - x[i-1]  (modulo normalization)
# settles into a limit cycle of period six 
# initial warm-up is discarded
# ================================================
def trainingSequence(seqLength, myDim):
  """generates a random vector difference sequence"""
  data    = torch.zeros(2*seqLength,myDim)
  data[0] = F.normalize(torch.randn(myDim), p=2, dim=0)
  data[1] = F.normalize(torch.randn(myDim), p=2, dim=0)
  for ss in range(2,2*seqLength):
    vector_sum = data[ss-1]-data[ss-2]
    data[ss] = F.normalize(vector_sum, p=2, dim=0)
  return data[seqLength:]        # discard warm-up phase

if (1==2):
  testLength = 20
  training_data = trainingSequence(testLength,dim)
  print("# --- training_data ---")
  for ll in range(testLength):
    for dd in range(dim): 
      print(f'{training_data[ll][dd]:8.3f}',end="")
    print()

# ===========================================
# training with random sequences
# token prediction == shifting prompt by (-1)
# ===========================================
print("# --- training ---")
for iIter in range(nIter):
  training_all   = trainingSequence(nContext+1,dim)
  training_data  = training_all[:-1]
  training_value = training_all[1:]
  loss = lossFunction(model(training_data),training_value)
  if (loss<0.001):
    break
  loss.backward()
#
  for iLayer in range(nLayer):
    allLayers[iLayer].update(learning_rate)
  if (iIter%200==0):                # print progress
    print(f'{iIter:4d} {loss.item():9.4f}')
#
# visual testing
#
test_all   = trainingSequence(nContext+1,dim)
test_data  = test_all[:-1]
test_value = test_all[1:]
yy = model(test_data)
print("# --- value vs. output ---")
for ll in range(nContext):
  for dd in range(dim): 
    print(f'{test_value[ll][dd]:8.3f}',end="")
  print(" |",end="")
  for dd in range(dim): 
    print(f'{        yy[ll][dd]:8.3f}',end="")
  print()
#
# print attention matrix
#
if (1==2):
  print()
  printAttenionMatrix()

full micro code

including optimzer, batch processing
nn.ModuleList()
essentially, a list of modules
torch.optim.Adam() nomen est omen

#!/usr/bin/env python3
# basic transformer with batch processing and PyTorch optimizer
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt

nLayer   = 24              # number of layers
nContext = 24              # context length
dim      = 2               # embedding dimension
batch_size = 32            # batch size for training

nIter    = 4000            # training iterations
learning_rate = 1.0e-3

# ==========================
# transformer block module
# enhanced for batch processing
# ==========================
class transformerBlock(nn.Module):
    def __init__(self, dim, nContext, hidden_mult=4):
        super().__init__()
        self.nContext = nContext

# layer norm
        self.norm1 = nn.LayerNorm(dim)
        self.norm2 = nn.LayerNorm(dim)

# attention with num_heads heads
        self.attn = nn.MultiheadAttention(embed_dim=dim, 
                    num_heads=1, batch_first=True)

# FFN
        self.ffn = nn.Sequential(
            nn.Linear(dim, hidden_mult * dim),
            nn.Tanh(),
            nn.Linear(hidden_mult * dim, dim)
        )

# causal mask 
        mask = torch.triu(torch.ones(nContext, nContext) * float('-inf'),
                                     diagonal=1)
        self.register_buffer("causal_mask", mask)

    def forward(self, x):
# x shape: (batch_size, nContext, dim)
#       batch_size = x.size(0)
        x_norm = self.norm1(x)
        attn_out, attn_weights = self.attn(
            x_norm, x_norm, x_norm,
            attn_mask=self.causal_mask
        )
        
        x = x + attn_out     
        ff_out = self.ffn(self.norm2(x))
        x = x + ff_out 

        return x 

# ======================
# complete model class
# ======================
class TransformerModel(nn.Module):
    def __init__(self, dim, nContext, nLayer):
        super().__init__()
        self.layers = nn.ModuleList([
            transformerBlock(dim, nContext) for _ in range(nLayer)
        ])
    
    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

# initialize model
model = TransformerModel(dim, nContext, nLayer)

# initialize optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# ======================
# loss function
# ======================
def lossFunction(outputActivity, targetActivity):
    return torch.square(outputActivity - targetActivity).sum()

# ================================================
# training data generation
# x[i+1] = x[i] - x[i-1]  (modulo normalization)
# settles into a limit cycle of period six 
# initial warm-up is discarded
# ================================================
def trainingSequence(seqLength, myDim):
    """generates a random vector difference sequence"""
    data = torch.zeros(2*seqLength, myDim)
    data[0] = F.normalize(torch.randn(myDim), p=2, dim=0)
    data[1] = F.normalize(torch.randn(myDim), p=2, dim=0)
    for ss in range(2, 2*seqLength):
        vector_sum = data[ss-1] - data[ss-2]
        data[ss] = F.normalize(vector_sum, p=2, dim=0)
    return data[seqLength:]        # discard warm-up phase

def generateBatch(batch_size, nContext, dim):
    """generate a batch of training sequences"""
    batch_data = torch.zeros(batch_size, nContext, dim)
    batch_targets = torch.zeros(batch_size, nContext, dim)
    
    for b in range(batch_size):
        training_all = trainingSequence(nContext + 1, dim)
        batch_data[b] = training_all[:-1]
        batch_targets[b] = training_all[1:]
    
    return batch_data, batch_targets

# ===========================================
# training with batched random sequences
# ===========================================
print("# --- training with batches ---")
for iIter in range(nIter):

# generate batch / forward pass / loss
    training_data, training_targets = generateBatch(batch_size, nContext, dim)
    outputs = model(training_data)
    loss = lossFunction(outputs, training_targets) / batch_size
    if loss < 0.001:
        break
    
# backward pass, gradient clipping and udate
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    
# progress printing
    if iIter % 200 == 0:      
        print(f'{iIter:4d} {loss.item():9.4f}')

# ===========================================
# visual testing with single sequence
# ===========================================
print("# --- testing ---")
test_all = trainingSequence(nContext + 1, dim)
test_data = test_all[:-1].unsqueeze(0)  # Add batch dimension
test_value = test_all[1:]

with torch.no_grad():
    yy = model(test_data).squeeze(0)  # Remove batch dimension

print("# --- value vs. output ---")
for ll in range(nContext):
    for dd in range(dim): 
        print(f'{test_value[ll][dd]:8.3f}', end="")
    print(" |", end="")
    for dd in range(dim): 
        print(f'{yy[ll][dd]:8.3f}', end="")
    print()

#!/usr/bin/env python3
# basic transformer with batch processing and PyTorch optimizer
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt

nLayer   = 24              # number of layers
nContext = 24              # context length
dim      = 2               # embedding dimension
batch_size = 32            # batch size for training

nIter    = 4000            # training iterations
learning_rate = 1.0e-3

# ==========================
# transformer block module
# enhanced for batch processing
# ==========================
class transformerBlock(nn.Module):
    def __init__(self, dim, nContext, hidden_mult=4):
        super().__init__()
        self.nContext = nContext

# layer norm
        self.norm1 = nn.LayerNorm(dim)
        self.norm2 = nn.LayerNorm(dim)

# attention with num_heads heads
        self.attn = nn.MultiheadAttention(embed_dim=dim, 
                    num_heads=1, batch_first=True)

# FFN
        self.ffn = nn.Sequential(
            nn.Linear(dim, hidden_mult * dim),
            nn.Tanh(),
            nn.Linear(hidden_mult * dim, dim)
        )

# causal mask 
        mask = torch.triu(torch.ones(nContext, nContext) * float('-inf'),
                                     diagonal=1)
        self.register_buffer("causal_mask", mask)

def forward(self, x):
# x shape: (batch_size, nContext, dim)
#       batch_size = x.size(0)
        x_norm = self.norm1(x)
        attn_out, attn_weights = self.attn(
            x_norm, x_norm, x_norm,
            attn_mask=self.causal_mask
        )
        
        x = x + attn_out     
        ff_out = self.ffn(self.norm2(x))
        x = x + ff_out

return x

# ======================
# complete model class
# ======================
class TransformerModel(nn.Module):
    def __init__(self, dim, nContext, nLayer):
        super().__init__()
        self.layers = nn.ModuleList([
            transformerBlock(dim, nContext) for _ in range(nLayer)
        ])
    
    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

# initialize model
model = TransformerModel(dim, nContext, nLayer)

# initialize optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# ======================
# loss function
# ======================
def lossFunction(outputActivity, targetActivity):
    return torch.square(outputActivity - targetActivity).sum()

# ================================================
# training data generation
# x[i+1] = x[i] - x[i-1]  (modulo normalization)
# settles into a limit cycle of period six 
# initial warm-up is discarded
# ================================================
def trainingSequence(seqLength, myDim):
    """generates a random vector difference sequence"""
    data = torch.zeros(2*seqLength, myDim)
    data[0] = F.normalize(torch.randn(myDim), p=2, dim=0)
    data[1] = F.normalize(torch.randn(myDim), p=2, dim=0)
    for ss in range(2, 2*seqLength):
        vector_sum = data[ss-1] - data[ss-2]
        data[ss] = F.normalize(vector_sum, p=2, dim=0)
    return data[seqLength:]        # discard warm-up phase

def generateBatch(batch_size, nContext, dim):
    """generate a batch of training sequences"""
    batch_data = torch.zeros(batch_size, nContext, dim)
    batch_targets = torch.zeros(batch_size, nContext, dim)
    
    for b in range(batch_size):
        training_all = trainingSequence(nContext + 1, dim)
        batch_data[b] = training_all[:-1]
        batch_targets[b] = training_all[1:]
    
    return batch_data, batch_targets

# ===========================================
# training with batched random sequences
# ===========================================
print("# --- training with batches ---")
for iIter in range(nIter):

# generate batch / forward pass / loss
    training_data, training_targets = generateBatch(batch_size, nContext, dim)
    outputs = model(training_data)
    loss = lossFunction(outputs, training_targets) / batch_size
    if loss < 0.001:
        break
    
# backward pass, gradient clipping and udate
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    
# progress printing
    if iIter % 200 == 0:      
        print(f'{iIter:4d} {loss.item():9.4f}')

# ===========================================
# visual testing with single sequence
# ===========================================
print("# --- testing ---")
test_all = trainingSequence(nContext + 1, dim)
test_data = test_all[:-1].unsqueeze(0)  # Add batch dimension
test_value = test_all[1:]

with torch.no_grad():
    yy = model(test_data).squeeze(0)  # Remove batch dimension

print("# --- value vs. output ---")
for ll in range(nContext):
    for dd in range(dim): 
        print(f'{test_value[ll][dd]:8.3f}', end="")
    print(" |", end="")
    for dd in range(dim): 
        print(f'{yy[ll][dd]:8.3f}', end="")
    print()