Machine Learning Primer -- Part II: Deep Learning




Claudius Gros, WS 2024/25

Institut für theoretische Physik
Goethe-University Frankfurt a.M.

Reinforcement Learning

interacting with the environment



no use for isolated brains

Markovian dynamics


stochastic policies

$$ \pi(s,a) = P(A_t∣S_t), \qquad\qquad 1 = \sum_{A_t\,\in\,\tilde{A}_t} P(A_t|S_t), \qquad\qquad P(A_t∣S_t)\ge0 $$

blind walker, a game

$$ \fbox{$\phantom{\big|}0\phantom{\big|}$}\to \fbox{$\phantom{\big|}1\phantom{\big|}$}\to \fbox{$\phantom{\big|}\color{red}L\phantom{\big|}$}\to \fbox{$\phantom{\big|}2\phantom{\big|}$}\to \fbox{$\phantom{\big|}\color{green}W\phantom{\big|}$}\to \fbox{$\phantom{\big|}0\phantom{\big|}$} $$
$\fbox{$\phantom{\big|}0\phantom{\big|}$}$   starting position
$\fbox{$\phantom{\big|}\color{green}W\phantom{\big|}$}$   game ends with a win; reward   1
$\fbox{$\phantom{\big|}\color{red}L\phantom{\big|}$}$   game ends with a loss; reward   0
$$ \pi(1|S_t) = \alpha, \quad\qquad \pi(2|S_t) = \beta \quad\qquad \beta = 1-\alpha, $$

time evolution

time state probability reward
$t=0$ $\fbox{$\phantom{\big|}0\phantom{\big|}$}$ $p=1$
$t=1$ $\fbox{$\phantom{\big|}1\phantom{\big|}$}$ $p=\alpha$
$t=2$ $\fbox{$\phantom{\big|}2\phantom{\big|}$}$ $p=\alpha\beta$
$t=3$ $\fbox{$\phantom{\big|}0\phantom{\big|}$}$ $p=\alpha\beta^2$ $R=1$, with probabiliy $\ \alpha^2\beta$
$$ R \to \alpha^2\beta \qquad\qquad \alpha_{\rm opt} \to \frac{2}{3} \approx 0.667 \qquad\qquad R_{\rm opt} \to \frac{4}{27} \approx 0.148 $$ $$ R_{\rm tot} = \alpha^\beta\, \sum_{n=0}^\infty (\alpha\beta^2)^n = \frac{\alpha^2\beta}{1-\alpha\beta^2}, \qquad\qquad \alpha_{\rm opt} \approx 0.639, \qquad\qquad R_{\rm opt} \approx 0.161 $$

value & future rewards

$$ V^\pi(S) = \sum_{t=0}^\infty \gamma^t \langle R_{t+1}\rangle $$ $$ V^\pi(S) = \sum_{A\in \tilde{A}(S)} \pi(S,A)\, Q^\pi(S,A), \quad\qquad S'=S(A), $$

Bellman equation

$$ Q^\pi(S,A) = R(S,A) + \gamma\, V^\pi(S'), \quad\qquad S' = S(A) $$

temporal difference learning

$$ V(S)\ \to\ (1-\alpha)V(S) + \alpha \big[ R(S')+\gamma V(S') \big] $$

off/on policy learning

state action reward state action (SARSA)

$$ Q^{\rm new}(S_t, A_t) \ =\ (1 - \alpha) Q(S_t,A_t) + \alpha \, \left[R_{t+1} + \gamma \, Q(S_{t+1}, A_{t+1})\right] $$

Q-learning

$$ Q^{\rm new}(S_t, A_t) \ =\ (1 - \alpha) Q(S_t,A_t) + \alpha \, \left[R_{t+1} + \gamma \, Q^{\rm best}(S_{t+1})\right] $$ $$ Q^{\rm best}(S_{t+1}) =\max_{A'} Q(S_{t+1},A') $$

Q-learning code

Copy Copy to clipboad
Downlaod Download
#!/usr/bin/env python3

#
# Q learning for walker (not blind)
#
# based on
# https://www.geeksforgeeks.org/q-learning-in-python/

import numpy as np

# environment
n_states   = 6   # number of states in a linear world
n_actions  = 2   # number of possible moves (distances)
goal_state = 5   # game ends with a win
loss_state = 3   # game ends with a loss

# initialize Q-table with zeros
Q_table = np.zeros((n_states, n_actions))

# parameters
learning_rate    = 0.8
discount_factor  = 0.95
exploration_prob = 0.2
nEpochs = 5000

# allowed starting positions
start_pos = set(range(n_states))    # range to set
start_pos.remove(loss_state)        # remove loss/win states
start_pos.remove(goal_state)

#
# Q-learning from random starting state
#
for epoch in range(nEpochs):
  current_state = np.random.choice(list(start_pos))
  game_length = 0                                 # currently not used
  while (current_state!=goal_state) and \
        (current_state!=loss_state):
    game_length += 1
#
    if np.random.rand() < exploration_prob:       # epsilon-greedy strategy
       action = np.random.randint(0, n_actions)   # exploring
    else:
       action = np.argmax(Q_table[current_state]) # exploiting

# determine next state, periodic boundaries
    next_state = (current_state + 1 + action)%n_states

# reward is \pm 1 for goal/loss state, 0 otherwise
    reward = 0.0
    if (next_state==goal_state): 
      reward =  1.0
    if (next_state==loss_state): 
      reward = -1.0

# updating Q-table 
    Q_table[current_state, action] += learning_rate * \
           ( reward + discount_factor*np.max(Q_table[next_state]) - 
             Q_table[current_state, action] )

    current_state = next_state                    # move to next state

#
# printing
# 
print("expected dicounted rewards Q(S,A)")
print("state,  Q-table")
for iS in range(n_states):
  print(f'{iS:5d} ', end = "")
  for iA in range(n_actions):
    QQ = Q_table[iS][iA]
    print(f'{QQ:7.3f} ', end = "")
#
  if (iS==loss_state):
    print(" loss", end = "")
  if (iS==goal_state):
    print(" win", end = "")
  print()

Alpha zero



game of Go

most explored configuration

Alpha zero cheat sheet

intermezzo: neural scaling laws

scarce recourses

 

Chinchilla scaling

(Hoffmann, et al, 2022)
  compute $\ \sim\ $ (size)$^2$  

Alpha Zero Elo scaling

Elo ranking

 $\displaystyle P(A\!\to\!B) = \frac{1}{1+10^{-(E_A-E_B)/400}} $ 

societies of game playing agents

$$ P(A\!\to\!B) \approx \frac{N_A}{N_A+N_B} $$ $$ \mathrm{Elo} \ \approx \ \log(N)\cdot\sigma\left( \beta\log\left(\frac{\sqrt{C}}{N}\right)\right), \quad\qquad \mathrm{time} \propto (\mathrm{resources})^2 $$