# Advanced Introduction to C++, Scientific Computing and Machine Learning

Claudius Gros, SS 23

Institut für theoretische Physik
Goethe-University Frankfurt a.M.

# decision trees (DT) are everywhere

• daily life
• medicine
symptoms $\ \ \to \ \$ diagnosis
• playing Go, Chess, ..
• ..

### optimal decicion trees

• data:
$\displaystyle\quad \begin{array}{rcl} \mbox{training} & \ \Leftrightarrow \ & \mbox{testing}\\ \mbox{(over)fitting}& \ \Leftrightarrow \ & \mbox{generalization}\\ & \vdots &\\ \mbox{complex}& \ \Leftrightarrow \ & \mbox{simple} \end{array}$

# whether to play tennis

• classical deccision-tree example
• training data
DayOutlookTemperatureHumidity WindPlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
• attributes: Outlook, Temperature, ..
• class: PlayTennis
• prediction / testing
DayOutlookTemperatureHumidity WindPlayTennis
today Sunny Hot Normal Weak ?
tomorrow Overcast Mild Normal Weak ?

# decision tree complexities

• how to find out that
• 'temperature' irrelevant
• 'outlook' $\ \to\$ 'overcast' enough
• 'outlook' $\ \to\$ 'humidity' does not need 'wind'
• 'outlook' $\ \to\$ 'wind' does not need 'humiditiy'
• generic problem
: elimination of irrelevant attributes
: taking relevant decisions first
• e.g. symptons $\ \to\$ diagnosis

DayOutlookTemperatureHumidity WindPlay?
D9 Sunny Cool Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D8 Sunny Mild High Weak No
D3 Overcast Hot High Weak Yes
D7 Overcast Cool Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D6 Rain Cool Normal Strong No
D14 Rain Mild High Strong No

# simple rule - complicated tree

• starting with 'wind'
: overfitting?

DayOutlookTemperatureHumidity WindPlayTennis
D9 Sunny Cool Normal Weak Yes
D5 Rain Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D13 Overcast Hot Normal Weak Yes
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D1 Sunny Hot High Weak No
D8 Sunny Mild High Weak No
D7 Overcast Cool Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D6 Rain Cool Normal Strong No
D12 Overcast Mild High Strong Yes
D2 Sunny Hot High Strong No
D14 Rain Mild High Strong No

# constructing decision trees

• CART (Classification and Regression Tree); binary
• ID3 (Iterative Dichotomiser)
• select attributes with maximal information gain
• minimizing entropy (uncertainty)

### entropy

• probability density function (PDF)
 $$\ \ p(x)\ge 0, \qquad\int p(x)dx = 1 \ \$$ $\qquad \mbox{or}\qquad$ $$\ \ p_\alpha\ge 0, \qquad \sum_\alpha p_\alpha = 1\ \$$
• continuous or discrete (flipping a coin)
• entropy $\ \ H[p]=-\big\langle \ln(p)\big\rangle$
 $$\ \ H[p] =-\int p(x)\ln\big[p(x)\big]dx \ \$$ $\qquad \mbox{or}\qquad$ $$\ \ H[p] = -\sum_\alpha p_\alpha\ln\big[p_\alpha\big]\ \$$

# entropy

### statistical physics

$$H[p]=-k_BT\big\langle \ln(p)\big\rangle, \qquad\quad \beta=\frac{1}{k_bT}$$
• Boltzman constant $\ \ k_B$
• temperature $\ \ T$
$$\begin{array}{rcl} \mbox{micro-canonical ensemble} && \mbox{minimal entropy} \\[1ex] \mbox{canonical ensemble} && \mbox{minimal entropy, given an energy E} \\ && \mbox{Boltzman distribution} \ \ \fbox{\displaystyle\phantom{\big|} p_\alpha\sim\exp(-\beta E_\alpha) \phantom{\big|}} \\[1ex] \mbox{grand-canonical ensemble} && \mbox{minimal entropy, given energy E, particle number N} \\ && \mbox{Boltzman distribution} \ \ \fbox{\displaystyle\phantom{\big|} p_\alpha\sim\exp(-\beta (E_\alpha-\mu N)) \phantom{\big|}} \\ \end{array}$$
• chemical potential $\ \ \mu$

### entropy as a measure of disorder

$$H(T)\quad\to\quad \left\{\begin{array}{rcl} 0 &\mbox{for}& T\to0 \\[1ex] Nk_B\ln(f) &\mbox{for}& T\to\infty \end{array}\right.$$
• degees of freedom per particle $\ \ f$

# information theory

• entropy positive $\ \ 0\le p_\alpha \le 1, \qquad \lim_{x\to0} x\log(x)=0$
• logarithm concave

### maximal entropy distributions

• no contraint: uniform distribution
: variational calculus
$$H[p] = -\sum_{\alpha=1}^M p_\alpha\log(p_\alpha) = -\frac{M}{M}\log(1/M) = \log(M),\qquad\quad p_\alpha = \frac{1}{M}$$
• number of events $\ \ M$
• with contraint: exponential distribution

# information gain

reduction in entropy given an event occured
• entropy with respect to classification (yes/no)
• tennis example $\ \ N=14, \qquad N_{\mbox{yes}}=9, \qquad N_{\mbox{no}}=5$
$$H_{\mbox{initial}} = \frac{9}{14}\log_2(14/9)+\frac{5}{14}\log_2(14/5) =0.940 % (log(14.0/9.0)*9.0/14.0+log(14.0/5.0)*5.0/14.0)/log(2.0)$$
• maximum $\ \ log_2(M)=1, \qquad\quad M=2$

### conditional entropy

• attribute 'outlook' has three possible events
$$\begin{array}{r|ccc|c} & N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline \mbox{sunny} & 5 & 2 & 3 & 0.971\\ \mbox{overcast} & 4 & 4 & 0 & 0\\ \mbox{rain} & 5 & 3 & 2 & 0.971 \end{array} % (log(5.0/2.0)*2.0/5.0+log(5.0/3.0)*3.0/5.0)/log(2.0)$$
• average (over all possible events) entropy
$$H_{\mbox{outlook}} = \frac{5}{14} H_{\mbox{sunny}} + \frac{4}{14} H_{\mbox{overcast}} + \frac{5}{14} H_{\mbox{rain}} = \frac{10}{14} 0.971 = 0.694$$
• information gain for attribute 'outlook'
$$G_{\mbox{outlook}} = H_{\mbox{initial}} - H_{\mbox{outlook}} = 0.940-0.694 = 0.246$$
• gains nearly a quarter

# determining optimal attribute

### 'outlook'

$$\begin{array}{r|ccc|c} & N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline \mbox{sunny} & 5 & 2 & 3 & 0.971\\ \mbox{overcast} & 4 & 4 & 0 & 0\\ \mbox{rain} & 5 & 3 & 2 & 0.971 \end{array}\qquad\Rightarrow \qquad \fbox{\displaystyle\phantom{\big|} H_{\mbox{outlook}} = 0.694 \phantom{\big|}}$$

### 'wind'

$$\begin{array}{r|ccc|c} & N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline \mbox{weak} & 8 & 6 & 2 & 0.811\\ \mbox{strong} & 6 & 3 & 3 & 1.0\\ \end{array}\qquad\Rightarrow % (log(8.0/6.0)*6.0/8.0+log(8.0/2.0)*2.0/8.0)/log(2.0) % (log(6.0/3.0)*3.0/6.0+log(6.0/3.0)*3.0/6.0)/log(2.0) \qquad \fbox{\displaystyle\phantom{\big|} H_{\mbox{wind}} = 0.892 \phantom{\big|}} % 0.811*8.0/14.0 + 1.0*6.0/14.0$$

### 'humidity'

$$\begin{array}{r|ccc|c} & N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline \mbox{normal} & 7 & 6 & 1 & 0.592\\ \mbox{high} & 7 & 3 & 4 & 0.985\\ \end{array}\qquad\Rightarrow % (log(7.0/6.0)*6.0/7.0+log(7.0/1.0)*1.0/7.0)/log(2.0) % (log(7.0/3.0)*3.0/7.0+log(7.0/4.0)*4.0/7.0)/log(2.0) \qquad \fbox{\displaystyle\phantom{\big|} H_{\mbox{humidity}} = 0.789 \phantom{\big|}} % 0.985*7.0/14.0 + 0.592*7.0/14.0$$

### 'temperature'

$$\begin{array}{r|ccc|c} & N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline \mbox{coool} & 4 & 3 & 1 & 0.811\\ \mbox{mild} & 6 & 4 & 2 & 0.918\\ \mbox{hot} & 4 & 2 & 2 & 1.0\\ \end{array}\qquad\Rightarrow % (log(4.0/3.0)*3.0/4.0+log(4.0/1.0)*1.0/4.0)/log(2.0) % (log(6.0/4.0)*4.0/6.0+log(6.0/2.0)*2.0/6.0)/log(2.0) \qquad \fbox{\displaystyle\phantom{\big|} H_{\mbox{temperature}} = 0.911 \phantom{\big|}} % 0.811*4.0/14.0 + 0.918*6.0/14.0 + 4.0/14.0$$

### information gain

• constant $\ \ H_{\mbox{initial}}=0.940$
$$H_{\mbox{initial}}- \big\langle H_{\mbox{after}}\big\rangle = \left\{\begin{array}{rcl} 0.246 && \mbox{'outlook'} \\ 0.048 && \mbox{'wind'} \\ 0.151 && \mbox{'humidity'} \\ 0.029 && \mbox{'temperature'} \end{array}\right.$$
• 'outlook' best, 'temperature' miserable

# recursive entropy weighting

• attribute with highet information gain
$\to \ \$ root of decision tree
• repeat recursively

### Gini index

• one (of many) alternatives to information gain
$\displaystyle\qquad\quad \mbox{Gini}[p] = 1-\sum_\alpha p_\alpha^2$
• minimizing
• expanding the entropy for large probabilities

$\displaystyle\qquad\quad \begin{array}{rcl} H[p] &=&-\sum_\alpha p_\alpha\log(p_\alpha) \\ &=&-\sum_\alpha p_\alpha\log(1-(1-p_\alpha)) \\ &\approx& \sum_\alpha p_\alpha(1-p_\alpha) = 1-\sum_\alpha p_\alpha^2 \end{array}$
• where $\ \ p_\alpha\to 1, \qquad\quad \sum_\alpha p_\alpha=1$

### entropy vs. Gini

• case $\ \ M=2$
$$1-\sum_{\alpha=1}^2 p_\alpha^2 = 1 - p_1^2 - (1-p_1)^2 =2p_1-2p_1^2 = 2p_1(1-p_1)$$

# overfitting - pruning

given enough parameters,
one can fit an elephant
• given enough nodes,
one can fit every decision process

### avoiding overfitting

• pre-pruning
: stop when new branches become statistically insignificant
• post pruneing
: grow full tree, then remove branches
$\rightarrow\ \$ replacing subtree by leaf
• optimal tree
performance $\ \ \leftrightarrow \ \$ overfitting
• performance with respect to validation data set
• complexity penalty

### range of algorithms

• CART (binary)
• ID3 (here + ...)
• C4.5 (includes pruning)
• permits real-valued attributes
• allow missing values
• robust to noise

# pruning example

### evaluation of job offer

• here: only working principle
• increasing error rate
increasing performance
: statistical relevance, reduced complexity, ...

# drawing graphs

• a range of open-source oftware:
• quick and dirty: GraphViz
• nodes arranged automatically
• fine tuning difficult
• for Linux: dot
and neato / circo / twopi
for distinct node location algorithms (all based on dot)
graph G {
one -- two -- three -- four -- one
/* run with
* circo -Tpdf test.dot -o test.pdf
* neato -Tpdf test.dot -o test.pdf
* dot   -Tpdf test.dot -o test.pdf
*/
}


### job evaluation graph

digraph G {                          // digraph: directed graph

rankdir=TD;                 // top-down, try LR
node   [penwidth=2.0]       // for all nodes

salary      [shape="oval",color="#000099",fontcolor="#000099"]
increase    [shape="oval",color="#000099",fontcolor="#000099"]
hours       [shape="oval",color="#000099",fontcolor="#000099"]
holidays    [shape="oval",color="#000099",fontcolor="#000099"]
retirement  [shape="oval",color="#000099",fontcolor="#000099"]

salary      [label="salary"]
increase    [label="prospective wage increase"]
hours       [label="# working hours"]
holidays    [label="# holidays"]
retirement  [label="early retirement possible?"]

good1     [shape="box",color="#009900",fontcolor="#009900"]
good2     [shape="box",color="#009900",fontcolor="#009900"]
good4     [shape="box",color="#009900",fontcolor="#009900"]

good1       [label="good"]
good2       [label="good"]
good4       [label="good"]

{rank=same; holidays; hours}                     // on same rank level
{rank=same; bad1; retirement; good1; increase }

salary  -> holidays                         [label=" high"]
holidays -> retirement           [label=" <20"]
holidays -> good1                [label=" >20"]
retirement -> good2          [label=" yes"]