Machine Learning Primer -- Basics

Claudius Gros, WS 2024/25

Institut für theoretische Physik
Goethe-University Frankfurt a.M.

Decision Trees

decision trees (DT) are everywhere

daily life
medicine
symptoms $\ \ \to \ \ $ diagnosis
playing Go, Chess, ..
..

optimal decicion trees

data:
$\quad \begin{array}{rcl} \mbox{training} & \ \Leftrightarrow \ & \mbox{testing}\\ \mbox{(over)fitting}& \ \Leftrightarrow \ & \mbox{generalization}\\ & \vdots &\\ \mbox{complex}& \ \Leftrightarrow \ & \mbox{simple} \end{array} $

whether to play tennis

classical decision-tree example
training data

Day	Outlook	Temperature	Humidity	Wind	PlayTennis
D1	Sunny	Hot	High	Weak	No
D2	Sunny	Hot	High	Strong	No
D3	Overcast	Hot	High	Weak	Yes
D4	Rain	Mild	High	Weak	Yes
D5	Rain	Cool	Normal	Weak	Yes
D6	Rain	Cool	Normal	Strong	No
D7	Overcast	Cool	Normal	Strong	Yes
D8	Sunny	Mild	High	Weak	No
D9	Sunny	Cool	Normal	Weak	Yes
D10	Rain	Mild	Normal	Weak	Yes
D11	Sunny	Mild	Normal	Strong	Yes
D12	Overcast	Mild	High	Strong	Yes
D13	Overcast	Hot	Normal	Weak	Yes
D14	Rain	Mild	High	Strong	No

attributes: Outlook, Temperature, ..
class: PlayTennis
prediction / testing

Day	Outlook	Temperature	Humidity	Wind	PlayTennis
today	Sunny	Hot	Normal	Weak	?
tomorrow	Overcast	Mild	Normal	Weak	?

decision tree complexities

how to find out that
- 'temperature' irrelevant
- 'outlook' $\ \to\ $ 'overcast' enough
- 'outlook' $\ \to\ $ 'humidity' does not need 'wind'
- 'outlook' $\ \to\ $ 'wind' does not need 'humiditiy'
generic problem
: elimination of irrelevant attributes
: taking relevant decisions first
e.g. symptons $ \ \to\ $ diagnosis

Day	Outlook	Temperature	Humidity	Wind	Play?
D9	Sunny	Cool	Normal	Weak	Yes
D11	Sunny	Mild	Normal	Strong	Yes
D1	Sunny	Hot	High	Weak	No
D2	Sunny	Hot	High	Strong	No
D8	Sunny	Mild	High	Weak	No

D3	Overcast	Hot	High	Weak	Yes
D7	Overcast	Cool	Normal	Strong	Yes
D12	Overcast	Mild	High	Strong	Yes
D13	Overcast	Hot	Normal	Weak	Yes

D4	Rain	Mild	High	Weak	Yes
D5	Rain	Cool	Normal	Weak	Yes
D10	Rain	Mild	Normal	Weak	Yes
D6	Rain	Cool	Normal	Strong	No
D14	Rain	Mild	High	Strong	No

simple rule - complicated tree

starting with 'wind'
: overfitting?

Day	Outlook	Temperature	Humidity	Wind	PlayTennis
D9	Sunny	Cool	Normal	Weak	Yes
D5	Rain	Cool	Normal	Weak	Yes
D10	Rain	Mild	Normal	Weak	Yes
D13	Overcast	Hot	Normal	Weak	Yes

D3	Overcast	Hot	High	Weak	Yes
D4	Rain	Mild	High	Weak	Yes
D1	Sunny	Hot	High	Weak	No
D8	Sunny	Mild	High	Weak	No

D7	Overcast	Cool	Normal	Strong	Yes
D11	Sunny	Mild	Normal	Strong	Yes
D6	Rain	Cool	Normal	Strong	No

D12	Overcast	Mild	High	Strong	Yes
D2	Sunny	Hot	High	Strong	No
D14	Rain	Mild	High	Strong	No

constructing decision trees

CART (Classification and Regression Tree); binary
ID3 (Iterative Dichotomiser)
- select attributes with maximal information gain
- minimizing entropy (uncertainty)

entropy

probability density function (PDF)

$$ \ \ p(x)\ge 0, \qquad\int p(x)dx = 1 \ \ $$

$\qquad \mbox{or}\qquad $

$$ \ \ p_\alpha\ge 0, \qquad \sum_\alpha p_\alpha = 1\ \ $$

continuous or discrete (flipping a coin)
entropy $\ \ H[p]=-\big\langle \ln(p)\big\rangle$

$$ \ \ H[p] =-\int p(x)\ln\big[p(x)\big]dx \ \ $$

$\qquad \mbox{or}\qquad $

$$ \ \ H[p] = -\sum_\alpha p_\alpha\ln\big[p_\alpha\big]\ \ $$

information gain

reduction in entropy given an event occured

entropy with respect to classification (yes/no)
tennis example $\ \ N=14, \qquad N_{\mbox{yes}}=9, \qquad N_{\mbox{no}}=5$

$$ H_{\mbox{initial}} = \frac{9}{14}\log_2(14/9)+\frac{5}{14}\log_2(14/5) =0.940 % (log(14.0/9.0)*9.0/14.0+log(14.0/5.0)*5.0/14.0)/log(2.0) $$

maximum $\ \ \log_2(M)=1, \qquad\quad M=2$

conditional entropy

attribute 'outlook' has three possible events

$$ \begin{array}{r|ccc|c} & N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline \mbox{sunny} & 5 & 2 & 3 & 0.971\\ \mbox{overcast} & 4 & 4 & 0 & 0\\ \mbox{rain} & 5 & 3 & 2 & 0.971 \end{array} % (log(5.0/2.0)*2.0/5.0+log(5.0/3.0)*3.0/5.0)/log(2.0) $$

average (over all possible events) entropy

$$ H_{\mbox{outlook}} = \frac{5}{14} H_{\mbox{sunny}} + \frac{4}{14} H_{\mbox{overcast}} + \frac{5}{14} H_{\mbox{rain}} = \frac{10}{14} 0.971 = 0.694 $$

information gain for attribute 'outlook'

$$ G_{\mbox{outlook}} = H_{\mbox{initial}} - H_{\mbox{outlook}} = 0.940-0.694 = 0.246 $$

gains nearly a quarter

determining optimal attribute

'outlook'

$$ \begin{array}{r|ccc|c} & N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline \mbox{sunny} & 5 & 2 & 3 & 0.971\\ \mbox{overcast} & 4 & 4 & 0 & 0\\ \mbox{rain} & 5 & 3 & 2 & 0.971 \end{array}\qquad\Rightarrow \qquad \fbox{$\phantom{\big|} H_{\mbox{outlook}} = 0.694 \phantom{\big|}$} $$

'wind'

$$ \begin{array}{r|ccc|c} & N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline \mbox{weak} & 8 & 6 & 2 & 0.811\\ \mbox{strong} & 6 & 3 & 3 & 1.0\\ \end{array}\qquad\Rightarrow % (log(8.0/6.0)*6.0/8.0+log(8.0/2.0)*2.0/8.0)/log(2.0) % (log(6.0/3.0)*3.0/6.0+log(6.0/3.0)*3.0/6.0)/log(2.0) \qquad \fbox{$\phantom{\big|} H_{\mbox{wind}} = 0.892 \phantom{\big|}$} % 0.811*8.0/14.0 + 1.0*6.0/14.0 $$

'humidity'

$$ \begin{array}{r|ccc|c} & N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline \mbox{normal} & 7 & 6 & 1 & 0.592\\ \mbox{high} & 7 & 3 & 4 & 0.985\\ \end{array}\qquad\Rightarrow % (log(7.0/6.0)*6.0/7.0+log(7.0/1.0)*1.0/7.0)/log(2.0) % (log(7.0/3.0)*3.0/7.0+log(7.0/4.0)*4.0/7.0)/log(2.0) \qquad \fbox{$\phantom{\big|} H_{\mbox{humidity}} = 0.789 \phantom{\big|}$} % 0.985*7.0/14.0 + 0.592*7.0/14.0 $$

'temperature'

$$ \begin{array}{r|ccc|c} & N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline \mbox{coool} & 4 & 3 & 1 & 0.811\\ \mbox{mild} & 6 & 4 & 2 & 0.918\\ \mbox{hot} & 4 & 2 & 2 & 1.0\\ \end{array}\qquad\Rightarrow % (log(4.0/3.0)*3.0/4.0+log(4.0/1.0)*1.0/4.0)/log(2.0) % (log(6.0/4.0)*4.0/6.0+log(6.0/2.0)*2.0/6.0)/log(2.0) \qquad \fbox{$\phantom{\big|} H_{\mbox{temperature}} = 0.911 \phantom{\big|}$} % 0.811*4.0/14.0 + 0.918*6.0/14.0 + 4.0/14.0 $$

information gain

constant $\ \ H_{\mbox{initial}}=0.940$

$$ H_{\mbox{initial}}- \big\langle H_{\mbox{after}}\big\rangle = \left\{\begin{array}{rcl} 0.246 && \mbox{'outlook'} \\ 0.048 && \mbox{'wind'} \\ 0.151 && \mbox{'humidity'} \\ 0.029 && \mbox{'temperature'} \end{array}\right. $$

'outlook' best, 'temperature' miserable

recursive entropy weighting

attribute with highet information gain
$\to \ \ $ root of decision tree
repeat recursively

Gini index

one (of many) alternatives to information gain

$\qquad\quad \mbox{Gini}[p] = 1-\sum_\alpha p_\alpha^2 $

minimizing
expanding the entropy for large probabilities

$\qquad\quad \begin{array}{rcl} H[p] &=&-\sum_\alpha p_\alpha\log(p_\alpha) \\ &=&-\sum_\alpha p_\alpha\log(1-(1-p_\alpha)) \\ &\approx& \sum_\alpha p_\alpha(1-p_\alpha) = 1-\sum_\alpha p_\alpha^2 \end{array} $

where $ \ \ p_\alpha\to 1, \qquad\quad \sum_\alpha p_\alpha=1$

entropy vs. Gini

case $\ \ M=2$

$$ 1-\sum_{\alpha=1}^2 p_\alpha^2 = 1 - p_1^2 - (1-p_1)^2 =2p_1-2p_1^2 = 2p_1(1-p_1) $$

overfitting - pruning

given enough parameters,
one can fit an elephant

given enough nodes,
one can fit every decision process

avoiding overfitting

pre-pruning
: stop when new branches become statistically insignificant
post pruneing
: grow full tree, then remove branches
$\rightarrow\ \ $ replacing subtree by leaf
optimal tree
performance $\ \ \leftrightarrow \ \ $ overfitting
- performance with respect to validation data set
- complexity penalty

range of algorithms

CART (binary)
ID3 (here + ...)
C4.5 (includes pruning)
- permits real-valued attributes
- allow missing values
- robust to noise

pruning example

evaluation of job offer

here: only working principle
increasing error rate
increasing performance
: statistical relevance, reduced complexity, ...

pruned graph

drawing graphs

a range of open-source oftware:
quick and dirty: GraphViz
- nodes arranged automatically
- fine tuning sometimes difficult
for Linux: dot
and neato / circo / twopi
for distinct node location algorithms (all based on dot)

graph G { 
 one -- two -- three -- four -- one
/* run with
 * circo -Tpdf test.dot -o test.pdf
 * neato -Tpdf test.dot -o test.pdf
 * dot   -Tpdf test.dot -o test.pdf
 */
}

job evaluation graph

digraph G {                          // digraph: directed graph

         rankdir=TD;                 // top-down, try LR
         node   [penwidth=2.0]       // for all nodes

         salary      [shape="oval",color="#000099",fontcolor="#000099"]    
         increase    [shape="oval",color="#000099",fontcolor="#000099"]  
         hours       [shape="oval",color="#000099",fontcolor="#000099"]
         holidays    [shape="oval",color="#000099",fontcolor="#000099"]
         retirement  [shape="oval",color="#000099",fontcolor="#000099"]

         salary      [label="salary"]    
         increase    [label="prospective wage increase"]    
         hours       [label="# working hours"]    
         holidays    [label="# holidays"]    
         retirement  [label="early retirement possible?"]    

         good1     [shape="box",color="#009900",fontcolor="#009900"]
         good2     [shape="box",color="#009900",fontcolor="#009900"]
         good4     [shape="box",color="#009900",fontcolor="#009900"]
         bad1      [shape="box",color="#990000",fontcolor="#990000"]
         bad2      [shape="box",color="#990000",fontcolor="#990000"]
         bad4      [shape="box",color="#990000",fontcolor="#990000"]

         good1       [label="good"]    
         good2       [label="good"]    
         good4       [label="good"]    
         bad1        [label="bad"]    
         bad2        [label="bad"]    
         bad4        [label="bad"]    

       {rank=same; holidays; hours}                     // on same rank level
       {rank=same; bad1; retirement; good1; increase } 


          salary  -> holidays                         [label=" high"]
                     holidays -> retirement           [label=" <20"]
                     holidays -> good1                [label=" >20"]
                                 retirement -> good2          [label=" yes"]
                                 retirement -> bad2           [label=" no"]
          salary  -> hours                            [label=" low"]
                     hours    -> increase             [label=" <35"]
                     hours    -> bad1                 [label=" >35"]
                                 increase   -> bad4           [label=" <4%"]
                                 increase   -> good4          [label=" >4%"]

/* some info 
 *
 * http://www.graphviz.org/Documentation.php
 * dot -Tpdf test.dot -o test.pdf
*/
}

digraph G {                          // digraph: directed graph

rankdir=TD;                 // top-down, try LR
         node   [penwidth=2.0]       // for all nodes

salary      [shape="oval",color="#000099",fontcolor="#000099"]    
         increase    [shape="oval",color="#000099",fontcolor="#000099"]  
         hours       [shape="oval",color="#000099",fontcolor="#000099"]
         holidays    [shape="oval",color="#000099",fontcolor="#000099"]
         retirement  [shape="oval",color="#000099",fontcolor="#000099"]

salary      [label="salary"]    
         increase    [label="prospective wage increase"]    
         hours       [label="# working hours"]    
         holidays    [label="# holidays"]    
         retirement  [label="early retirement possible?"]

good1     [shape="box",color="#009900",fontcolor="#009900"]
         good2     [shape="box",color="#009900",fontcolor="#009900"]
         good4     [shape="box",color="#009900",fontcolor="#009900"]
         bad1      [shape="box",color="#990000",fontcolor="#990000"]
         bad2      [shape="box",color="#990000",fontcolor="#990000"]
         bad4      [shape="box",color="#990000",fontcolor="#990000"]

good1       [label="good"]    
         good2       [label="good"]    
         good4       [label="good"]    
         bad1        [label="bad"]    
         bad2        [label="bad"]    
         bad4        [label="bad"]

{rank=same; holidays; hours}                     // on same rank level
       {rank=same; bad1; retirement; good1; increase }

salary  -> holidays                         [label=" high"]
                     holidays -> retirement           [label=" <20"]
                     holidays -> good1                [label=" >20"]
                                 retirement -> good2          [label=" yes"]
                                 retirement -> bad2           [label=" no"]
          salary  -> hours                            [label=" low"]
                     hours    -> increase             [label=" <35"]
                     hours    -> bad1                 [label=" >35"]
                                 increase   -> bad4           [label=" <4%"]
                                 increase   -> good4          [label=" >4%"]

/* some info 
 *
 * http://www.graphviz.org/Documentation.php
 * dot -Tpdf test.dot -o test.pdf
*/
}