Advanced Introduction to C++, Scientific Computing and Machine Learning




Claudius Gros, WS 2019/20

Institut für theoretische Physik
Goethe-University Frankfurt a.M.

Decision Trees

decision trees (DT) are everywhere



optimal decicion trees

whether to play tennis

DayOutlookTemperatureHumidity WindPlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
DayOutlookTemperatureHumidity WindPlayTennis
today Sunny Hot Normal Weak ?
tomorrow Overcast Mild Normal Weak ?

decision tree complexities




DayOutlookTemperatureHumidity WindPlay?
D9 Sunny Cool Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D8 Sunny Mild High Weak No
D3 Overcast Hot High Weak Yes
D7 Overcast Cool Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D6 Rain Cool Normal Strong No
D14 Rain Mild High Strong No

simple rule - complicated tree


DayOutlookTemperatureHumidity WindPlayTennis
D9 Sunny Cool Normal Weak Yes
D5 Rain Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D13 Overcast Hot Normal Weak Yes
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D1 Sunny Hot High Weak No
D8 Sunny Mild High Weak No
D7 Overcast Cool Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D6 Rain Cool Normal Strong No
D12 Overcast Mild High Strong Yes
D2 Sunny Hot High Strong No
D14 Rain Mild High Strong No

constructing decision trees

entropy

$$ \ \ p(x)\ge 0, \qquad\int p(x)dx = 1 \ \ $$
$\qquad \mbox{or}\qquad $
$$ \ \ p_\alpha\ge 0, \qquad \sum_\alpha p_\alpha = 1\ \ $$
$$ \ \ H[p] =-\int p(x)\ln\big[p(x)\big]dx \ \ $$
$\qquad \mbox{or}\qquad $
$$ \ \ H[p] = -\sum_\alpha p_\alpha\ln\big[p_\alpha\big]\ \ $$

entropy

statistical physics

$$ H[p]=-k_BT\big\langle \ln(p)\big\rangle, \qquad\quad \beta=\frac{1}{k_bT} $$ $$ \begin{array}{rcl} \mbox{micro-canonical ensemble} && \mbox{minimal entropy} \\[1ex] \mbox{canonical ensemble} && \mbox{minimal entropy, given an energy E} \\ && \mbox{Boltzman distribution} \ \ \fbox{$\displaystyle\phantom{\big|} p_\alpha\sim\exp(-\beta E_\alpha) \phantom{\big|}$} \\[1ex] \mbox{grand-canonical ensemble} && \mbox{minimal entropy, given energy E, particle number N} \\ && \mbox{Boltzman distribution} \ \ \fbox{$\displaystyle\phantom{\big|} p_\alpha\sim\exp(-\beta (E_\alpha-\mu N)) \phantom{\big|}$} \\ \end{array} $$

entropy as a measure of disorder

$$ H(T)\quad\to\quad \left\{\begin{array}{rcl} 0 &\mbox{for}& T\to0 \\[1ex] Nk_B\ln(f) &\mbox{for}& T\to\infty \end{array}\right. $$

information theory

maximal entropy distributions

$$ H[p] = -\sum_{\alpha=1}^M p_\alpha\log(p_\alpha) = -\frac{M}{M}\log(1/M) = \log(M),\qquad\quad p_\alpha = \frac{1}{M} $$

information gain

reduction in entropy given an event occured
$$ H_{\mbox{initial}} = \frac{9}{14}\log_2(14/9)+\frac{5}{14}\log_2(14/5) =0.940 % (log(14.0/9.0)*9.0/14.0+log(14.0/5.0)*5.0/14.0)/log(2.0) $$

conditional entropy

$$ \begin{array}{r|ccc|c} & N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline \mbox{sunny} & 5 & 2 & 3 & 0.971\\ \mbox{overcast} & 4 & 4 & 0 & 0\\ \mbox{rain} & 5 & 3 & 2 & 0.971 \end{array} % (log(5.0/2.0)*2.0/5.0+log(5.0/3.0)*3.0/5.0)/log(2.0) $$ $$ H_{\mbox{outlook}} = \frac{5}{14} H_{\mbox{sunny}} + \frac{4}{14} H_{\mbox{overcast}} + \frac{5}{14} H_{\mbox{rain}} = \frac{10}{14} 0.971 = 0.694 $$ $$ G_{\mbox{outlook}} = H_{\mbox{initial}} - H_{\mbox{outlook}} = 0.940-0.694 = 0.246 $$

determining optimal attribute

'outlook'

$$ \begin{array}{r|ccc|c} & N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline \mbox{sunny} & 5 & 2 & 3 & 0.971\\ \mbox{overcast} & 4 & 4 & 0 & 0\\ \mbox{rain} & 5 & 3 & 2 & 0.971 \end{array}\qquad\Rightarrow \qquad \fbox{$\displaystyle\phantom{\big|} H_{\mbox{outlook}} = 0.694 \phantom{\big|}$} $$

'wind'

$$ \begin{array}{r|ccc|c} & N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline \mbox{weak} & 8 & 6 & 2 & 0.811\\ \mbox{strong} & 6 & 3 & 3 & 1.0\\ \end{array}\qquad\Rightarrow % (log(8.0/6.0)*6.0/8.0+log(8.0/2.0)*2.0/8.0)/log(2.0) % (log(6.0/3.0)*3.0/6.0+log(6.0/3.0)*3.0/6.0)/log(2.0) \qquad \fbox{$\displaystyle\phantom{\big|} H_{\mbox{wind}} = 0.892 \phantom{\big|}$} % 0.811*8.0/14.0 + 1.0*6.0/14.0 $$

'humidity'

$$ \begin{array}{r|ccc|c} & N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline \mbox{normal} & 7 & 6 & 1 & 0.592\\ \mbox{high} & 7 & 3 & 4 & 0.985\\ \end{array}\qquad\Rightarrow % (log(7.0/6.0)*6.0/7.0+log(7.0/1.0)*1.0/7.0)/log(2.0) % (log(7.0/3.0)*3.0/7.0+log(7.0/4.0)*4.0/7.0)/log(2.0) \qquad \fbox{$\displaystyle\phantom{\big|} H_{\mbox{humidity}} = 0.789 \phantom{\big|}$} % 0.985*7.0/14.0 + 0.592*7.0/14.0 $$

'temperature'

$$ \begin{array}{r|ccc|c} & N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline \mbox{coool} & 4 & 3 & 1 & 0.811\\ \mbox{mild} & 6 & 4 & 2 & 0.918\\ \mbox{hot} & 4 & 2 & 2 & 1.0\\ \end{array}\qquad\Rightarrow % (log(4.0/3.0)*3.0/4.0+log(4.0/1.0)*1.0/4.0)/log(2.0) % (log(6.0/4.0)*4.0/6.0+log(6.0/2.0)*2.0/6.0)/log(2.0) \qquad \fbox{$\displaystyle\phantom{\big|} H_{\mbox{temperature}} = 0.911 \phantom{\big|}$} % 0.811*4.0/14.0 + 0.918*6.0/14.0 + 4.0/14.0 $$

information gain

$$ H_{\mbox{initial}}- \big\langle H_{\mbox{after}}\big\rangle = \left\{\begin{array}{rcl} 0.246 && \mbox{'outlook'} \\ 0.048 && \mbox{'wind'} \\ 0.151 && \mbox{'humidity'} \\ 0.029 && \mbox{'temperature'} \end{array}\right. $$

recursive entropy weighting






Gini index

$\displaystyle\qquad\quad \mbox{Gini}[p] = 1-\sum_\alpha p_\alpha^2 $
$\displaystyle\qquad\quad \begin{array}{rcl} H[p] &=&-\sum_\alpha p_\alpha\log(p_\alpha) \\ &=&-\sum_\alpha p_\alpha\log(1-(1-p_\alpha)) \\ &\approx& \sum_\alpha p_\alpha(1-p_\alpha) = 1-\sum_\alpha p_\alpha^2 \end{array} $

entropy vs. Gini

$$ 1-\sum_{\alpha=1}^2 p_\alpha^2 = 1 - p_1^2 - (1-p_1)^2 =2p_1-2p_1^2 = 2p_1(1-p_1) $$

overfitting - pruning





given enough parameters,
one can fit an elephant

avoiding overfitting

range of algorithms

pruning example

evaluation of job offer


pruned graph

drawing graphs

Copy Copy to clipboad
Downlaod Download
graph G { 
 one -- two -- three -- four -- one
/* run with
 * circo -Tpdf test.dot -o test.pdf
 * neato -Tpdf test.dot -o test.pdf
 * dot   -Tpdf test.dot -o test.pdf
 */
}

job evaluation graph

Copy Copy to clipboad
Downlaod Download
digraph G {                          // digraph: directed graph

         rankdir=TD;                 // top-down, try LR
         node   [penwidth=2.0]       // for all nodes

         salary      [shape="oval",color="#000099",fontcolor="#000099"]    
         increase    [shape="oval",color="#000099",fontcolor="#000099"]  
         hours       [shape="oval",color="#000099",fontcolor="#000099"]
         holidays    [shape="oval",color="#000099",fontcolor="#000099"]
         retirement  [shape="oval",color="#000099",fontcolor="#000099"]

         salary      [label="salary"]    
         increase    [label="prospective wage increase"]    
         hours       [label="# working hours"]    
         holidays    [label="# holidays"]    
         retirement  [label="early retirement possible?"]    

         good1     [shape="box",color="#009900",fontcolor="#009900"]
         good2     [shape="box",color="#009900",fontcolor="#009900"]
         good4     [shape="box",color="#009900",fontcolor="#009900"]
         bad1      [shape="box",color="#990000",fontcolor="#990000"]
         bad2      [shape="box",color="#990000",fontcolor="#990000"]
         bad4      [shape="box",color="#990000",fontcolor="#990000"]

         good1       [label="good"]    
         good2       [label="good"]    
         good4       [label="good"]    
         bad1        [label="bad"]    
         bad2        [label="bad"]    
         bad4        [label="bad"]    

       {rank=same; holidays; hours}                     // on same rank level
       {rank=same; bad1; retirement; good1; increase } 


          salary  -> holidays                         [label=" high"]
                     holidays -> retirement           [label=" <20"]
                     holidays -> good1                [label=" >20"]
                                 retirement -> good2          [label=" yes"]
                                 retirement -> bad2           [label=" no"]
          salary  -> hours                            [label=" low"]
                     hours    -> increase             [label=" <35"]
                     hours    -> bad1                 [label=" >35"]
                                 increase   -> bad4           [label=" <4%"]
                                 increase   -> good4          [label=" >4%"]

/* some info 
 *
 * http://www.graphviz.org/Documentation.php
 * dot -Tpdf test.dot -o test.pdf
*/
}