Advanced Introduction to C++, Scientific Computing and Machine Learning
Claudius Gros, SS 24
Institut für theoretische Physik
Goethe-University Frankfurt a.M.
Decision Trees
decision trees (DT) are everywhere
- daily life
- medicine
symptoms $\ \ \to \ \ $ diagnosis
- playing Go, Chess, ..
- ..
optimal decicion trees
- data:
$\displaystyle\quad
\begin{array}{rcl}
\mbox{training} & \ \Leftrightarrow \ & \mbox{testing}\\
\mbox{(over)fitting}& \ \Leftrightarrow \ & \mbox{generalization}\\
& \vdots &\\
\mbox{complex}& \ \Leftrightarrow \ & \mbox{simple}
\end{array}
$
whether to play tennis
- classical deccision-tree example
- training data
Day | Outlook | Temperature | Humidity |
Wind | PlayTennis |
---|
D1 |
Sunny |
Hot |
High |
Weak |
No |
D2 |
Sunny |
Hot |
High |
Strong |
No |
D3 |
Overcast |
Hot |
High |
Weak |
Yes |
D4 |
Rain |
Mild |
High |
Weak |
Yes |
D5 |
Rain |
Cool |
Normal |
Weak |
Yes |
D6 |
Rain |
Cool |
Normal |
Strong |
No |
D7 |
Overcast |
Cool |
Normal |
Strong |
Yes |
D8 |
Sunny |
Mild |
High |
Weak |
No |
D9 |
Sunny |
Cool |
Normal |
Weak |
Yes |
D10 |
Rain |
Mild |
Normal |
Weak |
Yes |
D11 |
Sunny |
Mild |
Normal |
Strong |
Yes |
D12 |
Overcast |
Mild |
High |
Strong |
Yes |
D13 |
Overcast |
Hot |
Normal |
Weak |
Yes |
D14 |
Rain |
Mild |
High |
Strong |
No |
- attributes: Outlook, Temperature, ..
- class: PlayTennis
- prediction / testing
Day | Outlook | Temperature | Humidity |
Wind | PlayTennis |
---|
today |
Sunny |
Hot |
Normal |
Weak |
? |
tomorrow |
Overcast |
Mild |
Normal |
Weak |
? |
decision tree complexities
- how to find out that
- 'temperature' irrelevant
- 'outlook' $\ \to\ $ 'overcast' enough
- 'outlook' $\ \to\ $ 'humidity' does not need 'wind'
- 'outlook' $\ \to\ $ 'wind' does not need 'humiditiy'
- generic problem
: elimination of irrelevant attributes
: taking relevant decisions first
- e.g. symptons $ \ \to\ $ diagnosis
Day | Outlook | Temperature | Humidity |
Wind | Play? |
---|
D9 |
Sunny |
Cool |
Normal |
Weak |
Yes |
D11 |
Sunny |
Mild |
Normal |
Strong |
Yes |
D1 |
Sunny |
Hot |
High |
Weak |
No |
D2 |
Sunny |
Hot |
High |
Strong |
No |
D8 |
Sunny |
Mild |
High |
Weak |
No |
| | | | | |
D3 |
Overcast |
Hot |
High |
Weak |
Yes |
D7 |
Overcast |
Cool |
Normal |
Strong |
Yes |
D12 |
Overcast |
Mild |
High |
Strong |
Yes |
D13 |
Overcast |
Hot |
Normal |
Weak |
Yes |
| | | | | |
D4 |
Rain |
Mild |
High |
Weak |
Yes |
D5 |
Rain |
Cool |
Normal |
Weak |
Yes |
D10 |
Rain |
Mild |
Normal |
Weak |
Yes |
D6 |
Rain |
Cool |
Normal |
Strong |
No |
D14 |
Rain |
Mild |
High |
Strong |
No |
simple rule - complicated tree
- starting with 'wind'
: overfitting?
Day | Outlook | Temperature | Humidity |
Wind | PlayTennis |
---|
D9 |
Sunny |
Cool |
Normal |
Weak |
Yes |
D5 |
Rain |
Cool |
Normal |
Weak |
Yes |
D10 |
Rain |
Mild |
Normal |
Weak |
Yes |
D13 |
Overcast |
Hot |
Normal |
Weak |
Yes |
| | | | | |
D3 |
Overcast |
Hot |
High |
Weak |
Yes |
D4 |
Rain |
Mild |
High |
Weak |
Yes |
D1 |
Sunny |
Hot |
High |
Weak |
No |
D8 |
Sunny |
Mild |
High |
Weak |
No |
| | | | | |
D7 |
Overcast |
Cool |
Normal |
Strong |
Yes |
D11 |
Sunny |
Mild |
Normal |
Strong |
Yes |
D6 |
Rain |
Cool |
Normal |
Strong |
No |
| | | | | |
D12 |
Overcast |
Mild |
High |
Strong |
Yes |
D2 |
Sunny |
Hot |
High |
Strong |
No |
D14 |
Rain |
Mild |
High |
Strong |
No |
constructing decision trees
- CART (Classification and Regression Tree); binary
- ID3 (Iterative Dichotomiser)
- select attributes with maximal information gain
- minimizing entropy (uncertainty)
entropy
- probability density function (PDF)
$$
\ \ p(x)\ge 0, \qquad\int p(x)dx = 1 \ \
$$
|
$\qquad \mbox{or}\qquad $
|
$$
\ \ p_\alpha\ge 0, \qquad \sum_\alpha p_\alpha = 1\ \
$$
|
- continuous or discrete (flipping a coin)
- entropy $\ \ H[p]=-\big\langle \ln(p)\big\rangle$
$$
\ \ H[p] =-\int p(x)\ln\big[p(x)\big]dx \ \
$$
|
$\qquad \mbox{or}\qquad $
|
$$
\ \ H[p] = -\sum_\alpha p_\alpha\ln\big[p_\alpha\big]\ \
$$
|
entropy
statistical physics
$$ H[p]=-k_BT\big\langle \ln(p)\big\rangle,
\qquad\quad \beta=\frac{1}{k_bT}
$$
- Boltzman constant $\ \ k_B$
- temperature $\ \ T$
$$
\begin{array}{rcl}
\mbox{micro-canonical ensemble} && \mbox{minimal entropy} \\[1ex]
\mbox{canonical ensemble} && \mbox{minimal entropy, given an energy E} \\
&& \mbox{Boltzman distribution} \ \
\fbox{$\displaystyle\phantom{\big|} p_\alpha\sim\exp(-\beta E_\alpha) \phantom{\big|}$} \\[1ex]
\mbox{grand-canonical ensemble} && \mbox{minimal entropy, given energy E, particle number N} \\
&& \mbox{Boltzman distribution} \ \
\fbox{$\displaystyle\phantom{\big|} p_\alpha\sim\exp(-\beta (E_\alpha-\mu N)) \phantom{\big|}$} \\
\end{array}
$$
- chemical potential $\ \ \mu$
entropy as a measure of disorder
$$
H(T)\quad\to\quad
\left\{\begin{array}{rcl}
0 &\mbox{for}& T\to0 \\[1ex]
Nk_B\ln(f) &\mbox{for}& T\to\infty
\end{array}\right.
$$
- degees of freedom per particle $\ \ f$
information theory
- entropy positive $\ \ 0\le p_\alpha \le 1, \qquad \lim_{x\to0} x\log(x)=0$
- logarithm concave
maximal entropy distributions
- no contraint: uniform distribution
: variational calculus
$$
H[p] = -\sum_{\alpha=1}^M p_\alpha\log(p_\alpha)
= -\frac{M}{M}\log(1/M) = \log(M),\qquad\quad
p_\alpha = \frac{1}{M}
$$
- number of events $\ \ M$
- with contraint: exponential distribution
information gain
reduction in entropy given an event occured
-
- entropy with respect to classification (yes/no)
- tennis example $\ \ N=14, \qquad N_{\mbox{yes}}=9, \qquad N_{\mbox{no}}=5$
$$
H_{\mbox{initial}} = \frac{9}{14}\log_2(14/9)+\frac{5}{14}\log_2(14/5)
=0.940
% (log(14.0/9.0)*9.0/14.0+log(14.0/5.0)*5.0/14.0)/log(2.0)
$$
- maximum $\ \ log_2(M)=1, \qquad\quad M=2$
conditional entropy
- attribute 'outlook' has three possible events
$$
\begin{array}{r|ccc|c}
& N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline
\mbox{sunny} & 5 & 2 & 3 & 0.971\\
\mbox{overcast} & 4 & 4 & 0 & 0\\
\mbox{rain} & 5 & 3 & 2 & 0.971
\end{array}
% (log(5.0/2.0)*2.0/5.0+log(5.0/3.0)*3.0/5.0)/log(2.0)
$$
- average (over all possible events) entropy
$$
H_{\mbox{outlook}} = \frac{5}{14} H_{\mbox{sunny}}
+ \frac{4}{14} H_{\mbox{overcast}}
+ \frac{5}{14} H_{\mbox{rain}}
= \frac{10}{14} 0.971 = 0.694
$$
- information gain for attribute 'outlook'
$$
G_{\mbox{outlook}} = H_{\mbox{initial}} -
H_{\mbox{outlook}} = 0.940-0.694 = 0.246
$$
determining optimal attribute
'outlook'
$$
\begin{array}{r|ccc|c}
& N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline
\mbox{sunny} & 5 & 2 & 3 & 0.971\\
\mbox{overcast} & 4 & 4 & 0 & 0\\
\mbox{rain} & 5 & 3 & 2 & 0.971
\end{array}\qquad\Rightarrow
\qquad
\fbox{$\displaystyle\phantom{\big|}
H_{\mbox{outlook}} = 0.694 \phantom{\big|}$}
$$
'wind'
$$
\begin{array}{r|ccc|c}
& N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline
\mbox{weak} & 8 & 6 & 2 & 0.811\\
\mbox{strong} & 6 & 3 & 3 & 1.0\\
\end{array}\qquad\Rightarrow
% (log(8.0/6.0)*6.0/8.0+log(8.0/2.0)*2.0/8.0)/log(2.0)
% (log(6.0/3.0)*3.0/6.0+log(6.0/3.0)*3.0/6.0)/log(2.0)
\qquad
\fbox{$\displaystyle\phantom{\big|}
H_{\mbox{wind}} = 0.892 \phantom{\big|}$}
% 0.811*8.0/14.0 + 1.0*6.0/14.0
$$
'humidity'
$$
\begin{array}{r|ccc|c}
& N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline
\mbox{normal} & 7 & 6 & 1 & 0.592\\
\mbox{high} & 7 & 3 & 4 & 0.985\\
\end{array}\qquad\Rightarrow
% (log(7.0/6.0)*6.0/7.0+log(7.0/1.0)*1.0/7.0)/log(2.0)
% (log(7.0/3.0)*3.0/7.0+log(7.0/4.0)*4.0/7.0)/log(2.0)
\qquad
\fbox{$\displaystyle\phantom{\big|}
H_{\mbox{humidity}} = 0.789 \phantom{\big|}$}
% 0.985*7.0/14.0 + 0.592*7.0/14.0
$$
'temperature'
$$
\begin{array}{r|ccc|c}
& N & N_{\mbox{yes}} & N_{\mbox{no}} & H\\ \hline
\mbox{coool} & 4 & 3 & 1 & 0.811\\
\mbox{mild} & 6 & 4 & 2 & 0.918\\
\mbox{hot} & 4 & 2 & 2 & 1.0\\
\end{array}\qquad\Rightarrow
% (log(4.0/3.0)*3.0/4.0+log(4.0/1.0)*1.0/4.0)/log(2.0)
% (log(6.0/4.0)*4.0/6.0+log(6.0/2.0)*2.0/6.0)/log(2.0)
\qquad
\fbox{$\displaystyle\phantom{\big|}
H_{\mbox{temperature}} = 0.911 \phantom{\big|}$}
% 0.811*4.0/14.0 + 0.918*6.0/14.0 + 4.0/14.0
$$
information gain
- constant $\ \ H_{\mbox{initial}}=0.940$
$$
H_{\mbox{initial}}- \big\langle H_{\mbox{after}}\big\rangle =
\left\{\begin{array}{rcl}
0.246 && \mbox{'outlook'} \\
0.048 && \mbox{'wind'} \\
0.151 && \mbox{'humidity'} \\
0.029 && \mbox{'temperature'}
\end{array}\right.
$$
- 'outlook' best, 'temperature' miserable
recursive entropy weighting
- attribute with highet information gain
$\to \ \ $ root of decision tree
- repeat recursively
Gini index
- one (of many) alternatives to information gain
$\displaystyle\qquad\quad
\mbox{Gini}[p] = 1-\sum_\alpha p_\alpha^2
$
- minimizing
- expanding the entropy for large probabilities
$\displaystyle\qquad\quad
\begin{array}{rcl}
H[p] &=&-\sum_\alpha p_\alpha\log(p_\alpha) \\
&=&-\sum_\alpha p_\alpha\log(1-(1-p_\alpha)) \\
&\approx& \sum_\alpha p_\alpha(1-p_\alpha)
= 1-\sum_\alpha p_\alpha^2
\end{array}
$
- where $ \ \ p_\alpha\to 1, \qquad\quad \sum_\alpha p_\alpha=1$
entropy vs. Gini
$$
1-\sum_{\alpha=1}^2 p_\alpha^2 = 1 - p_1^2 - (1-p_1)^2
=2p_1-2p_1^2 = 2p_1(1-p_1)
$$
overfitting - pruning
given enough parameters,
one can fit an elephant
- given enough nodes,
one can fit every decision process
avoiding overfitting
- pre-pruning
: stop when new branches become statistically insignificant
- post pruneing
: grow full tree, then remove branches
$\rightarrow\ \ $ replacing subtree by leaf
- optimal tree
performance $\ \ \leftrightarrow \ \ $ overfitting
- performance with respect to validation data set
- complexity penalty
range of algorithms
- CART (binary)
- ID3 (here + ...)
- C4.5 (includes pruning)
- permits real-valued attributes
- allow missing values
- robust to noise
pruning example
evaluation of job offer
- here: only working principle
- increasing error rate
increasing performance
: statistical relevance, reduced complexity, ...
pruned graph
drawing graphs
- a range of open-source oftware:
- quick and dirty: GraphViz
- nodes arranged automatically
- fine tuning difficult
- for Linux:
dot
and neato
/ circo
/ twopi
for distinct node location algorithms (all based on dot
)
graph G {
one -- two -- three -- four -- one
/* run with
* circo -Tpdf test.dot -o test.pdf
* neato -Tpdf test.dot -o test.pdf
* dot -Tpdf test.dot -o test.pdf
*/
}
job evaluation graph
digraph G { // digraph: directed graph
rankdir=TD; // top-down, try LR
node [penwidth=2.0] // for all nodes
salary [shape="oval",color="#000099",fontcolor="#000099"]
increase [shape="oval",color="#000099",fontcolor="#000099"]
hours [shape="oval",color="#000099",fontcolor="#000099"]
holidays [shape="oval",color="#000099",fontcolor="#000099"]
retirement [shape="oval",color="#000099",fontcolor="#000099"]
salary [label="salary"]
increase [label="prospective wage increase"]
hours [label="# working hours"]
holidays [label="# holidays"]
retirement [label="early retirement possible?"]
good1 [shape="box",color="#009900",fontcolor="#009900"]
good2 [shape="box",color="#009900",fontcolor="#009900"]
good4 [shape="box",color="#009900",fontcolor="#009900"]
bad1 [shape="box",color="#990000",fontcolor="#990000"]
bad2 [shape="box",color="#990000",fontcolor="#990000"]
bad4 [shape="box",color="#990000",fontcolor="#990000"]
good1 [label="good"]
good2 [label="good"]
good4 [label="good"]
bad1 [label="bad"]
bad2 [label="bad"]
bad4 [label="bad"]
{rank=same; holidays; hours} // on same rank level
{rank=same; bad1; retirement; good1; increase }
salary -> holidays [label=" high"]
holidays -> retirement [label=" <20"]
holidays -> good1 [label=" >20"]
retirement -> good2 [label=" yes"]
retirement -> bad2 [label=" no"]
salary -> hours [label=" low"]
hours -> increase [label=" <35"]
hours -> bad1 [label=" >35"]
increase -> bad4 [label=" <4%"]
increase -> good4 [label=" >4%"]
/* some info
*
* http://www.graphviz.org/Documentation.php
* dot -Tpdf test.dot -o test.pdf
*/
}