2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

9 minute read

2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

๐Ÿ’ฅ ์ด๋ฒˆ ํฌ์ŠคํŠธ๋Š” <Decision Tree>๋ฅผ ์ˆ˜ํ•™์ ์œผ๋กœ ์ •์˜ํ•˜๋Š” ๋‚ด์šฉ์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค. DT์— ๋Œ€ํ•œ ์ •์˜์™€ ๊ฐœ๋…์€ ๋‹ค๋ฅธ ํฌ์ŠคํŠธ๋ฅผ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š” ๐Ÿ˜‰

DT Estimation

DT๋ฅผ ์ด๋ฃจ๋Š” ๊ฐ terminal node ๋˜๋Š” leaf node๋Š” DT๊ฐ€ ๋ถ„ํ• ํ•˜๋Š” region $R_i$์— ๋Œ€์‘๋œ๋‹ค.

DT๋Š” regression๊ณผ classification ๋‘ ๋ฌธ์ œ์—์„œ๋„ ๋ชจ๋‘ ์ ์šฉํ•ด๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ, ๊ฐ ๊ฒฝ์šฐ์— ๋Œ€ํ•œ DT estimation function $f$๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜๋œ๋‹ค.

1. DT estimation in regression problem

\[\hat{f}(x) = \sum^M_{m=1} \hat{c}_m \cdot I(x\in R_m)\]

where

\[\hat{c}_m = \frac{1}{n_m} \sum_{x_i \in R_m} y_i\]

$n_m$ is the # of training data in $R_m$.

๊ทธ๋ƒฅ terminal node์— ์†ํ•˜๋Š” training data์˜ ํ‰๊ท ์œผ๋กœ estimation ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.


2. DT estimation in classification problem

\[\hat{P}(Y=k \mid X=x) = \hat{p}_{mk} = \frac{1}{n_m} \sum_{x_i \in R_m} I(y_i = k) \quad \text{for} \quad x \in R_m\]

Then, for given $X = x$, estiation $\hat{Y}$ is

\[\hat{Y} = \underset{k}{\text{argmax}} \; \hat{P}(Y=k \mid X=x)\]

๊ทธ๋ƒฅ ๋‹ค์ˆ˜๊ฒฐ์— ์˜ํ•ด estimation ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.



DT์˜ ์žฅ์ ์€ ์ข‹์€ interpretation power๋ฅผ ์ œ๊ณตํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค! ๋•Œ๋กœ๋Š” ๊ฒฐ์ •์— ๋Œ€ํ•œ ์ด์œ ๋ฅผ ์„ค๋ช…ํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๋Š”๋ฐ, DT๋Š” ์ด๋Ÿฐ ์ƒํ™ฉ์—์„œ ์ข‹์€ interpretation insight๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

ex: a bank must explain reasons for rejection to a loan applicant.



DT Construction

DT๋Š” ์•„๋ž˜ 4๊ฐ€์ง€ ๊ณผ์ •์„ ํ†ตํ•ด ๋งŒ๋“ค์–ด์ง„๋‹ค.

1. Growing

  • Find an optimal splitting rule for each node and grow the tree.
  • Stop growing if stopping rule is satisfied.

2. Prunning

  • Remove nodes which increase prediction error or which have inappropriate inference rules.
    (Complexity ๊ฐ์†Œ! ๐Ÿ˜†)

โ€ป # of terminal nodes๋Š” DT model์˜ complexity์— ์˜ํ–ฅ์„ ์ค€๋‹ค.

3. Validation

  • Validate to decide how much we prune the tree. (Overfitting๊ณผ ๊ด€๋ จ)

4. Interpretation & Prediction

  • Interpret the constructed tree & Predict!

Splitting Rule

DT๋ฅผ ๋งŒ๋“œ๋Š” ๊ฐ ๊ณผ์ •์—์„œ, ์•„๋ž˜ ๋‘ ๊ฐ€์ง€๋ฅผ ๊ฒฐ์ •ํ•ด์ค˜์•ผ ํ•œ๋‹ค.

  1. splitting variable $X_j$
  2. splitting criterion

์ด๋•Œ, $X_j$๊ฐ€ continuous์ธ์ง€ categorical์ธ์ง€์— ๋”ฐ๋ผ criterion์„ ๋งŒ๋“œ๋Š” ์ƒํ™ฉ์ด ๋‹ฌ๋ผ์ง„๋‹ค.

Splitting Rule: Regression

Assume that all input variables are continuous and

\[R_1 (j, s) = \left\{ X : X_j \le s \right\} \quad \text{and} \quad R_2(j, s) = \left\{ X : X_j > s \right\}\]

splitting crieterion์„ ์–ป๊ธฐ ์œ„ํ•ด splitting variable $X_j$์™€ split point $s$๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๋ฌธ์ œ๋Š” ์•„๋ž˜์˜ ์‹์„ ํ‘ธ๋Š” ๊ฒƒ๊ณผ ๋™์น˜๋‹ค.

\[\min_{j, s} \left[ \min_{c_1} \left( \sum_{x_i \in R_1(j, s)} (y_i - c_1)^2\right) + \min_{c_2} \left( \sum_{x_i \in R_2 (j, s)} (y_i - c_2)^2 \right) \right]\]

์ด๋•Œ, $c_1$๊ณผ $c_2$๋Š” $R_1(j, s)$์™€ $R_2(j, 2)$์— ์†ํ•˜๋Š” data์— ๋Œ€ํ•œ ํ‰๊ท ๊ฐ’์ด๋‹ค.

์œ„์˜ ์‹์„ ์ž˜ ์‚ดํŽด๋ณด๋ฉด, ๊ฒฐ๊ตญ โ€œtraining errorโ€๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” $X_j$์™€ $s$๋ฅผ ์ฐพ๋Š” ๋ฌธ์ œ์ž„์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค!

Splitting Rule: Classification

Classification๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ โ€œtraining errorโ€๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ์•„๋ž˜์˜ ์‹์„ ํ’€์–ด splitting criterion์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

\[\min_{j, s} \left[ \min_{k_1} \left( \sum_{x_i \in R_1(j, s)} I(y_i \ne k_1) \right) + \min_{k_2} \left( \sum_{x_i \in R_2(j, s)} I(y_i \ne k_2) \right) \right]\]

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ด๋•Œ, $k_1$๊ณผ $k_2$๋Š” $R_1(j, s)$, $R_2(j, s)$์— ์†ํ•˜๋Š” data์—์„œ ๋‹ค์ˆ˜๊ฒฐ์— ์˜ํ•ด ๊ตฌํ•œ label์ด๋‹ค.

Classification DT์—์„œ๋Š” Splitting์ด ์ข‹์€์ง€ ์•ˆ ์ข‹์€์ง€ <Impurity>๋ฅผ ํ†ตํ•ด ์ธก์ •ํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ ์œ„์˜ ๊ฒฝ์šฐ๋Š” zero-one loss๋ฅผ ํ†ตํ•ด criterion์„ ๊ฒฐ์ •ํ–ˆ๋Š”๋ฐ, ์ด ๋ฐฉ์‹์€ ๊ณง <misclassification error>๋ผ๋Š” impurtiy measure๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒƒ๊ณผ ๊ฐ™๋‹ค. ๋‹ค๋ฅธ impurity measure๋ฅผ ์ฃผ์–ด ๋™์ผํ•œ training data๋ผ๋„ ๋‹ค๋ฅธ DT๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค!


Impurity Measures

1. Misclassification Error

\[\frac{1}{n_m} \sum_{i \in R_m} I(y_i \ne k_m) = 1 - \hat{p}_{mk_m}\]

2. Gini Index

\[\sum^K_{k = 1} \hat{p}_{mk} (1 - \hat{p}_{mk})\]

3. Cross-Entropy or Deviance

\[- \sum^K_{k=1} \hat{p}_{mk} \log \hat{p}_{mk}\]

Pruning

DT๋Š” splitting์˜ ๊ณ„์†ํ•œ๋‹ค๋ฉด, tratining error๋Š” 0์œผ๋กœ ์ˆ˜๋ ดํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๊ฒƒ์€ overfitting์„ ์•ผ๊ธฐํ•˜๊ณ , model์˜ generalization์„ ๋–จ์–ด๋œจ๋ฆฐ๋‹ค. ๊ทธ๋ž˜์„œ node์ด ๊ฐฏ์ˆ˜๋ฅผ ์ค„์ด๋Š” pruning ๊ณผ์ •์ด ํ•„์š”ํ•˜๋‹ค.

Let $T_0$ be a large tree obtatined by splitting.

A subtree $T \subset T_0$ is any tree that can be obtained by pruning.

let $n_m = \left| \{ i : x_i \in R_m \} \right|$ and \(\displaystyle\hat{c}_m = \frac{1}{n_m} \sum_{x_i \in R_m} y_i\).

1. In a regression,

\[Q_m (T) = \frac{1}{n_m} \sum_{x_i \in R_m} (y_i - \hat{c}_m)^2\]

2. In a classication (ver. Gini Index)

\[Q_m(T) = \sum^K_{k=1} \hat{p}_{mk} (1-\hat{p}_{mk})\]

Then, cost complexity pruning is defined by

\[C_{\alpha}(T) = \left(\sum^{|T|}_{m=1} n_m Q_m(T)\right) + \alpha |T|\]

where $\alpha > 0$ is a tuning parameter.

$C_\alpha(T)$์— ๋Œ€ํ•œ ์‹์„ ์ž˜ ์‚ดํŽด๋ณด๋ฉด, โ€œtraining errโ€์™€ โ€œcomplexity paneltyโ€์— ๋Œ€ํ•œ ํ…€์ด ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด์ œ, ์œ„์˜ $C_\alpha(T)$๋ฅผ ๊ธฐ์ค€์œผ๋กœ Best Pruning Subtree $T_\alpha$๋ฅผ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

\[T_\alpha = \underset{T}{\text{argmin}} \; C_{\alpha} (T) \quad \text{where} \quad T \subset T_0\]

์œ„์™€ ๊ฐ™์€ ์ ‘๊ทผ๋ฒ•์„ <weakest link pruning>๋ผ๊ณ ๋„ ํ•œ๋‹ค.

tuning parameter $\alpha$๋Š” generalization error๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์„ ๊ณจ๋ผ์„œ ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค๊ณ  ํ•œ๋‹ค. (Cross Validation์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค.)



Sensitivity & Specificity

Medical Classification Problem์„ ๋‹ค๋ฃจ๋Š” ์ƒํ™ฉ์—์„œ $Y=1$์ด disease state๋ผ๊ณ  ๊ฐ€์ •ํ•˜์ž. ์ด๋•Œ, <sensitivity>์™€ <specificity>๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜๋œ๋‹ค.

Definition. Sensitivity

โ€œProbability of predicting disease given true disease.โ€

True Positive Rate๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ๋˜๋Š” <Recall>์ด๋ผ๊ณ  ํ•œ๋‹ค.

\[P(\hat{Y} = 1 \mid Y = 1)\]

Definition. Specificity

โ€œProbability of predicting non-disease given true non-disease.โ€

True Negative Rate๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

\[P(\hat{Y} = 0 \mid Y = 0)\]

ROC curve

์•ž์—์„œ ์‚ดํŽด๋ณธ DT๋Š” ๋‹ค์ˆ˜๊ฒฐ(majority vote)์— ์˜ํ•ด prediction์„ ๊ฒฐ์ •ํ–ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋‹ค์ˆ˜๊ฒฐ ๋ฐฉ์‹ ์™ธ์—๋„ ์–ด๋–ค threshold ๊ฐ’์„ ์ •ํ•ด, spam/normal์„ ๋ถ„๋ฅ˜ํ•ด๋ณผ ์ˆ˜๋„ ์žˆ๋‹ค.

\[\hat{Y} = I\left( \hat{P}(Y=1 \mid X=x) > c \right)\]

for some $c > 0.5$.

์ด๋•Œ, ์šฐ๋ฆฌ๋Š” classiciation rule $c$์˜ ๊ฐ’์„ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ, โ€œsensitivity-specificityโ€๋ฅผ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๊ฒƒ์„ pair๋กœ ์‚ผ์•„ plot์œผ๋กœ ๊ทธ๋ฆฐ ๊ฒƒ์ด ๋ฐ”๋กœ <ROC curve; Receiver Operating Characteristic curve>๋‹ค.

ps) ROC curve๋Š” classification problem์—์„œ๋งŒ ์œ ๋„ํ•  ์ˆ˜ ์žˆ๋‹ค!

์ด๋•Œ, ROC curve์˜ <AUC; Area Under the Curve>๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋น„๊ต ํ‰๊ฐ€ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.


DT: ์žฅ์  & ๋‹จ์ 

Advantages.

  • Robust to input outliers
  • Non-parameteric model

Disadvantages.

  • Poor prediction accuracy with continuous output compared to linear regression model.
  • When depth is too large, not only accuracy but interpretation are bad ๐Ÿ˜ฅ
  • Heavy computation cost
  • Unstable
  • Absence of linearity
  • Absence of Main effects: all nodes are high order interactions.
  • Discontinuity

์ด์–ด์ง€๋Š” ์ฑ•ํ„ฐ์—์„œ๋Š” ํ†ต๊ณ„์  ๋ชจ๋ธ์˜ ๊ฝƒ๐ŸŒน์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋Š” <Boosting>์— ๋Œ€ํ•ด ๋‹ค๋ฃฌ๋‹ค. ๋‚ด์šฉ์ด ์‰ฝ์ง„ ์•Š์ง€๋งŒ, ํ˜„๋Œ€ ํ†ต๊ณ„์˜ ํ•ต์‹ฌ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ž˜ ์ตํ˜€๋‘ฌ์•ผ ํ•˜๋Š” ํ…Œํฌ๋‹‰์ด๋‹ค.