2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

5 minute read

2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

์šฐ๋ฆฌ๋Š” Feature์˜ ์ฐจ์›์ด ๋Š˜์–ด๋‚จ์— ๋”ฐ๋ผ <Curse of Dimensionality>๋ผ๋Š” ๋ฌธ์ œ๋ฅผ ๊ฐ–๊ฒŒ ๋œ๋‹ค. ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋งŽ์€ ๊ธฐ๋ฒ•๋“ค์ด ์ œ์‹œ๋˜์—ˆ๊ณ , ์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ๋Š” ๊ทธ ๊ธฐ๋ฒ• ์ค‘ ์ „์ฒด feature์—์„œ ๋ช‡๊ฐœ๋ฅผ ์„ ํƒํ•ด ์‚ฌ์šฉํ•˜๋Š” <Feature Selection>์˜ ๊ธฐ๋ฒ•๋“ค์— ๋Œ€ํ•ด ์†Œ๊ฐœํ•  ์˜ˆ์ •์ด๋‹ค ๐Ÿ˜

Best Subset Selection

For given $k \le p$, choose $k$ input variables. This makes $\binom{n}{k}$ number of models. Find a model with which the residual mean square error is minimized among all models having $k$ input variables. Denote the model as $M_k$.

Select the optimal model among $M_0, M_1, \dots, M_p$. For $M_p$, this means we use all the input variables.

๐Ÿ’ฅ ์ด๋•Œ, ์–ด๋–ค ๋ชจ๋ธ์ด ์ข‹์€์ง€๋Š” Trainin Err๊ฐ€ ์•„๋‹ˆ๋ผ Test Err๋ฅผ ๋ด์•ผ ํ•œ๋‹ค!

Prostate cancer dataset

<Best subset Selection>์˜ ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๋ฉด, ๋” ์ ์€ feature๋ฅผ ์‚ฌ์šฉํ–ˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  weight๋Š” ๋น„์Šทํ•˜๊ฒŒ ๋‚˜์™”๊ณ , ๋˜ ์˜ˆ์ธก ์„ฑ๋Šฅ ์—ญ์‹œ ์ „์ฒด feature๋ฅผ ์“ฐ๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•œ ์ˆ˜์ค€์œผ๋กœ ๋‚˜์™”๋‹ค.

๋‹น์—ฐํ•˜๊ฒŒ๋„ ๋” ์ ์€ feature๋ฅผ ์ผ์œผ๋‹ˆ ๊ณ„์‚ฐ ์ธก๋ฉด์—์„œ๋„ ์ด๋“! ๐Ÿ˜


However, if $p$ is large, say $\ge 40$, it becomes computationally infeasible.

\[\binom{p}{k} = O(p^k)\]

์ด๋Ÿฐ ๊ณ„์‚ฐ์ƒ์˜ ์ด์Šˆ ๋•Œ๋ฌธ์— <Best Subset Selection> ๋Œ€์‹  <Forward & Backward Selection> ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค.


Forward Stepwise Selection

Start with the model $M_0$ containing the intercept only.

Construct a sequence of models $M_1, \dots, M_p$ by sequentially adding the predictor that most improves the fit .

Choose the best model among $M_0, \dots, M_p$.

์ฆ‰, input variable์„ ํ•˜๋‚˜ ์ถ”๊ฐ€ํ•  ๋•Œ, Test Err๊ฐ€ ๊ฐ€์žฅ ํฌ๊ฒŒ ๋‚ฎ์•„์ง€๋Š” ๋…€์„์„ ํ•˜๋‚˜ ๋„ฃ๊ฒ ๋‹ค๋Š” ๋ง์ด๋‹ค!


Backward Stepwise Selection

Start with the full model $M_p$.

Construct a sequence of models $M_{p-1}, \dots, M_0$ by sequentially deleting the predictor that has the least impact on the fit.

Choose the best model among $M_p, \dots, M_0$.

์ฆ‰, input variable์„ ํ•˜๋‚˜ ์ œ๊ฑฐํ•  ๋•Œ, Test Err๊ฐ€ ์ œ์ผ ์ ๊ฒŒ ๋‚˜์˜ค๋Š” ๋…€์„์„ ๋บ€๋‹ค๋Š” ๋ง์ด๋‹ค!


Mallowโ€™s $C_p$

<Mallowโ€™s $C_p$> ์ดํ•˜ $C_p$๋Š” ์–ด๋–ค ํ†ต๊ณ„ ๋ชจ๋ธ $M$๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ์ง€ํ‘œ ์ค‘ ํ•˜๋‚˜๋‹ค.

\[C_p (M) = \frac{1}{n} \cdot \left( \sum^n_{i=1} (y_i - \hat{y})^2 + 2d \hat{\sigma}^2 \right)\]

์ฆ‰, $C_p$๋Š” Test Err์™€ ํ•จ๊ป˜ ๋ชจ๋ธ ํ”ผ์ฒ˜ ์ˆ˜ $d$๋ฅผ ๊ณ ๋ คํ•œ๋‹ค๋Š” ๋ง์ด๋‹ค!

๊ทธ๋ž˜์„œ ๋ชจ๋ธ์„ ์„ ํƒํ•  ๋•Œ, $C_p$๊ฐ€ ๊ฐ€์žฅ ๋‚ฎ์€ ๋ชจ๋ธ์„ ์„ ํƒํ•˜๋ฉด ๋œ๋‹ค.


AIC & BIC

<AIC; Akaike Information Criterion>๊ณผ <BIC; Bayesian Information Criterion>์€ ์ข€๋” generalํ•œ model selection ์ง€ํ‘œ์ด๋‹ค.

<AIC> & <BIC>๋Š” <MLE> ๊ธฐ๋ฒ•๊ณผ๋„ ๊ด€๋ จ๋˜์–ด ์žˆ๋‹ค.

\[\text{AIC}(M) = - \frac{2}{n} \cdot \text{loglik} + \frac{2d}{n}\]

์ด๋•Œ, $\text{loglik}$๋Š” โ€œlog-likelihoodโ€์˜ ์•ฝ์ž๋กœ,<AIC>๊ฐ€ MLE์— ์˜ํ•ด ์ตœ๋Œ€ํ™”๋œ โ€œlog-likelihoodโ€ ํ…€์— ๋ณ€์ˆ˜์˜ ๊ฐฏ์ˆ˜ $d$์— ๋”ฐ๋ฅธ ํŒจ๋„ํ‹ฐ๋ฅผ ์ถ”๊ฐ€ํ•œ ์ง€ํ‘œ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

ํ”ผ์ฒ˜๋ฅผ ๋งŽ์ด ์“ฐ๋Š” ๋ชจ๋ธ์ด๋ผ๋ฉด($d$๊ฐ€ ํฐ) RSS๋Š” ์ž‘์•„์ง€๊ฒŒ ๋œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ AIC์™€ BIC๋Š” ํ”ผ์ฒ˜ ์ˆ˜ $d$๋ฅผ ํฌํ•จํ•˜๊ธฐ ๋•Œ๋ฌธ์— AIC/BIC๊ฐ€ ๊ฐ€์žฅ ์ž‘์€ ๋ชจ๋ธ์„ ์“ด๋‹ค๋Š” ๊ฒƒ์€ ์šฐ๋„(likelihood)๋ฅผ ๊ฐ€์žฅ ํฌ๊ฒŒ ํ•˜๋Š”(explainable) ๋™์‹œ์— ํ”ผ์ฒ˜ ๊ฐฏ์ˆ˜๋Š” ๊ฐ€์žฅ ์ ์€(parsimonious) ์ตœ์ ์˜ ๋ชจ๋ธ์„ ์“ด๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.

\[\text{BIC}(M) = - \frac{2}{n} \cdot \text{loglik} + \frac{d \cdot \log(n)}{n}\]

<AIC>์˜ ๊ฒฝ์šฐ <Mallowโ€™s $C_p$>์™€ ๋™์น˜๋ผ๊ณ  ํ•˜๋ฉฐ, <BIC>๊ฐ€ <AIC>๋ณด๋‹ค ๋” ์‹ฌํ”Œํ•œ ๋ชจ๋ธ์„ ์„ ํ˜ธํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.


Instability of Variable Selection

โ€œVariable selection methods are known to be unstable.โ€

- Breiman, L. (1996)

โ€œUnstableโ€ means that small change of data results in large change of the estimator.

This is because variable selection uses hard decision rule (hard survivie or die rule).

<Variable Selection>์˜ โ€˜instabilityโ€™ ๋ฌธ์ œ์— ๋Œ€ํ•œ ๋Œ€์•ˆ์œผ๋กœ ๋‹ค์Œ ํฌ์ŠคํŠธ์—์„œ ์†Œ๊ฐœํ•  <Shrinkage method>๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.


๐Ÿ‘‰ Shrinkage Method