2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

6 minute read

2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

The Curse of Dimentionality

(์ค€๋น„์ค‘)


Bias-Variance Decomposition

์•ž์„œ ์‚ดํŽด๋ณธ NN ๊ธฐ๋ฒ•์—์„  ์–ด๋–ค $k$ ๊ฐ’์„ ์„ ํƒํ•˜๋Š”์ง€์— ๋”ฐ๋ผ Test Error์˜ ๊ฐ’์ด ๋‹ฌ๋ผ์กŒ๋‹ค. ๋ณธ ํŒŒํŠธ์—์„œ๋Š” $k$๋ฅผ ์–ด๋–ป๊ฒŒ ์„ ํƒํ•ด์•ผ ํ•˜๋Š”์ง€์— ๋Œ€ํ•ด <bias>, <variancce> ๊ฐœ๋…์„ ํ†ตํ•ด ์„ค๋ช…ํ•œ๋‹ค.

๋จผ์ € ์šฐ๋ฆฌ๋Š” $k$ ๊ฐ’์— ๋”ฐ๋ฅธ prediction error $\text{EPE}(\hat{f}_k)$๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ธฐ์ˆ ํ•  ์ˆ˜ ์žˆ๋‹ค.

\[\text{EPE}(\hat{f}_k) = E_{X, Y} \left[Y - \hat{f}(X)\right]^2\]

where $\hat{f}_k$ is the $k$-NN estimator.

$\text{EPE}$๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•œ ๊ธฐ์ค€์œผ๋กœ training error๋ฅผ ์‚ฌ์šฉํ•  ์ˆœ ์—†๋‹ค.

\[\sum^n_{i=1} \left(y_i - \hat{f}(x_i) \right)^2\]

์ผ๋ฐ˜์ ์œผ๋กœ complexity๊ฐ€ ๋†’์ผ ์ˆ˜๋ก Test Error๋Š” ๊ฐ์†Œํ•œ๋‹ค. ํ•˜์ง€๋งŒ, complexity๊ฐ€ ์ผ์ • ์ˆ˜์ค€์˜ ์ด์ƒ์œผ๋กœ ๋†’์•„์ง„๋‹ค๋ฉด, Test Error๋Š” ๋‹ค์‹œ ์ฆ๊ฐ€ํ•˜๊ธฐ ์‹œ์ž‘ํ•œ๋‹ค: NN ๊ธฐ๋ฒ•์—์„œ๋Š” $k$๊ฐ€ ์ž‘์•„์งˆ์ˆ˜๋ก complexity๊ฐ€ ์ฆ๊ฐ€ํ•œ๋‹ค. ์ด๊ฒƒ์€ ๋ชจ๋“  ๋ชจ๋ธ์—์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฐœ๊ฒฌ๋˜๋Š” ํ˜„์ƒ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋Ÿฐ ํ˜„์ƒ์€ <bias-variance decomposition>์ด๋ผ๋Š” ์ด๋ก ์œผ๋กœ ์„ค๋ช… ๊ฐ€๋Šฅํ•˜๋‹ค!

Property. Bias-Variance trade-off (for regression)

์•„๋ž˜์™€ ๊ฐ™์€ regression ๋ฌธ์ œ๋ฅผ ์ƒ์ •ํ•ด๋ณด์ž.

\[Y = f(X) + \epsilon, \quad \text{where } E(\epsilon) = 0, \quad \text{Var}(\epsilon) = \sigma^2\]

๊ทธ๋ฆฌ๊ณ  $f$๋ฅผ ์ถ”์ •ํ•œ estimator๋ฅผ $\hat{f}$๋ผ๊ณ  ํ•ด๋ณด์ž. ๊ทธ๋Ÿฌ๋ฉด, $x_0$๋ผ๋Š” ์ƒˆ๋กœ์šด sample point์— ๋Œ€ํ•ด $\hat{f}$๋Š” $\text{Err}(x_0)$์˜ Test Error๋ฅผ ๊ฐ–๋Š”๋‹ค.

\[\begin{aligned} \text{Err}(x_0) \overset{\text{def}}{=} E\left[ (Y - \hat{f}(X))^2 \mid X = x_0 \right] \end{aligned}\]

์ด $\text{Err}(x_0)$์— ๋Œ€ํ•œ ์œ„์˜ ์‹์„ ์ž˜ ์ „๊ฐœํ•˜๋ฉด ์•„๋ž˜์˜ ๊ฒฐ๊ณผ๋ฅผ ์–ป๋Š”๋‹ค.

\[\begin{aligned} \text{Err}(x_0) &= E\left[ (Y - \hat{f}(X))^2 \mid X = x_0 \right] \\ &= E \left[ \left( Y - \textcolor{red}{f(X) + f(X) - E[\hat{f}(X)] + E[\hat{f}(X)]} - \hat{f}(X)\right)^2 \mid X = x_0 \right] \\ &= E \left[ (Y - f(X))^2 + (f(X) - E[\hat{f}(X)])^2 + (E[\hat{f}(X)] - \hat{f}(X))^2 + \textcolor{red}{\cancel{2 \cdots}} \mid X = x_0 \right] \end{aligned}\]

๋งˆ์ง€๋ง‰ ์ค„์˜ $\textcolor{red}{\cancel{2 \cdots}}$ ๋ถ€๋ถ„์€ ์ž˜ ์ „๊ฐœํ•ด๋ณด๋ฉด ๋ชจ๋‘ 0์ด ๋œ๋‹ค. ์ด๊ฒƒ์€ ์•„๋ž˜์˜ ํŽผ์ณ๋ณด๊ธฐ ํ†ตํ•ด ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

ํŽผ์ณ๋ณด๊ธฐ

1. $E\left[ (Y - f(X)) \cdot (f(X) - E [\hat{f}(X)]) \mid X = x_0 \right]$

์œ„์˜ ์‹์—์„œ $(Y-f(X))$์™€ $(f(x) - E [\hat{f}(X)])$ ์„œ๋กœ ๋…๋ฆฝ์ด๋‹ค. ๋”ฐ๋ผ์„œ $(Y-f(X))$์— ๋Œ€ํ•œ ํ…€์„ ๋ถ„๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋•Œ, $(Y-f(X))$์— ๋Œ€ํ•ด ํ‰๊ท ์„ ์ทจํ•˜๋ฉด, ๊ทธ ๊ฐ’์€ 0์ด๋‹ค. ๋”ฐ๋ผ์„œ ์ฒซ๋ฒˆ์งธ ํ…€์˜ ๊ฐ’์€ 0์ด๋‹ค.

2. $E\left[ (Y - f(X)) \cdot (E [\hat{f}(X)] - \hat{f}(X)) \mid X = x_0 \right]$

1๋ฒˆ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋…๋ฆฝ์— ์˜ํ•ด $(Y-f(X))$์„ ๋ถ„๋ฆฌํ•  ์ˆ˜ ์žˆ๊ณ , ํ‰๊ท ์„ ์ทจํ•˜๋ฉด 0์ด ๋˜์–ด์„œ ๋‘๋ฒˆ์งธ ํ…€์˜ ๊ฐ’์€ 0์ด๋‹ค.

3. $E\left[ (f(X) - E [\hat{f}(X)]) \cdot (E [\hat{f}(X)] - \hat{f}(X)) \mid X = x_0 \right]$

์œ„์˜ ์‹์„ ์ „๊ฐœํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค. ์ด๋•Œ, $E[\hat{f}(X)]$๊ฐ€ ์ƒ์ˆ˜์ž„์„ ๊ธฐ์–ตํ•˜๋ผ.

\[\begin{aligned} & E\left[ (f(X) - \mu_{\hat{f}(X)}) \cdot (\mu_{\hat{f}(X)} - \hat{f}(X)) \mid X = x_0 \right] \\ &= E\left[ f(X) \mu_{\hat{f}(X)} - \left(\mu_{\hat{f}(X)}\right)^2 - f(X) \hat{f}(X) + \mu_{\hat{f}(X)} \hat{f}(X) \mid X = x_0 \right] \\ &= \mu_{\hat{f}(X)} E[f(X)] - \left(\mu_{\hat{f}(X)}\right)^2 - E[f(X)\hat{f}(X)] + \mu_{\hat{f}(X)} E[\hat{f}(X)] \\ &= \cancel{\left( \mu_{\hat{f}(X)} E[f(X)] - E[f(X)\hat{f}(X)] \right)} + \cancel{\left( - \left(\mu_{\hat{f}(X)}\right)^2 + \mu_{\hat{f}(X)} E[\hat{f}(X)]\right)} = 0 \end{aligned}\]

์œ„์˜ ์‹์˜ ๊ฐ๊ฐ์˜ ํ…€๋“ค์„ ํ™•์ธํ•ด๋ณด์ž.

1. $E[(Y - f(X))^2]$

์ฒซ๋ฒˆ์งธ๋กœ $E[(Y - f(X))^2]$ํ…€์€ $Y = f(X) + \epsilon$์˜ ๊ด€๊ณ„์‹์„ ์ด์šฉํ•˜๋ฉด, $E[\epsilon^2]$์„ ๊ตฌํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค. ์ด๋•Œ, $E(\epsilon) = 0$, $\text{Var}(\epsilon) = \sigma^2$์ด๋ฏ€๋กœ

\[E[\epsilon^2] = \sigma^2\]

2. $E[((f(X) - E[\hat{f}(X)])^2]$

์ด $E[((f(X) - E[\hat{f}(X)])^2]$ํ…€์„ ์šฐ๋ฆฌ๋Š” <bias>๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. ๋งŒ์•ฝ ์ถ”์ •ํ•œ estimaor $\hat{f}(x_0)$์— ๋Œ€ํ•ด ๊ทธ๊ฒƒ์˜ ํ‰๊ท ์ธ $E[\hat{f}(x_0)]$์˜ ๊ฐ’์ด $f(x_0)$์™€ ๊ฐ™๋‹ค๋ฉด, ์ฆ‰

\[f(x_0) = E[\hat{f}(x_0)]\]

๋ผ๋ฉด, ์šฐ๋ฆฌ๋Š” estimator $\hat{f}$๋ฅผ unbaised ๋˜์—ˆ๋‹ค๊ณ  ๋งํ•œ๋‹ค. ์ข€๋” ๊ฐ„๋‹จํžˆ ๋งํ•˜๋ฉด, ์‹ค์ œ๊ฐ’๊ณผ ํ‰๊ท ๊ฐ’์˜ ์ฐจ์ด๋ฅผ <bias>๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ๋‘๋ฒˆ์งธ ํ…€์€

\[E \left[ \left(f(x_0) - E[\hat{f}(x_0)] \right)^2 \right] = \left\{ \text{Bias}(\hat{f}(x_0)) \right\}^2\]

3. $E[(E[\hat{f}(X)] - \hat{f}(X))^2]$

์ด๊ฒƒ์€ ๋น„๊ต์  ์ต์ˆ™ํ•˜๋‹ค. ๋ฐ”๋กœ $\hat{f}(x_0)$์˜ <variance>์— ๋Œ€ํ•œ ์‹์œผ๋กœ ์„ธ๋ฒˆ์งธ ํ…€์€ ๊ณง

\[E \left[ \left( E[\hat{f}(x_0)] - \hat{f}(x_0) \right)^2 \right] = \text{Var}(\hat{f}(x_0))\]

์ด๊ฒƒ์„ ์ข…ํ•ฉํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[\begin{aligned} \text{Err}(x_0) &= E \left[ (Y - f(X))^2 + (f(X) - E[\hat{f}(X)])^2 + (E[\hat{f}(X)] - \hat{f}(X))^2 + \textcolor{red}{\cancel{2 \cdots}} \mid X = x_0 \right] \\ &= \sigma^2 + \left\{ \text{Bias}(\hat{f}(x_0)) \right\}^2 + \text{Var}(\hat{f}(x_0)) \end{aligned}\]

์œ„์˜ ์‹์—์„œ $\sigma^2$์€ ์ž๋ฃŒ๋กœ๋ถ€ํ„ฐ ์˜ค๋Š” uncertainty์ด๋‹ค. ๊ทธ๋ž˜์„œ ์ž๋ฃŒ๋ฅผ ์•„๋ฌด๋ฆฌ ๋งŽ์ด ์ค€๋น„ํ•œ๋‹ค๊ณ  ํ•˜๋”๋ผ๊ณ  ์ด $\sigma^2$์˜ ๊ฐ’์€ ์ค„์ผ ์ˆ˜๋„, ์ปจํŠธ๋กค ํ•  ์ˆ˜๋„ ์—†๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ Bias ํ…€๊ณผ Variance ํ…€์€ ๋ชจ๋ธ๋ง ํ•˜๋Š” ์‚ฌ๋žŒ์— ์˜ํ•ด ์ปจํŠธ๋กค ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๊ฒƒ์„ complexity๋ฅผ ์ด์šฉํ•ด ์ปจํŠธ๋กคํ•˜๋Š”๋ฐ, ์ผ๋ฐ˜์ ์ธ ๊ฒฝํ–ฅ์— ๋”ฐ๋ฅด๋ฉด complexity๊ฐ€ ํฌ๋ฉด ํด์ˆ˜๋ก Bias๋Š” ์ค„์–ด๋“ค๊ณ , Variance๋ฅผ ์ปค์ง„๋‹ค๊ณ  ํ•œ๋‹ค.

NN์—์„œ์˜ ๊ฒฝ์šฐ๋ฅผ ๋– ์˜ฌ๋ ค๋ณด์ž. $k$๊ฐ€ ์ปค์ง€๋ฉด ์ปค์งˆ์ˆ˜๋ก ๋ชจ๋ธ์€ local neighbor๋ฅผ ๋ณด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ global neighbor, ์ฆ‰ global mean์„ ๋”ฐ๋ฅด๊ฒŒ ๋œ๋‹ค. ์ด๊ฒƒ์€ sample point์—์„œ ๊ต‰์žฅํžˆ ๋ฉ€๋ฆฌ์žˆ๋Š” ๋…€์„๋“ค๊นŒ์ง€ ๊ฐ€์ ธ์™€์„œ ๋ณธ๋‹ค๋Š” ๋ง์ด๋ฏ€๋กœ, ๊ณง Bais๊ฐ€ ์ฆ๊ฐ€ํ•จ์„ ์˜๋ฏธํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ต‰์žฅํžˆ ๋ฉ€๋ฆฌ ์žˆ๋Š” ๋…€์„๋“ค๋„ ์‚ดํŽด๋ณด๊ธฐ ๋•Œ๋ฌธ์— Variance๋Š” ์ค„์–ด๋“ค ๊ฒƒ์ด๋‹ค.

์ด model complexity์— ๋”ฐ๋ฅธ Bias-Variance์˜ ๊ฒฝํ–ฅ์€ statistical model ์ „๋ฐ˜์— ๋ฐœํ˜„๋˜๋Š” ์ผ๋ฐ˜์ ์ธ ํ˜„์ƒ์ด๋‹ค!