2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

8 minute read

2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

<Bagging>, <Boosting>, <Random Forest>๋ฅผ ์•„์šฐ๋ฅด๋Š” ๊ณตํ†ต์ ์€ ๋ชจ๋‘ <ensemble model>์ด๋ผ๋Š” ์ ์ด๋‹ค. <ensemble model>์€ ๋งŽ์€ weak learner๋ฅผ ๊ฒฐํ•ฉํ•ด strong learner๋ฅผ ๋””์ž์ธ ํ•˜๋Š” ์ ‘๊ทผ๋ฒ•์ด๋‹ค. ๋งŽ์€ ๊ฒฝ์šฐ์— <ensemble model>๊ฐ€ single model๋ณด๋‹ค ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค ๐Ÿ˜

Bagging

Bootstrap Sampling

<Bagging>์„ ์‚ดํŽด๋ณด๋ ค๋ฉด, ๋จผ์ € <bootstrap sampling>์— ๋Œ€ํ•ด ์ดํ•ดํ•ด์•ผ ํ•œ๋‹ค.

๊ฐ„๋‹จํ•˜๊ฒŒ ๋งํ•˜๋ฉด, ๊ธฐ์กด์˜ ์ƒ˜ํ”Œ $X = \{ (x_i, y_i) \}_i$์—์„œ โ€œsampling with replacementโ€๋กœ ๋ฝ‘์€ ์ƒ˜ํ”Œ์„ ๋งํ•œ๋‹ค. <Bagging>์—์„œ๋Š” ์ด <bootstrap sample>์„ $Z$๋กœ ํ‘œ๊ธฐํ•œ๋‹ค.

$Z^{(b)}$๋ฅผ $b$๋ฒˆ์งธ <bootstrap sample>์ด๋ผ๊ณ  ํ•œ๋‹ค๋ฉด, <Bagging>์€ ๊ฐ bootstrap sample์— ๋Œ€ํ•ด estiamtor $\hat{f}^{(b)}(\, \cdot \, ; \, Z^{(b)})$๋ฅผ ๊ตฌํ•œ๋‹ค.

Bagging

<Bagging>์€ โ€œBootstrap Aggregationโ€์˜ ํ•ฉ์„ฑ์–ด์ด๋‹ค. ๋Œ€์ถฉ <bootstrap>์„ ์ข…ํ•ฉ(aggregate) ํ–ˆ๋‹ค๋Š” ๋ง์ด๋‹ค.

Bagging estimator๋Š” ์•„๋ž˜์˜ ์‹์œผ๋กœ prediction์„ ์ง„ํ–‰ํ•œ๋‹ค.

\[\hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum^B_{b=1} \hat{f}^{(b)}(x)\]

์ฆ‰, bootstrap sample์—์„œ ๋””์ž์ธํ•œ estiamtor์˜ ํ‰๊ท ์œผ๋กœ prediction ํ•œ๋‹ค๋Š” ๋ง์ด๋‹ค. ์‹ฌํ”Œํ•˜๋‹ค!


Model. Bagging for classification

<Bagging>์œผ๋กœ classification์„ ์ˆ˜ํ–‰ํ•  ๋•Œ๋Š” ๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ๋ฒ•์ด ์žˆ๋‹ค: โ€œconsensus votesโ€ & โ€œaveraging the probabilityโ€

โ€œconsensus votesโ€๋Š” $\hat{f}^{(b)}(x)$์—์„œ โ€œ# of 1โ€๊ณผ โ€œ# of 0โ€์˜ ์ˆ˜๋ฅผ ๋น„๊ตํ•ด ๋งŽ์€ ๊ฒƒ์„ ์ทจํ•˜๋Š” ์ ‘๊ทผ์ด๋‹ค.

โ€œaveraging the probabilityโ€๋Š” ๊ฐœ๋ณ„ estimator๊ฐ€ ํ™•๋ฅ ์„ predictํ•˜๋ฉฐ, ์ „์ฒด bagging estimator์˜ ๊ฒฐ๊ณผ์— ํ‰๊ท ์„ ๋‚ด๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

Variance Reduction for Bagging

Let $f_A (x) = E_{\mathbf{Z}}\hat{f}(x)$ be the aggregated estimator. ์ด $f_A (x)$๋Š” ์‹ค์ œ estimator๊ฐ€ ์•„๋‹ˆ๋ผ ๋…ผ์˜์˜ ํŽธ์˜๋ฅผ ์œ„ํ•ด ์กด์žฌ๋ฅผ ๊ฐ€์ •ํ•œ ๋…€์„์ด๋‹ค!

์šฐ๋ฆฌ๊ฐ€ ํ˜„์žฌ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” bagging estiamtor์ธ $\hat{f}_{\text{bag}}$์™€ $f_A$์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด error๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜ํ•˜์ž.

\[e = E_\mathbf{Z} \left[ E_{Y, X} \left[ Y - \hat{f}(X) \right]^2 \right]\] \[e_A = E_{Y, X} \left[ Y - f_A (X) \right]^2\]

Theorem.

With the squared error loss, $f_A$ always has a smaller risk than $\hat{f}$.

\[e \ge e_A\]

If each single classifier is unstable โ€“ that is, it has high variance, the aggregated classifier $f_A$ has a smaller variance than a single original classifier.

์ฆ‰, ์šฐ๋ฆฌ๊ฐ€ ์ œ๋Œ€๋กœ aggregate ํ–ˆ๋‹ค๋ฉด, ํ•ญ์ƒ ๊ทธ error $e_A$๋Š” ํ•ญ์ƒ ๋‹จ์ผ estiatmor์˜ error $e$ ๋ณด๋‹ค ์ž‘๋‹ค๋Š” ์ •๋ฆฌ์ด๋‹ค.

proof.

\[\begin{aligned} e &= E_\mathbf{Z} \left[ E_{Y, X} \left[ Y - \hat{f}(X) \right]^2 \right] \\ &= E_\mathbf{Z} \left[ E_{Y, X} \left[ Y - f_A(X) + f_A(X) - \hat{f}(X) \right]^2 \right] \\ &= E_\mathbf{Z} \left[ E_{Y, X} \left[ Y - f_A(X) \right]^2 + E_{Y, X} \left[ f_A(X) - \hat{f}(X) \right]^2 + 2 E_{Y, X} \left[ (Y - f_A(X)) (f_A(X) - \hat{f}(X)) \right] \right] \end{aligned}\]

์ด๋•Œ, $2 E_\mathbf{Z} \left[ E_{Y, X} \left[ (Y - f_A(X)) (f_A(X) - \hat{f}(X)) \right] \right]$๋Š”

\[\begin{aligned} & 2 E_\mathbf{Z} \left[ E_{Y, X} \left[ (Y - f_A(X)) (f_A(X) - \hat{f}(X)) \right] \right] \\ &= 2 E_{Y, X} \left[ (Y - f_A(X)) \cdot E_\mathbf{Z} \left[ (f_A(X) - \hat{f}(X)) \right] \right] \\ &= 2 E_{Y, X} \left[ (Y - f_A(X)) \cdot (f_A(X) - E_\mathbf{Z} \hat{f}(X)) \right] \\ &= 2 E_{Y, X} \left[ (Y - f_A(X)) \cdot \cancel{(f_A(X) - f_A(X))} \right] \\ &= 0 \end{aligned}\]

๋”ฐ๋ผ์„œ,

\[\begin{aligned} e &= E_\mathbf{Z} \left[ E_{Y, X} \left[ Y - \hat{f}(X) \right]^2 \right] \\ &= E_\mathbf{Z} \left[ E_{Y, X} \left[ Y - f_A(X) \right]^2 + E_{Y, X} \left[ f_A(X) - \hat{f}(X) \right]^2 \right] \\ &= E_{Y, X} \left[ Y - f_A(X) \right]^2 + E_\mathbf{Z} \left[ E_{X} \left[ f_A(X) - \hat{f}(X) \right]^2 \right] \\ &\ge E_{Y, X} \left[ Y - f_A(X) \right]^2 = e_A \end{aligned}\]

์ด๋•Œ, <Bagging>์— ์˜ํ•ด ๊ฐœ์„ ๋œ ์ •๋„๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[E_\mathbf{Z} \left[ E_{X} \left[ f_A(X) - \hat{f}(X) \right]^2 \right] = \text{Var}\left( \hat{f}(X) - f_A(X) \right)\]


Remark.

  • Increasing $B$ does not cause an overfitting
    • $B$๊ฐ€ ์ปค์งˆ ์ˆ˜๋ก ํ‰๊ท ์— ๊ฐ€๊นŒ์›Œ์ง.
    • ๋ฐ˜๋ฉด์— Boosting์€ $B$๊ฐ€ ์ปค์งˆ์ˆ˜๋ก overfitting ๋จ.
  • ๋งŒ์•ฝ individual tree๊ฐ€ ํฌ๋‹ค๋ฉด, ๊ฐœ๋ณ„ tree๊ฐ€ overfitting ํ•  ๊ฐ€๋Šฅ์„ฑ์€ ์žˆ์Œ.
  • <Bagging>์€ high-variance & low-bais ๋ชจ๋ธ์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ƒ„.
  • In practice, <Boosting> is better than <Bagging> in many examples.
    • <Bagging>์„ ๋ณ€ํ˜•ํ•œ ๋ชจ๋ธ์ธ <Random Forest>๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด, <Boosting>๊ณผ ๋น„์Šทํ•œ ์ •๋„์˜ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•จ!

Random Forest

<Bagging>์—์„œ ๊ฐœ๋ณ„ DT๋Š” ๋ชจ๋‘ ๋™์ผํ•œ ๋ถ„ํฌ๋ฅผ ๊ฐ–๋Š”๋‹ค. ๊ทธ๋ž˜์„œ โ€œbagged estimatorโ€๋Š” ๊ฐœ๋ณ„ bootstrap tree์™€ ๋™์ผํ•œ bias๋ฅผ ๊ฐ–๋Š”๋‹ค.

์ด๋•Œ, <Bagging> ๋ชจ๋ธ์˜ โ€œvarianceโ€๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์€ ๊ฐœ๋ณ„ bootstrap tree ์‚ฌ์ด์˜ correlation์„ ์ค„์—ฌ ๋ชจ๋ธ์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค!!! ๐Ÿ˜

The โ€˜averageโ€™ of $B$ iid random variables with variance $\sigma^2$ has variance $\sigma^2/B$. (๋‹น์—ฐ!)

์ด๋ฒˆ์—๋Š” iid sample์ด ์•„๋‹Œ ๊ฒฝ์šฐ๋ฅผ ์‚ดํŽด๋ณด์ž! $\text{Var}(X_i) = \sigma^2$์ด๊ณ , pairwise correlation์€ $\text{Corr}(X_i, X_j) = \rho$์ธ ๊ฒฝ์šฐ๋‹ค!

The โ€˜averageโ€™ of $B$ identically distributed random variables with variance $\sigma^2$ and pairwise correlation $\rho$ has variance

\[\rho \sigma^2 + \frac{1-\rho}{B} \sigma^2\]

์ด๋•Œ, ์ด๊ฒƒ๊ณผ iid์ธ ๊ฒฝ์šฐ์—์„œ์˜ variance $\sigma^2/B$๋ฅผ ๋น„๊ตํ•ด๋ณด์ž.

\[\text{Var} \left( \frac{X_1 + \cdots X_B }{B} \right) = \rho \sigma^2 + \frac{1-\rho}{B} \sigma^2 \quad \text{vs.} \quad \frac{\sigma^2}{B}\]

๋งŒ์•ฝ correlation $\rho$๊ฐ€ 0์ด๋ผ๋ฉด, iid์ธ ๊ฒฝ์šฐ์™€ ๊ฐ™๋‹ค.

\[0 \sigma^2 + \frac{1-0}{B} \sigma^2 = \frac{\sigma^2}{B}\]

๋งŒ์•ฝ correlation $\rho$๊ฐ€ 1์ธ ๊ฒฝ์šฐ, ์ฆ‰ $X = X_i$๋ผ๋ฉด,

\[\rho \sigma^2 + \frac{1-1}{B} \sigma^2 = \sigma^2 > \frac{\sigma^2}{B}\]

๋”ฐ๋ผ์„œ, sample์ด correlate ๋˜์–ด ์žˆ๋‹ค๋ฉด, variance๋Š” ๋” ์ปค์ง„๋‹ค.

\[\rho \sigma^2 + \frac{1-\rho}{B} \sigma^2 \ge \frac{\sigma^2}{B}\]

<Bagging>์€ bootstrap sample๋กœ DT๋ฅผ ๊ตฌ์„ฑํ•œ๋‹ค. ์ด๋•Œ, bootstrap sample์€ sampling with replacement์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ฐœ๋ณ„ DT๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•œ input์˜ ์ผ๋ถ€๋Š” ํ•ญ์ƒ ์–ด๋Š ์ •๋„ ๊ฒน์น  ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ, <Bagging>์˜ ๊ฐœ๋ณ„ DT๋Š” ์„œ๋กœ positively correlated ๋˜์–ด ์žˆ๋‹ค! RF๋Š” ์ด๋Ÿฐ correlation์„ ์ค„์—ฌ Bagging์„ ๊ฐœ์„ ํ•œ๋‹ค!!

Before each split in bagging DT, RF selects $m \le p$ of input variables at random as cadidates for splitting. ์ฆ‰, ๋ณ€์ˆ˜ ์ค‘ ์ผ๋ถ€๋งŒ ์“ฐ๊ธฐ ๋•Œ๋ฌธ์—, ๊ฐœ๋ณ„ DT๋ผ๋ฆฌ ๋ณ€์ˆ˜๊ฐ€ ๊ฒน์น  ํ™•๋ฅ ์ด ์ค„์–ด๋“ ๋‹ค!

RF์˜ ๊ฒฝ์šฐ, ์ผ๋ถ€์˜ ๋ณ€์ˆ˜๋งŒ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— <bagging>๋ณด๋‹ค ๋” ๋น ๋ฅธ ์†๋„๋ฅผ ๋ณด์ธ๋‹ค!


Out of Bag Error

์ด๋ฒˆ์—๋Š” <Bagging>๊ณผ <Random Forest>์—์„œ cross-validation์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์ธ <OOB error; Out of Bag error>์— ๋Œ€ํ•ด ์†Œ๊ฐœํ•œ๋‹ค.

<Bootstrap sample>์˜ ๊ฒฝ์šฐ, sampling with replacement์ด๊ธฐ ๋•Œ๋ฌธ์— 1,000๊ฐœ์˜ ์ „์ฒด ์ƒ˜ํ”Œ ์ค‘์— 800๋งŒ ์“ฐ์ด๊ณ , ๋‚จ์€ 200๊ฐœ๊ฐ€ ์•ˆ ์“ฐ์ด๊ฒŒ ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ Bagging, RF์—์„œ๋Š” ์ด๋Ÿฐ โ€œout of bagโ€ํ•œ ์ƒ˜ํ”Œ๋“ค์„ ๊ฐ€์ง€๊ณ  validation error๋ฅผ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๋‹ค!!

๊ทธ๋ž˜์„œ Bagging, RF๋Š” <OOB error>๊ฐ€ stabilizeํ•˜๋Š” ์ˆœ๊ฐ„์— training์„ ์ข…๋ฃŒํ•œ๋‹ค!