2021-1ν•™κΈ°, λŒ€ν•™μ—μ„œ β€˜ν†΅κ³„μ  λ°μ΄ν„°λ§ˆμ΄λ‹β€™ μˆ˜μ—…μ„ λ“£κ³  κ³΅λΆ€ν•œ λ°”λ₯Ό μ •λ¦¬ν•œ κΈ€μž…λ‹ˆλ‹€. 지적은 μ–Έμ œλ‚˜ ν™˜μ˜μž…λ‹ˆλ‹€ :)

9 minute read

2021-1ν•™κΈ°, λŒ€ν•™μ—μ„œ β€˜ν†΅κ³„μ  λ°μ΄ν„°λ§ˆμ΄λ‹β€™ μˆ˜μ—…μ„ λ“£κ³  κ³΅λΆ€ν•œ λ°”λ₯Ό μ •λ¦¬ν•œ κΈ€μž…λ‹ˆλ‹€. 지적은 μ–Έμ œλ‚˜ ν™˜μ˜μž…λ‹ˆλ‹€ :)

<Boosting>에 λŒ€ν•œ κ°œμš”λŠ” μ•„λž˜μ˜ ν¬μŠ€νŠΈμ—μ„œ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€ 😁

πŸ‘‰ Boosting

AdaBoost

<AdaBoost>λž€ β€œAdaptive Boostingβ€μ˜ μ•½μžλ‘œ <Boosting> μ•Œκ³ λ¦¬μ¦˜μ„ 처음으둜 κ³ μ•ˆν•œ Schapire & Freund에 μ˜ν•΄ 1996λ…„ 개발된 졜초의 Boosting λͺ¨λΈμ΄λ‹€.


μ•Œκ³ λ¦¬μ¦˜μ„ μ‚΄νŽ΄λ³΄κΈ° 전에 λ¨Όμ € 셋업을 ν™•μΈν•˜μž.

[Data]

$\{ (x_1, y_1), \dots, (x_n, y_n) \}$

  • p-dimensional input: $x_i \in \mathcal{X}$
  • Binary output: $y_i \in \{ -1, 1 \}$ // 1, -1둜 인코딩

[Weak Classifier]

$m$-th Weak Classifier $G_m : \mathcal{X} \rightarrow \{-1, 1 \}$

[Prediction]

predictions are based on a weighted majority vote!

\[G(x) = \text{sign} \left( \sum^M_{m=1} \alpha_m G_m (x) \right)\]

where $\alpha_m$ is weight of weak classifier.

Algorithm. AdaBoost

Initialize the weights for data points
$w_i = 1 / n$ for $i=1, \dots, n$.

// training!
For $m=1$ to $M$
   Obtain $G_m(x)$ based on the training data using weights $w_i$
   Compute error of $m$-th classifier.

\[\text{err}_m = \frac{\displaystyle\sum^n_{i=1} w_i \cdot I(y_i \ne G_m(x_i))}{\displaystyle\sum^n_{i=1} w_i}\]

   Compute weights for $m$-th classifier

\[\alpha_m = \log \left( \frac{1 - \text{err}_m}{\text{err}_m}\right)\]

   // μ •λΆ„λ₯˜ $w_i$κ°€ κ·ΈλŒ€λ‘œ, μ˜€λΆ„λ₯˜ $w_i$κ°€ 컀짐 β†’ normalize β†’ μ •λΆ„λ₯˜ $w_i$ β–Ό, μ˜€λΆ„λ₯˜ $w_i$ β–²

   Update weights for data points

\[w_i \leftarrow w_i \cdot \exp \left[ \alpha_m \cdot I(y_i \ne G_m(x_i)) \right] \quad \text{for} \quad i=1, \dots, n\]

Output $\displaystyle G(x) = \text{sign} \left[ \sum^M_{m=1} \alpha_m G_m(x) \right]$


Boosting as Additive Expansion

<AdaBoost> μ•Œκ³ λ¦¬μ¦˜μ˜ 결과인 $G(x)$

\[G(x) = \text{sign} \left[ \sum^M_{m=1} \alpha_m G_m(x) \right]\]

λ₯Ό additive/basis expansion λͺ¨λΈλ‘œλ„ 이해할 수 μžˆλ‹€.

Algorithm. general additive function expansion

\[f(x) = \sum^M_{m=1} \beta_m \cdot h(x; \theta_m)\]

즉, $\alpha_m$λ₯Ό basis coefficient $\beta_m$둜, weak classifier $G_m(x)$λ₯Ό basis function $h(x; \theta_m)$으둜 ν•΄μ„ν•˜λŠ” 것이닀!

본래 additive model을 ERM(Empirical Risk Minimization)으둜 fitting μ‹œν‚€λŠ” 것은 hardν•œ λ¬Έμ œμ΄λ‹€.

\[\min_{\left\{\beta_m, \theta_m\right\}^M_{m=1}} \; \sum^n_{i=1} L \left( y_i, \sum^M_{m=1} \beta_m \cdot h(x_i; \theta_m) \right)\]

λŒ€λΆ€λΆ„μ˜ κ²½μš°μ—μ„œλŠ” computationally infeasible ν•˜λ‹€κ³  보면 λœλ‹€.

이런 계산적인 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•œ λ°©λ²•μœΌλ‘œλŠ”

  • Update the expansion sequentially
  • Consider only a single basis function at each step

κ°€ μžˆλŠ”λ°, 이런 접근을 ν”νžˆ <Forward Stagewise Update>라고 ν•œλ‹€!

κ°„λ‹¨ν•˜κ²Œ μ„€λͺ…ν•΄λ³΄μžλ©΄,

  1. $\beta_0$λ₯Ό fitting
  2. $\beta_0$λŠ” κ·ΈλŒ€λ‘œ 두고, $\beta_1$을 fitting
  3. $\beta_0, \beta_1$은 κ·ΈλŒ€λ‘œ 두고, $\beta_2$λ₯Ό fitting
  4. …

Algorithm. Forward Stagewise Update πŸ”₯

Supp. that the current expansion is

\[f_{m-1} (x) = \sum^{m-1}_{j=1} \beta_j \cdot h(x; \theta_j)\]

then, the next update can be achieved by solving

\[(\beta_m, \theta_m) = \underset{(\beta, \theta)}{\text{argmin}} \sum^n_{i=1} L \left( y_i, f_{m-1}(x_i) + \beta \cdot h(x_i; \theta) \right)\]

즉, <forward stagewise update>λŠ” 기쑴의 solution을 μˆ˜μ •ν•˜μ§€ μ•ŠμœΌλ©΄μ„œ μƒˆλ‘œμš΄ basis function을 μΆ”κ°€ν•œλ‹€.

Example. Forward Stagewise Update with Squared Error Loss

Consider the squared error loss:

\[L(y, f(x)) = (y - f(x))^2\]

then, the loss in forward stagewise is

\[L(y_i, f_{m-1} (x_i) + \beta \cdot h(x_i; \theta)) = \left( y_i - f_{m-1} (x_i) - \beta \cdot h(x_i; \theta) \right)\]

μ΄λ•Œ, current residual인 $(y_i - f_{m-1}(x_i))$λŠ” μƒμˆ˜κ°’μœΌλ‘œ $\beta, \theta$λ₯Ό μ°ΎλŠ”λ°μ— 영ν–₯을 주지 μ•ŠλŠ”λ‹€. λ”°λΌμ„œ μš°λ¦¬λŠ” $\beta, \theta$에 μ˜¨μ „νžˆ μ§‘μ€‘ν•˜λ©΄ λœλ‹€!

Example. Forward Stagewise Update with Exponential Loss πŸ”₯

Consider the exponential error loss:

\[L(y, f(x)) = \exp (-y \cdot f(x))\]

μ΄λ•Œ! λ†€λžκ²Œλ„ <AdaBoost>λŠ” Forward Stagewise Updates for the ERM with Exponential Loss와 λ™μΉ˜λ‹€!! 😲

이 λͺ…μ œλŠ” <Boosting>에 λŒ€ν•œ μ™„μ „νžˆ μƒˆλ‘œμš΄ λ°©μ‹μ˜ μ‹œμ•Όλ₯Ό μ œμ‹œν•œλ‹€!

Note that the $m$-th step of the forward stagewise updates solves

\[\begin{aligned} (\beta_m , G_m) &= \underset{(\beta, G)}{\text{argmin}} \sum^n_{i=1} \exp \left[ -y \cdot \left( f_{m-1}(x_i) + \beta \cdot G(x_i) \right)\right] \\ &= \underset{(\beta, G)}{\text{argmin}} \sum^n_{i=1} w_i^{(m)} \cdot \exp \left[ - \, y_i \cdot \beta \cdot G(x_i) \right] \\ & \quad \text{where} \quad w_i^{(m)} = \exp \left[ -y \cdot f_{m-1}(x_i) \right] \end{aligned}\]

It can be shown that

\[G_m = \underset{G}{\text{argmin}} \sum^n_{i=1} w_i^{(m)} \cdot I(y_i \ne G(x_i))\] \[\beta_m = \frac{1}{2} \log \left( \frac{1-\text{err}_m}{\text{err}_m}\right)\]

where

\[\text{err}_m = \frac{\displaystyle\sum^n_{i=1} w_i^{(m)} \cdot I(y_i \ne G_m(x_i))}{\displaystyle\sum^n_{i=1} w_i^{(m)}}\]

Then, the update is

\[f_(x) = f_{m-1}(x) + \beta_m \cdot G_m (x)\]

and leads the weights for the new iterations to be

\[\begin{aligned} w_i^{(m+1)} &= w_i^{(m)} \cdot \exp \left[ - \, y_i \cdot \beta_m \cdot G_m (x_i) \right] \\ &= w_i^{(m)} \cdot \exp \left[ \alpha_m \cdot I(y_i \ne G_m(x_i)) \right] \cdot \exp \left[ - \beta_m \right] \end{aligned}\]

where $\alpha_m = 2 \beta_m$ appears in the AdaBoost algorithm.

Q. Why Exponential Loss?

Population minimizer of the exponential loss:

\[\begin{aligned} f^{*} (x) &= \underset{f(x)}{\text{argmin}} \; E_{Y \mid x} \left[ L(Y, f(x)) \right] \\ &= \underset{f(x)}{\text{argmin}} \; E_{Y \mid x} \left[ \exp \left( -Y \cdot f(x) \right) \right] \\ &= \frac{1}{2} \log \frac{P(Y = 1 \mid x)}{P(Y = -1 \mid x)} \end{aligned}\]

equivalently,

\[P(Y = 1 \mid x) = \frac{\exp \left( f^* (x) \right)}{\exp \left( -f^* (x) \right) + \exp \left( f^* (x) \right)}\]
Derivation

It is true that $P(Y=1 \mid x) = 1 - P(Y=-1 \mid x)$.

\[\begin{aligned} f^*(x) &= \frac{1}{2} \log \frac{P(Y = 1 \mid x)}{P(Y = -1 \mid x)} \\ \exp \left( 2 \cdot f^* (x) \right) &= \frac{P(Y = 1 \mid x)}{P(Y = -1 \mid x)} \\ &= \frac{P(Y = 1 \mid x)}{1 - P(Y=1 \mid x)} \\ \exp \left( 2 \cdot f^* (x) \right) \cdot (1 - P(Y = 1 \mid x)) &= P(Y = 1 \mid x) \\ \exp \left( 2 \cdot f^* (x) \right) &= P(Y = 1 \mid x) + \exp \left( 2 \cdot f^* (x) \right) \cdot P(Y = 1 \mid x) \\ &= P(Y = 1 \mid x) \cdot \left( 1 - \exp \left( 2 \cdot f^* (x) \right) \right) \\ \frac{\exp \left( 2 \cdot f^* (x) \right)}{1 - \exp \left( 2 \cdot f^* (x) \right)} &= P(Y = 1 \mid x) \\ \frac{\exp \left( f^* (x) \right)}{\exp \left( -f^* (x) \right) + \exp \left( f^* (x) \right)} &= \end{aligned}\]

$\blacksquare$


<AdaBoost>에 λŒ€ν•΄μ„œλŠ” <AdaBoost>λ₯Ό κ΅¬ν•˜λŠ” 것이 사싀 β€œForward Stagewise Updates for the ERM with Exponential Lossβ€λΌλŠ” 것을 κΌ­ κΈ°μ–΅ν•΄μ•Ό ν•œλ‹€!! 😀

이후에 λ‚˜μ˜¨ GBM 계열이 더 μ„±λŠ₯이 μ’‹μ•„μ„œ μš”μ¦˜μ€ 잘 쓰지 μ•ŠλŠ”λ‹€κ³  ν•œλ‹€ πŸ˜₯