2021-1학기, 대학에서 ‘통계적 데이터마이닝’ 수업을 듣고 공부한 바를 정리한 글입니다. 지적은 언제나 환영입니다 :)

9 minute read

2021-1학기, 대학에서 ‘통계적 데이터마이닝’ 수업을 듣고 공부한 바를 정리한 글입니다. 지적은 언제나 환영입니다 :)

<Boosting>에 대한 개요는 아래의 포스트에서 확인할 수 있습니다 😁

👉 Boosting

AdaBoost

<AdaBoost>란 “Adaptive Boosting”의 약자로 <Boosting> 알고리즘을 처음으로 고안한 Schapire & Freund에 의해 1996년 개발된 최초의 Boosting 모델이다.


알고리즘을 살펴보기 전에 먼저 셋업을 확인하자.

[Data]

$\{ (x_1, y_1), \dots, (x_n, y_n) \}$

  • p-dimensional input: $x_i \in \mathcal{X}$
  • Binary output: $y_i \in \{ -1, 1 \}$ // 1, -1로 인코딩

[Weak Classifier]

$m$-th Weak Classifier $G_m : \mathcal{X} \rightarrow \{-1, 1 \}$

[Prediction]

predictions are based on a weighted majority vote!

\[G(x) = \text{sign} \left( \sum^M_{m=1} \alpha_m G_m (x) \right)\]

where $\alpha_m$ is weight of weak classifier.

Algorithm. AdaBoost

Initialize the weights for data points
$w_i = 1 / n$ for $i=1, \dots, n$.

// training!
For $m=1$ to $M$
   Obtain $G_m(x)$ based on the training data using weights $w_i$
   Compute error of $m$-th classifier.

\[\text{err}_m = \frac{\displaystyle\sum^n_{i=1} w_i \cdot I(y_i \ne G_m(x_i))}{\displaystyle\sum^n_{i=1} w_i}\]

   Compute weights for $m$-th classifier

\[\alpha_m = \log \left( \frac{1 - \text{err}_m}{\text{err}_m}\right)\]

   // 정분류 $w_i$가 그대로, 오분류 $w_i$가 커짐 → normalize → 정분류 $w_i$ ▼, 오분류 $w_i$ ▲

   Update weights for data points

\[w_i \leftarrow w_i \cdot \exp \left[ \alpha_m \cdot I(y_i \ne G_m(x_i)) \right] \quad \text{for} \quad i=1, \dots, n\]

Output $\displaystyle G(x) = \text{sign} \left[ \sum^M_{m=1} \alpha_m G_m(x) \right]$


Boosting as Additive Expansion

<AdaBoost> 알고리즘의 결과인 $G(x)$

\[G(x) = \text{sign} \left[ \sum^M_{m=1} \alpha_m G_m(x) \right]\]

additive/basis expansion 모델로도 이해할 수 있다.

Algorithm. general additive function expansion

\[f(x) = \sum^M_{m=1} \beta_m \cdot h(x; \theta_m)\]

즉, $\alpha_m$를 basis coefficient $\beta_m$로, weak classifier $G_m(x)$를 basis function $h(x; \theta_m)$으로 해석하는 것이다!

본래 additive model을 ERM(Empirical Risk Minimization)으로 fitting 시키는 것은 hard한 문제이다.

\[\min_{\left\{\beta_m, \theta_m\right\}^M_{m=1}} \; \sum^n_{i=1} L \left( y_i, \sum^M_{m=1} \beta_m \cdot h(x_i; \theta_m) \right)\]

대부분의 경우에서는 computationally infeasible 하다고 보면 된다.

이런 계산적인 문제를 해결하기 위한 방법으로는

  • Update the expansion sequentially
  • Consider only a single basis function at each step

가 있는데, 이런 접근을 흔히 <Forward Stagewise Update>라고 한다!

간단하게 설명해보자면,

  1. $\beta_0$를 fitting
  2. $\beta_0$는 그대로 두고, $\beta_1$을 fitting
  3. $\beta_0, \beta_1$은 그대로 두고, $\beta_2$를 fitting

Algorithm. Forward Stagewise Update 🔥

Supp. that the current expansion is

\[f_{m-1} (x) = \sum^{m-1}_{j=1} \beta_j \cdot h(x; \theta_j)\]

then, the next update can be achieved by solving

\[(\beta_m, \theta_m) = \underset{(\beta, \theta)}{\text{argmin}} \sum^n_{i=1} L \left( y_i, f_{m-1}(x_i) + \beta \cdot h(x_i; \theta) \right)\]

즉, <forward stagewise update>는 기존의 solution을 수정하지 않으면서 새로운 basis function을 추가한다.

Example. Forward Stagewise Update with Squared Error Loss

Consider the squared error loss:

\[L(y, f(x)) = (y - f(x))^2\]

then, the loss in forward stagewise is

\[L(y_i, f_{m-1} (x_i) + \beta \cdot h(x_i; \theta)) = \left( y_i - f_{m-1} (x_i) - \beta \cdot h(x_i; \theta) \right)\]

이때, current residual인 $(y_i - f_{m-1}(x_i))$는 상수값으로 $\beta, \theta$를 찾는데에 영향을 주지 않는다. 따라서 우리는 $\beta, \theta$에 온전히 집중하면 된다!

Example. Forward Stagewise Update with Exponential Loss 🔥

Consider the exponential error loss:

\[L(y, f(x)) = \exp (-y \cdot f(x))\]

이때! 놀랍게도 <AdaBoost>는 Forward Stagewise Updates for the ERM with Exponential Loss와 동치다!! 😲

이 명제는 <Boosting>에 대한 완전히 새로운 방식의 시야를 제시한다!

Note that the $m$-th step of the forward stagewise updates solves

\[\begin{aligned} (\beta_m , G_m) &= \underset{(\beta, G)}{\text{argmin}} \sum^n_{i=1} \exp \left[ -y \cdot \left( f_{m-1}(x_i) + \beta \cdot G(x_i) \right)\right] \\ &= \underset{(\beta, G)}{\text{argmin}} \sum^n_{i=1} w_i^{(m)} \cdot \exp \left[ - \, y_i \cdot \beta \cdot G(x_i) \right] \\ & \quad \text{where} \quad w_i^{(m)} = \exp \left[ -y \cdot f_{m-1}(x_i) \right] \end{aligned}\]

It can be shown that

\[G_m = \underset{G}{\text{argmin}} \sum^n_{i=1} w_i^{(m)} \cdot I(y_i \ne G(x_i))\] \[\beta_m = \frac{1}{2} \log \left( \frac{1-\text{err}_m}{\text{err}_m}\right)\]

where

\[\text{err}_m = \frac{\displaystyle\sum^n_{i=1} w_i^{(m)} \cdot I(y_i \ne G_m(x_i))}{\displaystyle\sum^n_{i=1} w_i^{(m)}}\]

Then, the update is

\[f_(x) = f_{m-1}(x) + \beta_m \cdot G_m (x)\]

and leads the weights for the new iterations to be

\[\begin{aligned} w_i^{(m+1)} &= w_i^{(m)} \cdot \exp \left[ - \, y_i \cdot \beta_m \cdot G_m (x_i) \right] \\ &= w_i^{(m)} \cdot \exp \left[ \alpha_m \cdot I(y_i \ne G_m(x_i)) \right] \cdot \exp \left[ - \beta_m \right] \end{aligned}\]

where $\alpha_m = 2 \beta_m$ appears in the AdaBoost algorithm.

Q. Why Exponential Loss?

Population minimizer of the exponential loss:

\[\begin{aligned} f^{*} (x) &= \underset{f(x)}{\text{argmin}} \; E_{Y \mid x} \left[ L(Y, f(x)) \right] \\ &= \underset{f(x)}{\text{argmin}} \; E_{Y \mid x} \left[ \exp \left( -Y \cdot f(x) \right) \right] \\ &= \frac{1}{2} \log \frac{P(Y = 1 \mid x)}{P(Y = -1 \mid x)} \end{aligned}\]

equivalently,

\[P(Y = 1 \mid x) = \frac{\exp \left( f^* (x) \right)}{\exp \left( -f^* (x) \right) + \exp \left( f^* (x) \right)}\]
Derivation

It is true that $P(Y=1 \mid x) = 1 - P(Y=-1 \mid x)$.

\[\begin{aligned} f^*(x) &= \frac{1}{2} \log \frac{P(Y = 1 \mid x)}{P(Y = -1 \mid x)} \\ \exp \left( 2 \cdot f^* (x) \right) &= \frac{P(Y = 1 \mid x)}{P(Y = -1 \mid x)} \\ &= \frac{P(Y = 1 \mid x)}{1 - P(Y=1 \mid x)} \\ \exp \left( 2 \cdot f^* (x) \right) \cdot (1 - P(Y = 1 \mid x)) &= P(Y = 1 \mid x) \\ \exp \left( 2 \cdot f^* (x) \right) &= P(Y = 1 \mid x) + \exp \left( 2 \cdot f^* (x) \right) \cdot P(Y = 1 \mid x) \\ &= P(Y = 1 \mid x) \cdot \left( 1 - \exp \left( 2 \cdot f^* (x) \right) \right) \\ \frac{\exp \left( 2 \cdot f^* (x) \right)}{1 - \exp \left( 2 \cdot f^* (x) \right)} &= P(Y = 1 \mid x) \\ \frac{\exp \left( f^* (x) \right)}{\exp \left( -f^* (x) \right) + \exp \left( f^* (x) \right)} &= \end{aligned}\]

$\blacksquare$


<AdaBoost>에 대해서는 <AdaBoost>를 구하는 것이 사실 “Forward Stagewise Updates for the ERM with Exponential Loss”라는 것을 꼭 기억해야 한다!! 😤

이후에 나온 GBM 계열이 더 성능이 좋아서 요즘은 잘 쓰지 않는다고 한다 😥