2021-1ν•™κΈ°, λŒ€ν•™μ—μ„œ β€˜ν†΅κ³„μ  λ°μ΄ν„°λ§ˆμ΄λ‹β€™ μˆ˜μ—…μ„ λ“£κ³  κ³΅λΆ€ν•œ λ°”λ₯Ό μ •λ¦¬ν•œ κΈ€μž…λ‹ˆλ‹€. 지적은 μ–Έμ œλ‚˜ ν™˜μ˜μž…λ‹ˆλ‹€ :)

9 minute read

2021-1ν•™κΈ°, λŒ€ν•™μ—μ„œ β€˜ν†΅κ³„μ  λ°μ΄ν„°λ§ˆμ΄λ‹β€™ μˆ˜μ—…μ„ λ“£κ³  κ³΅λΆ€ν•œ λ°”λ₯Ό μ •λ¦¬ν•œ κΈ€μž…λ‹ˆλ‹€. 지적은 μ–Έμ œλ‚˜ ν™˜μ˜μž…λ‹ˆλ‹€ :)

<Boosting>에 λŒ€ν•œ κ°œμš”λŠ” μ•„λž˜μ˜ ν¬μŠ€νŠΈμ—μ„œ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€ 😁

πŸ‘‰ Boosting

AdaBoostPermalink

<AdaBoost>λž€ β€œAdaptive Boostingβ€μ˜ μ•½μžλ‘œ <Boosting> μ•Œκ³ λ¦¬μ¦˜μ„ 처음으둜 κ³ μ•ˆν•œ Schapire & Freund에 μ˜ν•΄ 1996λ…„ 개발된 졜초의 Boosting λͺ¨λΈμ΄λ‹€.


μ•Œκ³ λ¦¬μ¦˜μ„ μ‚΄νŽ΄λ³΄κΈ° 전에 λ¨Όμ € 셋업을 ν™•μΈν•˜μž.

[Data]

{(x1,y1),…,(xn,yn)}

  • p-dimensional input: xi∈X
  • Binary output: yi∈{βˆ’1,1} // 1, -1둜 인코딩

[Weak Classifier]

m-th Weak Classifier Gm:Xβ†’{βˆ’1,1}

[Prediction]

predictions are based on a weighted majority vote!

G(x)=sign(βˆ‘m=1MΞ±mGm(x))

where Ξ±m is weight of weak classifier.

Algorithm. AdaBoost

Initialize the weights for data points
wi=1/n for i=1,…,n.

// training!
For m=1 to M
   Obtain Gm(x) based on the training data using weights wi
   Compute error of m-th classifier.

errm=βˆ‘i=1nwiβ‹…I(yiβ‰ Gm(xi))βˆ‘i=1nwi

   Compute weights for m-th classifier

Ξ±m=log⁑(1βˆ’errmerrm)

   // μ •λΆ„λ₯˜ wiκ°€ κ·ΈλŒ€λ‘œ, μ˜€λΆ„λ₯˜ wiκ°€ 컀짐 β†’ normalize β†’ μ •λΆ„λ₯˜ wi β–Ό, μ˜€λΆ„λ₯˜ wi β–²

   Update weights for data points

wi←wiβ‹…exp⁑[Ξ±mβ‹…I(yiβ‰ Gm(xi))]fori=1,…,n

Output G(x)=sign[βˆ‘m=1MΞ±mGm(x)]


Boosting as Additive ExpansionPermalink

<AdaBoost> μ•Œκ³ λ¦¬μ¦˜μ˜ 결과인 G(x)

G(x)=sign[βˆ‘m=1MΞ±mGm(x)]

λ₯Ό additive/basis expansion λͺ¨λΈλ‘œλ„ 이해할 수 μžˆλ‹€.

Algorithm. general additive function expansion

f(x)=βˆ‘m=1MΞ²mβ‹…h(x;ΞΈm)

즉, Ξ±mλ₯Ό basis coefficient Ξ²m둜, weak classifier Gm(x)λ₯Ό basis function h(x;ΞΈm)으둜 ν•΄μ„ν•˜λŠ” 것이닀!

본래 additive model을 ERM(Empirical Risk Minimization)으둜 fitting μ‹œν‚€λŠ” 것은 hardν•œ λ¬Έμ œμ΄λ‹€.

min{Ξ²m,ΞΈm}m=1Mβˆ‘i=1nL(yi,βˆ‘m=1MΞ²mβ‹…h(xi;ΞΈm))

λŒ€λΆ€λΆ„μ˜ κ²½μš°μ—μ„œλŠ” computationally infeasible ν•˜λ‹€κ³  보면 λœλ‹€.

이런 계산적인 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•œ λ°©λ²•μœΌλ‘œλŠ”

  • Update the expansion sequentially
  • Consider only a single basis function at each step

κ°€ μžˆλŠ”λ°, 이런 접근을 ν”νžˆ <Forward Stagewise Update>라고 ν•œλ‹€!

κ°„λ‹¨ν•˜κ²Œ μ„€λͺ…ν•΄λ³΄μžλ©΄,

  1. Ξ²0λ₯Ό fitting
  2. Ξ²0λŠ” κ·ΈλŒ€λ‘œ 두고, Ξ²1을 fitting
  3. Ξ²0,Ξ²1은 κ·ΈλŒ€λ‘œ 두고, Ξ²2λ₯Ό fitting
  4. …

Algorithm. Forward Stagewise Update πŸ”₯

Supp. that the current expansion is

fmβˆ’1(x)=βˆ‘j=1mβˆ’1Ξ²jβ‹…h(x;ΞΈj)

then, the next update can be achieved by solving

(Ξ²m,ΞΈm)=argmin(Ξ²,ΞΈ)βˆ‘i=1nL(yi,fmβˆ’1(xi)+Ξ²β‹…h(xi;ΞΈ))

즉, <forward stagewise update>λŠ” 기쑴의 solution을 μˆ˜μ •ν•˜μ§€ μ•ŠμœΌλ©΄μ„œ μƒˆλ‘œμš΄ basis function을 μΆ”κ°€ν•œλ‹€.

Example. Forward Stagewise Update with Squared Error Loss

Consider the squared error loss:

L(y,f(x))=(yβˆ’f(x))2

then, the loss in forward stagewise is

L(yi,fmβˆ’1(xi)+Ξ²β‹…h(xi;ΞΈ))=(yiβˆ’fmβˆ’1(xi)βˆ’Ξ²β‹…h(xi;ΞΈ))

μ΄λ•Œ, current residual인 (yiβˆ’fmβˆ’1(xi))λŠ” μƒμˆ˜κ°’μœΌλ‘œ Ξ²,ΞΈλ₯Ό μ°ΎλŠ”λ°μ— 영ν–₯을 μ£Όμ§€ μ•ŠλŠ”λ‹€. λ”°λΌμ„œ μš°λ¦¬λŠ” Ξ²,θ에 μ˜¨μ „νžˆ μ§‘μ€‘ν•˜λ©΄ λœλ‹€!

Example. Forward Stagewise Update with Exponential Loss πŸ”₯

Consider the exponential error loss:

L(y,f(x))=exp⁑(βˆ’yβ‹…f(x))

μ΄λ•Œ! λ†€λžκ²Œλ„ <AdaBoost>λŠ” Forward Stagewise Updates for the ERM with Exponential Loss와 λ™μΉ˜λ‹€!! 😲

이 λͺ…μ œλŠ” <Boosting>에 λŒ€ν•œ μ™„μ „νžˆ μƒˆλ‘œμš΄ λ°©μ‹μ˜ μ‹œμ•Όλ₯Ό μ œμ‹œν•œλ‹€!

Note that the m-th step of the forward stagewise updates solves

(Ξ²m,Gm)=argmin(Ξ²,G)βˆ‘i=1nexp⁑[βˆ’yβ‹…(fmβˆ’1(xi)+Ξ²β‹…G(xi))]=argmin(Ξ²,G)βˆ‘i=1nwi(m)β‹…exp⁑[βˆ’yiβ‹…Ξ²β‹…G(xi)]wherewi(m)=exp⁑[βˆ’yβ‹…fmβˆ’1(xi)]

It can be shown that

Gm=argminGβˆ‘i=1nwi(m)β‹…I(yiβ‰ G(xi)) Ξ²m=12log⁑(1βˆ’errmerrm)

where

errm=βˆ‘i=1nwi(m)β‹…I(yiβ‰ Gm(xi))βˆ‘i=1nwi(m)

Then, the update is

f(x)=fmβˆ’1(x)+Ξ²mβ‹…Gm(x)

and leads the weights for the new iterations to be

wi(m+1)=wi(m)β‹…exp⁑[βˆ’yiβ‹…Ξ²mβ‹…Gm(xi)]=wi(m)β‹…exp⁑[Ξ±mβ‹…I(yiβ‰ Gm(xi))]β‹…exp⁑[βˆ’Ξ²m]

where Ξ±m=2Ξ²m appears in the AdaBoost algorithm.

Q. Why Exponential Loss?

Population minimizer of the exponential loss:

fβˆ—(x)=argminf(x)EY∣x[L(Y,f(x))]=argminf(x)EY∣x[exp⁑(βˆ’Yβ‹…f(x))]=12log⁑P(Y=1∣x)P(Y=βˆ’1∣x)

equivalently,

P(Y=1∣x)=exp⁑(fβˆ—(x))exp⁑(βˆ’fβˆ—(x))+exp⁑(fβˆ—(x))
Derivation

It is true that P(Y=1∣x)=1βˆ’P(Y=βˆ’1∣x).

fβˆ—(x)=12log⁑P(Y=1∣x)P(Y=βˆ’1∣x)exp⁑(2β‹…fβˆ—(x))=P(Y=1∣x)P(Y=βˆ’1∣x)=P(Y=1∣x)1βˆ’P(Y=1∣x)exp⁑(2β‹…fβˆ—(x))β‹…(1βˆ’P(Y=1∣x))=P(Y=1∣x)exp⁑(2β‹…fβˆ—(x))=P(Y=1∣x)+exp⁑(2β‹…fβˆ—(x))β‹…P(Y=1∣x)=P(Y=1∣x)β‹…(1βˆ’exp⁑(2β‹…fβˆ—(x)))exp⁑(2β‹…fβˆ—(x))1βˆ’exp⁑(2β‹…fβˆ—(x))=P(Y=1∣x)exp⁑(fβˆ—(x))exp⁑(βˆ’fβˆ—(x))+exp⁑(fβˆ—(x))=

β—Ό


<AdaBoost>에 λŒ€ν•΄μ„œλŠ” <AdaBoost>λ₯Ό κ΅¬ν•˜λŠ” 것이 사싀 β€œForward Stagewise Updates for the ERM with Exponential Lossβ€λΌλŠ” 것을 κΌ­ κΈ°μ–΅ν•΄μ•Ό ν•œλ‹€!! 😀

이후에 λ‚˜μ˜¨ GBM 계열이 더 μ„±λŠ₯이 μ’‹μ•„μ„œ μš”μ¦˜μ€ 잘 μ“°μ§€ μ•ŠλŠ”λ‹€κ³  ν•œλ‹€ πŸ˜₯