๋ณธ ๊ธ€์€ 2018-2ํ•™๊ธฐ Stanford Univ.์˜ Andrew Ng ๊ต์ˆ˜๋‹˜์˜ Machine Learning(CS229) ์ˆ˜์—…์˜ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

3 minute read

๋ณธ ๊ธ€์€ 2018-2ํ•™๊ธฐ Stanford Univ.์˜ Andrew Ng ๊ต์ˆ˜๋‹˜์˜ Machine Learning(CS229) ์ˆ˜์—…์˜ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

โ€“ lecture 3
โ€“ lecture 4


Perceptron Algorithm

์ด๋ฒˆ์—๋Š” Learning Algorithm์˜ ์—ญ์‚ฌ์  ๋ฐฐ๊ฒฝ ์ค‘ ํ•˜๋‚˜์ธ Perceptron Algorithm์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด์ž!

Perceptron Algorithm์€ Logistic Regression์˜ sigmoid function $g(z)$๋ฅผ ์•ฝ๊ฐ„ ๋ณ€ํ˜•ํ•œ Threshold function์„ ์‚ฌ์šฉํ•œ๋‹ค.

$$ g(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{if } z < 0 \\ \end{cases} $$

hypothesis $h_{\theta}(x)$๋ฅผ $h_{\theta}(x)=g(\theta^{T}x)$๋กœ ๋‘๊ณ  Learning rule์„ ๊ธฐ์ˆ ํ•ด๋ณด๋ฉด

$$\theta_{j} := \theta_{j} - \alpha (y^{(i)}-h_\theta(x^{(i)}))x^{(i)}_j$$

์•ž์„  Logistic Regression์ด๋‚˜ Perceptron์ด๋‚˜ ๋™์ผํ•œ rule๋กœ $\theta$๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ชจ์Šต์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ๋‘˜์€ $h_{\theta}(x)$๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์—„์—ฐํžˆ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค!

Perceptron์€ 1960๋…„ ๋Œ€์— ์ธ๊ฐ„์˜ Neuron์„ ๋ณธ๋œฌ ๋ชจ๋ธ์ด๋‹ค. ํ•˜์ง€๋งŒ, Logistic Regression๊ณผ Linear Regression๊ณผ๋Š” ๋‹ฌ๋ฆฌ ๋งˆ๋•…ํ•œ ํ†ต๊ณ„ํ•™์  ์˜๋ฏธ๋‚˜ Maximum Likelihood Estimation๊ณผ ์ ‘์ ์ด ์—†๋‹ค.


Newtonโ€™s Method

$\theta$๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๋˜๋‹ค๋ฅธ Iterative Learning์„ ์‚ดํŽด๋ณด์ž!

Newtonโ€™s Method๋Š” real-valued function $f$์— ๋Œ€ํ•ด $f(\theta)=0$์ด ๋˜๋Š” $\theta$๋ฅผ ์ฐพ๋Š” ๋ฐฉ์‹์ด๋‹ค.

์šฐ๋ฆฌ๋Š” log likelihood $l(\theta)$๋ฅผ Maximizeํ•˜๋Š” $\theta$๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ด๋‹ค. ์ฆ‰, $lโ€™(\theta)$๋ฅผ ์ฐพ์•„์•ผ ํ•˜๋ฉฐ, ์ด๊ฒƒ์„ Newtonโ€™s Method๋ฅผ ํ†ตํ•ด ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค!!

๊ทธ๋ฆฌ๊ณ  ๊ทธ ๊ทœ์น™์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

$$\theta := \theta - \frac{f(\theta)}{f'(\theta)}$$


Newtonโ€™s Method๋Š” ์˜ค์ง linear function๋งŒ์„ ์‚ฌ์šฉํ•ด ํ•จ์ˆซ๊ฐ’์ด 0์ด ๋˜๋Š” ์ง€์ ์„ ์ฐพ๋Š” ๊ฐ€์žฅ ์›์ดˆ์ ์ธ ๋ฐฉ์‹์ด๋‹ค.

Newtonโ€™s Method๋Š” ์•ž์„  Gradient Descent ๋ฐฉ์‹๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ, ๋งค์šฐ ์ ์€ step์œผ๋กœ ์ตœ์ ์˜ $\theta$๋ฅผ ์–ป๋Š”๋‹ค.

์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” $\theta$๋Š” 1D real-value๊ฐ€ ์•„๋‹ˆ๋ผ N-dimensional real-valued vector์ด๋‹ค. ๊ทธ๋ž˜์„œ Newtonโ€™s Method๋ฅผ ์ผ๋ฐ˜ํ™”ํ•ด๋ณด์ž!

$$ \theta^{(1)} := \theta^{(0)} - \Delta \\ f'(\theta^{(0)}) = \frac{f(\theta^{(0)})}{\Delta} \\ \therefore \Delta = \frac{f(\theta^{(0)})}{f'(\theta^{(0)})} $$

์ด๊ฒƒ์„ ์ผ๋ฐ˜ํ™”ํ•˜๋ฉด

$$\theta^{(t+1)} := \theta^{(t)} - \frac{f(\theta^{(t)})}{f'(\theta^{(t)})}$$

์ด๋‹ค.

๊ทธ๋ฆฌ๊ณ  ์ด๊ฒƒ์„ vector form์œผ๋กœ ์ผ๋ฐ˜ํ™”ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

$$\theta^{(t+1)} := \theta^{(t)} - H^{-1} \nabla_{\theta}l(\theta)$$

์ด๋•Œ, $H$๋Š” Hessian์ด๋‹ค. vector function์˜ Derivate๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

๋ถ„๋ช… Newtonโ€™s Method๋Š” ์ ์€ step์œผ๋กœ ์ตœ์ ์˜ $\theta$๋ฅผ ์ฐพ์•„๋‚ธ๋‹ค. ํ•˜์ง€๋งŒ, Newtonโ€™s Method์˜ ๊ณต์‹์„ ์‚ดํŽด๋ณด๋ฉด n-by-n ํ–‰๋ ฌ์˜ ์—ญํ–‰๋ ฌ์ธ $H^{-1}$๋ฅผ ๊ตฌํ•˜๋Š” ๊ณผ์ •์ด ์žˆ๋‹ค! n์ด ์ž‘๋‹ค๋ฉด ์—ญํ–‰๋ ฌ์„ ๊ตฌํ•˜๋Š” ๋น„์šฉ์ด ํฌ์ง€ ์•Š๊ฒ ์ง€๋งŒ, n์ด ์ปค์ง„๋‹ค๋ฉด ์—ญํ–‰๋ ฌ์„ ๊ตฌํ•˜๋Š” ๋น„์šฉ์€ ์•„์ฃผ์•„์ฃผ ์ปค์ง„๋‹คโ€ฆ ๊ทธ๋ž˜์„œ Netwonโ€™s Method๋Š” $\theta$๊ฐ€ ๊ฐ–๋Š” feacture๊ฐ€ ์ ์„ ๋•Œ๋งŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ œ์•ฝ์„ ๊ฐ€์ง„๋‹ค.


๋งบ์Œ๋ง

  • Perceptron Algorithm์€ ๊ฐ€์žฅ ์ดˆ๊ธฐ์˜ ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฒ•์ด๋‹ค.
    • ํ•˜์ง€๋งŒ, ํ†ต๊ณ„ํ•™์  ์˜๋ฏธ๋‚˜ MLE์™€์˜ ์ ‘์ ์ด ์—†๋‹ค.
  • Newtonโ€™s Method๋Š” ํ•จ์ˆซ๊ฐ’์„ 0์œผ๋กœ ๊ทผ์‚ฌํ•˜๋Š” ๊ฐ€์žฅ ์›์ดˆ์ ์ธ ์ ‘๊ทผ๋ฒ•์ด๋‹ค.
  • Newtonโ€™s Method๋Š” ๋” ์ ์€ step์œผ๋กœ ์ตœ์ ํ•ด์— ๋„๋‹ฌํ•œ๋‹ค.
    • ํ•˜์ง€๋งŒ, n-by-n ํ–‰๋ ฌ์ธ Hessian $H$์˜ ์—ญํ–‰๋ ฌ์„ ๊ตฌํ•˜๋Š” ๋น„์šฉ์ด ํฌ๋‹ค๋ฉด ์‚ฌ์šฉํ•  ์ˆ˜ ์—†๋‹ค.