๋ณธ ๊ธ€์€ 2018-2ํ•™๊ธฐ Stanford Univ.์˜ Andrew Ng ๊ต์ˆ˜๋‹˜์˜ Machine Learning(CS229) ์ˆ˜์—…์˜ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

5 minute read

๋ณธ ๊ธ€์€ 2018-2ํ•™๊ธฐ Stanford Univ.์˜ Andrew Ng ๊ต์ˆ˜๋‹˜์˜ Machine Learning(CS229) ์ˆ˜์—…์˜ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

โ€“ lecture video


Classification

์ด๋ฒˆ ํŒŒํŠธ์—์„œ๋Š” Classification Problem์— ๋Œ€ํ•ด ๋‹ค๋ฃฌ๋‹ค. ์•Œ์•„๋‘˜ ์ ์€ Classification ์—ญ์‹œ Regression์˜ ํ•œ ์ข…๋ฅ˜๋ผ๋Š” ๊ฒƒ์ด๋‹ค. ๋‹จ์ง€ Predicted Value๊ฐ€ ์—ฐ์†์ด ์•„๋‹ˆ๋ผ ์ด์‚ฐ์ ์ผ ๋ฟ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋ฒˆ ํŒŒํŠธ์—์„œ๋Š” Classification Problem ์ค‘์—์„œ๋„ $\{ 0, 1 \}$์˜ Binary Classification Problem์„ ๋‹ค๋ฃฌ๋‹ค! 1


Failure of Linear Regression

์•ž์„  ๊ฐ•์˜์—์„œ๋Š” Linear Regression์˜ hypothesis $h_{\theta}(x) = w^{T} \cdot x + b$๋กœ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ , prediction ํ•˜์˜€๋‹ค. ํ•˜์ง€๋งŒ, Classification์—์„œ๋Š” 2๊ฐ€์ง€ ๋ฌธ์ œ์  ๋•Œ๋ฌธ์— ๊ทธ๋Ÿด ์ˆ˜ ์—†๋‹ค.

P1. Linear Regression์˜ ๊ฒฝ์šฐ ์ž…๋ ฅ์œผ๋กœ๋ถ€ํ„ฐ ์˜ค๋Š” Loss์˜ ์ดํ•ฉ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ Classification ์ƒํ™ฉ์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ Boundary๋ฅผ ๋งŒ๋“ค์–ด ๋‚ธ๋‹ค.

P2. Classification์€ $\{ 0, 1 \}$์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ๋ฅผ ์›ํ•œ๋‹ค. ํ•˜์ง€๋งŒ, Linear Regression์—์„œ๋Š” 0๋ณด๋‹ค ์ž‘๊ฑฐ๋‚˜ 1๋ณด๋‹ค ํฐ ๊ฐ’์„ ์ถœ๋ ฅํ•ด๋ฒ„๋ฆฐ๋‹ค.

๊ธฐ์กด ๋ชจ๋ธ์ธ Linear Regression์€ Classification์—์„œ ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์ด ๋ฐํ˜€์กŒ๋‹ค. ์šฐ๋ฆฌ๋Š” Classification์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋ชจ๋ธ์ด ํ•„์š”ํ•˜๋‹ค!

Logistic Regression (a.k.a. sigmoid function)

์•ž์—์„œ๋„ ์–ธ๊ธ‰ ๋˜์—ˆ๋“ฏ์ด ์šฐ๋ฆฌ๋Š” $\{ 0, 1 \}$์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ๋ฅผ ์›ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ hypothesis $h_{\theta}(x)$๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค๊ณ„ํ•  ๊ฒƒ์ด๋‹ค.

$$h_{\theta}(x) = g(\theta^{T}x) = \frac{1}{ 1+e^{-\theta^{T}x} } \in [0, 1]$$

์šฐ๋ฆฌ๋Š” ํ•จ์ˆ˜ $g(z)= \frac{1}{ 1+e^{-z} }$์™€ ๊ฐ™์€ ๊ผด์„ Logistic function ๋˜๋Š” sigmoid function์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

์ƒˆ๋กญ๊ฒŒ ๋ชจ๋ธ๋งํ•œ ํ•จ์ˆ˜ $h_{\theta}(x)$๋Š” ์ถœ๋ ฅ ๊ฐ’์ด $[0, 1]$ ์‚ฌ์ด๋กœ ๊ฐ•์ œ๋œ๋‹ค!

Why we choose โ€˜sigmoidโ€™ function?

์šฐ๋ฆฌ๋Š” ์™œ ํ•˜ํ•„ sigmoid ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ผ๊นŒ? ์ถœ๋ ฅ ๊ฐ’์„ $[0, 1]$ ์‚ฌ์ด๋กœ ๊ฐ•์ œํ•œ๋‹ค๋ฉด, sigmoid ๋ง๊ณ ๋„ ๋‹ค๋ฅธ smooth function์„ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๋Š”๋ฐ ๋ง์ด๋‹ค.

๊ทธ ์ด์œ ๋Š” lecture 4์˜ GLMGeneralized Linear Model์—์„œ ๋‹ค๋ฃจ๊ฒŒ ๋œ๋‹ค!


Logistic Regression & Probabilistic Approach

$\textrm{MLE} \equiv \textrm{LMS}$๋ฅผ ๋ณด์ด๊ธฐ ์œ„ํ•ด ๋ช‡๊ฐ€์ง€ ๊ฐ€์ •์„ ํ–ˆ๋“ฏ์ด, Logistic Regression์—์„œ๋„ ๋ช‡๊ฐ€์ง€ ๊ฐ€์ •์„ ๋„์ž…ํ•  ๊ฒƒ์ด๋‹ค.

$$ \begin{split} P(y=1 \vert x; \theta) &= h_{\theta}(x) \\ P(y=0 \vert x; \theta) &= 1 - h_{\theta}(x) \\ \end{split} $$

๊ทธ๋ฆฌ๊ณ  ์ด๊ฒƒ์„ ํ•˜๋‚˜์˜ equation์œผ๋กœ ๋‹ค์‹œ ์จ๋ณด๋ฉด

$$P(y \vert x; \theta) = (h_{\theta}(x))^y (1-h_{\theta}(x))^{(1-y)}$$

์ด์ œ MLE ๊ณผ์ •์„ ์œ„ ์‹์„ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์‹œ ํ•ด๋ณด์ž!

$$ \begin{split} L(\theta) &= P(\vec{y} \vert x; \theta) \\ &= \prod _{ i }^{m} { p(y^{(i)} \vert x^{(i)}; \theta) } \\ &= \prod _{ i }^{m} { \left( h_{\theta}(x^{(i)})\right)^{y^{(i)}} \left( 1- h_{\theta}(x^{(i)})\right) ^ {(1-y^{(i)})}} \end{split} $$

๊ทธ๋ฆฌ๊ณ  ์ค€์‹์— $\log$๋ฅผ ์ทจํ•˜๋ฉด

$$ \begin{split} l(\theta) &= \log {L(\theta)} \\ &= \sum_{i}^{m} { y^{(i)}\log{h(x^{(i)})} + (1-y^{(i)})\log{\left( 1-h(x^{(i)}) \right)} } \end{split} $$

์—ฌ์ „ํžˆ ์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ๋Š” Likelihood๋ฅผ Maximizeํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋ฅผ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ GD ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค.

$$\theta := \theta + \alpha \nabla_{\theta}(l(\theta))$$

์‹์„ ๋ณด๋ฉด, ๊ธฐ์กด์˜ GD๊ณผ๋Š” ๋‹ฌ๋ฆฌ Gradient ํ…€์ด ๋ฐ”๋€Œ์—ˆ๋‹ค. log likelihood์˜ Gradient๋ฅผ ๋”ํ•˜๋Š”($+$)๋Š” ํ˜•ํƒœ๋กœ ๋ฐ”๋€Œ์—ˆ๋‹ค.

๊ทธ ์ด์œ ๋Š” Likelihood์˜ ๊ฒฝ์šฐ Maximize ํ•˜๋Š” Gradient Ascent ๊ณผ์ •์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค!

์ด์ œ Gradient ํ…€์„ ๊ตฌํ•˜๋ฉด

$$ \begin{split} \frac{\partial}{\partial \theta_{j}} l(\theta) &= \left( y \frac{1}{g(\theta^{T}x)} - (1-y)\frac{1}{1 - g(\theta^{T}x)} \right) \frac{\partial}{\partial \theta_{j}} g(\theta^{T}x) \\ &= \left( y \frac{1}{g(\theta^{T}x)} - (1-y)\frac{1}{1 - g(\theta^{T}x)} \right) g(\theta^{T}x)(1-g(\theta^{T}x))\frac{\partial}{\partial \theta_{j}} \theta^{T}x \\ &= \left( y (1 - g(\theta^{T}x)) - (1 - y) g(\theta^{T}x) \right) x_j \\ &= (y-h_\theta(x))x_j \end{split} $$

์ •๋ฆฌํ•˜๋ฉด ๊ฒฐ๊ตญ

$$\theta := \theta + \alpha (y-h_\theta(x))x_j$$

Gradient Ascent์˜ ์ตœ์ข…์ ์ธ ํ˜•ํƒœ๋ฅผ ๋ณด๋ฉด, LMS์—์„œ์˜ GD์™€ ์ƒ๋‹นํžˆ ๋‹ฎ์•„์žˆ๋‹ค!

$$ (\textrm{Gradient Ascent}) \\ \theta := \theta + \alpha (y-h_\theta(x))x_j $$ $$ (\textrm{Gradient Descent}) \\ \theta := \theta - \alpha (y-h_\theta(x))x_j $$

ํ•˜์ง€๋งŒ, ์ด ๋‘˜์€ ๋ช…๋ฐฑํžˆ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค! ์™œ๋ƒํ•˜๋ฉด, hypothesis $h_{\theta}(x^{(i)})$๊ฐ€ non-linear function์ธ sigmoid์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค!


LMS๊ณผ Classification์˜ Learning Algorithm์ด ์œ ์‚ฌํ•œ ํ˜•ํƒœ๋ฅผ ๋„๋Š” ๊ฒƒ์€ ์ƒ๋‹นํžˆ ํฅ๋ฏธ๋กญ๋‹ค. ๋‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์œ ์‚ฌ์„ฑ์€ ๋‹จ์ˆœํžˆ ์šฐ์—ฐ์—์„œ ๊ธฐ์ธํ•˜๋Š” ๊ฒƒ์ผ๊นŒ? ์ด ์งˆ๋ฌธ์— ๋Œ€ํ•œ ํ•ด๋‹ต์ด ๋‹ค์Œ lecture์— ์žˆ๋‹ค!



๋งบ์Œ๋ง

์ด๋ž˜์ €๋ž˜ Binary Classification์—์„œ์˜ Learning Algorithm์„ ์‚ดํŽด๋ดค๋‹ค. ์ด๋ฒˆ ๋ฌธ๋‹จ์—์„œ ๊ทธ ์˜๋ฏธ๋งŒ์„ ๊ฐ„๋‹จํ•˜๊ฒŒ ์š”์•ฝํ•ด๋ณด์ž.

  • Classification์—์„œ๋Š” $\{ 0, 1 \}$์— ๋งž์ถ”๊ธฐ ์œ„ํ•ด sigmoid๋ฅผ $h_{\theta}(x)$๋กœ ์‚ฌ์šฉํ•œ ์ƒˆ๋กœ์šด ๋ชจ๋ธ์„ ์‚ฌ์šฉํ–ˆ๋‹ค.
  • ๊ทธ ๋ชจ๋ธ์—์„œ๋Š” Loss ๋Œ€์‹  Likelihood๋ฅผ Maximize ํ–ˆ๋‹ค.
    • ์ด๋ฅผ ์œ„ํ•ด Gradient Ascent ๋ฐฉ์‹์œผ๋กœ $\theta$๋ฅผ ์ตœ์ ํ™” ํ–ˆ๋‹ค.
    • ๋†€๋ž๊ฒŒ๋„ LMS์—์„œ์˜ GD์™€ ๊ทธ ํ˜•ํƒœ๊ฐ€ ์œ ์‚ฌํ–ˆ๋‹ค!!

  1. Binary Classification์„ Multi-class๋ฅผ ์ผ๋ฐ˜ํ™”ํ•˜๊ฒŒ ๋˜๋ฉด, Generalized Linear Regression์˜ ํ˜•ํƒœ๊ฐ€ ๋œ๋‹ค! (Lecture 4์—์„œ ๋‹ค๋ฃจ๊ฒŒ ๋จ.)ย