๋ณธ ๊ธ€์€ 2018-2ํ•™๊ธฐ Stanford Univ.์˜ Andrew Ng ๊ต์ˆ˜๋‹˜์˜ Machine Learning(CS229) ์ˆ˜์—…์˜ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

8 minute read

๋ณธ ๊ธ€์€ 2018-2ํ•™๊ธฐ Stanford Univ.์˜ Andrew Ng ๊ต์ˆ˜๋‹˜์˜ Machine Learning(CS229) ์ˆ˜์—…์˜ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

โ€“ lecture video


Locally Weighted Linear Regression; LWR

์ง€๋‚œ ๊ฐ•์˜์—์„œ๋Š” Linear Regression์— ๋Œ€ํ•œ ๋‹ค๋ค˜๋‹ค๋ฉด, ์ด๋ฒˆ์—๋Š” Weight ํ•จ์ˆ˜ $w^{(i)}$๊ฐ€ ํฌํ•จ๋œ Locally Weighted Linear RegressionLWR์— ๋Œ€ํ•ด ๋‹ค๋ฃฌ๋‹ค.

LWR์€ ํ•จ์ˆ˜๋ฅผ ๊ทผ์‚ฌํ•  ๋•Œ, neighborhood์˜ ์˜ํ–ฅ์„ ๋” ๊ณ ๋ คํ•˜์ž๋Š” ํŒจ๋Ÿฌ๋‹ค์ž„์ด๋‹ค. ๊ทธ๋ž˜์„œ LWR์€ ๋‹ค์Œ์˜ ์ˆ˜์ •๋œ Cost function์„ ์‚ฌ์šฉํ•œ๋‹ค.

$$\sum_{ i } { w^{(i)} {\left( y^{(i)} - {\theta}^{T} x^{(i)} \right)}^{2} }$$

์ด๋•Œ, Weight ํ•จ์ˆ˜ $w^{(i)}$์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

$$w^{(i)} = \textrm{ exp } \left( - \cfrac{1}{2} {\left( x^{(i)} - x \right)}^{2} \right)$$

$w^{(i)}$์˜ ์˜๋ฏธ๋ฅผ ํ•ด์„ํ•ด๋ณด๋ฉด,

  • if $\left| x^{(i) - x }\right|$ is small, $w^{(i)} \approx 1$
  • else if $\left| x^{(i) - x }\right|$ is large, $w^{(i)} \approx 0$

๊ทธ๋ž˜์„œ $\theta$๋Š” query point $x$์™€ ๊ฐ€๊นŒ์šด ์ ๋“ค์— ๋Œ€ํ•ด ๋” ํฐ weight๋ฅผ ๋ถ€์—ฌํ•œ๋‹ค.

$w^{(i)}$๋Š” Gaussian๊ณผ ๋น„์Šทํ•œ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง€์ง€๋งŒ, $w^{(i)}$๊ฐ€ Gaussian function์ธ ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค!

$w^{(i)}$๋Š” neighborhood์˜ ๋ฒ”์œ„๋ฅผ neighborhood parameter $\tau$๋ฅผ ํ†ตํ•ด ์ง€์ •ํ•œ๋‹ค.

$$w^{(i)}(x; \tau) = \textrm{ exp } \left( - \cfrac{1}{2 \tau^{2}} {\left( x^{(i)} - x \right)}^{2} \right)$$

parametric / non-parametric Learning

$w^{(i)}$๋Š” ์ž…๋ ฅ๊ฐ’ $x$์— ๋Œ€ํ•œ ํ•จ์ˆ˜์ด๋‹ค. ๊ทธ๋ž˜์„œ $w^{(i)}(x)$๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

โ€œweights depend on the particular point $x$ at which weโ€™re trying to evaluate $x$.โ€

LWR๋ฅผ ํ†ตํ•ด Learning์„ ํ•  ๊ฒฝ์šฐ weight $\theta$๋Š” ํŠน์ • ์ž…๋ ฅ๊ฐ’ $x$์— ๋”ฐ๋ผ ๋ฐ”๋€๋‹ค. ๊ทธ๋ž˜์„œ ์ž…๋ ฅ๊ฐ’์ด ๋ฐ”๋€” ๋•Œ๋งˆ๋‹ค $\theta$๋ฅผ ๋งค๋ฒˆ ๋‹ค์‹œ ์ตœ์ ํ™”ํ•ด์•ผ ํ•œ๋‹ค.

LWR์€ ๋Œ€ํ‘œ์ ์ธ non-parametric Learning์ด๋‹ค. ์•ž์—์„œ ์‚ดํŽด๋ณธ (unweighted) Linear Regression์€ parametric Learning์— ํ•ด๋‹นํ•œ๋‹ค. ๊ทธ ์ด์œ ๋Š” Linear Regression์—์„œ๋Š” dataset์ด ๊ณ ์ •๋˜์–ด ์žˆ๊ณ  Learning์„ ํ†ตํ•ด ์ตœ์ ํ™”ํ•œ $\theta$ ๊ฐ’์„ ์•Œ๊ณ  ์žˆ์œผ๋ฉด, ๊ทธ ์ดํ›„์˜ prediction์€ Learning ์—†์ด ํ•ด๋‹น $\theta$ ๊ฐ’๋งŒ์„ ์‚ฌ์šฉํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

โ€œOnce weโ€™ve fit the $\theta_{i}$โ€™s and stored them away, we no longer need to keep the training data around to make future predictions.โ€

ํ•˜์ง€๋งŒ non-parametric Learning์—์„œ๋Š” dataset์ด ๊ณ ์ •๋˜์–ด ์žˆ์ง€ ์•Š๊ณ , ๋งค prediction๋งˆ๋‹ค query point $x$์— ๋”ฐ๋ผ Learning์„ ์ƒˆ๋กญ๊ฒŒ ํ•ด์ค˜์•ผ ํ•œ๋‹ค. ๊ณ„์‚ฐ ๋น„์šฉ์€ ๋งค์šฐ ํฌ์ง€๋งŒ, Linear Regression๋ณด๋‹ค ๋” ๋‚ฎ์€ prediction error๋ฅผ ๋ณด์ธ๋‹ค.


Probabilistic Interpretation

์šฐ๋ฆฌ๋Š” ์ง€๊ธˆ๊นŒ์ง€ LMSLeast Mean Square ๋ฐฉ์‹์œผ๋กœ Cost๋ฅผ ์ •์˜ํ•˜์—ฌ $\theta$๋ฅผ Learning ํ•˜์˜€๋‹ค. ์™œ ํ•˜ํ•„ LMS์„ ์‚ฌ์šฉํ•œ ๊ฒƒ์ผ๊นŒ? LMS ๋ฐฉ์‹์ด ์™œ ํ•ฉ๋ฆฌ์ ์ธ์ง€ ๊ทธ ์ด์œ ๋ฅผ Probability ๊ด€์ ์—์„œ ์‚ดํŽด๋ณด์ž!

Maximum Likelihood Estimation

๋จผ์ € target variable $y^{(i)}$์™€ input variable $x^{(i)}$๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ด€๊ณ„๋ฅผ ๊ฐ€์ง„๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž.

$$y^{(i)} = \theta^{T} x^{(i)} + \epsilon^{(i)}$$

$\epsilon^{(i)}$์€ error term์œผ๋กœ modeling ๊ณผ์ •์—์„œ ํฌํ•จํ•˜์ง€ ๋ชปํ•œ feature๋‚˜ random noise๋ฅผ ํฌํ•จํ•œ๋‹ค.

๋˜, $\epsilon^{(i)}$์— ๋Œ€ํ•ด์„œ๋Š” Gassian distribution์„ ๋”ฐ๋ฅด๋Š” IIDIndependently and Identically distributed๋ผ๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. (โ€œNormal distributionโ€์ด๋ผ๊ณ ๋„ ํ•œ๋‹ค.)

$$\epsilon^{(i)} \sim N\left( 0, \sigma^{2} \right)$$

โ€$\epsilon^{(i)}$๊ฐ€ IIDโ€๋ผ๊ณ  ํ•จ์€ house1์˜ error term์ด house2์˜ error term์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๋Š”๋‹ค๋Š” ๋ง์ด๋‹ค.

$\epsilon^{(i)}$์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ˆ˜์‹์œผ๋กœ ๊ธฐ์ˆ ํ•˜๋ฉด,

$$p( \epsilon^{(i)} ) = \frac{ 1 }{ \sqrt{2\pi} \sigma } \textrm{ exp } \left( - \cfrac{ {\left( \epsilon^{(i)} \right)}^{2} }{2 \sigma^{2}} \right)$$

์ด๋•Œ, ์‹ $y^{(i)} = \theta^{T} x^{(i)} + \epsilon^{(i)}$์— ์˜ํ•ด์„œ $\epsilon^{(i)} = y^{(i)} - \theta^{T} x^{(i)}$์ด๋‹ค. ๋”ฐ๋ผ์„œ $p\left( \epsilon^{(i)} \right)$๋Š” ๋‹ค์Œ์„ ์˜๋ฏธํ•œ๋‹ค.

$$p ( y^{(i)} \vert x^{(i)}; \theta ) = \frac{ 1 }{ \sqrt{2\pi} \sigma } \textrm{ exp } \left( - \cfrac{ {\left( y^{(i)} - \theta^{T} x^{(i)} \right)}^{2} }{2 \sigma^{2}} \right)$$

1

ํ™•๋ฅ  ํ•จ์ˆ˜ $p ( y^{(i)} \vert x^{(i)}; \theta )$๋Š” $\theta$๋ฅผ ์ธ์ž๋กœ ํ•˜๋Š” Likelihood ํ•จ์ˆ˜ $L(\theta)$๋กœ ํ•ด์„ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

$$L(\theta) = L(\theta; X, \vec{y}) = p(\vec{y} \vert X; \theta)$$

$\theta$๊ฐ€ ๋ณ€์ˆ˜์ธ์ง€ ๊ณ ์ •๊ฐ’์ธ์ง€์— ๋”ฐ๋ผ $L(\theta)$์™€ $p(\vec{y} \vert X; \theta)$๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

$\theta$๊ฐ€ ๋ณ€์ˆ˜๋ผ๋ฉด, Likelihood $L(\theta)$๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ๋ฐ˜๋ฉด, $\theta$๊ฐ€ ๊ณ ์ •๋˜์–ด ์žˆ๊ณ , $(y, x)$ ๊ฐ™์€ training pair๊ฐ€ ๋ณ€ํ•œ๋‹ค๋ฉด, $p(\vec{y} \vert X; \theta)$๋ฅผ ํ†ตํ•ด โ€œprobability of dataโ€์˜ ์˜๋ฏธ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.

$p(\vec{y} \vert X; \theta)$๋Š” IID์— ์˜ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„๋œ๋‹ค.

$$p(\vec{y} \vert X; \theta) = \prod _{ i } { p(y^{(i)} \vert X^{(i)}; \theta) }$$

The principal of Maximum Likelihood

์•ž์—์„œ ๊ตฌ์ถ•ํ•œ ํ™•๋ฅ  ๋ชจ๋ธ $p(\vec{y} \vert X; \theta)$์—์„œ ์–ด๋–ค $\theta$ ๊ฐ’์ด ๊ฐ€์žฅ ํ•ฉ๋ฆฌ์ ์ผ๊นŒ? ๊ทธ๊ฒƒ์€ $p(\vec{y} \vert X; \theta)$์˜ ๊ฐ’์„ Maximizeํ•˜๋Š” $\theta$์ด๋‹ค!

โ€œThe principle of maximum likelihood says we should choose $\theta$ so as to make the data as high probability as possible. I.e., we should choose $\theta$ to maximuze $L(\theta)$โ€

Log Likelihood $l(\theta)$

ํ•˜์ง€๋งŒ $L(\theta)$๋ฅผ ๋ฐ”๋กœ Maximizeํ•˜๋Š” ๊ฒƒ์€ ์‰ฝ์ง€ ์•Š๋‹ค. Derivative of $L(\theta)$๋Š” ์ƒ๋‹นํžˆ ๊ทธ๊ฒƒ์„ ์œ ๋„ํ•˜๋Š” ๊ฒƒ์ด ๋ณต์žกํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๊ทธ๋ž˜์„œ ์šฐ๋ฆฌ๋Š” $L(\theta)$์— $\log$๋ฅผ ์ทจํ•œ $l(\theta)$๋ฅผ ๋Œ€์‹  ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค!

$$ \begin{split} l(\theta) &= \log {L(\theta)} = \log { p(\vec{y} \vert X; \theta) }\\ &= \log {\prod _{ i } { p(y^{(i)} \vert X^{(i)}; \theta) }} \\ &= \sum_{i} { \log { p(y^{(i)} \vert X^{(i)}; \theta) }} \\ &= \sum_{i} { \log { \frac{ 1 }{ \sqrt{2\pi} \sigma } \textrm{ exp } \left( - \cfrac{ {\left( y^{(i)} - \theta^{T} x^{(i)} \right)}^{2} }{2 \sigma^{2}} \right) } } \\ &= m \log {\frac{ 1 }{ \sqrt{2\pi} \sigma}} - \cfrac{1}{\sigma^{2}} \cdot \cfrac{1}{2} \sum_{i} { {\left( y^{(i)} - \theta^{T} x^{(i)} \right)}^2 } \end{split} $$

$l(\theta)$๋ฅผ Maximize ํ•˜๊ธฐ ์œ„ํ•ด์„  ์œ„์˜ ์‹์—์„œ $- \cfrac{1}{\sigma^{2}} \cdot \cfrac{1}{2} \sum_{i} { {\left( y^{(i)} - \theta^{T} x^{(i)} \right)}^2 }$ ๋ถ€๋ถ„์„ Maximize ํ•ด์•ผ ํ•œ๋‹ค.

์ด๋•Œ, ์Œ์ˆ˜ ๋ถ€ํ˜ธ $-$๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ๋ฐ˜์ „ํ•˜๋ฉด,

$\cfrac{1}{\sigma^{2}} \cdot \cfrac{1}{2} \sum_{i} { {\left( y^{(i)} - \theta^{T} x^{(i)} \right)}^2 }$๋ฅผ minimizeํ•˜๋Š” ๊ฒƒ์ด๊ณ  ์ด๊ฒƒ์€ $J(\theta)$๋ฅผ ์˜๋ฏธํ•œ๋‹ค!!


๋™์น˜ ๊ด€๊ณ„๋ฅผ ์ •๋ฆฌํ•ด๋ณด์ž.

  1. $\max{L(\theta)} \equiv \max{l(\theta)}$
  2. $\max{l(\theta)} \equiv \max{\sum_{i}{- \frac{\left( y^{(i)} - \theta^{T} x^{(i)} \right)^2}{2{\sigma}^2}}}$
  3. $\max{\sum_{i}{- \frac{\left( y^{(i)} - \theta^{T} x^{(i)} \right)^2}{2{\sigma}^2}}} \equiv \min{\frac{1}{2}\sum_{i}{\left( y^{(i)} - \theta^{T} x^{(i)} \right)^2}} = \min{J(\theta)}$
  4. $\max{L(\theta)} \equiv \min{J(\theta)}$

์ฐธ๊ณ ๋กœ Likelihood Maximizing์€ Regression problem ์™ธ์—๋„ Classification Problem์—์„œ๋„ ์ ์šฉ๊ฐ€๋Šฅ ํ•˜๋‹ค!!

๋งบ์Œ๋ง

๊ฒฐ๋ก ์ ์œผ๋กœ ์šฐ๋ฆฌ๋Š” LMS ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ Likelihood $L(\theta)$๋ฅผ Maximizingํ•˜๊ณ  ์žˆ์—ˆ๋‹ค. ๋ฌผ๋ก  $\textrm{LMS} \equiv \textrm{MLE}$์—๋Š” IID๋ผ๋Š” ๊ฐ€์ •์ด ํ•„์š”ํ•˜์ง€๋งŒ, ์ด ๋™์น˜ ๊ด€๊ณ„๋Š” LMS ๋ฐฉ์‹์„ ์ง€์ง€ํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•œ๋‹ค.


  1. $p (y^{(i)} \vert x^{(i)}; \theta )$์™€ $p ( y^{(i)} \vert x^{(i)}, \theta )$๋Š” ๋ถ„๋ช… ๋‹ค๋ฅธ ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š”๋‹ค. โ€˜$;$โ€™๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ๊ทธ ์˜๋ฏธ๋Š” โ€œparameterized byโ€๊ฐ€ ๋œ๋‹ค. ๋ฐ˜๋ฉด, โ€˜$,$โ€™๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ โ€œ$\theta$์— ๋Œ€ํ•ด conditioning ๋จโ€์„ ์˜๋ฏธํ•œ๋‹ค. ์ด๋•Œ $p ( y^{(i)} \vert x^{(i)}, \theta )$๋Š” ํ‹€๋ฆฐ ํ‘œํ˜„์ธ๋ฐ, $\theta$๊ฐ€ random variable์ด ์•„๋‹Œ parameter์˜ ์ง€์œ„๋ฅผ ๊ฐ–๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.ย