๋ณธ ๊ธ€์€ 2018-2ํ•™๊ธฐ Stanford Univ.์˜ Andrew Ng ๊ต์ˆ˜๋‹˜์˜ Machine Learning(CS229) ์ˆ˜์—…์˜ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

9 minute read

๋ณธ ๊ธ€์€ 2018-2ํ•™๊ธฐ Stanford Univ.์˜ Andrew Ng ๊ต์ˆ˜๋‹˜์˜ Machine Learning(CS229) ์ˆ˜์—…์˜ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

โ€“ lecture video


Part I: Linear Regression

Q. ๋ฌด์—‡์ด Regression์ธ๊ฐ€?
A. ์˜ˆ์ธกpredictํ•˜๊ณ ์ž ํ•˜๋Š” target variable์ด continuous ํ•˜๋‹ค๋ฉด, ๊ทธ prediction์„ Regression์ด๋ผ๊ณ  ํ•œ๋‹ค. (๋งŒ์•ฝ target variable์ด discrete ํ•˜๋‹ค๋ฉด, ๊ทธ prediction์„ classification์ด๋ผ๊ณ  ํ•œ๋‹ค.)

โ€œ๋”ฅ๋Ÿฌ๋‹์€ prediction goal์„ ๋ช…ํ™•ํ•˜๊ฒŒ ์ œ์‹œํ•ด์•ผ prediction์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.โ€

โ€œregression์€ prediction ํ•˜๋Š” ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ผ ๋ฟ์ด๋‹ค.โ€

single-var. regression์„ ํ•˜๋Š” ๊ฒฝ์šฐ๋ณด๋‹ค๋Š” multi-var. regression์„ ์ฃผ๋กœ ํ•จ.

Q. regression์€ ์–ด๋–ป๊ฒŒ predictionํ•˜๋Š”๊ฐ€?
A. ์ฃผ์–ด์ง„ parameter $\theta$์— ๋Œ€ํ•œ hypothesis function $h(x; \theta)$์˜ ๊ฐ’์ด prediction ๊ฐ’์ด๋‹ค.

Q. parameter $\theta$๋ฅผ ์–ด๋–ป๊ฒŒ ์ตœ์ ํ™”ํ• ๊นŒ?
A. cost function $J(\theta)$๋ฅผ ์ •์˜ํ•˜๊ณ  cost ๊ฐ’์„ minimize ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ learning ํ•œ๋‹ค.

1. LMS algorithm

$$J(\theta) = \cfrac { 1 }{ 2 } \sum_{ i=1 }^{ m }{ { \left( { h }_{ \theta }\left( { x }^{ \left( i \right) } \right) - { y }^{ \left( i \right) } \right) }^{ 2 } }$$

LMSLeast Mean Square๋Š” cost $J(\theta)$๋ฅผ minimizeํ•˜๋Š” ์ ์ ˆํ•œ $\theta$๋ฅผ ์ฐพ๋Š” ๋ฐฉ๋ฒ•.

LMS algorithm์€ GDGradient Descent๋ฅผ ํ†ตํ•ด ๋งค iteration๋งˆ๋‹ค $\theta$๋ฅผ ์—…๋ฐ์ดํŠธํ•œ๋‹ค.

LMS๋Š” cost $J(\theta)$ ๋•Œ๋ฌธ์— ๋ถ™์€ ์ด๋ฆ„1, GD๋Š” LMS์—์„œ iteration algorithm์„ ๋‹ด๋‹นํ•  ๋ฟ์ž„.

$${\theta}_j := {\theta}_j - \alpha \cfrac {\partial}{\partial {\theta}_j}J(\theta)$$

2

$\alpha$๋Š” learning rate์ด๋‹ค.

$\cfrac {\partial}{\partial {\theta}_j}J(\theta)$๋Š” ๊ทธ๋ƒฅ ํŽธ๋ฏธ๋ถ„ ํ•ด์„œ ๊ตฌํ•˜๋ฉด ๋จ.

batch GD vs. stochastic GD

batch GD๋Š” $\theta$๋ฅผ ํ•œ๋ฒˆ ์—…๋ฐ์ดํŠธ ํ•˜๊ธฐ ์œ„ํ•ด training set ์ „์ฒด์— ๋Œ€ํ•œ cost ๊ฐ’์„ ๊ณ„์‚ฐํ•œ๋‹ค.

stochastic GD(=SGD)๋Š” training set์˜ ์›์†Œ ํ•˜๋‚˜๋ฅผ ๋ณด๊ณ , $\theta$๋ฅผ ํ•œ๋ฒˆ ์—…๋ฐ์ดํŠธ ํ•œ๋‹ค.
โ†’ SGD๊ฐ€ batch GD๋ณด๋‹ค ๋น ๋ฆ„!

we update the parameters according to the gradient of the error with respect to that single training example only.
Note however that it may never โ€œconvergeโ€ to the minimum, and the parameters $\theta$ will keep oscillating around the minimum of $J(\theta)$; but in practice most of the values near the minimum will be reasonably good approximations to the true minimum.

2. Normal Equation

:warning: only work for โ€˜Linearโ€™ regression :warning:

minimize $J(\theta)$ by explicitly taking its derivatives with respect to the ${\theta}_j$โ€™s

Matrix Derivatives

matrix function $f: \mathbb{R}^{m \times n} \rightarrow \mathbb{R}$

cost function $J(\theta)$๊ฐ€ ๋Œ€ํ‘œ์ ์ธ matrix function์ž„.

$$\nabla_{\theta} J(\theta) = { \left[ \cfrac{\partial J}{\partial \theta_0} \cfrac{\partial J}{\partial \theta_1} ... \cfrac{\partial J}{\partial \theta_n}\right] }^{T}$$

์ด๋ฒˆ์—” ๋˜๋‹ค๋ฅธ matrix function์ธ $\textrm{tr} A$๋ฅผ ์‚ดํŽด๋ณด์ž.

$$\textrm{tr} A = \sum_{i=1}^{n}{A_{ii}}$$

cost function $J(\theta)$๊ฐ€ $\mathbb{R}^{n \times 1}$์ธ ์—ด๋ฒกํ„ฐ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๋Š”๋‹ค๋ฉด, $\textrm{tr} A$๋Š” $\mathbb{R}^{n \times n}$์˜ ์ •์‚ฌ๊ฐ ํ–‰๋ ฌ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๋Š”๋‹ค. $\textrm{tr} A$์˜ derivative์˜ ํ˜•ํƒœ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

$$\nabla _{ A }f\left( A \right) = \begin{pmatrix} \cfrac { \partial f }{ \partial A_{ 11 } } & \cdots & \cfrac { \partial f }{ \partial A_{ 1n } } \\ \vdots & \ddots & \vdots \\ \cfrac { \partial f }{ \partial A_{ n1 } } & \cdots & \cfrac { \partial f }{ \partial A_{ nn } } \end{pmatrix}$$

$\textrm{tr} A$์— ๋Œ€ํ•œ derivative ๊ณต์‹๋“ค์„ ์‚ดํŽด๋ณด์ž.

(1) $\nabla _{ A } f\left( A \right) = B^T$
(2) $\nabla _{ {A}^{T} } f\left( A \right) = {\left( \nabla _{ A }f\left( A \right) \right)}^{T}$
(3) $\nabla _{ A } \textrm{tr} \left( AB{A^T}C \right) = CAB + {C^T}A{B^T}$

Least Squares Method

์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ๋Š” global minimum์„ ์ฐพ๋Š” ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ

$$\nabla_{\theta} f(\theta) = 0$$

์ด ๋˜๋Š” $\theta$๋ฅผ ์ฐพ์•„์•ผ ํ•œ๋‹ค.

global minimum์„ ์ฐพ๊ธฐ ์œ„ํ•ด training set์„ ๋ชจ๋‘ ๋ชจ์€ ํ–‰๋ ฌ $X$๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•˜์ž.

$$X = \begin{bmatrix} { \left( { x }^{ (1) } \right) }^{ T } \\ { \left( { x }^{ (2) } \right) }^{ T } \\ \vdots \\ { \left( { x }^{ (m) } \right) }^{ T } \end{bmatrix}$$

ํ–‰๋ ฌ $X$์˜ ํ–‰ ํ•˜๋‚˜ํ•˜๋‚˜๊ฐ€ ํ•˜๋‚˜์˜ training vector๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

$$\vec{y} = \begin{bmatrix} y^{(1)} \\ y^{(1)} \\ \vdots \\ y^{(1)} \end{bmatrix}$$

์—ด๋ฒกํ„ฐ $\vec{y}$๋Š” ์ „์ฒด training set์— ๋Œ€ํ•œ target values์˜ ๋ชจ์Œ์ด๋‹ค.

์ด์ œ cost function $J(\theta)$๋ฅผ ํ–‰๋ ฌ ํผ์— ๋งž์ถฐ ๋‹ค์‹œ ์“ฐ๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

$$J(\theta) = \cfrac{1}{2} {\left( X\theta - \vec{y} \right)}^{T} \left( X\theta - \vec{y} \right)$$

์ด๊ฒƒ์„ ํŽธ๋ฏธ๋ถ„ ํ•˜๋ฉด,

$$ \begin{split} \nabla_{\theta} J(\theta) &= \nabla_{\theta} \cfrac{1}{2} {\left( X\theta - \vec{y} \right)}^{T} \left( X\theta - \vec{y} \right) \\ &= \cfrac{1}{2} \nabla_{\theta} \left( {\theta}^{T} X^T - \vec{y}^T \right) \left( X\theta - \vec{y} \right) \end{split} $$

์ด๊ณ , ๋ถ„๋ฐฐ๋ฒ•์น™์— ๋”ฐ๋ผ ๋ถ„๋ฐฐํ•˜๋ฉด,

$$ \begin{split} &= \cfrac{1}{2} \nabla_{\theta} \left( {\theta}^{T} {X^T} X \theta - {\theta}^{T} {X^T} \vec{y} - {\vec{y}^T} X \theta + {\vec{y}^T} \vec{y} \right) \end{split} $$

cost function $J(\theta)$์˜ ๊ฐ’์ด scalar์ด๊ธฐ ๋•Œ๋ฌธ์—, $\textrm{tr}$์„ ์ทจํ•ด๋„ ๊ทธ๋Œ€๋กœ scalar์ด๋‹ค.

$$ \begin{split} &= \cfrac{1}{2} \nabla_{\theta} \textrm{tr} \left( {\theta}^{T} {X^T} X \theta - {\theta}^{T} {X^T} \vec{y} - {\vec{y}^T} X \theta + {\vec{y}^T} \vec{y} \right) \end{split} $$

$\textrm{tr}$ ํ•จ์ˆ˜์˜ ์„ฑ์งˆ์— ๋”ฐ๋ผ $\textrm{tr}$์„ ๊ด„ํ˜ธ ์•ˆ์œผ๋กœ ๋„ฃ์œผ๋ฉด,

$$ \begin{split} &= \cfrac{1}{2} \nabla_{\theta} \left( \textrm{tr} \left( {\theta}^{T} {X^T} X \theta \right) - \textrm{tr} \left( {\theta}^{T} {X^T} \vec{y} \right) - \textrm{tr} \left( {\vec{y}^T} X \theta \right) + \textrm{tr} \left( {\vec{y}^T} \vec{y} \right) \right) \end{split} $$

์ด๋•Œ, $\textrm{tr} \left( {\theta}^{T} {X^T} \vec{y} \right) = \textrm{tr} \left( {\vec{y}^T} X \theta \right)$ ์ด๊ณ , ${\vec{y}^T} \vec{y}$์—๋Š” $\theta$-freeํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.

$$ \begin{split} &= \cfrac{1}{2} \nabla_{\theta} \left( \textrm{tr} \left( {\theta}^{T} {X^T} X \theta \right) - 2 \textrm{tr} \left( {\vec{y}^T} X \theta \right) \right) \end{split} $$

์ด๊ฒƒ์„ ๋ฏธ๋ถ„ํ•˜๋ฉด,

$$ \begin{split} &= \cfrac{1}{2} \left( {X^T} X \theta + {X^T} X \theta - 2 {X^T} \vec{y} \right) \\ & = {X^T} X \theta - {X^T} \vec{y} \end{split} $$

์ด ๊ณผ์ •์—์„œ (3) $\nabla _{ A } \textrm{tr} \left( AB{A^T}C \right) = CAB + {C^T}A{B^T}$ ๊ณต์‹์ด ์‚ฌ์šฉ๋˜์—ˆ๋‹ค.

$$ \begin{split} \nabla _{ A } \textrm{tr} \left( AB{A^T}C \right) &= CAB + {C^T}A{B^T} \\ &= AB + A{B^T} \\ & \left( C = I, A = \theta, B = {X^T}X \right) \end{split} $$

๋“œ๋””์–ด ํ–‰๋ ฌํผ์—์„œ์˜ $\nabla_{\theta} J(\theta)$๋ฅผ ์–ป์—ˆ๋‹ค.

$$\nabla_{\theta} J(\theta) = {X^T} X \theta - {X^T} \vec{y} = 0$$

์šฐ๋ณ€์„ ์ •๋ฆฌํ•˜๋ฉด,

$${X^T} X \theta = {X^T} \vec{y}$$

์ด ํ˜•ํƒœ๋ฅผ Normal Equation์ด๋ผ๊ณ  ํ•œ๋‹ค.

๋‹ค์‹œ $\theta$์— ๋”ฐ๋ผ ์ •๋ฆฌํ•˜๋ฉด,

$$\theta = {\left( {X^T} X \right)}^{-1} {X^T} \vec{y}$$

๋”ฐ๋ผ์„œ, cost function $J(\theta)$๋ฅผ minimize ํ•˜๋Š” $\theta$๋Š” ${\left( {X^T} X \right)}^{-1} {X^T} \vec{y}$ ์ด๋‹ค. โ– 


  1. LMS algorithm์€ Widrow-Hoff leraning rule์ด๋ผ๊ณ ๋„ ๋ถˆ๋ฆผ. ํ˜•ํƒœ๋Š” ์™„์ „ ๋™์ผํ•จ.ย 

  2. $:=$๋ฅผ assignment์˜ ์˜๋ฏธ๋กœ ์‚ฌ์šฉย