โ€œMachine Learningโ€์„ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๊ฐœ์ธ์ ์ธ ์šฉ๋„๋กœ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

8 minute read

โ€œMachine Learningโ€์„ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๊ฐœ์ธ์ ์ธ ์šฉ๋„๋กœ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

๋ณธ ๊ธ€์„ ์ฝ๊ธฐ ์ „์— โ€œDistribution over functions & Gaussian Processโ€œ์— ๋Œ€ํ•œ ๊ธ€์„ ๋จผ์ € ์ฝ๊ณ  ์˜ฌ ๊ฒƒ์„ ๊ถŒํ•ฉ๋‹ˆ๋‹ค ๐Ÿ˜‰


โ€œDistribution over functions & Gaussian Processโ€œ๋ฅผ ํ†ตํ•ด Gaussian Process๋กœ ํ•จ์ˆ˜์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ(distribution over functions)๋ฅผ ์–ด๋–ป๊ฒŒ ๋ชจ๋ธ๋งํ•˜๋Š”์ง€ ์‚ดํŽด๋ณด์•˜๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ๋Š” distribution over functions๊ฐ€ Bayesian Regression์˜ ํŒจ๋Ÿฌ๋‹ค์ž„ ์•„๋ž˜์—์„œ ์–ด๋–ป๊ฒŒ ํ™œ์šฉ๋˜๋Š”์ง€๋ฅผ ์‚ดํŽด๋ณธ๋‹ค๐Ÿ™Œ


Gaussian Process Regression

๋จผ์ € <Gaussian Process Regression; ์ดํ•˜ GP Regression>์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•œ ์…‹์—…์„ ํ•ด๋ณด์ž.

i.i.d. sample์˜ ๋ชจ์ž„์ธ train set $S = \{ (x_i, y_i)\}^m_{i=1} = (X, y)$์€ unknown distribution์—์„œ ์ถ”์ถœ๋œ ์ƒ˜ํ”Œ์ด๋‹ค. ์ด๋•Œ, GP Regression์€ ์•„๋ž˜์™€ ๊ฐ™์ด Regression model์„ ๊ตฌ์ถ•ํ•œ๋‹ค.

\[y_i = h(x_i) + \epsilon_i \quad (i = 1, \dots, m)\]

์ด๋•Œ $\epsilon_i$๋Š” i.i.d noise๋กœ $\epsilon_i \sim N(0, \sigma^2)$์ด๋‹ค.

<Bayesian Regression>์—์„  $y_i = \theta^T x_i + \epsilon_i$ ๋ชจ๋ธ๋งํ•œ ๊ฒƒ๊ณผ ์ฐจ์ด์ ์ด ์žˆ๋‹ค.


์ด์ œ $h(\cdot)$์— ๋Œ€ํ•ด prior distribution over function์— ๋Œ€ํ•œ ๊ฐ€์ •์„ ๋„์ž…ํ•œ๋‹ค.1 โ€˜priorโ€™๊ฐ€ ๋ถ™์€ ๊ฒƒ์„ ๋ˆˆ์น˜์ฑ˜๋‹ค๋ฉด ์ด๊ฒƒ์„ โ€˜posteriorโ€™๋กœ ๊ฐฑ์‹ ํ•˜๋ฆฌ๋ผ๋Š” ๊ฒƒ๋„ ์•Œ์•„์ฑŒ ๊ฒƒ์ด๋‹ค ๐Ÿ™Œ ๋จผ์ € $h(\cdot)$๊ฐ€ zero-mean GP๋ผ๊ณ  ๊ฐ€์ •ํ•œ๋‹ค.

\[h(\cdot) \sim \mathcal{GP}(0, \; k(\cdot, \cdot))\]

โ€ป NOTE: $k(x, xโ€™)$ is a valid covariance function.


์ด๋ฒˆ์—๋Š” $S$์™€ ๋™์ผํ•œ unknown distribution์—์„œ ์ถ”์ถœํ•œ i.i.d. sample์˜ ๋ชจ์ž„์ธ test set \(T = \left\{ x^{*}_i, y^{*}_i\right\}^{m_{*}}_{i=1} = (X^{*}, y^{*})\)๋ฅผ ์‚ดํŽด๋ณด์ž. ์ด์ „์˜ <Bayesian Regression>์—์„œ๋Š” Bayesโ€™ rule์„ ์ด์šฉํ•ด <parameter posterior> $p(\theta \mid S)$๋ฅผ ์œ ๋„ํ•˜๊ณ , ์ด๊ฒƒ์„ ํ†ตํ•ด <posterior predictive distribution> $p(y^{*} \mid x^{*}, S)$๋ฅผ ์œ ๋„ํ–ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ GP Regression์—์„œ๋Š” ํ›จ์”ฌ ์‰ฌ์šด ๋ฐฉ๋ฒ•์œผ๋กœ posterior predictive distribution์„ ์œ ๋„ํ•  ์ˆ˜ ์žˆ๋‹ค!! ๐Ÿ˜ฒ


Prediction

์šฐ๋ฆฌ๋Š” prior distribution over function $h(\cdot) \sim \mathcal{GP}(0, \; k(\cdot, \cdot))$์„ ์ •์˜ํ–ˆ๋‹ค. GP์˜ ์„ฑ์งˆ์— ๋”ฐ๋ผ $\mathcal{X}$์˜ subset์ธ $X, X^{*} \subset \mathcal{X}$์— ๋Œ€ํ•ด joint distribution $p(\vec{h}, \vec{h^{*}} \mid X, X^{*})$์„ ๊ตฌํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[\begin{bmatrix} \vec{h} \\ \vec{h^{*}} \end{bmatrix} \mid X, X^{*} \sim \mathcal{N} \left( \vec{0}, \; \begin{bmatrix} K(X, X) & K(X, X^{*}) \\ K(X^{*}, X) & K(X^{*}, X^{*}) \end{bmatrix}\right)\]

matrix-form์˜ ํ‘œ๊ธฐ๊ฐ€ ๋งŽ์ด ๋“ฑ์žฅํ–ˆ์ง€๋งŒ ๋”ฐ๋กœ ํ‘œ๊ธฐ๋ฅผ ์„ค๋ช…ํ•˜์ง€๋Š” ์•Š๊ฒ ๋‹ค ๐Ÿ™

๋˜ i.i.d. noise์— ๋Œ€ํ•ด์„  ์•„๋ž˜๊ฐ€ ์„ฑ๋ฆฝํ•œ๋‹ค.

\[\begin{bmatrix} \vec{\epsilon} \\ \vec{\epsilon^{*}} \end{bmatrix} \sim \mathcal{N} \left( \vec{0}, \; \begin{bmatrix} \sigma^2 I & O \\ O & \sigma^2 I \end{bmatrix}\right)\]

์ด์ œ ์ด๊ฑธ ์ข…ํ•ฉํ•˜๋ฉด,

\[\begin{bmatrix} \vec{y} \\ \vec{y^{*}} \end{bmatrix} \mid X, X^{*} = \begin{bmatrix} \vec{h} \\ \vec{h^{*}} \end{bmatrix} \mid X, X^{*} + \begin{bmatrix} \vec{\epsilon} \\ \vec{\epsilon^{*}} \end{bmatrix}\]

๊ฐ€ ๋˜๋Š”๋ฐ, independent Gaussian random variable์˜ ํ•ฉ์€ ์—ญ์‹œ Gaussian์ด๋ฏ€๋กœ

\[\begin{bmatrix} \vec{y} \\ \vec{y^{*}} \end{bmatrix} \mid X, X^{*} \sim \mathcal{N} \left(\vec{0} , \; \begin{bmatrix} K(X, X) + \sigma^2 I & K(X, X^{*}) \\ K(X^{*}, X) & K(X^{*}, X^{*}) + \sigma^2 I \end{bmatrix}\right)\]

์œ„์˜ ์‹์€ $p(\vec{y}, \vec{y^{*}} \mid X, X^{*})$์— ๋Œ€ํ•œ ์‹์œผ๋กœ โ€œjoint distribution of the observed values and testing pointsโ€์ด๋‹ค. regression์€ testing points์— ๋Œ€ํ•œ ๋ถ„ํฌ๋ฅผ ์›ํ•˜๋ฏ€๋กœ conditional distribution $p(\vec{y^{*}}, \mid \vec{y}, X, X^{*})$์„ ๊ตฌํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[\vec{y^{*}}, \mid \vec{y}, X, X^{*} \sim \mathcal{N} \left( \mu^{*}, \; \Sigma^{*} \right)\]

where ($K^{*} = K(X, X^{*})$)

\[\begin{aligned} \mu^{*} &= K^{*} \left( K + \sigma^2 I \right)^{-1}\vec{y} \\ \Sigma^{*} &= K^{**} + \sigma^2 I - {K^{*}}^T \left( K + \sigma^2 I \right)^{-1} K^{*} \end{aligned}\]

์œ ๋„ ๊ณผ์ •์€ conditional distribution of multi-variate Gaussiaion distribution์— ๋Œ€ํ•œ ์‹์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค. ๐Ÿ™Œ

Boom! ์ด๊ฒƒ์œผ๋กœ ์šฐ๋ฆฌ๋Š” posterior predictive distribution์„ ์–ป์—ˆ๋‹ค!! ๐Ÿคฉ ์ด์ „์˜ Bayesian Linear Regression์˜ ๊ฒƒ๊ณผ ๋น„๊ตํ•ด๋ณด๋ฉด GP Regression์€ ์ •๋ง ๊ณ„์‚ฐ์ ์œผ๋กœ๋„ ์ •๋ง ๊ฐ„๋‹จํ•œ ํ˜•ํƒœ์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค ๐Ÿ‘

๋ณด์ถฉ

์•ž์—์„œ $h(\cdot)$๊ฐ€ โ€˜priorโ€™ distribution over functions ๋ผ๊ณ  ํ–ˆ๋‹ค. ๊ทธ๋Ÿผ โ€˜posteriorโ€™ distribution over function์„ ์œ ๋„ํ•˜๋ฉด, ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ joint distribution $p(\vec{h}, \vec{h^{*}} \mid X, X^{*})$์—์„œ conditional distribution์„ ๊ตฌํ•˜๋ฉด ๋œ๋‹ค.

\[\begin{bmatrix} \vec{h} \\ \vec{h^{*}} \end{bmatrix} \mid X, X^{*} \sim \mathcal{N} \left( \vec{0}, \; \begin{bmatrix} K(X, X) & K(X, X^{*}) \\ K(X^{*}, X) & K(X^{*}, X^{*}) \end{bmatrix}\right)\]

then, the conditional distribution is

\[\vec{h^{*}} \mid \vec{h}, X, X^{*} \sim \mathcal{N} \left( {K^{*}}^T K^{-1} \vec{h}, \; K^{**} - {K^{*}}^TK^{-1}K^{*}\right)\]

๊ฐ€ ๋œ๋‹ค. ์ด๊ฒƒ์ด posterior distribution over function $h(\cdot \mid X)$์ด๋‹ค!


Insights

์ด๋ฒˆ ๋ฌธ๋‹จ์—์„œ๋Š” GP Regression์— ๋Œ€ํ•œ ํ†ต์ฐฐ๋“ค์— ๋Œ€ํ•ด ์‚ดํŽด๋ณผ ๊ฒƒ์ด๋‹ค. locally-weighted linear regression์ฒ˜๋Ÿผ GP Regression์€ non-parameteric regression model์ด๋‹ค. ์ด๋Š” input data์˜ ํ•จ์ˆ˜์— ์„ ํ˜•์— ๋Œ€ํ•œ ๊ฐ€์ •์ด๋‚˜ ๋‹คํ•ญ์‹์— ๋Œ€ํ•œ ๊ฐ€์ •์„ ํ•  ํ•„์š”๊ฐ€ ์—†์œผ๋ฉฐ arbitrary function์„ ๋‹ค๋ฃจ๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค! ๐Ÿคฉ ๋Œ€ํ•™์—์„œ ๋“ค์—ˆ๋˜ โ€œํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹(IMEN472)โ€ ์ˆ˜์—…์—์„œ non-parameteric model์— ๋Œ€ํ•ด ๋‹ค๋ฃจ๊ธด ํ–ˆ๋Š”๋ฐ, <GP Regression>์€ ๋‹ค๋ฃจ์ง€ ์•Š์•˜๋‹ค.


GP์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ <squared exponential kernel> $k_{SE}(x, xโ€™)$์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด์ž.

\[k_{SE}(x, x') = \exp \left( - \frac{1}{2\tau^2} (x - x')^2 \right) \quad (\tau > 0)\]

hyper-parameter์ธ $\tau$๋Š” smoothness๋ฅผ ์กฐ์ •ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ $\tau$ ๊ฐ’์ด ์ž‘์„์ˆ˜๋ก ๊ฐ€๊นŒ์ด ์žˆ๋Š” ์ƒ˜ํ”Œ์„ ์ฃผ๋กœ ๋ณธ๋‹ค. ๊ทธ๋ž˜์„œ model์˜ fluctuation์ด ์‹ฌํ•ด์ง„๋‹ค. ๋ฐ˜๋Œ€๋กœ $\tau$ ๊ฐ’์ด ์ปค์ง€๋ฉด, ๋ฉ€๋ฆฌ ์žˆ๋Š” ์ƒ˜ํ”Œ๋„ ๋ฐ˜์˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์— model์ด smooth ํ•ด์ง„๋‹ค.


๋‹ค์Œ์œผ๋กœ regression noise์ธ $\sigma$(๊ทธ๋ฆผ์—์„œ๋Š” $\sigma_y$)๊ฐ€ ์žˆ๋‹ค. ์ด ๋…€์„์€ uncertainty์˜ ์ •๋„๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ $\sigma$ ๊ฐ’์ด ํด์ˆ˜๋ก ๋ฐ์ดํ„ฐ์˜ noise๊ฐ€ ํฌ๋‹ค๊ณ  ํŒ๋‹จํ•œ๋‹ค.


๋งบ์Œ๋ง

์ง€๊ธˆ๊นŒ์ง€ <GP Regression>์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด์•˜๋‹ค. ์ด ๋…€์„์€ bayesian regression model์ด๋ฉด์„œ non-parameteric model์ธ ๋…€์„์ด์—ˆ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ <GP Regression>์„ anomaly detection์— ์‚ฌ์šฉํ•œ๋‹ค๋ฉด, anomaly set์— ๋Œ€ํ•œ labeling ์—†์ด unsupervised learning๋กœ anomaly detection์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค ๐Ÿ˜ ๊ฐœ์ธ์ ์œผ๋กœ GP Regression์€ ์‹ค์ „์—์„œ๋„ ๊ต‰์žฅํ•œ ์„ฑ๋Šฅ๊ณผ ๊ทธ๋Ÿด๋“ฏํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ–ˆ์–ด์„œ ๊ฝค ๋งŒ์กฑํ–ˆ๋‹ค ๐Ÿ‘ ์œ„ํ‚คํ”ผ๋””์•„์—์„œ๋Š” GP Regression์„ โ€œkrigingโ€์ด๋ผ๊ณ  ๋ถ€๋ฅด๋˜๋ฐ, ๋ฌธ์„œ๋ฅผ ์ฝ์–ด๋ณด๋‹ˆ GP Regression์— ๋Œ€ํ•œ ๋” ๊นŠ๊ณ  ๋งŽ์€ ๋‚ด์šฉ์„ ๋‹ค๋ฃจ๊ณ  ์žˆ๋‹ค. GP Regression์ด ๋” ๊ถ๊ธˆํ•˜๋‹ค๋ฉด ํ•ด๋‹น ๋ฌธ์„œ๋ฅผ ์ฝ์–ด๋ณด์ž! ๐Ÿ™Œ

๐Ÿ‘‰ Kriging(GP Regression)


๋‹ค์Œ ์‹œ๋ฆฌ์ฆˆ๋กœ๋Š” MCMC(Markov Chain Monte Carlo)๋ฅผ ์ƒ๊ฐํ•˜๊ณ  ์žˆ๋‹ค. ๋Œ€ํ•™์—์„œ โ€˜์ธ๊ณต์ง€๋Šฅโ€™ ๊ณผ๋ชฉ ๋“ค์„ ๋•Œ ๋ณด๊ธด ํ–ˆ๋Š”๋ฐ ๊ทธ๋•Œ๋Š” ์ œ๋Œ€๋กœ ์ดํ•ด๋ฅผ ๋ชป ํ–ˆ์—ˆ๋‹ค ๐Ÿ˜ฅ


references


  1. Bayesian Linear Regression์—์„œ๋„ prior distribution์„ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ, ๊ทธ๋•Œ๋Š” parameter $\theta$์— ๋Œ€ํ•œ <parameter prior> ์˜€๋‹ค! ๊ทธ๋Ÿฌ๋‚˜ Bayesian Regression๊ณผ ๋‹ฌ๋ฆฌ GP Regression์€ parameter $\theta$๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” non-parameteric ๋ชจ๋ธ์ด๋‹ค!! ์ด์— ๋Œ€ํ•ด์„  ๋’ท ๋ฌธ๋‹จ์—์„œ ๋‘˜์„ ๋น„๊ตํ•˜๋ฉฐ ํ•œ๋ฒˆ ๋” ์‚ดํŽด๋ณด๊ฒ ๋‹ค ๐Ÿ˜‰ย