โ€œMachine Learningโ€์„ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๊ฐœ์ธ์ ์ธ ์šฉ๋„๋กœ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

11 minute read

โ€œMachine Learningโ€์„ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๊ฐœ์ธ์ ์ธ ์šฉ๋„๋กœ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

๋ณธ ๊ธ€์„ ์ฝ๊ธฐ ์ „์— โ€œRandom Processโ€œ์— ๋Œ€ํ•œ ๊ธ€์„ ๋จผ์ € ์ฝ๊ณ  ์˜ฌ ๊ฒƒ์„ ๊ถŒํ•ฉ๋‹ˆ๋‹ค ๐Ÿ˜‰

Introduction to Gaussian Process

๋จผ์ € <Gaussian distribution>์„ ๋ณต์Šตํ•ด๋ณด์ž.

1. 1D Gaussian Distribution

\[f(x) = \frac{1}{\sqrt{2\pi} \sigma} \exp \left( - \frac{(x-\mu)^2}{2\sigma^2}\right)\]

2. 2D Gaussian Distribution

\[f(\mathbf{x}) = \frac{1}{2\pi \left| \Sigma \right|^{1/2}} \exp \left( - \frac{1}{2} (\mathbf{x} - \mu)^T \Sigma^{-1} (\mathbf{x} - \mu) \right)\]

3. Multi-variate Gaussian Distribution

Distribution over random vectors!

\[f(\mathbf{x}) = \frac{1}{(2\pi)^{n/2} \left| \Sigma \right|^{1/2}} \exp \left( - \frac{1}{2} (\mathbf{x} - \mu)^T \Sigma^{-1} (\mathbf{x} - \mu) \right)\]

์ด์ œ <Gaussian Process>์˜ ์ •์˜๋ฅผ ์‚ดํŽด๋ณด์ž.

Definition. Gaussian Process

A sequence of Gaussian distributions! <Gaussian Process> is a generlization of multi-variate Gaussian distribution. It is a distribution over random functions!

\[\mathcal{GP} (m(x), k(x, x'))\]
  • distribution over random functions
    • mean function $m(x)$1
    • covariance function $k(x, xโ€™)$
  • Every finite subset of RVs in GP is a multi-variate Gaussian distribution!2

์œ„์˜ ์ •์˜๋งŒ ๋ด์„œ๋Š” <Gaussian Process>๊ฐ€ ๋ญ”์ง€ ์ž˜ ์ดํ•ด๊ฐ€ ์•ˆ ๋œ๋‹ค ๐Ÿ˜ฅ ๋จผ์ € โ€œdistribution over random functionsโ€๋ผ๋Š” ํ‘œํ˜„๋ถ€ํ„ฐ ์ดํ•ดํ•ด๋ณด์ž. <random function>์ด๋ผ๋Š” ๋‚ฏ์„  ๊ฐœ๋…์ด ๋“ฑ์žฅํ–ˆ๋Š”๋ฐ <random variable>๊ณผ๋Š” ๋‹ค๋ฅธ ๊ฒƒ์ผ๊นŒ?

Definition. random function

Let $\mathcal{H}$ be a class of functions mapping $\mathcal{X} \rightarrow \mathcal{Y}$. A random function $h(\cdot)$ from $\mathcal{H}$ is a function which is randomly drawn from $\mathcal{H}$, according to some probability distribution over $\mathcal{H}$.

Once a random function $h(\cdot)$ is selected from $\mathcal{H}$ probabilistically, it implies a deterministic mapping from inputs in $\mathcal{X}$ to outputs in $\mathcal{Y}$.

์œ„์—์„œ ์ •์˜ํ•œ <random function>์€ ๋‹จ์ˆœํžˆ random number๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ํ•จ์ˆ˜๊ฐ€ ์•„๋‹ˆ๋‹ค! ๐Ÿ‘Š


Probability distribution over functions with finite domains

๋จผ์ € ํ™•๋ฅ  ๋ถ„ํฌ๊ฐ€ ์–ด๋–ป๊ฒŒ ํ•จ์ˆ˜ ์œ„์—์„œ ์ •์˜๋˜๋Š”์ง€ ์•Œ๊ธฐ ์œ„ํ•ด $\mathcal{X}$๊ฐ€ finite set์ธ ๊ฐ„๋‹จํ•œ ์ƒํ™ฉ๋ถ€ํ„ฐ ์‚ดํŽด๋ณด์ž.

Let $\mathcal{X} = \{x_1, \dots, x_m\}$ be any finite set of elements. Now consider the set $\mathcal{H}$ of all possible functions mapping from $\mathcal{X}$ to $\mathbb{R}$.

Since the domain of any $h(\cdot) \in \mathcal{H}$ has only finite $m$ elts, we can represent $h(\cdot)$ as an $m$-dimensional vector, $\vec{h} = [h(x_1), \dots, h(x_m)]^T$.

In order to specify a probability distribution over functions $h(\cdot)$, we must associate some โ€˜probability densityโ€™ with each function in $\mathcal{H}$. Note that weโ€™ve represent function $h(\cdot)$ as a vector $\vec{h}$. Then we can give a prob. distribution like gaussian as follows

\[\vec{h} \sim \mathcal{N} \left( \vec{\mu}, \; \sigma^2 I \right)\]

Boom! this implies a prob. distribution over functions $h(\cdot)$, whose probability density function is given by

\[p(h) = \prod^m_{i=1} \frac{1}{\sqrt{2\pi} \sigma} \exp \left( - \frac{1}{2\sigma^2} (h(x_i) - \mu_i)^2 \right)\]

์œ„์˜ finite domain์˜ ์˜ˆ์‹œ๋ฅผ ํ†ตํ•ด ์šฐ๋ฆฌ๋Š” prob. distribution over functions with finite domains๊ฐ€ finite-dimensional multi-variate Gaussian์œผ๋กœ ํ‘œํ˜„๋จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค! ๐Ÿ˜ฒ ์—ฌ๊ธฐ์„œ function domain $\mathcal{X}$๋ฅผ infinite dimension์œผ๋กœ ํ™•์žฅํ•˜๋ฉด, ์šฐ๋ฆฌ๋Š” <Gaussian Process>๋ฅผ ์–ป๊ฒŒ ๋œ๋‹ค! ๐Ÿ’ช


Probability distribution over functions with infinite domains

์ด๋ฒˆ์—๋Š” $\mathcal{X}$์—์„œ ์ถ”์ถœํ•œ collection์„ ์ด์šฉํ•ด random variable์˜ ์ง‘ํ•ฉ $\{ h(x) : x \in \mathcal{X}\}_m$๋ฅผ ์ •์˜ํ•ด๋ณด์ž. $h(\cdot)$๊ฐ€ probabilistic ํ•˜๊ฒŒ ๊ฒฐ์ •๋˜๋Š” random function์ด๊ธฐ ๋•Œ๋ฌธ์— $h(x)$๋„ random variable ์ด๋‹ค. ๐Ÿ˜‰ ์ด๋•Œ domain set $\mathcal{X}$์— ๋Œ€ํ•ด ๋ณ„๋„๋กœ ํŠน์ •ํ•˜์ง€๋Š” ์•Š์•˜๋‹ค. ์ด์ „๊ณผ ๊ฐ™์€ finite domain์„ ์ƒ๊ฐํ•ด๋„ ์ข‹๊ณ , $\mathbb{R}$์™€ ๊ฐ™์€ infinite dimension์„ ์ƒ๊ฐํ•ด๋„ ์ข‹๋‹ค.

์šฐ๋ฆฌ๋Š” finite collection of random variable $\{ h(x) : x \in \mathcal{X}\}_m$๋กœ multi-variate Gaussian distribution์„ ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋•Œ, $\mathcal{X}$๋ฅผ domain์œผ๋กœ ๊ฐ–๋Š” $m(x)$์™€ $k(x, xโ€™)$๋Š” mean function, covariance function์„ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

\[\begin{aligned} m(x) &= E \left[h(x)\right] \\ k(x, x') &= E \left[ (h(x) - m(x)) (h(x') - m(x'))\right] \end{aligned}\]

๋”ฐ๋ผ์„œ collection of random variable $\{ h(x) : x \in \mathcal{X}\}_m$ ์œ„์—์„œ์˜ multi-variate Gaussian distribution์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[\vec{h}_m = \begin{bmatrix} h(x_1) \\ \vdots \\ h(x_m) \end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix} m(x_1) \\ \vdots \\ m(x_m) \end{bmatrix} ,\; \begin{bmatrix} k(x_1, x_1) & \dots & k(x_1, x_m) \\ \vdots & \ddots & \vdots \\ k(x_m, x_1) & \dots & k(x_m, x_m) \end{bmatrix} \right)\]

<Gaussian Process>์˜ ์ •์˜๋ฅผ ๋‹ค์‹œ ๋– ์˜ฌ๋ ค๋ณด์ž.

โ€œA Gaussian process is a stochastic process s.t. any finite subcollection of random variables has a multivariate Gaussian distribution.โ€

Boom! ์šฐ๋ฆฌ๊ฐ€ ์œ„์—์„œ ์ •์˜ํ•œ domain $\mathcal{X}$์—์„œ ์ถ”์ถœํ•œ collection of random variable $\{ h(x) : x \in \mathcal{X}\}_m$๋กœ ์ ์ ˆํ•œ multi-variate Gaussian distribution์„ ๋งŒ๋“œ๋Š” ๊ณผ์ •์€ ์‚ฌ์‹ค <Gaussian Process>์˜ ์ •์˜๋ฅผ ์žฌํ˜„ํ•˜๋Š” ๊ฒƒ์ด์—ˆ๋‹ค! finite collection์—์„œ ์œ ๋„ํ•œ ์œ„์˜ ํ‘œํ˜„์„ ์ผ๋ฐ˜ํ™”ํ•˜๋ฉด <Gaussian Process>๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์ ์„ ์ˆ˜ ์žˆ๋‹ค.

\[h(\cdot) \sim \mathcal{GP} (m(\cdot), \; k(\cdot, \cdot))\]

finite domain์—์„œ $h(x)$๋ฅผ finite random vector๋กœ ์ดํ•ดํ•œ ๊ฒƒ์ฒ˜๋Ÿผ, infinite domain์—์„œ์˜ $h(x)$๋Š” infinite random vector๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค! ๐Ÿ™Œ


mean & convariance function for GP

์ด์ œ GP๊ฐ€ distribution over random function์ด๋ผ๋Š” ์ , ๊ทธ๋ฆฌ๊ณ  distribution over infinite random vector๋ผ๋Š” ๊ฒƒ์„ ์ดํ•ดํ–ˆ๋‹ค. ์šฐ๋ฆฌ์˜ ๋‹ค์Œ ๊ด€์‹ฌ์‚ฌ๋Š” GP๋ฅผ an function $m(x)$๊ณผ covariance function $k(x, xโ€™)$์ด๋‹ค ๐Ÿ™Œ ์‚ฌ์‹ค ์•ž์˜ ๋ฌธ๋‹จ์—์„œ $m(x)$์™€ $k(x, xโ€™)$์˜ ์ •์˜๋ฅผ ์ ๊ธด ํ–ˆ๋‹ค๋งŒ, $h(\cdot)$๊ฐ€ random function์ด๊ธฐ ๋•Œ๋ฌธ์— ์œ„์˜ ์ •์˜๋ฅผ ๊ฐ€์ง€๊ณ ๋Š” $m(x)$, $k(x, xโ€™)$๊ฐ€ ์ •ํ™•ํžˆ ์–ด๋–ค ํ•จ์ˆ˜์ธ์ง€ ๊ฐ์„ ์žก์„ ์ˆœ ์—†์—ˆ์„ ๊ฒƒ์ด๋‹ค.

์ผ๋ฐ˜์ ์œผ๋กœ mean function $m(x)$๋Š” ์–ด๋–ค real-valued function๋„ ๊ฐ€๋Šฅํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ covariance function $k(x, xโ€™)$๋Š” GP๋ฅผ marginalization ํ–ˆ์„ ๋•Œ ์œ ๋„๋˜๋Š” Covariance Matrix๊ฐ€ semi-positive definite ๊ฐ™์€ covariance์˜ ์„ฑ์งˆ๋“ค์„ ๋งŒ์กฑํ•ด์•ผ ํ•œ๋‹ค.

For covariance function $k(x, xโ€™)$ and for any set of elts $x_1, \dots, x_m \in \mathcal{X}$, the resulting covariance matrix must be satisfy the properties of covariance matrix.

\[K = \begin{bmatrix} k(x_1, x_1) & \dots & k(x_1, x_m) \\ \vdots & \ddots & \vdots \\ k(x_m, x_1) & \dots & k(x_m, x_m) \end{bmatrix}\]

For example, all $k(x, xโ€™) \ge 0$ and $K$ is a non-negative definite matrix.

์œ„์˜ ์กฐ๊ฑด์„ ๋ณด๋ฉด ์œ ํšจํ•œ $k(x, xโ€™)$๋ฅผ ์ฐพ๋Š” ๊ฒƒ์€ ๊นŒ๋งˆ๋“ํ•ด ๋ณด์ธ๋‹ค ๐Ÿ˜ฅ ๊ทธ.๋Ÿฌ.๋‚˜. Chuong B. Do์˜ ์•„ํ‹ฐํด์— ๋”ฐ๋ฅด๋ฉด valid convariance function์— ๋Œ€ํ•œ ์กฐ๊ฑด์ด ๊ณง <Mercerโ€™s theorem; ๋จธ์„œ์˜ ์ •๋ฆฌ>์—์„œ ์š”๊ตฌํ•˜๋Š” kernel์˜ ์กฐ๊ฑด๊ณผ ๋™์ผํ•˜๋‹ค๊ณ  ๋งํ•œ๋‹ค! ๐Ÿ˜ฒ ๊ทธ๋ž˜์„œ <Mercerโ€™s theorem>์ด ๋ณด์žฅํ•˜๋Š” valid kernel function $k(x, xโ€™)$๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด convariance์˜ ์„ฑ์งˆ์„ ๊ณ ๋ฏผํ•˜์ง€ ์•Š๊ณ ๋„ convariance function $k(x, xโ€™)$๋ฅผ ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค!! ๐Ÿคฉ ์•ž์œผ๋กœ๋Š” convariance function ๋Œ€์‹  โ€œkernel functionโ€์ด๋ผ๋Š” ํ‘œํ˜„์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค.


zero-mean GP

์ด์ œ GP์— ์นœํ•ด์ง€๊ธฐ ์œ„ํ•ด mean function $m(x) = 0$์ธ zero-mean Gaussian process๋ผ๋Š” ๊ฐ„๋‹จํ•œ ์˜ˆ์‹œ๋ฅผ ์‚ดํŽด๋ณด์ž.

\[h(\cdot) \sim \mathcal{GP}(0, \; k(\cdot, \cdot))\]

์ด๋•Œ, function $h$๋Š” $h: \mathbb{R} \rightarrow \mathbb{R}$์˜ ํ•จ์ˆ˜์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  kernel function $k(\cdot, \cdot)$์€ <squared exponential kernel function>3์„ ์‚ฌ์šฉํ•œ๋‹ค.

\[k_{SE}(x, x') = \exp \left( - \frac{1}{2\tau^2} (x - x')^2 \right) \quad (\tau > 0)\]

์ด๋•Œ, ์œ„์™€ ๊ฐ™์€ GP์—์„œ sampleํ•œ function $h(x)$๋Š” ์–ด๋–ป๊ฒŒ ์ƒ๊ฒผ์„๊นŒ? ๋จผ์ € ํ•จ์ˆ˜๊ฐ’์˜ ํ‰๊ท ์ด 0์ด๊ธฐ ๋•Œ๋ฌธ์— ํ•จ์ˆ˜๊ฐ’์ด 0 ์ฃผ๋ณ€์— ๋ถ„ํฌํ•  ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. ๋˜, $x, xโ€™ \in \mathcal{X}$์ธ ๋‘ ์›์†Œ์— ๋Œ€ํ•œ ํ•จ์ˆ˜๊ฐ’์€

  • $x$์™€ $xโ€™$๊ฐ€ ๊ฐ€๊น๋‹ค(nearby)๋ฉด, $k_{SE}(x, xโ€™) \approx 1$์ด ๋˜๋ฏ€๋กœ $h(x)$์™€ $h(xโ€™)$๋Š” high covariance๋ฅผ ๊ฐ€์ง„๋‹ค.
  • ๋ฐ˜๋Œ€๋กœ $x$์™€ $xโ€™$๊ฐ€ ๋ฉ€๋‹ค(far apart)๋ฉด, $k_{SE}(x, xโ€™) \approx 0$์ด ๋˜๋ฏ€๋กœ $h(x)$์™€ $h(xโ€™)$๋Š” low covariance๋ฅผ ๊ฐ€์ง„๋‹ค.

์ด๋Ÿฐ ์•„์ด๋””์–ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์‹ค์ œ๋กœ ์ƒ˜ํ”Œ๋ง ํ•ด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค๊ณ  ํ•œ๋‹ค.


์ž! ์—ฌ๊ธฐ๊นŒ์ง€ <Gaussian Process>์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด์•˜๋‹ค. distribution over random vector์˜ ๊ฐœ๋…์„ ํ™•์žฅํ•œ distribution over random function ๊ทธ๋ฆฌ๊ณ  ๊ทธ๊ฒƒ์„ infinite dimension๊นŒ์ง€ ํ™•์žฅํ•œ Gaussian Process๊นŒ์ง€!! ์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ ๋‹ค๋ฃฌ ๋‚ด์šฉ์ด ๊ฒฐ์ฝ” ์‰ฝ์ง€๋Š” ์•Š์ง€๋งŒ, ๊ณต๋ถ€ํ•  ๊ฐ€์น˜๋Š” ์ถฉ๋ถ„ํ•œ ์ฃผ์ œ์˜€๋‹ค ๐Ÿ’ช

๋‹ค์Œ ํฌ์ŠคํŠธ์—์„  GP๋ฅผ ์ด์šฉํ•ด Regression model์„ ๋งŒ๋“œ๋Š” <Gaussian Process Regression>์— ๋Œ€ํ•ด ์‚ดํŽด๋ณธ๋‹ค!!

๐Ÿ‘‰ Gaussian Process Regression


references


  1. ์ด์ „์˜ <Bernoulli Process>์˜ ๊ฒฝ์šฐ, ๊ฐ trial์—์„œ ๋ชจ๋‘ ๋™์ผํ•œ <Bernoulli distribution>์„ ๊ฐ€์ •ํ–ˆ๋Š”๋ฐ, <Gaussian Process>์˜ ๊ฒฝ์šฐ $x$ ๊ฐ’์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ํ‰๊ท /๋ถ„์‚ฐ์„ ๊ฐ€์ง„ Gaussian distribution์œผ๋กœ ์ด๋ค„์งˆ ์ˆ˜ ์žˆ์Œ์— ์ฃผ๋ชฉํ•˜์ž!ย 

  2. ๋ณดํ†ต <Gaussian Process>์˜ ์ •ํ™•ํ•œ ์ •์˜๋Š” ์ด ๋ฌธ์žฅ์œผ๋กœ ํ‘œํ˜„ํ•œ๋‹ค. โ€œA Gaussian process is a stochastic process s.t. any finite subcollection of random variables has a multivariate Gaussian distribution.โ€ย 

  3. ์‚ฌ์‹ค SE kernel์€ gaussian kernel์˜ ํ•œ ์ข…๋ฅ˜์ด๋‹ค. ๋‹ค๋งŒ, ์—ฌ๊ธฐ์„œ๋Š” Gaussian Process์™€ ์ด๋ฆ„์ด ๊ฒน์ณ์„œ squared-exponential๋ผ๋Š” ์ด๋ฆ„์„ ์“ฐ๊ฒŒ ๋˜์—ˆ๋‹ค.ย