โ€œMachine Learningโ€์„ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๊ฐœ์ธ์ ์ธ ์šฉ๋„๋กœ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

5 minute read

โ€œMachine Learningโ€์„ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๊ฐœ์ธ์ ์ธ ์šฉ๋„๋กœ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

๊ธฐํš ์‹œ๋ฆฌ์ฆˆ: Bayesian Regression

  1. MLE vs. MAP
  2. Predictive Distribution
  3. Bayesian Regression ๐Ÿ‘€

Bayesian Linear Regression

์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ๋Š” ์•ž์—์„œ ๋‹ค๋ฃฌ <parameter posterior>, <posterior predictive distribution>์„ Regression Problem์— ์ ์šฉํ•œ๋‹ค. ์‚ฌ์‹ค <Bayesian Linear Regression>์€ ๋‹จ์ˆœํžˆ <posterior predictive distribution under the regression problem>์ผ ๋ฟ์ด๋‹ค! ๐Ÿ™Œ

๊ด€์ธก๋œ ๋ฐ์ดํ„ฐ $S = (X, y)$๊ฐ€ ์กด์žฌํ•ด ์ด๊ฒƒ์œผ๋กœ <parameter prior>๋ฅผ ๊ฐฑ์‹ ํ•ด๋ณด์ž. Bayes Rule์— ๋”ฐ๋ฅด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด <parameter posterior>๋ฅผ ์œ ๋„ํ•  ์ˆ˜ ์žˆ๋‹ค.

\[\begin{aligned} p(\theta \mid S) &= \frac{p(S \mid \theta) p(\theta)}{p(S)} = \frac{p(S \mid \theta) p(\theta)}{\int_{\theta'} p(S \mid \theta') p(\theta') d\theta'} \\ &= \frac{p(\theta) \prod^m_{i=1} p(y^{(i)} \mid x^{(i)}, \theta)}{\int_{\theta'} p(\theta') \prod^m_{i=1} p(y^{(i)} \mid x^{(i)}, \theta') d\theta'} \end{aligned}\]

์ด๋•Œ, likelihood์˜ $p(y^{(i)} \mid x^{(i)}, \theta)$ ํ…€์€ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ธฐ์ˆ ํ•  ์ˆ˜ ์žˆ๋‹ค.

\[p(y^{(i)} \mid x^{(i)}, \theta) = \frac{1}{\sqrt{2\pi} \sigma} \exp \left( - \frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}\right)\]

์ด๊ฒƒ์€ ์•„๋ž˜์™€ ๊ฐ™์€ regression์˜ ๊ฐ€์ •์„ ํ†ตํ•ด ์œ ๋„๋œ ๊ฒƒ์œผ๋กœ $y$๋ฅผ ํ•˜๋‚˜์˜ ํ™•๋ฅ  ๋ณ€์ˆ˜๋กœ ์ทจ๊ธ‰ํ•œ๋‹ค๋Š” ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„๋‹ค. ๋˜ํ•œ, ์ด์ „ ํฌ์ŠคํŠธ์—์„œ๋Š” likelihood function์ด ์ดํ•ญ ๋ถ„ํฌ, ์ •๊ทœ ๋ถ„ํฌ ๋“ฑ๋“ฑ์˜ ๋ถ„ํฌ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์—ˆ๋Š”๋ฐ, regression problem์˜ ์ƒํ™ฉ์—์„œ๋Š” likelihood๋ฅผ Gaussian distribution๋กœ ์„ค์ •ํ•  ์ˆ˜ ๋ฐ–์— ์—†๋‹ค! ๐Ÿ™Œ


์ด๋ฒˆ์—๋Š” Regression problem์— ๋Œ€ํ•œ <Predictive Distribution>์„ ์‚ดํŽด๋ณด์ž. observed data $S = (X, y)$(train-set)์™€ unobserved data $S^{*} = (X^{*}, y^{*})$(test-set)๊ฐ€ ์žˆ์„ ๋•Œ, unobserved data $x^{*} \in X^{*}$์— ๋Œ€ํ•œ prediction์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ณผ์ •์—์„œ ์œ ๋„ํ•˜๋Š” ๋ถ„ํฌ์ด๋‹ค.

Definition. Prior Predictive Distribution (Regression)

Let $S = \{ (X, y) \}$ be a set of observed data, $X^{*}$ be a set of unobsersed data, and $x^{*} \in X^{*}$.

Then, the <prior predictive distribution> is

\[p(y^{*} \mid x^{*}) = \int p(y^{*}, \theta \mid x^{*}) d\theta = \int p(y^{*} \mid x^{*}, \theta) p(\theta) d\theta\]

๊ทธ๋Ÿฌ๋‚˜ <prior predictive distribution>์€ observed data $S$๋ฅผ ์ „ํ˜€ ์“ฐ๊ณ  ์žˆ์ง€ ์•Š๋‹ค. observed data๋ฅผ ์ œ๋Œ€๋กœ ํ™œ์šฉํ•˜๋ ค๋ฉด parameter posterior $p(\theta \mid S)$๋กœ ์œ ๋„ํ•œ <posterior predictive distribution>์„ ์‚ฌ์šฉํ•ด์•ผ ํ•œ๋‹ค!

Definition. Posterior Predictive Distribution (Regression)

\[p(y^{*} \mid x^{*}, S) = \int p(y^{*}, \theta \mid x^{*}, S) d\theta = \int p(y^{*} \mid x^{*}, S, \theta) p(\theta \mid S) d\theta\]

๋ณดํ†ต $x^{*}$์™€ $S$๋ฅผ ๋…๋ฆฝ์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋˜๋Š” iid๋ฅผ ๊ฐ€์ •ํ•˜๋ฏ€๋กœ,

\[p(y^{*} \mid x^{*}, S) = \int p(y^{*} \mid x^{*}, S, \theta) p(\theta \mid S) d\theta = \int p(y^{*} \mid x^{*}, \theta) p(\theta \mid S) d\theta\]


์ผ๋ฐ˜์ ์œผ๋กœ regression problem์—์„œ ์ •์˜ํ•œ parameter poster $p(\theta \mid S)$์™€ posterior predictive distribution $p(y^{*} \mid x^{*}, S)$๋Š” ์ ๋ถ„ ๊ณ„์‚ฐ์ด ๋งค์šฐ ์–ด๋ ต๋‹ค. ๊ทธ๋ž˜์„œ ๊ทผ์‚ฌ๋ฅผ ์ด์šฉํ•ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š”๋ฐ, ์‹œ๋ฆฌ์ฆˆ์˜ ๋งจ ์ฒ˜์Œ์— ๋‹ค๋ค˜๋˜ MAP(Maximum a Posterior)๋„ ์ด๋Ÿฐ ๊ทผ์‚ฌ ๋ฐฉ์‹ ์ค‘ ํ•˜๋‚˜์ด๋‹ค.

๋‹คํ–‰์ธ ์ ์€ <Bayesian Linear Regression>์€ $p(\theta \mid S)$์™€ $p(y^{*} \mid x^{*}, S)$์— ๋Œ€ํ•œ ๋ถ„ํฌํ•ด๊ฐ€ ์•Œ๋ ค์ ธ ์žˆ์œผ๋ฉฐ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[\begin{aligned} \theta \mid S &\sim \mathcal{N} \left( \frac{1}{\sigma^2} A^{-1}X^T\vec{y}, \; A^{-1}\right) \\ y^{*} \mid x^{*}, S &\sim \mathcal{N} \left( \frac{1}{\sigma^2} {x^{*}}^T A^{-1} X^{T} \vec{y}, \; {x^{*}}^T A^{-1} x^{*} + \sigma^2 \right) \end{aligned}\]

where $A = \frac{1}{\sigma^2}X^TX + \frac{1}{\tau^2}I$.

์œ„์˜ ์‹์ด ์–ด๋–ป๊ฒŒ ์œ ๋„ ๋˜๋Š”์ง€๋Š” ์•„์ง ๋ณธ์ธ๋„ ์ œ๋Œ€๋กœ ์ดํ•ดํ•˜์ง€ ๋ชปํ•ด์„œ ์ถ”ํ›„์— ๋ณ„๋„์˜ ํฌ์ŠคํŠธ๋กœ ์œ ๋„ ๊ณผ์ •์„ ๊ธฐ์ˆ ํ•˜๋„๋ก ํ•˜๊ฒ ๋‹ค ๐Ÿ™Œ

๊ทธ๋ž˜๋„ ์œ„์˜ ์‹์„ ํ†ตํ•ด parameter posterior์™€ posterior predictive distribution์ด Gaussian distribution์„ ๋”ฐ๋ฅธ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํŠนํžˆ prediction $y^{*}$, $y^{*} = \theta^T x^{*} + \epsilon^{*}$์— ๋Œ€ํ•œ uncertainty์™€ parameter $\theta$์˜ ์„ ํƒ์— ๋Œ€ํ•œ uncertainty๋„ ๋‘ ์‹์˜ variance ๊ฐ’์„ ํ†ตํ•ด ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค! ๐Ÿ‘


๋งบ์Œ๋ง

์ด๋ฒˆ ํฌ์ŠคํŠธ๋ฅผ ๋งˆ์ง€๋ง‰์œผ๋กœ Bayesian Approach ์‹œ๋ฆฌ์ฆˆ๊ฐ€ ๋์ด ๋‚ฌ๋‹ค. ์šฉ์–ด์— โ€˜Bayesianโ€™์ด๋ผ๋Š” ๋ง์ด ๋“ค์–ด๊ฐ€๋ฉด ์–ด๋ ต๊ฒŒ๋งŒ ๋Š๊ปด์กŒ๋Š”๋ฐ, ์ด๋ฒˆ ์‹œ๋ฆฌ์ฆˆ๋ฅผ ํ†ตํ•ด ์กฐ๊ธˆ์€ Bayesian Theory๋ฅผ ๊ทน๋ณตํ•œ ๊ฒƒ ๊ฐ™๋‹ค ๐Ÿ™Œ

<Bayesian Regression>์ด bayesian parameteric regression์ด๋ผ๋ฉด, bayesian regression์ด์ง€๋งŒ non-parameteric model์ธ <Gaussian Process Regression>๋„ ์žˆ๋‹ค. ๊ถ๊ธˆํ•˜๋‹ค๋ฉด, ํ•ด๋‹น ํฌ์ŠคํŠธ๋ฅผ ๋ฐฉ๋ฌธํ•ด๋ณด์ž ๐Ÿ‘


reference