โ€œMachine Learningโ€์„ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๊ฐœ์ธ์ ์ธ ์šฉ๋„๋กœ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

8 minute read

โ€œMachine Learningโ€์„ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๊ฐœ์ธ์ ์ธ ์šฉ๋„๋กœ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

๊ธฐํš ์‹œ๋ฆฌ์ฆˆ: Bayesian Regression

  1. MLE vs. MAP
  2. Predictive Distribution ๐Ÿ‘€
  3. Bayesian Regression

Bayesian Approach

์ด๋ฒˆ ํฌ์ŠคํŠธ๋ถ€ํ„ฐ ๋ณธ๊ฒฉ์ ์œผ๋กœ Bayesian Approach์— ๋Œ€ํ•ด ํƒ๊ตฌํ•œ๋‹ค. ๋จผ์ € Bayesian์˜ ๊ด€์ ์—์„œ๋Š” ํ™•๋ฅ (probability)์„ โ€œ๊ฐ€์„ค์— ๋Œ€ํ•œ ๋ฏฟ์Œ์˜ ์ •๋„โ€๋กœ์„œ ์ดํ•ดํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ์‚ฌ์ „ ๋ฏฟ์Œ์„ ๊ฐ€์ง€๊ณ  ๊ฐ€์„ค์„ ์‚ดํŽด๋ณด๊ณ , ์ดํ›„์— ๋ฐ์ดํ„ฐ๋ฅผ ๊ด€์ธกํ–ˆ๋‹ค๋ฉด ๊ทธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์ƒˆ๋กญ๊ฒŒ ๋ฏฟ์Œ์„ ๊ฐฑ์‹ ํ•ด ์‚ฌํ›„ ๋ฏฟ์Œ์„ ์–ป๋Š”๋‹ค. ์ด๋Ÿฐ ์ˆ˜์ง‘-๊ฐฑ์‹ ์˜ ๊ณผ์ •์„ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฐœ์ƒํ•  ๋•Œ๋งˆ๋‹ค ๊ณ„์† ๋ฐ˜๋ณตํ•œ๋‹ค.

๊ธฐ์–ตํ•  ์ ์€ Bayesian Approach๋Š” ํ•ญ์ƒ โ€˜๋ถˆํ™•์‹ค์„ฑ(uncertainty)โ€™์— ๋Œ€ํ•ด ์–˜๊ธฐํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ณ ์ „์ ์ธ ํ™•๋ฅ ๋ก ์ด Point Estimation์œผ๋กœ unbiased estimator ๋˜๋Š” the most efficient estimator of $\theta$1๋ฅผ ๊ตฌํ•˜๊ฑฐ๋‚˜ ๋˜๋Š” Interval Estimation์œผ๋กœ confidence level์„ ๊ตฌํ•˜๋Š” ๋“ฑ์˜ ์ถ”์ •์„ ์ˆ˜ํ–‰ํ•˜์ง€๋งŒ, Bayesian Approach๋Š” parameter $\theta$์— ๋Œ€ํ•œ โ€˜ํ™•๋ฅ  ๋ถ„ํฌโ€™๋ฅผ ์ฐพ๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ Point Estimation์—์„œ ์ฒ˜๋Ÿผ parameter์˜ ๊ฐ’์„ $\theta = \theta_0$๋กœ ํŠน์ •ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ โ€œ$\mu = 4$, $\sigma^2 = 1$์ธ ์ •๊ทœ๋ถ„ํฌ๋กœ parameter๊ฐ€ ๋ถ„ํฌ๋˜์–ด ์žˆ๋‹คโ€๋ผ๊ณ  ๋งํ•œ๋‹ค.

Bayesian Approach์—์„œ๋Š” ๊ด€์ธก ๋ฐ์ดํ„ฐ๊ฐ€ ์ถ”๊ฐ€๋จ์— ๋”ฐ๋ผ parameter์˜ distribution์„ ๊ณ„์† ๊ฐฑ์‹ ํ•œ๋‹ค. ์ด๋Š” parameter์˜ prior distribution์„ ์ƒˆ๋กญ๊ฒŒ ๊ด€์ธก๋œ ๋ฐ์ดํ„ฐ๋กœ ๊ฐฑ์‹ ํ•ด posterior distribution์„ ์–ป๋Š” ์…ˆ์ด๋‹ค. ์ด ์•„ํ‹ฐํด์—์„œ๋Š” ์ด๊ฒƒ์„ โ€œ๋ฐ์ดํ„ฐ๊ฐ€ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์žก์•„๋‹น๊ธฐ๋Š” ์ž์„๊ณผ ๊ฐ™๋‹คโ€๊ณ  ํ‘œํ˜„ํ•˜๋Š”๋ฐ, ํ‘œํ˜„์ด ๊ทธ๋Ÿด์‹ธ ํ•˜๋‹ค ๐Ÿ˜ฒ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ํ•ด๋‹น ์•„ํ‹ฐํด์˜ ์š” ๋ถ€๋ถ„์„ ์ž ๊น ์ฝ์–ด๋ณด๊ณ  ์˜ค๋Š” ๊ฑธ ์ถ”์ฒœํ•œ๋‹ค. ๊ธ€์„ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๊ฐ€ posterior distribution์„ ์–ด๋–ป๊ฒŒ ๊ฐฑ์‹ ํ•˜๋Š”์ง€ ๊ทธ๋ฆฌ๊ณ  prior distribution์„ ์ž˜ ์žก๋Š”๊ฒŒ ์ค‘์š”ํ•œ ์ด์œ ๋ฅผ ๊นจ๋‹ฌ์„ ์ˆ˜ ์žˆ๋‹ค ๐Ÿ‘

๊ธฐ์กด์˜ ๊ณ ์ „์ ์ธ ๋ฐฉ๋ฒ•์€ Point Estimator๋‚˜ confidence interval๋ฅผ ์œ ๋„ํ–ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ Bayesian Approach์—์„œ๋Š” ๊ทธ๋Ÿฐ ๊ฒƒ๋“ค์ด ์ „ํ˜€ ์—†์œผ๋ฉฐ๐Ÿ‘‹ ๋‹จ์ง€ parameter์— ๋Œ€ํ•œ posterior distribution์„ ์ด์šฉํ•ด ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ $x^{*}$๋ฅผ ์˜ˆ์ธกํ•  ๋ฟ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ๊ณผ์ •์—์„œ ๋“ฑ์žฅํ•˜๋Š” ๊ฒƒ์ด ๋ฐ”๋กœ <Predictive Distribution; ์˜ˆ์ธก ๋ถ„ํฌ>์ด๋‹ค!


Parameter Posterior

์•ž์˜ ๋ฌธ๋‹จ์—์„œ Bayesian Approach๊ฐ€ ๊ด€์ธก ๋ฐ์ดํ„ฐ๋กœ parameter์˜ distribution์„ ๊ฐฑ์‹ ํ•œ๋‹ค๊ณ  ๊ธฐ์ˆ ํ–ˆ๋‹ค. ์ด๊ฒƒ์„ ์ข€๋” ์‚ดํŽด๋ณด์ž! ์ œ์ผ ๋จผ์ € parameter $\theta$์— ๋Œ€ํ•œ prior distribution์„ ๊ฐ€์ •ํ•œ๋‹ค. ์ด๊ฒƒ์„ <prior distribution of parameter> ๋˜๋Š” <parameter prior>๋ผ๊ณ  ํ•˜๋ฉฐ, ์—ฌ๊ธฐ์„œ๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ์ •๊ทœ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ •ํ•˜๊ฒ ๋‹ค.

\[\theta \sim N(0, \tau^2 I)\]

์ด์ œ ๊ด€์ธก๋œ ๋ฐ์ดํ„ฐ $X = \{ x^{(1)}, \dots, x^{(m)} \}$๋ฅผ ์ด์šฉํ•ด <parameter prior> $p(\theta)$๋ฅผ ๊ฐฑ์‹ ํ•ด๋ณด์ž. <parameter posterior> $p(\theta \mid X)$๋Š” Bayes Rule์— ๋”ฐ๋ผ ์•„๋ž˜์™€ ๊ฐ™์ด ์œ ๋„ํ•  ์ˆ˜ ์žˆ๋‹ค.

\[\begin{aligned} p(\theta \mid X) &= \frac{p(X \mid \theta) p(\theta)}{p(X)} = \frac{p(X \mid \theta) p(\theta)}{\int_{\theta'} p(X \mid \theta') p(\theta') d\theta'} \\ &= \frac{p(\theta) \prod^m_{i=1} p(x^{(i)} \mid \theta)}{\int_{\theta'} p(\theta') \prod^m_{i=1} p(x^{(i)} \mid \theta') d\theta'} \end{aligned}\]

์ด๋•Œ, likelihood์˜ $p(x^{(i)} \mid \theta)$๋Š” $\theta$๋กœ parametized๋œ ํ™•๋ฅ  ๋ณ€์ˆ˜ $X$์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ์ดํ•ญ ๋ถ„ํฌ, ์ •๊ทœ ๋ถ„ํฌ, ํฌ์•„์†ก ๋ถ„ํฌ ๋“ฑ๋“ฑ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. likelihood๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ parameter $\theta$์— ์˜ํ•ด ์–ด๋–ป๊ฒŒ parameterized ๋˜์–ด ์žˆ์„ ๊ฒƒ์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ฐฑ์‹ ํ•˜๋Š” ๋Œ€์ƒ์ด ์•„๋‹ˆ๋‹ค! ๐Ÿ™Œ

์ดํ•ญ๋ถ„ํฌ

\[p(x^{(i)} \mid \theta) = \frac{n!}{x!(n-x)!} \theta^x (1 - \theta)^{(n-x)}\]

1D-์ •๊ทœ๋ถ„ํฌ

\[p(x^{(i)} \mid \theta) = \frac{1}{\sqrt{2\pi} \sigma} \exp \left( - \frac{(x^{(i)} - \theta)^2}{2\sigma^2}\right)\]

2D-์ •๊ทœ๋ถ„ํฌ ($x^{(i)} \in \mathbb{R}^2$, also $\theta \in \mathbb{R}^2$)

\[p(x^{(i)} \mid \theta) = \frac{1}{\sqrt{2\pi} \sigma} \exp \left( - \frac{(x^{(i)} - \theta)^2}{2\sigma^2}\right)\]

๋“ฑ๋“ฑ๋“ฑ!!

Example. Parameter Posterior

๋™์ „์„ ๋˜์กŒ์„ ๋•Œ, ์•ž๋ฉด์ด ๋‚˜์˜ฌ ํ™•๋ฅ ์ด ๊ท ์ผ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค. ๋™์ „์„ ๋˜์กŒ๋”๋‹ˆ ์•ž๋ฉด์ด ๋‚˜์™”์„ ๋•Œ, parameter poster๋ฅผ ๊ตฌํ•˜๋ผ.

Solution

โ€œ์•ž๋ฉด์ด ๋‚˜์˜ฌ ํ™•๋ฅ ์ด ๊ท ์ผ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค.โ€
โ†’ $p(\theta) = I_{(0 \le \theta \le 1)}$ (parameter prior)

โ€œ๋™์ „์„ ๋˜์ง„๋‹ค.โ€
โ†’ $p(x \mid \theta) = \theta^x (1 - \theta)^{(1-x)}$ (likelihood)

โ€œ๋™์ „์„ ๋˜์กŒ๋”๋‹ˆ ์•ž๋ฉด์ด ๋‚˜์™”์„ ๋•Œโ€
โ†’ $x_1 = 1$

\[\begin{aligned} p(\theta \mid x_1 = 1) &= \frac{p(x_1 = 1 \mid \theta) p(\theta)}{p(x_1 = 1)} \\ &= \frac{\theta \cdot p(\theta)}{1/2} \\ &= 2 \theta \cdot p(\theta) = 2\theta \end{aligned}\]

๊ฐฑ์‹ ๋œ parameter posterior์—์„œ๋Š” ์•ž๋ฉด์ด ๋งŽ์ด ๋‚˜์˜ฌ ๊ฑฐ๋ผ๋Š” ํ™•๋ฅ (=๋ฏฟ์Œ)์ด ๋ฐ˜์˜๋˜์—ˆ๋‹ค.

$\blacksquare$


Predictive Distribution

<Predictive Distribution; ์˜ˆ์ธก ๋ถ„ํฌ>๋Š” unobserved data $x^{*} \in X^{*}$์— ๋Œ€ํ•œ prediction์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ณผ์ •์—์„œ ์œ ๋„ํ•˜๋Š” ๋ถ„ํฌ์ด๋‹ค. ๊ทธ๋ž˜์„œ ์ด๋ฆ„์— โ€œpredictiveโ€๋ผ๋Š” ์ด๋ฆ„์ด ๋ถ™์—ˆ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‡Œํ”ผ์…œ์ž…๋‹ˆ๋‹ค <Predictive Distribution>์€ parameter prior๋กœ ์œ ๋„ํ•˜๋Š”์ง€, observed data $X$๊ฐ€ ๋ฐ˜์˜๋œ parameter posterior๋กœ ์œ ๋„ํ•˜๋Š”์ง€์— ๋”ฐ๋ผ ๋‘ ๊ฐ€์ง€๋กœ ๋‚˜๋‰œ๋‹ค.

Definition. Prior Predictive Distribution

Let $X = \{ x^{(1)}, \dots, x^{(m)} \}$ be a set of observed data, $X^{*}$ be a set of unobsersed data, and $x^{*} \in X^{*}$.

Then, the <prior predictive distribution> is

\[p(x^{*}) = \int p(x^{*}, \theta) d\theta = \int p(x^{*} \mid \theta) p(\theta) d\theta\]

์ฆ‰, likelihood $p(x \mid \theta)$๋ฅผ parameter prior $p(\theta)$์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๊ณ ๋ คํ•ด ์ ๋ถ„ํ•œ ๊ฒƒ์ด <prior predictive distribution>์ด๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ <prior predictive distribution>์€ observed data $X$๋ฅผ ์ „ํ˜€ ์“ฐ๊ณ  ์žˆ์ง€ ์•Š๋‹ค. observed data๋ฅผ ์ œ๋Œ€๋กœ ํ™œ์šฉํ•˜๋ ค๋ฉด parameter posterior $p(\theta \mid X)$๋กœ ์œ ๋„ํ•œ <posterior predictive distribution>์„ ์‚ฌ์šฉํ•ด์•ผ ํ•œ๋‹ค!

Definition. Posterior Predictive Distribution

\[p(x^{*} \mid X) = \int p(x^{*}, \theta \mid X) d\theta = \int p(x^{*} \mid \theta, X) p(\theta \mid X) d\theta\]

๋ณดํ†ต $x^{*}$์™€ $X$๋ฅผ ๋…๋ฆฝ์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋˜๋Š” iid๋ฅผ ๊ฐ€์ •ํ•˜๋ฏ€๋กœ,

\[p(x^{*} \mid X) = \int p(x^{*} \mid \theta, X) p(\theta \mid X) d\theta = \int p(x^{*} \mid \theta) p(\theta \mid X) d\theta\]

<prior predictive distribution>๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ๋‹ฌ๋ผ์ง„ ์ ์€ ์ ๋ถ„ ๋‚ด๋ถ€์˜ ํ•จ์ˆ˜๊ฐ€ parameter prior $p(\theta)$์—์„œ parameter posterior $p(\theta \mid X)$๋กœ ๋ฐ”๋€Œ์—ˆ๋‹ค๋Š” ์ ์ด๋‹ค! <posterior predictive distribution>์€ ๊ด€์ธก๋œ ๋ฐ์ดํ„ฐ๋กœ ๊ฐฑ์‹ ๋œ <parameter posterior>๋ฅผ ์‚ฌ์šฉํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์‹ค์ œ ๋ชจ์ˆ˜(parameter)์™€ ๊ทผ์ ‘ํ•  ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋˜๋Š” ๋ถ„ํฌ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์˜ˆ์ธก(prediction)ํ–ˆ๋‹ค๊ณ  ๊ธฐ๋Œ€ํ•˜๊ฒŒ ๋œ๋‹ค.

์ด ์•„ํ‹ฐํด์˜ ํ•ด๋‹น ๋ถ€๋ถ„์—์„œ predictive distribution์„ ์œ ๋„ํ•˜๋Š” ๊ฐ„๋‹จํ•œ ์˜ˆ์ œ๋ฅผ ํ’€์–ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋ฌธ์ œ๊ฐ€ ์ข‹์œผ๋‹ˆ ํ•œ๋ฒˆ์ฏค ํ’€์–ด๋ณด๋„๋ก ํ•˜์ž ๐Ÿ‘€ ์ฐธ๊ณ ๋กœ ์ฒซ๋ฒˆ์งธ ์˜ˆ์ œ์—์„œ Gamma function $\Gamma$๋ฅผ ์จ์„œ ์ ๋ถ„ํ•˜๋Š” ๋ถ€๋ถ„์€ Beta Distribution์— ๋Œ€ํ•œ ์ ๋ถ„์ด๋‹ค.


๋‹ค์Œ ํฌ์ŠคํŠธ์—์„œ๋Š” <Predictive Distribution>์„ ์ด์šฉํ•ด Regression Problem์„ ๋‹ค๋ฃฌ๋‹ค. ์ด๊ฒƒ์„ <Bayesian Linear Regression>์ด๋ผ๊ณ  ํ•˜๋ฉฐ ์ด๋ฒˆ ํฌ์ŠคํŠธ๋ฅผ ์ž˜ ์ดํ•ดํ–ˆ๋‹ค๋ฉด ๋‹ค์Œ ํฌ์ŠคํŠธ๋ฅผ ์‰ฝ๊ฒŒ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค ๐Ÿ˜

๐Ÿ‘‰ Bayesian Regression


reference


  1. unbiased estimaor with the smallest varianceย