โ€œMachine Learningโ€์„ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๊ฐœ์ธ์ ์ธ ์šฉ๋„๋กœ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

4 minute read

โ€œMachine Learningโ€์„ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๊ฐœ์ธ์ ์ธ ์šฉ๋„๋กœ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

๊ธฐํš ์‹œ๋ฆฌ์ฆˆ: Bayesian Regression

  1. MLE vs. MAP ๐Ÿ‘€
  2. Predictive Distribution
  3. Bayesian Regression

MLE vs. MAP

  • MLE = Maximum Likelihood Estimation
  • MAP = Maximum A Posteriori

MLE, MAP ๋‘˜๋‹ค <statistical inference>์˜ ๋ฐฉ๋ฒ•๋ก  ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๋‘˜๋‹ค ์ตœ์ ์˜ $\theta$ ๊ฐ’์„ ์ฐพ๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค.

1. MLE

MLE์— ๋Œ€ํ•œ introduction์€ ์ด ํฌ์ŠคํŠธ๋ฅผ ์ฐธ๊ณ ํ•˜๊ธธ ๋ฐ”๋ž€๋‹ค. MLE๋Š” ๊ฐ„๋‹จํ•˜๊ฒŒ ๋งํ•ด ์•„๋ž˜์˜ ๊ฐ’์„ ๊ตฌํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

\[\theta_{\text{MLE}} = \underset{\theta}{\text{argmax}} \; p(X \mid \theta) = \underset{\theta}{\text{argmax}} \prod_{i} p(x_i \mid \theta)\]

์ด๋•Œ, $P(X\mid \theta)$๋ฅผ โ€œlikelihoodโ€๋ผ๊ณ  ํ•˜๋Š”๋ฐ, ์—ฌ๋Ÿฌ๋ถ„์ด ์ƒ๊ฐํ•˜๋Š” Bayesian Rule์˜ likelihood๊ฐ€ ๋งž๋‹ค! MLE์˜ ์‹์— production $\prod$ ํ…€์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ณดํ†ต์˜ MLE ๋ฌธ์ œ๋Š” log-likelihood์—์„œ ์ตœ๋Œ€๊ฐ’์„ ๊ตฌํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ „๊ฐœํ•œ๋‹ค.

\[\theta_{\text{MLE}} = \underset{\theta}{\text{argmax}} \sum_{i} \log \left( p(x_i \mid \theta) \right)\]

MLE๋Š” ์œ„์˜ log-likelihood ์‹์„ ๋ฏธ๋ถ„ํ•œ ๋ฏธ๋ถ„๋ฐฉ์ •์‹์„ ํ’€์–ด $\theta_{\text{MLE}}$๋ฅผ ๊ตฌํ•œ๋‹ค!


2. MAP

MAP๋Š” Bayesian Rule์—์„œ๋ถ€ํ„ฐ ์ถœ๋ฐœํ•œ๋‹ค.

\[P(\theta \mid X) = \frac{P(\theta)P(X \mid \theta)}{P(X)} \propto P(\theta)P(X\mid\theta)\]

MAP๋Š” ๊ทธ ์ด๋ฆ„์—์„œ๋ถ€ํ„ฐ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด posterior๋ฅผ ์‚ฌ์šฉํ•ด $\theta$๋ฅผ ์ถ”์ •ํ•œ๋‹ค. ์œ„์˜ ์‹์„ ์‚ดํŽด๋ณด๋ฉด posterior๋Š” prior์™€ likelihood์˜ ๊ณฑ์œผ๋กœ ์œ ๋„ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ ์‹์„ ์ ์–ด๋ณด๋ฉด MAP๋Š” MLE๋ฅผ ์œ ๋„ํ•˜๋Š” ์‹์—์„œ likelihood๋ฅผ posterior๋กœ ๋ฐ”๊ฟ”์ฃผ๊ธฐ๋งŒ ํ•˜๋ฉด ๋œ๋‹ค!

\[\begin{aligned} \theta_{\text{MAP}} &= \underset{\theta}{\text{argmax}} P(X \mid \theta) P(\theta) \\ &= \underset{\theta}{\text{argmax}} \left( \log P(X \mid \theta) + \log P(\theta) \right) \\ &= \underset{\theta}{\text{argmax}} \left( \log \prod_{i} P(x_i \mid \theta) + \log P(\theta) \right) \\ &= \underset{\theta}{\text{argmax}} \left( \sum_{i} \log P(x_i \mid \theta) + \log P(\theta) \right) \end{aligned}\]


MLE์™€ MAP์˜ ์‹์„ ๋น„๊ตํ•˜๋ฉด ๋”ฑ ํ•˜๋‚˜๊ฐ€ ๋‹ค๋ฅธ๋ฐ ๋ฐ”๋กœ MAP์—๋Š” prior $P(\theta)$๊ฐ€ ์กด์žฌํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค! ์ด๊ฒƒ์€ optimization ๊ณผ์ •์—์„œ $\theta$์— ๋Œ€ํ•œ prior๊นŒ์ง€ ํ•จ๊ป˜ ๊ณ ๋ คํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  $p(\theta)$์˜ ๊ฐ’์— ๋”ฐ๋ผ ์ตœ์ ํ™”์˜ target equation์˜ ๊ฐ’์ด ๋‹ฌ๋ผ์ง€๋Š”๋ฐ, ์ด๊ฒƒ์€ prior $p(\theta)$๊ฐ€ target equstion์— ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๋Š” ๊ฒƒ์œผ๋กœ ์ดํ•ดํ•  ์ˆ˜๋„ ์žˆ๋‹ค. ์ด์ „์˜ MLE๊ฐ€ $\theta$๋ฅผ deterministic ํ•œ ๊ฐ’์œผ๋กœ ์—ฌ๊ฒผ๋˜ ๊ฒƒ๊ณผ๋Š” ๋‹ฌ๋ฆฌ MAP์—์„œ๋Š” $\theta$r๊ฐ€ prior $p(\theta)$๋ฅผ ๊ฐ–๋Š” RV๋กœ ์ทจ๊ธ‰ํ•œ๋‹ค๋Š” ์‹œ๊ฐ๋„ ๋‹๋ณด์ธ๋‹ค.

MAP๋ฅผ ์ข€๋” ์‚ดํŽด๋ณด๊ธฐ ์œ„ํ•ด prior $p(\theta)$๋ฅผ ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ํ˜•ํƒœ์ธ uniform prior๋ผ๊ณ  ๊ฐ€์ •ํ•ด๋ณด์ž. ์ด๊ฒƒ์€ ๋ชจ๋“  likelihood์— const๋กœ ๋™์ผํ•œ weight๋ฅผ ์ฃผ๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค. ๊ทธ๋ž˜์„œ

\[\begin{aligned} \theta_{\text{MAP}} &= \underset{\theta}{\text{argmax}} \left( \sum_{i} \log P(x_i \mid \theta) + \log P(\theta) \right) \\ &= \underset{\theta}{\text{argmax}} \left( \sum_{i} \log P(x_i \mid \theta) + \text{const} \right) \\ &= \underset{\theta}{\text{argmax}} \sum_{i} \log P(x_i \mid \theta) \\ &= \theta_{\text{MLE}} \end{aligned}\]

Boom! uniform prior ์•„๋ž˜์—์„œ๋Š” $\theta_{\text{MLE}} = \theta_{\text{MAP}}$๋ผ๋Š” ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ๋‹ค! ๋ฌผ๋ก  prior๋ฅผ Gaussian์ด๋‚˜ ๋‹ค๋ฅธ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ๊ฐ€์ •ํ•œ๋‹ค๋ฉด, ์ „ํ˜€ ๋‹ค๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ๊ฒƒ์ด๋‹ค. ๋ณดํ†ต prior์— ๋Œ€ํ•ด ์–ด๋–ค ๊ฐ€์ •์„ ์ทจํ•œ๋‹ค๋ฉด MAP๋กœ ํ’€๊ณ , ๊ทธ๋ ‡์ง€ ์•Š๋‹ค๋ฉด MLE๋กœ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค.

์‚ฌ์‹ค MLE์™€ MAP๋Š” ๋ชฉํ‘œ๋กœ ํ•˜๋Š” ๋ฐ”๊ฐ€ ๋‹ค๋ฅธ๋ฐ, ์ด๊ฒƒ์„ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ธฐ์ˆ ํ•œ๋‹ค.

Formally MLE produces the choice that is most likely to generated the observed data.

A MAP estimated is the choice that is most likely given the observed data. In contrast to MLE, MAP estimation applies Bayesโ€™s Rule, so that our estimate can take into account prior knowledge about what we expect our parameters to be in the form of a prior probability distribution.


MLE์™€ MAP์— ๋Œ€ํ•ด ์ถฉ๋ถ„ํžˆ ์ดํ•ดํ–ˆ๋‹ค๋ฉด, ์•„๋ž˜์˜ ์•„ํ‹ฐํด์„ ์ฝ์–ด๋ณด๋Š” ๊ฒƒ์„ ์ถ”์ฒœํ•œ๋‹ค. Linear Regression์„ Frequntist์™€ Bayesian์˜ ๊ด€์ ์—์„œ ์ž˜ ํ’€์–ด๋ƒˆ๋‹ค.


์ด์–ด์ง€๋Š” ํฌ์ŠคํŠธ์—์„œ๋Š” <predictive distribution; ์˜ˆ์ธก ๋ถ„ํฌ>์— ๋Œ€ํ•ด์„œ ์‚ดํŽด๋ณธ๋‹ค.

๐Ÿ‘‰ Predictive Distribution


reference