2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

4 minute read

2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

Bayesโ€™ Theorem

\[\frac{P(\text{data} | \text{parameter}) \cdot P(\text{parameter})}{P(\text{data})} = P(\text{parameter} \mid \text{data})\] \[\frac{\text{Likelihood} \cdot \text{Prior probability}}{\text{Evidence}} = \text{Posterior probability}\]

โ€œOne assumption taken is the strong independence assumptions btw the features.โ€

Naive Assumption

โ€œIn a supervised learning situation, Naive Bayes classifiers are trained very efficiently. Naive Bayes classifiers need a small training data to estimate the parameters needed for classification. Naive Bayes classifiers have simple design and implementation, and they can applied to many real life situations.โ€

<Naive Bayes Classifier>๋Š” Supervised Learning Model์ด๋‹ค. Conditional Probability๋ฅผ ํ™œ์šฉํ•ด ๋ถ„๋ฅ˜๋ฅผ ์ง„ํ–‰ํ•˜๋ฉฐ Maximum Likelihood๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค!


Naive Bayes Classifier

์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ๋Š” ๋ฐ์ดํ„ฐ์˜ feature๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ๊ฐ label์— ๋Œ€ํ•œ ํ™•๋ฅ ์ธ Posterior Probability $p(c_k \mid \mathbf{x})$๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

์ด๋•Œ, ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๋ฅผ ํ†ตํ•ด Posterior Probability๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด Conditional Probability๋กœ ๋ถ„ํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค.

\[p(c_k \mid \mathbf{x}) = \frac{p(c_k) \cdot p(\mathbf{x} \mid c_k)}{p(\mathbf{x})}\]

์ด๋•Œ, Evidence์— ํ•ด๋‹นํ•˜๋Š” $p(\mathbf{x})$๋Š” output $y$์™€ ๋…๋ฆฝ์ด๋‹ค. ๋”ฐ๋ผ์„œ, ์šฐ๋ฆฌ๋Š” Evidence๋ฅผ ๋ฌด์‹œํ•˜๊ณ  ์•„๋ž˜์™€ ๊ฐ™์ด $y$๋ฅผ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

\[\begin{aligned} y &= \underset{c_k}{\text{argmax}} \; p(c_k \mid \mathbf{x}) \\ &= \underset{c_k}{\text{argmax}} \; p(c_k) \cdot p(\mathbf{x} \mid c_k) \end{aligned}\]

๋งŒ์•ฝ Prior $p(c_k)$์˜ ํ™•๋ฅ ์ด ๋ชจ๋‘ ๊ฐ™๋‹ค๋ฉด, ์‚ฌ์ „ ํ™•๋ฅ ์— ๋Œ€ํ•œ ๋ถ€๋ถ„๋„ ์ƒ๋žตํ•  ์ˆ˜ ์žˆ๋‹ค ๐Ÿ˜

Data $\mathbf{x}$๊ฐ€ feature $a_1, a_2, \dots, a_n$์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ , ์œ„์˜ ์‹๋“ค์„ ๋‹ค์‹œ ์จ๋ณด์ž.

\[\begin{aligned} p(c_k \mid a_1, \dots, a_n) &= \frac{p(c_k) \cdot p(a_1, \dots, a_n \mid c_k)}{p(a_1, \dots, a_n)} \\ &\propto p(c_k) \cdot p(a_1, \dots, a_n \mid c_k) \end{aligned}\]

์ด๋•Œ, <NB Classifier>์—์„œ๋Š” ๊ฐ feature๊ฐ€ ๋…๋ฆฝ(independent)ํ•˜๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ,

\[\begin{aligned} p(a_1, \dots, a_n) &= p(a_1) \cdots p(a_n) \\ p(a_1, \dots, a_n \mid c_K) &= p(a_1 \mid c_k) \cdots p(a_n \mid c_k) \end{aligned}\]

์‹์„ ๋‹ค์‹œ ์จ๋ณด๋ฉด,

\[\begin{aligned} p(c_k \mid a_1, \dots, a_n) &\propto p(c_k) \cdot p(a_1, \dots, a_n \mid c_k) \\ &= p(c_k) \cdot p(a_1 \mid c_k) \cdots p(a_n \mid c_k) \\ &= p(c_k) \cdot \prod^n_{i=1} p(a_n \mid c_k) \end{aligned}\]

๋งŒ์•ฝ ์œ„์˜ ์‹์—์„œ $\propto$๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋“ฑ์‹์œผ๋กœ ๋ฐ”๊พธ๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[p(c_k \mid a_1, \dots, a_n) = \frac{\displaystyle p(c_k) \cdot \prod^n_{i=1} p(a_n \mid c_k)}{\displaystyle\sum_i p(c_i) p(\mathbf{x} \mid c_i)}\]

Gaussian Naive Bayes

<Gaussian NB> ๋ชจ๋ธ์„ Likelihood $p(\mathbf{x} \mid c_k)$๊ฐ€ Gaussian ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ, ์ด๋•Œ $\mathbf{x}$์˜ ๊ฐ feature๊ฐ€ ๋ชจ๋‘ ๋…๋ฆฝ์ด๋ฏ€๋กœ ์šฐ๋ฆฌ๋Š” ๊ฐ feature์— ๋Œ€ํ•œ Gaussian Likelihood๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜ํ•ด ์‚ฌ์šฉํ•œ๋‹ค.

\[p(a_i \mid c_k) = \frac{1}{\sqrt{2\pi \sigma_{i, c_k}^2}} \cdot \exp \left( - \frac{(a_i - \mu_{c_k})}{2 \sigma_{i, c_k}^2} \right)\]

์œ„์˜ Likelihood function์„ ํ™œ์šฉํ•ด ์‹์„ ๋‹ค์‹œ ๊ธฐ์ˆ ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[\begin{aligned} p(c_k \mid a_1, \dots, a_n) &\propto p(c_k) \cdot p(a_1, \dots, a_n \mid c_k) \\ &= p(c_k) \cdot \prod^n_{i=1} p(a_n \mid c_k) \\ &= p(c_k) \cdot \prod^n_{i=1} \frac{1}{\sqrt{2\pi \sigma_{i, c_k}^2}} \cdot \exp \left( - \frac{(a_i - \mu_{c_k})}{2 \sigma_{i, c_k}^2} \right) \end{aligned}\]

์ฐธ๊ณ ์ž๋ฃŒ