2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

7 minute read

2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

์•ž์—์„œ ์‚ดํŽด๋ณธ <K-means>์™€ <hierarchical> clustering์€ ํœด๋ฆฌ์Šคํ‹ฑ ๊ธฐ๋ฒ•์— ์†ํ•œ๋‹ค. ์–ด๋–ค ๋ชจ๋ธ์ด ์žˆ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์—, statistical inference ์—ญ์‹œ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค. ๋˜ํ•œ, ์‹ค์ „์—์„œ๋Š” categorical variable ๋•Œ๋ฌธ์— โ€œdissimilarity measureโ€๋ฅผ ์ •์˜ํ•˜๋Š” ๊ฒƒ๋„ ์‰ฝ์ง€ ์•Š๋‹ค.

<Model-based cluastering>์€ ๋ฐ์ดํ„ฐ์…‹์˜ ๋ชจ๋ธ, ์ฆ‰ PDF $\text{Pr}(x)$์„ estimation ํ•˜๋Š” ๊ธฐ๋ฒ•์ด๋‹ค!

2์ฐจ์› ๋ฐ์ดํ„ฐ์—์„œ non-parameteric method๋กœ $\text{Pr}(x)$๋ฅผ ์ถ”์ •ํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ต์ง€ ์•Š๋‹ค!


Mixture Model

2์ฐจ์› ๋ฐ์ดํ„ฐ์—์„œ PDF $\text{Pr}(x)$๋ฅผ ์ถ”์ •ํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ต์ง€ ์•Š๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ 3์ฐจ์›๋ถ€ํ„ฐ๋Š” PDF๋ฅผ ์ถ”์ •ํ•˜๊ธฐ ์‰ฝ์ง€ ์•Š์œผ๋ฉฐ, ์ด๊ฒƒ์„ visualize ํ•˜๋Š” ๊ฒƒ๋„ clustering ํ•˜๋Š” ๊ฒƒ๋„ ํž˜๋“  ์ผ์ด๋‹ค.

์ด๋Ÿฐ ๊ณ ์ฐจ์›์—์„œ์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์“ฐ๋Š” ๊ธฐ๋ฒ•์ด ๋ฐ”๋กœ <Mixture Model>์ด๋‹ค! ์ฐธ๊ณ ๋กœ <Mixture Model>์€ โ€œparametericโ€ ๊ธฐ๋ฒ•์ด๋‹ค.


Let $P = \{ f_{\phi}(\cdot) : \phi \in \mathcal{P} \}$ be a parameteric model. ์ด๋•Œ, $f_{\phi}(x)$๋Š” ์–ด๋–ค pdf๋กœ ์˜ˆ๋ฅผ ๋“ค๋ฉด, ์ด๋Ÿฐ ํ˜•ํƒœ๋‹ค:

\[f_{\phi}(x) = N(x \mid \mu, \sigma^2)\]

$f_{\phi}(x)$์—์„œ $\phi$๋Š” parameter์ด๋ฉฐ, ์ง‘ํ•ฉ $P$๋Š” ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  $\phi$์— ๋Œ€ํ•œ PDF์˜ ๋ชจ์Œ์ธ ๊ฒƒ์ด๋‹ค.


์ด๋ฒˆ์—๋Š” $K$-cluster๋ฅผ ํ‘œํ˜„ํ•˜๋Š” parameteric PDF๋ฅผ $f_{\phi_k}(x)$๋ผ๊ณ  ํ•  ๋•Œ, ์•„๋ž˜์™€ ๊ฐ™์€ PDF์˜ ๋ชจ์Œ; collection of densities๋ฅผ ์ •์˜ํ•ด๋ณด์ž. ๋ณ„๊ฑด ์•„๋‹ˆ๊ณ , ๊ทธ๋ƒฅ cluster PDF์˜ ๊ฐ€์ค‘ํ•ฉ์˜ ์กฐํ•ฉ์ด๋‹ค.

\[\left\{ \sum_{k=1}^K \pi_k f_{\phi_k}(\cdot) \; : \; \phi_k \in \mathcal{P}, \pi_k > 0, \sum_{k=1}^K \pi_k = 1\right\}\]

์ด๊ฒƒ์„ <K-component mixture model>์ด๋ผ๊ณ  ํ•œ๋‹ค! ์ด๋•Œ, $\pi_k$๋ฅผ โ€œmixing proportionโ€์ด๋ผ๊ณ  ํ•œ๋‹ค.

๐Ÿ’ฅ NOTE: ์œ„์—์„œ ์ •์˜ํ•œ <K-component mixture model> ์—ญ์‹œ PDF๊ฐ€ ๋œ๋‹ค. ์ฆ๋ช…์€ ํ™•๋ฅ ๅˆ์ด 1์ด ๋˜๋Š”์ง€ ํ™•์ธํ•ด๋ณด๋ฉด ๋˜๋Š”๋ฐ, ์‰ฝ๋‹ค.

component๋ฅผ ๋ช‡๊ฐœ ์“ฐ๋Š”์ง€์— ๋”ฐ๋ผ ๋ถ„ํฌ์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” mode์˜ ์ตœ๋Œ€ ๊ฐฏ์ˆ˜๊ฐ€ ๋‹ฌ๋ผ์ง„๋‹ค.


Let $X_1, \dots, X_n$ be iid from a <K-component mixture>; Data point $X_i$๊ฐ€ mixture model์—์„œ ์ƒ˜ํ”Œ๋ง ๋˜์—ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค.

์šฐ๋ฆฌ๋Š” Data points $\{ X_1, \dots, X_n \}$๋กœ๋ถ€ํ„ฐ parameter $\theta$๋ฅผ ์ถ”์ •ํ•ด์•ผ ํ•œ๋‹ค. ํ˜„์žฌ ์šฐ๋ฆฌ๋Š” $\theta$๋ฅผ ๋ชจ๋ฅด๋Š”(unknown) ์ƒํƒœ๋ฉฐ, ๊ทธ ํ˜•ํƒœ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[\theta = (\pi_1, \dots, \pi_K, \; \phi_1, \dots, \phi_K)^T\]

์šฐ๋ฆฌ๋Š” ์ด unknown parameter $\phi$๋ฅผ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด MLE๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค.

Algorithm. MLE

Find $\theta$ such that maximize $\ell(\theta)$.

\[\ell (\theta) = \sum_{i=1}^n \log \left( \sum_{k=1}^K \pi_k f_{\phi_k} (X_i) \right)\]

์•„๋งˆ ๋ณดํ†ต์˜ MLE ๋ฌธ์ œ๋Š” ์œ„์˜ log-likelihood function์— ํŽธ๋ฏธ๋ถ„ ํ•ด์„œ, optimal solution์„ ์ฐพ๋Š”๋‹ค. ํ•˜์ง€๋งŒ, ์ด ๋ฌธ์ œ์˜ ๊ฒฝ์šฐ, ์ฐพ์•„์•ผ ํ•˜๋Š” ๋ณ€์ˆ˜๊ฐ€ $2K$ ์ •๋„ ๋˜๊ธฐ ๋•Œ๋ฌธ์—, ํŽธ๋ฏธ๋ถ„์œผ๋กœ๋Š” ์‰ฝ๊ฒŒ ํ’€ ์ˆ˜ ์—†๋‹ค. ์‹ฌ์ง€์–ด Gradient Descent ๊ฐ™์€ numerical method๋ฅผ ์“ฐ๋Š” ๊ฒƒ ์—ญ์‹œ ์–ด๋ ต๋‹ค๊ณ  ํ•œ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๋ฐฉ๋ฒ•์€ ์žˆ๋‹ค!! ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ํ’€๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” <EMAlgorithm; Expectation-Maximization>์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค!! ๐Ÿ˜Ž


EM Algorithm

๋จผ์ € <EM Algorithm>์ด ๋ญ”์ง€ ๊ทธ ์ •์˜๋ถ€ํ„ฐ ์‚ดํŽด๋ณด์ž.

EM is a general algorithm to find the MLE when a part of data is missing.

EM์—์„œ๋Š” ์ด๋Ÿฐ missing data๋ฅผ ์ž ์žฌ๋ณ€์ˆ˜, โ€œlatent variableโ€ $Z$๋ผ๊ณ  ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ EM์—์„œ ์“ฐ๋Š” ๋ฐ์ดํ„ฐ $Y$๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ํ‘œํ˜„๋œ๋‹ค.

Let $Y = (X, Z)$ be a random vector following a density $p_{\theta}(\cdot)$ where $X$ is the โ€œobsesrved dataโ€ and $Z$ is the โ€œmissing dataโ€.

missing data $Z$๊ฐ€ ๋„์ž…๋˜์—ˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ ์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ๋Š” observed data $X$๋กœ likelihood๋ฅผ ์ตœ๋Œ€๊ฐ’์œผ๋กœ ๋งŒ๋“œ๋Š” $\theta$๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด๋‹ค!


Let $\ell(\theta; X, Z) = \log p_{\theta}(Y)$ be the โ€œcomplete log-likelihoodโ€; ๋จผ์ € $X$์™€ $Z$๋ฅผ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜๋Š” log-likelihood function์„ ์ •์˜ํ•œ๋‹ค.

์—ฌ๊ธฐ์„œ ์ž ๊น Insight๋ฅผ ์–ป๊ณ  ๊ฐ€์ž. โ€˜Lee-jaejoonโ€™๋‹˜์˜ ํฌ์ŠคํŠธ์˜ ํฌ์ŠคํŠธ๋ฅผ ๋งŽ์ด ์ฐธ๊ณ ํ–ˆ๋‹ค.

1. ๋งŒ์•ฝ ๋ชจ๋“  ํ™•๋ฅ ๋ณ€์ˆ˜๊ฐ€ ๊ด€์ธก ๊ฐ€๋Šฅํ–ˆ๋‹ค๋ฉด, ์ฆ‰, $Z = \emptyset$.

์šฐ๋ฆฌ๋Š” $\theta$์˜ ๊ฐ’์„ ์ถ”์ •ํ•˜๋Š” $\hat{\theta}$๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ์„ ๊ฒƒ์ด๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์šฐ๋ฆฌ๋Š” ๊ด€์ธกํ•˜์ง€ ๋ชปํ•œ $Z$๊ฐ€ ์กด์žฌํ•˜๋ฏ€๋กœ, $\theta$์˜ ๊ฐ’์„ ์ถ”์ •ํ•  ์ˆ˜ ์—†๋‹ค.


2. ๋งŒ์•ฝ $\theta$์˜ ๊ฐ’์„ ์•Œ๊ณ  ์žˆ์—ˆ๋‹ค๋ฉด,

์šฐ๋ฆฌ๋Š” ์ด๊ฑธ๋กœ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ์™„์ „ํžˆ ํŠน์ •ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๊ด€์ธก๋˜์ง€ ์•Š์€ $Z$์˜ ๋ถ„ํฌ๋„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ parameter $\theta$๋Š” unknown์ด๊ธฐ ๋•Œ๋ฌธ์— $Z$์˜ ๋ถ„ํฌ ์—ญ์‹œ ์•Œ ์ˆ˜ ์—†๋‹ค.

์ฆ‰, ํ˜„์žฌ์˜ ์ƒํ™ฉ์€ $Z$๋ฅผ ๋ชจ๋ฅด๋‹ˆ $\theta$๋ฅผ ๋ชจ๋ฅด๊ณ , ๋ฐ˜๋Œ€๋กœ $\theta$๋ฅผ ๋ชจ๋ฅด๋‹ˆ $Z$๋ฅผ ๋ชจ๋ฅด๋Š” ์ƒํ™ฉ์ด๋‹ค.

<EM Algorithm>์€ โ€œ๋งŒ์•ฝ $\theta$๋ฅผ ์•ˆ๋‹ค๋ฉด?โ€๋ผ๋Š” ์•„์ด๋””์–ด๋กœ ์ด ์ƒํ™ฉ์„ ํ•ด๊ฒฐํ•œ๋‹ค.

Algorithm. EM Algorithm

1. Initialize Parameter

\[\theta \leftarrow \theta^{(0)}\]

2. Expectation Step; E-Step

๋งŒ์•ฝ ์šฐ๋ฆฌ๊ฐ€ $\theta$์˜ ๊ฐ’์ด $\theta^{(0)}$๋ผ๋Š” ๊ฒƒ์„ ์•Œ๊ณ  ์žˆ๋‹ค๋ฉด, ๋˜ ๊ด€์ธกํ•œ ๋ฐ์ดํ„ฐ $X$๊ฐ€ ์žˆ๋‹ค๋ฉด, ์šฐ๋ฆฌ๋Š” ์ด๋ฅผ ํ†ตํ•ด ๊ด€์ธกํ•˜์ง€ ๋ชปํ•œ missing data $Z$์˜ โ€˜์กฐ๊ฑด๋ถ€ ๋ถ„ํฌโ€˜์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

\[\text{Pr}(Z \mid X = x, \; \theta = \theta^{(0)})\]

is a conditional distribution of $Z$ given $X=x$ and $\theta = \theta^{(0)}$.

์šฐ๋ฆฌ๋Š” ์ด $Z$์˜ ์กฐ๊ฑด๋ถ€ ๋ถ„ํฌ๋ฅผ ํ™œ์šฉํ•ด log-likelihood์—์„œ $Z$๋ฅผ marginalize ์‹œํ‚ฌ ๊ฒƒ์ด๋‹ค. ์ด๊ฒƒ์€ ์šฐ๋ฆฌ๊ฐ€ ๋ชจ๋ฅด๋Š” $Z$๋ฅผ ์—†์• ์ฃผ๊ธฐ ๋•Œ๋ฌธ์— ์ถฉ๋ถ„ํžˆ ๊ทธ๋Ÿด๋“ฏํ•œ ์ ‘๊ทผ์ด๋‹ค! ์ด๊ฒƒ์„ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[E_{Z \mid X, \theta^{(0)}} \; \ell(\theta; X, Z) = \sum_{Z} \ell(\theta^{(0)}; X=x, z) \cdot p_{\theta^{(0)}} (z \mid X=x)\]

์ด๊ฒƒ์€ missing data $Z$์— ๊ฐ€๋Šฅํ•œ ๊ฐ’๋“ค์— ๋Œ€ํ•œ ํ‰๊ท ์ ์ธ log-likelihood์ด๋‹ค. ์ฆ‰, $Z$๊ฐ€ ์ •ํ™•ํžˆ ์–ด๋–ค ๊ฐ’์„ ๊ฐ–๋Š”์ง€๋Š” ๋ชจ๋ฅด์ง€๋งŒ, $Z$๊ฐ€ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ๊ฐ’๋“ค์— ๋Œ€ํ•ด log-likelihood๊ฐ€ ํ‰๊ท ์ ์œผ๋กœ ์–ด๋–ค ์–‘์ƒ์ธ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด์ฃผ๋Š” โ€˜ํ•จ์ˆ˜โ€™๋‹ค.

์šฐ๋ฆฌ๋Š” ์ด๊ฒƒ์„ โ€˜log-likelihood์˜ ๋Œ€ํƒ€โ€™๋ผ๋Š” ์˜๋ฏธ์—์„œ โ€œsurrogate functionโ€ $Q(\theta \mid \theta^{(0)}$๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

\[Q(\theta \mid \theta^{(0)}) = E_{Z \mid X, \theta^{(0)}} \; \ell(\theta; X, Z)\]


3. Maximization Step; M-Step

์ด์ œ ์šฐ๋ฆฌ๋Š” $Z$๋ฅผ marginalize outํ•œ, missing data์—์„œ ์ž์œ ๋กœ์šด surrogate function $Q(\theta \mid \theta^{(0)})$๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์šฐ๋ฆฌ๋ฅผ ์ด $Q(\theta \mid \theta^{(0)})$๋ฅผ ๋งˆ์น˜ ์šฐ๋ฆฌ๊ฐ€ ๋ณธ๋ž˜ ๋ชฉํ‘œ๋กœ ํ–ˆ๋˜ likelihood ํ•จ์ˆ˜๋กœ ์ทจ๊ธ‰์— ์ด๊ฒƒ์— ๋Œ€ํ•ด maximize ํ•˜๋Š” $\theta$๋ฅผ ์ฐพ์•„์ค„ ๊ฒƒ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ทธ๋Ÿฐ $\theta$๋ฅผ ์ฐพ์•˜๋‹ค๋ฉด, ๊ทธ๊ฒƒ์„ ์ƒˆ๋กœ์šด $\theta$ ๊ฐ’์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค!!

\[\theta^{(1)} = \underset{\theta}{\text{argmax}} \; Q(\theta \mid \theta^{(0)})\]

4. Repeat Until Converge!

์ง€๊ธˆ์€ ๋ฐ”์˜๋‹ˆ, ์ถ”ํ›„์— ๋งˆ๋ฌด๋ฆฌ ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค!


references