๋ณธ ๊ธ€์€ 2018-2ํ•™๊ธฐ Stanford Univ.์˜ Andrew Ng ๊ต์ˆ˜๋‹˜์˜ Machine Learning(CS229) ์ˆ˜์—…์˜ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

9 minute read

๋ณธ ๊ธ€์€ 2018-2ํ•™๊ธฐ Stanford Univ.์˜ Andrew Ng ๊ต์ˆ˜๋‹˜์˜ Machine Learning(CS229) ์ˆ˜์—…์˜ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

โ€“ lecture 4


Multi-class Classification

์ง€๊ธˆ๊นŒ์ง€์˜ Classification Problem์€ $y \in \{0, 1\}$์˜ Binary Classification์ด์—ˆ๋‹ค. ์ด๋ฒˆ์—๋Š” $y \in \{1, 2, โ€ฆ, k\}$์˜ Multi-Class Classification์— ๋Œ€ํ•ด ์‚ดํŽด๋ณผ ๊ฒƒ์ด๋‹ค.


(์‚ฌ์ „์ง€์‹) Multinomial Distribution

์ด๋ฒˆ ๊ธ€์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„  Multinomial Distribution๋ฅผ ๋จผ์ € ์ดํ•ดํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค.

์šฐ๋ฆฌ๋Š” ์ด๋ฏธ โ€˜-nomialโ€™์ด ๋ถ™์€ ๋‹จ์–ด๋ฅผ ํ•˜๋‚˜ ์•Œ๊ณ  ์žˆ๋‹ค. ๋ฐ”๋กœ Bi-nomial์ด๋‹ค. Binomial Distribution์€ ์ดํ•ญ๋ถ„ํฌ๋กœ, $N$๋ฒˆ์˜ ๋™์ „ ๋˜์ง€๊ธฐ์—์„œ ์•ž/๋’ท๋ฉด์ด ๋ช‡๋ฒˆ ๋‚˜์˜ฌ์ง€์— ๋Œ€ํ•œ ๋ถ„ํฌ๋ฅผ ๋– ์˜ฌ๋ฆฌ๋ฉด ๋œ๋‹ค. ์ดํ•ญ๋ถ„ํฌ๋Š” $B(n, p)$๋กœ ํ‘œํ˜„ํ•˜๋ฉฐ $n$๋Š” ์‹œํ–‰ํšŸ์ˆ˜, $p$๋Š” ๊ธฐ์ค€์ด ๋˜๋Š” event์˜ ํ™•๋ฅ ์ด๋‹ค.

์ดํ•ญ๋ถ„ํฌ์—์„œ $n$๋ฒˆ ์‹œํ–‰ ์ค‘ $k$๋ฒˆ ์„ฑ๊ณตํ•  ํ™•๋ฅ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

$$p(k; n, p) = \binom{n}{k} p^{k} (1-p)^{(n-k)}$$

๊ทธ๋ฆฌ๊ณ  ์ดํ•ญ๊ณ„์ˆ˜ $\binom{n}{k}$๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ํ‘œํ˜„๋œ๋‹ค.

$$\binom{n}{k} = \frac{n!}{k!(n-k)!}$$


์ดํ•ญ๋ถ„ํฌ์˜ ์ƒํ™ฉ์„ Multi-class๋กœ ํ™•์žฅํ•˜๋ฉด, ๋‹คํ•ญ๋ถ„ํฌ, Multinomial์ด ๋œ๋‹ค.

$k$๊ฐœ ์นดํ…Œ๊ณ ๋ฆฌ์— ๊ทธ ๊ฐ’๋“ค์ด ๋‚˜ํƒ€๋‚  ํ™•๋ฅ ์„ ๊ฐ๊ฐ $p_1$, $p_2$, โ€ฆ, $p_k$๋ผ๊ณ  ํ•˜์ž. $n$๋ฒˆ์˜ ์‹œํ–‰์—์„œ $i$๋ฒˆ์งธ ๊ฐ’์ด $x_i$ํšŒ ๋ฐœ์ƒํ•  ํ™•๋ฅ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

$$p(\vec{x}; n, \vec{p}) = \frac{n!}{x_1! x_2! \ldots x_k!} \prod_{k}^{K} {p_k^{x_k}}$$

Multinomial Distribution์—์„œ๋Š” ํ‘œ๋ณธ๊ฐ’์ด ๋ฒกํ„ฐ $\vec{x}=(x_1, \ldots, x_k)$๊ฐ€ ๋œ๋‹ค. ์ฆ‰, ์ฃผ์‚ฌ์œ„๋ฅผ 10๋ฒˆ ๋˜์ ธ ์ฃผ์‚ฌ์œ„ ๋ˆˆ์˜ ์ถœํ˜„ ํšŸ์ˆ˜๊ฐ€ $(3, 2, 3, 1, 1, 0)$์ผ ํ™•๋ฅ ์„ Multinomial Distribution์„ ํ†ตํ•ด ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค.

๋‹คํ•ญ๋ถ„ํฌ์—์„œ์˜ ๊ณ„์ˆ˜๋„ ์ดํ•ญ๋ถ„ํฌ์˜ ์ดํ•ญ๊ณ„์ˆ˜ $\binom{n}{k}$์™€ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

$$\binom{n}{\vec{x}} = \binom{n}{x_1, \ldots, x_k} = \frac{n!}{x_1! x_2! \ldots x_k!}$$

Multi-Class Classification with GLM

Multi-Class Classification Problem์„ GLM์˜ ๊ผด๋กœ ๊ธฐ์ˆ ํ•ด๋ณด์ž.

๋จผ์ € $T(y)$๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•  ๊ฒƒ์ด๋‹ค.

$$T(1) = \begin{bmatrix}1 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, T(2) = \begin{bmatrix}0 \\ 1 \\ \vdots \\ 0 \end{bmatrix}, \ldots, T(k) = \begin{bmatrix}0 \\ 0 \\ \vdots \\ 1 \end{bmatrix}$$

๊ทธ๋ฆฌ๊ณ  $(T(y))_i$๋Š” ๋ฒกํ„ฐ $T(y)$์˜ i๋ฒˆ์งธ ์›์†Œ๋ฅผ ๊ฐ€๋ฆฌํ‚จ๋‹ค.

์ด๋•Œ ํŽธ์˜๋ฅผ ์œ„ํ•ด ๋งˆ์ง€๋ง‰ ํด๋ž˜์Šค์ธ $k$๋ฅผ ๋‹ค๋ฅธ $k-1$์˜ ํด๋ž˜์Šค๋กœ ์œ ๋„๋˜๋Š” ํด๋ž˜์Šค๋กœ ์ •์˜ํ•˜์ž. ๊ทธ ๋ง์€ ๊ณง $T(y)$๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•˜์ž๋Š” ๊ฒƒ์ด๋‹ค.

$$T(1) = \begin{bmatrix}1 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, T(2) = \begin{bmatrix}0 \\ 1 \\ \vdots \\ 0 \end{bmatrix}, \ldots, T(k) = \begin{bmatrix}0 \\ 0 \\ \vdots \\ 0 \end{bmatrix}$$

์ด๊ฒƒ์„ ํ†ตํ•ด ๋ฒกํ„ฐ $T(y)$์˜ ์ฐจ์›์„ $\mathbb{R^{k-1}}$๋กœ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  $k$์˜ ํ™•๋ฅ  $p(y=k;\phi)$๋Š” $1 - \sum_{i=1}^{k-1} {\phi_i}$๋กœ ์ •์˜ํ•œ๋‹ค.


์ด๋ฒˆ์—” ๊ฐ class ๋ณ„๋กœ parameter $\theta_{i} \in \mathbb{R}^n$๋ฅผ ์ •์˜ํ•  ๊ฒƒ์ด๋‹ค.

๊ทธ๋ž˜์„œ ์ „์ฒด class์˜ parameter๋ฅผ ๋ชจ์€ $\theta$๋Š” $\mathbb{R}^{n \times k}$์˜ ํ–‰๋ ฌ์ด ๋œ๋‹ค.

$$\theta = \begin{bmatrix} - \theta_1 - \\ - \theta_2 - \\ \vdots \\ - \theta_k - \end{bmatrix}$$

์šฐ๋ฆฌ๋Š” ๋˜ ํ•˜๋‚˜์˜ ์ƒˆ๋กœ์šด ํ‘œ๊ธฐ๋ฒ•์„ ๋„์ž…ํ•œ๋‹ค. indicator function $1 \{ \cdot \}$์€ ๊ด„ํ˜ธ ์•ˆ์˜ ๋ช…์ œ๊ฐ€ ์ฐธ์ด๋ผ๋ฉด 1์„, ๊ฑฐ์ง“์ด๋ผ๋ฉด 0์„ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜์ด๋‹ค. ๊ทธ๋ž˜์„œ $1\{\textrm{True}\}$๋Š” $1$์ด๊ณ , $1\{\textrm{False}\}$๋Š” $0$์ด๋‹ค.

์ด๊ฒƒ์„ ํ™œ์šฉํ•ด ํ™•๋ฅ  $p(y; \phi)$๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ํ•œ๋‹ค.

$$ \begin{split} p(y; \phi) &= \phi_1^{1 \{ y=1 \}} \phi_2^{1 \{ y=2 \}} \ldots \phi_k^{1 \{ y=k \}} \\ &= \phi_1^{1 \{ y=1 \}} \phi_2^{1 \{ y=2 \}} \ldots \phi_k^{1 - \sum_{i=1}^{k-1} 1 \{ y=i \}} \\ &= \phi_1^{(T(y))_1} \phi_2^{(T(y))_2} \ldots \phi_k^{1 - \sum_{i=1}^{k-1}(T(y))_i} \end{split} $$

์ด์ œ $p(y; \phi)$์— Algebraic Massage๋ฅผ ํ•ด๋ณด๋ฉด,

$$ \begin{split} p(y; \phi) &= \phi_1^{(T(y))_1} \phi_2^{(T(y))_2} \ldots \phi_k^{1 - \sum_{i=1}^{k-1}(T(y))_i} \\ &= \exp{\left[ ((T(y))_1 \log{(\phi_1)} + ((T(y))_2 \log{(\phi_2)} + \ldots + \left(1 - \sum_{i=1}^{k-1}(T(y))_i\right) \log{(\phi_k)} \right]} \\ &= \exp{\left[ ((T(y))_1 \log{(\phi_1/\phi_k)} + ((T(y))_2 \log{(\phi_2/\phi_k)} + \ldots + \log{(\phi_k)} \right]} \\ &= b(y) \exp {(\eta^{T} T(y) -a(\eta))} \end{split} $$

์ด์ œ GLM์˜ ๊ฐ ์š”์†Œ๋“ค์„ ํ™•์ธํ•ด๋ณด๋ฉด,

  • $\eta$: $\left[ {\log{(\phi_1/\phi_k)}, \log{(\phi_2/\phi_k)}, \cdots, \log{(\phi_{k-1}/\phi_k)}} \right]^{T}$
  • $a(\eta)$: $-\log{(\phi_k)}$
  • $b(y)$: $1$

๊ฐ€ ๋œ๋‹ค.

$\eta = \left[ {\log{(\phi_1/\phi_k)}, \log{(\phi_2/\phi_k)}, \cdots, \log{(\phi_{k-1}/\phi_k)}} \right]^{T}$๋ผ๋Š” ์‚ฌ์‹ค์— ์˜ํ•ด natural parameter $\eta$์™€ canonical parameter $\phi$๋ฅผ ์—ฐ๊ฒฐ์ง“๋Š” link function์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋œ๋‹ค.

$$ \eta_i = \log{\frac{\phi_i}{\phi_k}} $$

์ด๋ฒˆ์—” link function์˜ ์—ญํ•จ์ˆ˜๋ฅผ ์ทจํ•ด response function์„ ์‚ดํŽด๋ณด์ž.

$$ \begin{split} e^{\eta_i} &= \frac{\phi_i}{\phi_k} \\ \phi_k e^{\eta_i} &= \phi_i \\ \phi_k \sum_{i=1}^{k} {e^{\eta_i}} &= \sum_{i=1}^{k} \phi_i = 1 \end{split} $$

๋”ฐ๋ผ์„œ $\phi_k = 1/{\sum_{i=1}^{k}{e^{\eta_i}}}$๋ฅผ ์˜๋ฏธํ•˜๊ณ , ์ด๊ฒƒ์„ ์ด์šฉํ•ด $e^{\eta_i} = \frac{\phi_i}{\phi_k}$๋ฅผ ๋‹ค์‹œ์“ฐ๋ฉด

$$\phi_i = \frac{e^{\eta_i}}{\sum_{j=1}^{k}{e^{\eta_j}}}$$

๊ฐ€ ๋œ๋‹ค. $\eta$๋ฅผ $\phi$๋กœ ๋งคํ•‘ํ•˜๋Š” ์ด ํ•จ์ˆ˜๋ฅผ ์šฐ๋ฆฌ๋Š” softmax function์ด๋ผ๊ณ  ํ•œ๋‹ค!

์ด์ œ ์ด softmax function์„ ์ด์šฉํ•ด ํ™•๋ฅ  $p(y=i \vert x; \theta)$๋ฅผ ๋‹ค์‹œ ์ •์˜ํ•ด๋ณด์ž.

$$ \begin{split} p(y=i \vert x; \theta) &= \phi_i \\ &= \frac{e^{\eta_i}}{\sum_{j=1}^{k}{e^{\eta_j}}} \\ &= \frac{e^{\theta_i^T x}}{\sum_{j=1}^{k}{e^{\theta_i^T x}}} \end{split} $$

์ด ๊ณผ์ •์—์„œ GLM์˜ ๊ฐ€์ •์ธ โ€œnatural parameter $\eta$ and model parameter $\theta$ are linearly relatedโ€๋ฅผ ์ ์šฉํ•˜์˜€๋‹ค.

์ด๋ ‡๊ฒŒ softmax function์„ response function์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” regression์„ softmax regression์ด๋ผ๊ณ  ํ•œ๋‹ค. softmax regression์€ logistic regression์˜ general model์ด๋‹ค.


์ด์ œ ์šฐ๋ฆฌ์˜ ์ตœ์ข…์ ์ธ ์ถœ๋ ฅ๊ฐ’์ธ hypothesis $h_{\theta}(x)$๋ฅผ ์‚ดํŽด๋ณด์ž. GLM์—์„œ $h_{\theta}(x)$๋Š” ๊ฐ€์ •์— ์˜ํ•ด $\textrm{E}[T(y) \vert x; \theta]$์ด๋‹ค.

$$ \begin{split} h_{\theta}(x) &= \textrm{E}[T(y) \vert x; \theta] \\ &= \begin{bmatrix} \phi_1 \\ \phi_2 \\ \vdots \\ \phi_{k-1} \end{bmatrix} \\ &= \begin{bmatrix} \frac{\exp({\theta_1^T x})}{\sum_{j=1}^{k}{\exp({\theta_j^T x})}} \\ \frac{\exp({\theta_2^T x})}{\sum_{j=1}^{k}{\exp({\theta_j^T x})}} \\ \vdots \\ \frac{\exp({\theta_{k-1}^T x})}{\sum_{j=1}^{k}{\exp({\theta_j^T x})}} \end{bmatrix} \end{split} $$

์œ„์˜ ์‹์—์„œ๋Š” $i=1, \ldots, k-1$์—์„œ์˜ $p(y=i \vert x; \theta)$๋งŒ์„ ๋‹ค๋ฃจ๊ณ  ์žˆ๋‹ค. $p(y=k \vert x; \theta)$์˜ ๊ฒฝ์šฐ๋Š” $1-\sum_{i=1}^{k-1} {\phi_i}$๋กœ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.


Cross Entropy

์•ž์—์„œ ๋‹ค๋ฃฌ Softmax Regression์„ ๊ทธ๋ฆผ์„ ํ†ตํ•ด ๋ณต์Šตํ•˜๋ฉด์„œ Softmax Regression์˜ Loss function์ธ Cross Entropy์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด์ž.


์šฐ๋ฆฌ๋Š” $\theta$์™€ linearly related ๋œ $\eta$์— exponential๊ณผ normalize๋ฅผ ์ทจํ•˜์—ฌ predicted probability์ธ $\hat{p}(y)$์„ ์œ ๋„ํ•˜์˜€๋‹ค.


ํ•˜์ง€๋งŒ $\hat{p}(y)$์€ ์—„์—ฐํžˆ predicted ๊ฐ’์ผ ๋ฟ! ์šฐ๋ฆฌ๋Š” $\hat{p}(y)$๊ณผ ์‹ค์ œ ๊ฐ’์ธ $p(y)$๋ฅผ ๋น„๊ตํ•˜์—ฌ ๋‘˜ ์‚ฌ์ด์˜ ์˜ค์ฐจ๋ฅผ ์ตœ์†Œํ™” ํ•ด์•ผ ํ•œ๋‹ค. ์ด๋•Œ ์ •๋‹ต ๋ ˆ์ด๋ธ”์— ๋Œ€ํ•ด $p(y)$๋Š” $1$์˜ ๊ฐ’์„ ๊ฐ€์ง„๋‹ค.

์ด์ œ $\hat{p}(y)$์™€ $p(y)$ ์‚ฌ์ด์˜ ์˜ค์ฐจ๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ์ง€ํ‘œ์ธ Cross Entropy๊ฐ€ ๋“ฑ์žฅํ•œ๋‹ค. Cross Entropy๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋œ๋‹ค.

$$\textrm{CrossEnt}(p, \hat{p}) = - \sum_{i \in \{1, \ldots, k\}} {\left(p(y_i) \log {\hat{p}(y_i)} \right)}$$

์ด๋•Œ $p(y)$์˜ ๊ฐ’์€ ์ •๋‹ต ๋ ˆ์ด๋ธ”์— ๋Œ€ํ•ด์„œ๋งŒ $1$์˜ ๊ฐ’์„ ๊ฐ–๊ธฐ ๋•Œ๋ฌธ์—, $\textrm{CrossEnt}(p, \hat{p})$์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ธฐ์ˆ ๋œ๋‹ค.

$$ \begin{split} \textrm{CrossEnt}(p, \hat{p}) &= - \sum_{i \in \{1, \ldots, k\}} {\left(p(y_i) \log {\hat{p}(y_i)} \right)} \\ &= - p(y_z) \log {\hat{p}(y_z)} \end{split} $$

๊ทธ๋ฆฌ๊ณ  ์šฐ๋ฆฌ๊ฐ€ ์•ž์„  ๊ณผ์ •์—์„œ ๊ตฌํ•œ $\hat{p}(y_i)$์˜ ์‹์„ ๋Œ€์ž…ํ•˜๋ฉด,

$$ \begin{split} \textrm{CrossEnt}(p, \hat{p}) &= - \sum_{i \in \{1, \ldots, k\}} {\left(p(y_i) \log {\hat{p}(y_i)} \right)} \\ &= - p(y_z) \log {\hat{p}(y_z)} \\ &= - \log { \frac{\exp({\theta_{z}^T x})}{\sum_{i=1}^{k}{\exp({\theta_i^T x})}} } \end{split} $$

์ด์ œ parameter $\theta$๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๊ณ ์ž ํ•œ๋‹ค๋ฉด, ์œ„์˜ $\textrm{CrossEnt}(p, \hat{p})$์— Gradient Descent ๋ฐฉ๋ฒ•์„ ์ทจํ•จ์œผ๋กœ์จ Softmax Regression Model์„ ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค!


๋งบ์Œ๋ง

๋ณธ ๊ธ€์—์„œ๋Š” Multi-class Classification์„ GLM์˜ ๊ด€์ ์—์„œ ์‚ดํŽด๋ณด์•˜๋‹ค. ๋‚ด์šฉ์„ ์š”์•ฝํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • Multi-class Classification์€ Multinomial์—์„œ ์ถœ๋ฐœํ•œ๋‹ค.
  • softmax function ํ•จ์ˆ˜๋Š” $\eta$๋ฅผ $\phi$๋กœ ๋งคํ•‘ํ•˜๋Š” response function์ด๋‹ค.
  • Cross Entropy๋Š” ์ •๋‹ต ๋ ˆ์ด๋ธ” $p(y)$๊ณผ softmax function์œผ๋กœ ์–ป์€ predicted probability $\hat{p}(y)$ ์‚ฌ์ด์˜ ์˜ค์ฐจ๋ฅผ ์ •์˜ํ•˜๋Š” ํ•จ์ˆ˜์ด๋‹ค.