2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

8 minute read

2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

Classficiation by Linear Regression

Binary Classification

Assume that $\mathcal{Y}= \{ 0, 1 \}$, and consider the <linear regression> model

\[Y = X^T \beta + \epsilon\]

For some estimator $\hat{\beta}$, and given $X=x$, predictor $Y$ do something like this

\[\hat{Y} = \begin{cases} \quad 1 & \text{if } x^{T}\hat{\beta} > 0.5 \\ \quad 0 & \text{if } x^{T}\hat{\beta} \le 0.5 \end{cases}\]

์ด๋•Œ, $x^T \hat{\beta}$๋Š” $Y=1$์— ๋Œ€ํ•œ ํ™•๋ฅ ์„ ๋ฑ‰๋Š” ํ•จ์ˆ˜์˜ ์ผ์ข…์ด๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

\[\hat{f}(x) = \text{Pr}(Y = 1 \mid X = x) = E(Y \mid X = x)\]

์ด๋ จ ํ˜•ํƒœ์˜ ๊ฐ Class์— ๋Œ€ํ•œ posterior probability๋ฅผ ๊ณ„์‚ฐํ•ด Classify๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ๋ชจ๋ธ์„ <Bayes Classifier>๋ผ๊ณ  ํ†ต์นญํ•œ๋‹ค.

Multi-class Classification

Assume that $\mathcal{Y}=\{1, \dots, K\}$, and let $\mathbf{Y}$ be the $n \times K$ binary matrix where the $(i, k)$ element is $1$ if $y_i = k$ and $0$ otherwise.

Let

\[\hat{\mathbf{B}} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{Y}\]

(deriven from RSS estimator!)

\[\hat{\mathbf{Y}} = \mathbf{X} \hat{\mathbf{B}}\]

์ด๋ ‡๊ฒŒ ๋‘”๋‹ค๋ฉด inference์—์„œ๋Š”, For $X=x$, one may predict $Y$ as

\[\hat{Y} = \underset{k\in\mathcal{Y}}{\text{argmax}} \; {\hat{f}_k (x)}\]

where $(\hat{f}_1 (x), \dots, \hat{f}_K(x)) = x^T \hat{\mathbf{B}}$.

์ด๋•Œ, ๊ฐ $\hat{f}_i (x)$๋Š” binary classification์˜ $\hat{f}(x)$์™€ ๋น„์Šทํ•˜๊ฒŒ ์•„๋ž˜์˜ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  ๋˜๋Š” poster-priority๋ฅผ ์˜ˆ์ธกํ•˜๋Š” estimator์˜ ์—ญํ• ์„ ํ•œ๋‹ค.

\[\hat{f}_k (x) = \text{Pr}(Y=k \mid X=x)\]


Problem.

(์ถ”ํ›„์— ๊ธฐ์ˆ )


Probabilistic Model

์œ„์˜ ๋‘ ๊ฒฝ์šฐ์—์„œ ์‚ดํŽด๋ดค๋“ฏ, Linear Regression์— ์˜ํ•œ ์ ‘๊ทผ์€ ๊ฒฐ๊ตญ ์•„๋ž˜์˜ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ estimate ํ•˜๋Š” ์ ‘๊ทผ์ด์—ˆ๋‹ค.

\[\text{Pr}(Y=k \mid X=x)\]

์ด๋•Œ, ์ด ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์€ posterior probability๋กœ <๋ฒ ์ด์ฆˆ ์ •๋ฆฌ>์— ์˜ํ•ด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[\text{Pr}(Y=k \mid X=x) = \frac{\pi_k \cdot f_k(x)}{\displaystyle \sum^K_{i=1} \pi_i \cdot f_i (x)}\]

์ด๋•Œ, ์šฐ๋ณ€์˜ ๊ฐ ํ•ญ๋ชฉ๋“ค์„ ์‚ดํŽด๋ณด๋ฉด

  • $\pi_k$: the <class probability> of class $k$; prior probability
\[\pi_k = P(y=k)\]
  • $f_k (x)$: the <class-conditional density> of $X$ in class $Y=k$
\[f_k (x) = P(X=x \mid Y=k)\]

๋งŒ์•ฝ ์šฐ๋ฆฌ๊ฐ€ $\pi_k$, $f_k (x)$๋ฅผ ๋ชจ๋‘ ์•Œ๊ณ  ์žˆ๋‹ค๋ฉด, ์šฐ๋ฆฌ๋Š” $X=x$๊ฐ€ ์žˆ์„ ๋•Œ, ์–ด๋–ค $Y=k$๊ฐ€ ์„ ํƒ๋˜์–ด์•ผ ํ•˜๋Š”์ง€ estimate ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‹จ, $\pi_k$, $f_k(x)$๋Š” ๋ชจ๋‘ ์šฐ๋ฆฌ๊ฐ€ ์ถ”์ •ํ•ด์•ผ ํ•˜๋Š” ๊ฐ’์ด๋ฉฐ, ์ด๊ฒƒ์„ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด ์–ด๋–ป๊ฒŒ ์ ‘๊ทผํ•˜๊ณ  ์–ด๋–ค ๊ฐ€์ •์„ ์“ฐ๋Š”์ง€ ์ˆ˜์—…์—์„œ ๋ฐฐ์šด๋‹ค.

Linear Discriminant Analysis

LDA๋Š” $f_k$๊ฐ€ ์•„๋ž˜์™€ ๊ฐ™์€ ์ •๊ทœ ๋ถ„ํฌ $N(\mu_k, \Sigma_k)$๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. 1

\[f_k(x) = \frac{1}{\left| 2\pi \Sigma_k \right|^{1/2}} \exp \left[ - \frac{(x-\mu_k)^T \Sigma_k^{-1} (x-\mu_k)}{2}\right]\]

์ด๋•Œ, ๋งŒ์•ฝ ๊ฐ $\Sigma_k$์— ๋Œ€ํ•ด

\[\Sigma_k = \Sigma \quad \text{for all} \; k\]

๋ผ๋Š” ์•„์ฃผ์•„์ฃผ ํŠน๋ณ„ํ•œ ๊ฐ€์ •์ด ์ถ”๊ฐ€๋œ ๊ฒƒ์ด <LDA; Linear Discriminant Analysis>๋‹ค! ๐Ÿ˜


์œ„์™€ ๊ฐ™์ด ์„ค์ •ํ–ˆ์„ ๋•Œ์˜ <decision boundary>๋ฅผ ์ƒ๊ฐํ•ด๋ณด์ž.

\[P(Y=k \mid X=x) = P(Y=l \mid X=x)\]

boundary์— ๋Œ€ํ•œ ์œ„์˜ ์‹์„ ์ž˜ ์ •๋ฆฌํ•ด๋ณด๋ฉด,

\[\begin{aligned} \log \frac{P(Y=k \mid X=x)}{P(Y=l \mid X=x)} &= \log \frac{\pi_k \cdot f_k(x)}{\pi_l \cdot f_l(x)} = \log \frac{\pi_k}{\pi_l} + \log \frac{f_k(x)}{f_l(x)} \\ &= \log \frac{\pi_k}{\pi_l} + \log \left( \frac{\exp\left( -\frac{(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)}{2}\right)}{\exp\left( -\frac{(x-\mu_l)^T\Sigma^{-1}(x-\mu_l)}{2}\right)}\right) \\ &= \log \frac{\pi_k}{\pi_l} + \log \left( \exp \left( \frac{-(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)+(x-\mu_l)^T\Sigma^{-1}(x-\mu_l)}{2}\right) \right) \\ &= \log \frac{\pi_k}{\pi_l} + \left( \frac{-(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)+(x-\mu_l)^T\Sigma^{-1}(x-\mu_l)}{2}\right) \\ &= \left(\log \frac{\pi_k}{\pi_l} -\frac{1}{2} (\mu_k + \mu_l)^T\Sigma^{-1}(\mu_k - \mu_l)\right) + x^T \Sigma^{-1} (\mu_k - \mu_l) \\ &= a + b^T x = 0 \end{aligned}\]

์ผ๋ถ€ ์Šคํ…์€ ๊ณผ์ •์„ ์ƒ๋žตํ•˜๊ณ  ๊ฒฐ๊ณผ๋งŒ ๋ฐ”๋กœ ์ ์—ˆ๋Š”๋ฐ, ์ž์„ธํ•œ ๊ณผ์ •์€ ์•„๋ž˜์˜ ํŽผ์ณ๋ณด๊ธฐ์— ๊ธฐ์ˆ ํ•ด๋‘๊ฒ ๋‹ค.

ํŽผ์ณ๋ณด๊ธฐ

(์ถ”ํ›„ ๊ธฐ์ˆ )

์œ„์˜ ์Šคํ…์˜ ๋งˆ์ง€๋ง‰์— $a+b^Tx = 0$๋ผ๋Š” ํ‰๋ฉด์‹์ด ์œ ๋„ ๋˜์—ˆ๋Š”๋ฐ, ๊ฒฐ๊ตญ ์ด $a+b^T x = 0$์ด๋ผ๋Š” hyper-plain์ด <decision boundary>๊ฐ€ ๋จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค!!


Classification.

์ด์ œ LDA๋ฅผ ์ด์šฉํ•ด classification์„ ์–ด๋–ป๊ฒŒ ์ˆ˜ํ–‰ํ•˜๋Š”์ง€ ๋…ผํ•ด๋ณด์ž.

๋จผ์ € $P(Y = k \mid X=x)$์—์„œ ๋ถ„๋ชจ๊ฐ€ ๋ชจ๋‘ ๋™์ผํ•œ ๊ฐ’์„ ๊ฐ€์งˆ ํ…Œ๋‹ˆ, ๋ถ„์ž์ธ $\pi_k \cdot f_k(x)$๋ฅผ maximizeํ•˜๋ฉด ๋œ๋‹ค! $f_k(x)$์— ์ง€์ˆ˜์‹์ด ์žˆ์œผ๋‹ˆ ๊ณ„์‚ฐ์˜ ํŽธ์˜๋ฅผ ์œ„ํ•ด $\log$๋ฅผ ์”Œ์›Œ์ฃผ์ž.

\[\delta_k (x) = \log P(Y = k \mid X=x) = \log \pi_k + x^T \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k\]

์ด๋•Œ, $\log P(Y = k \mid X=x)$๋ฅผ <linear discriminant function> $\delta_k (x)$๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. ์šฐ๋ฆฌ๋Š” $\{\delta_k(x)\}_{k \in \mathcal{Y}}$์— ๋Œ€ํ•ด $\text{argmax}$๋ฅผ ์ทจํ•ด

\[\hat{y} = \underset{k \in \mathcal{Y}}{\text{argmax}} \; \delta_k (x)\]

๊ฐ€์žฅ ํฐ discriminant function $\delta_k(x)$ ํ•จ์ˆ˜๊ฐ’์„ ๊ฐ–๋Š” ํด๋ž˜์Šค๋กœ classification์„ ์ง„ํ–‰ํ•œ๋‹ค!


Parameter Estimation.

์•„์ง ์šฐ๋ฆฌ๋Š” $\pi_k$์™€ $f_k(x)$๋ฅผ ์ •ํ™•ํžˆ ์ •์˜ํ•˜์ง€๋Š” ์•Š์•˜๋‹ค. $f_k(x)$๊ฐ€ normal dist. $N(\mu_k, \Sigma)$๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •์€ ํ–ˆ์ง€๋งŒ, $\mu_k$, $\Sigma$์— ๋Œ€ํ•œ ๋ช…ํ™•ํžˆ ์ •์˜ํ•˜์ง€๋Š” ์•Š์•˜์—ˆ๋‹ค. ์ด๋ฒˆ ํŒŒํŠธ์—์„œ๋Š” ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ์–ด๋–ป๊ฒŒ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์‚ดํŽด๋ณธ๋‹ค.

Let $\displaystyle n_k = \sum^n_{i=1} I(y_i=k)$, then

\[\hat{\pi}_k = \frac{n_k}{n} \quad \text{(sample proportion)}\] \[\hat{\mu}_k = \frac{1}{n_k} \sum_{i: y_i = k} x_i \quad \text{(sample mean)}\] \[\hat{\Sigma} = \frac{1}{N-k} \sum^K_{k=1} \sum_{i: y_i=k} (x_i - \hat{\mu}_k) (x_i - \hat{\mu}_k)^T \quad \text{(pooled sample cov-var matrix)}\]
ํŽผ์ณ๋ณด๊ธฐ

1. $\mu$๋ฅผ ํ™•์‹คํžˆ ์•Œ ๋•Œ

\[\hat{\Sigma} = \frac{1}{N} \sum^K_{k=1} \sum_{i: y_i=k} (x_i - \mu_k) (x_i - \mu_k)^T\]

2. $\mu$๋ฅผ ๋ชจ๋ฅผ ๋•Œ

\[\hat{\Sigma} = \frac{1}{N-k} \sum^K_{k=1} \sum_{i: y_i=k} (x_i - \hat{\mu}_k) (x_i - \hat{\mu}_k)^T\]

Quadratic Discriminant Analysis

LDA์—์„œ๋Š” ๊ฐ class $k$์˜ Cov-Var matrix $\Sigma_k$๊ฐ€ ๋ชจ๋‘ ๊ฐ™์€ ๊ฐ’์„ ๊ฐ€์ง„๋‹ค๊ณ  ๊ฐ€์ •ํ–ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ๋งŒ์•ฝ ์ด ๊ฐ€์ •์„ ์ œ๊ฑฐํ•œ๋‹ค๋ฉด, ์šฐ๋ฆฌ๋Š” <QDA; Quadratic Discriminant Analysis>๋ฅผ ํ•˜๊ฒŒ ๋œ๋‹ค.

\[\delta_k (x) = \log \pi_k - \frac{1}{2} \log \left| \Sigma_k \right| - \frac{1}{2} (x-\mu_k)^T \Sigma_k^{-1} (x-\mu_k)\]

LDA์—์„œ์˜ $\delta_k(x)$์™€ ๋น„๊ตํ•ด๋ฉด, QDA์˜ ๊ฒฝ์šฐ, cancel out์ด ๋œ ๋˜๊ธฐ ๋•Œ๋ฌธ์— โ€œ2์ฐจ์‹โ€์ด ๋‚จ๊ฒŒ ๋œ๋‹ค!

(๋‘˜ ์ค‘ ํ•˜๋‚˜๋Š” QDA๋ฅผ, ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” $X_i$์— ๋Œ€ํ•œ 2์ฐจ์‹($X_i^2, X_iX_j$)์„ ๋„ฃ๊ณ  LDA๋ฅผ ๋Œ๋ฆฐ ๊ฒฐ๊ณผ๋‹ค.)


์ด์–ด์ง€๋Š” ํฌ์ŠคํŠธ์—์„œ๋Š” <Logistic Regression>์— ๋Œ€ํ•ด ๋‹ค๋ฃฌ๋‹ค. ์ข€๋” ์ต์ˆ™ํ•œ ์šฉ์–ด๋ฅผ ์“ฐ์ž๋ฉด, <MLE; Maximum Likelihood Estimation>์— ๋Œ€ํ•ด ๋‹ค๋ฃฌ๋‹ค๋Š” ๋ง์ด๋‹ค!

๐Ÿ‘‰ Linear Classification - 2


  1. ๊ฐ•์กฐํ•˜์ง€๋งŒ, ๋ฐ˜๋“œ์‹œ ์ด๋ ‡๊ฒŒ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋Š” ๊ฑด ์•„๋‹ˆ๋‹ค. $f_k(X)$์˜ ๋ถ„ํฌ๊ฐ€ $N(\mu_k, \Sigma_k)$๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  โ€˜๊ฐ€์ •โ€™ํ–ˆ์„ ๋ฟ์ด๋‹ค! ์‹ค์ œ $f_k(X)$์˜ ๋ถ„ํฌ๋Š” ๋‹ค๋ฅผ ๊ฐ€๋Šฅ์„ฑ์ด ํฌ๋‹ค!ย