2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

6 minute read

2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

Binary Logistic Regression

LDA์™€ QDA์—์„œ๋Š” $f_k(x)$, ์ฆ‰ โ€œthe conditional density of $X$ given $Y=k$โ€๋ฅผ ๋ชจ๋ธ๋งํ•˜์˜€๋‹ค. ํ•˜์ง€๋งŒ, <Logistic Regression> ๋ชจ๋ธ์€ Regression output $x^T \beta$๋ฅผ <Logistic Function> $\dfrac{e^x}{1 + e^x}$์— ์ ์šฉํ•œ ๊ฒฐ๊ณผ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค!! ํ•œ๋ฒˆ ์‚ดํŽด๋ณด์ž. ๐Ÿคฉ

Definition

Assume that $\mathcal{Y} = \{0, 1\}$. Then the <logistic regression model> assumes

\[P(Y=1 \mid X=x) = \frac{\exp (x^T \beta)}{1 + \exp (x^T \beta)}\]


Q1. ์™œ Logistic โ€œRegressionโ€ ์ธ๊ฐ€?

ํ‰๋ฒ”ํ•œ Linear Regression Model์—์„œ $P(Y = 1 \mid X = x)$๋Š”

\[P(Y = 1 \mid X=x) = x^T \beta\]

์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ฑ‰๋Š”๋‹ค. ํ•˜์ง€๋งŒ, ์ด๋ ‡๊ฒŒ ๋ชจ๋ธ๋งํ•  ๊ฒฝ์šฐ, $x^T \beta$์˜ ๊ฐ’์ด ํ™•๋ฅ ์˜ ์ •์˜์ธ $[0, 1]$ ๊ฐ’์„ ๊ฐ–๋Š”๋‹ค๋Š” ์กฐ๊ฑด์„ ์ž˜ ๋งŒ์กฑํ•˜์ง€ ๋ชป ํ–ˆ๋‹ค. โ€œThe linear regression model vilostes that $x^T \beta \in [0, 1]$โ€

๊ทธ๋ž˜์„œ ์ด $[0, 1]$ ์‚ฌ์ด์— ๋“ค์–ด์˜จ๋‹ค๋Š” ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๊ธฐ ์œ„ํ•ด $x^T \beta$์˜ ๊ฐ’์— Transformation์„ ์ ์šฉํ•œ๋‹ค. ๊ทธ ์ค‘ ํ•˜๋‚˜๊ฐ€ ์ด๋ฒˆ์— ์‚ฌ์šฉํ•˜๋Š” <logistic function>, ๋‹ค๋ฅธ ์ด๋ฆ„์œผ๋กœ <sigmoid function>์ธ ๊ฒƒ์ด๋‹ค.

\[\text{sigmoid}(x) = \frac{e^x}{1+e^x}\]

์‚ฌ์‹ค <sigmoid function> ์™ธ์—๋„ <Gaussian cdf>๋‚˜ <Gompertz function>์„ ์จ๋„ ๋œ๋‹ค๊ณ  ํ•˜๋ฉฐ, ์ด ๊ฒฝ์šฐ ์ข€๋” ํŠน๋ณ„ํ•œ ์ƒํ™ฉ, ์˜ˆ๋ฅผ ๋“ค๋ฉด โ€˜๋ณดํ—˜โ€™ ๋“ฑ์˜ ๋ถ„์•ผ์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๊ณ  ํ•œ๋‹ค.


Q2. ์™œ โ€œLogisticโ€ Regression์ธ๊ฐ€?

Linear Regression๊ณผ LDA/QDA ๋ชจ๋‘ classification์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ์ ์ ˆํ•œ decision boundary๋ฅผ ์ฐพ๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ–ˆ๋‹ค.

\[\left\{ x : \log \frac{P(Y=1\mid X=x)}{P(Y=0 \mid X = x)} \right\}\]

์ด๋•Œ, ํŠน๋ณ„ํ•˜๊ฒŒ๋„ decision boundary๊ฐ€ โ€œlinearโ€ํ•œ ์ƒํ™ฉ์ด๋ผ๋ฉด, ์•„๋ž˜์˜ ์‹์„ ๋งŒ์กฑํ•˜๋ฉฐ classification์„ ์œ„ํ•œ hyper-plain์ด ์ •์˜๋œ๋‹ค.

\[\log \frac{P(Y=1 \mid X=x)}{P(Y=0 \mid X = x)} = x^T \beta\]

๊ทธ๋ฆฌ๊ณ  ์œ„์˜ ์‹์—์„œ ๋กœ๊ทธ๋ฅผ ํ’€๊ณ , ํ™•๋ฅ ์˜ ์„ฑ์งˆ์„ ์ž˜ ์ด์šฉํ•˜๋ฉด ์•„๋ž˜์˜ ์‹์ด ์œ ๋„๋œ๋‹ค.

\[P(Y=1 \mid X= x) = \frac{\exp (x^T \beta)}{1 + \exp (x^T \beta)}\]

๋†€๋ž๊ฒŒ๋„ ์ด๋•Œ ์œ ๋„๋œ ์‹์ด ๋ฐ”๋กœ <Logistic Function>์ธ ๊ฒƒ์ด๋‹ค!! ๐Ÿคฉ


MLE; Maximum Likelihood Estimation

<Logistic Regression> ๋ชจ๋ธ๋„ ๊ฒฐ๊ตญ์€ Regression์„ ์œ„ํ•ด $\beta$ ๊ฐ’์„ ์ถ”์ •ํ•ด์•ผ ํ•œ๋‹ค. ์ด๊ฒƒ์€ <MLE; Maximum Likelihood Estimation>์„ ํ†ตํ•ด ์ถ”์ •ํ•œ๋‹ค. ๊ทธ ๊ณผ์ •์„ ์‚ดํŽด๋ณด์ž.

๋จผ์ € <Likelihood function> $L(\beta)$๋ฅผ ์ •์˜ํ•˜์ž.

\[L(\beta) = \prod^n_{i=1} P(Y = y_i \mid X = x_i)\]

$L(\beta)$๊ฐ€ ์™œ ์ด๋ ‡๊ฒŒ ์ •์˜๋˜์—ˆ๋Š”์ง€ ์‚ดํŽด๋ณด์ž.

์šฐ์„ , $L(\beta)$์—์„œ $X=x_i$์—์„œ์˜ ํ™•๋ฅ ์„ ๋ชจ๋‘ ๊ณฑํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๊ณฑํ•  ์ˆ˜ ์žˆ๋Š” ์ด์œ ๋Š” MLE์˜ ๊ฐ€์ •์ธ โ€œ๊ฐ $x_i$๊ฐ€ ๋ชจ๋‘ i.i.d.ํ•˜๋‹คโ€์— ๊ธฐ๋ฐ˜ํ•œ๋‹ค. ๋…๋ฆฝ์ธ ์‚ฌ๊ฑด๋“ค์ด ๋™์‹œ์— ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์ด๋ฏ€๋กœ โ€œํ™•๋ฅ ์˜ ๊ณฑ๋ฒ•์น™โ€์— ์˜ํ•ด ์œ„์™€ ๊ฐ™์ด $\prod$ ๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒƒ์ด๋‹ค.

์œ„์˜ $L(\beta)$์˜ ์‹์— <logistic function>์„ ๋„ฃ์–ด์„œ ์กฐ๊ธˆ ํ’€์–ด์„œ ์จ๋ณด์ž.

\[\begin{aligned} L(\beta) &= \prod^n_{i=1} \left( \frac{\exp (x_i^T \beta)}{1 + \exp (x_i^T \beta)}\right)^{y_i} \left( \frac{1}{1 + \exp (x_i^T \beta)}\right)^{1-y_i} \\ &= \prod^n_{i=1} \left( \frac{\exp (y_i \cdot x_i^T \beta)}{1 + \exp( x_i^T \beta)}\right) \end{aligned}\]

๋”ฑ ๋ด๋„ ์‹์ด ์ข€ ๋ณต์žกํ•˜๋‹ค. ๊ทธ๋ž˜์„œ ๊ณ„์‚ฐ์˜ ํŽธ์˜๋ฅผ ์œ„ํ•ด $L(\beta)$์— $\log$๋ฅผ ์ทจํ•ด <Log-Likelihood> $\ell(\beta)$๋ฅผ ์‚ฌ์šฉํ•˜์ž!

\[\begin{aligned} \ell (\beta) &= \log \left( \prod^n_{i=1} \left( \frac{\exp (y_i \cdot x_i^T \beta)}{1 + \exp( x_i^T \beta)}\right) \right) \\ &= \sum^n_{i=1} \; \log \left( \frac{\exp (y_i \cdot x_i^T \beta)}{1 + \exp( x_i^T \beta)}\right) \\ &= \sum^n_{i=1} \; \left( y_i \cdot x_i^T \beta - \log \left( 1 + \exp(x_i^T \beta)\right) \right) \end{aligned}\]

Production์œผ๋กœ ๊ตฌ์„ฑ๋œ ๊ธฐ์กด์˜ ์‹์„ Summation์œผ๋กœ ๋ณ€ํ™˜ํ–ˆ๊ธฐ์— ์ด์ „๋ณด๋‹ค๋Š” ๋ถ„์„ํ•˜๊ธฐ ํ›จ์”ฌ ์‰ฌ์›Œ์กŒ์ง€๋งŒ, ์—ฌ์ „ํžˆ $\ell (\beta)$๋ฅผ Maximizationํ•˜๋Š” ๊ฒƒ์€ ๊ฐ„๋‹จํ•˜์ง€ ์•Š๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ $\ell (\beta)$๊ฐ€ <concave function>์ด๋ผ๋Š” ์ 1์€ ์šฐ๋ฆฌ๊ฐ€ nemerical method๋ฅผ ์‚ฌ์šฉํ•ด Maximization์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ2์„ ๋งํ•œ๋‹ค!! ๐Ÿคฉ ๊ทธ๋ž˜์„œ <Newton-Raphson method> ๋“ฑ์˜ Nemerical Approximation์˜ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ด MLE๋ฅผ ์ˆ˜ํ–‰ํ•œ ๋œ๋‹ค.


Regularization.

(์•„์ง ์ž˜ ์™€๋‹ฟ์งˆ ์•Š๋„ค;; ๊ฐ•์˜ ํ•œ๋ฒˆ ๋” ๋“ฃ๊ณ , ์ด ๋ถ€๋ถ„ ๊ต์žฌ ์ฝ๊ณ  ์ฑ„์šธ ๊ฒƒ)


LDA vs. Logistic Regression

LDA Logistic Regression
linear decision boundary
Normal ๋ถ„ํฌ ๊ฐ€์ • ๆœ‰ Normal ๋ถ„ํฌ ๊ฐ€์ • ็„ก
joint distribution์ธ $P(Y, X)$์— ๋Œ€ํ•œ ๊ตฌ์ฒด์ ์ธ ๊ฐ’์ด ํ•„์š” $P(Y=1 \mid X = x)$์— ๋Œ€ํ•œ ๊ฐ’๋งŒ ์žˆ์œผ๋ฉด ์ถฉ๋ถ„
Logistic์— ๋น„ํ•ด ๋” ๋งŽ์€ '๊ฐ€์ •'์ด ๋“ค์–ด๊ฐ€๊ธฐ ๋•Œ๋ฌธ์— Logistic๊ณผ ๋น„๊ตํ•ด applicability๊ฐ€ ๋–จ์–ด์ง. LDA์™€ ๋น„๊ตํ•ด categorical input์„ ์“ฐ๊ธฐ ์‰ฌ์›€

Multi-class Logistic Regression

Let $\mathcal{Y} = \{ 1, \dots, K \}$, and assume that

\[P(Y = k \mid X = x) \propto \exp (x^T \beta_k)\]

์ด๊ฒƒ์€ ๊ณง,

\[P(Y = k \mid X = x) = \frac{\exp(x^T \beta_k)}{\displaystyle \sum^K_{i=1} \exp (x^T \beta_i)}\]

references


  1. $\ell (\beta)$๊ฐ€ ์™œ concave function์ด ๋˜๋Š”์ง€๋Š” ๋” ์กฐ์‚ฌ๋ฅผ ํ•œ ํ›„์— ์ถ”ํ›„์— ๊ธฐ์ˆ ํ•˜๊ฒ ใ„ท.ย 

  2. ๊ทธ๋Ÿฌ๋‚˜ ์–ด๋–ค ํŠน๋ณ„ํ•œ ์ผ€์ด์Šค์—์„œ๋Š” MLE์˜ ํ•ด๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š์„ ์ˆ˜๋„ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์ด๊ฒƒ ์—ญ์‹œ ๋” ์กฐ์‚ฌ๋ฅผ ํ•œ ํ›„์— ์ถ”ํ›„์— ๊ธฐ์ˆ ํ•˜๊ฒ ๋‹ค.ย