2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

5 minute read

2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

Goal.

Regression์˜ ๋ชฉํ‘œ๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ <regression function>์„ ์ถ”์ •ํ•˜๋Š” ๊ฒƒ์— ์žˆ๋‹ค.

\[f(x) = E[Y \mid X = x]\]

์œ„์˜ ๊ด€๊ณ„์‹์€ ์•„๋ž˜์˜ ์‹๊ณผ ๋™์น˜๋‹ค. ์ฆ‰, ์œ„์˜ ํ•จ์ˆ˜ $f(x)$๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด๋‚˜ ์•„๋ž˜์˜ $f(x)$๋ฅผ ์ž˜ ์ฐพ์œผ๋ฉด <regression>์˜ ๋ชฉํ‘œ๋ฅผ ์„ฑ์ทจํ•œ ๊ฒƒ์œผ๋กœ ๋ณธ๋‹ค.

\[Y = f(x) + \epsilon, \quad E[\epsilon \mid X] = 0\]

<linear regression>์„ ๋‹ฌ์„ฑํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด, <regression function> $f(x)$๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด $X$, $Y$์˜ ๊ด€๊ณ„์‹์„ ์•„๋ž˜์™€ ๊ฐ™์ด ๋ชจ๋ธ๋งํ•œ๋‹ค.

\[\hat{Y} = \hat{\beta_0} + \sum^p_{j=1} \hat{\beta}_j X_j\]

ํ‘œ๊ธฐ์˜ ํŽธ์˜๋ฅผ ์œ„ํ•ด <intercept> ๋˜๋Š” <bias> ํ…€์„ ํฌํ•จํ•ด ์•„๋ž˜์™€ ๊ฐ™์ด ๊ธฐ์ˆ ํ•˜๊ธฐ๋กœ ํ•œ๋‹ค.

\[\hat{Y} = \sum^p_{j=0} \hat{\beta}_j X_j = X^T \hat{\beta}\]

Least Squared Estimator

<Linear regression>์˜ ํ•ด๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด RSS๋ฅผ ์‚ฌ์šฉํ•ด ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋‹ค.

\[\begin{aligned} \text{RSS}(\beta) &= \sum^n_{i=1} \left( y_i - x_i^T \beta\right)^2 \\ &= (\mathbf{y} - \mathbf{X}\beta)^T (\mathbf{y} - \mathbf{X}\beta) \end{aligned}\]

where $\mathbf{y} = (y_1, \dots, y_n)^T$ (response vector) and $\mathbf{X} = (\mathbf{x}_1, \dots, \mathbf{x}_p)$ (design matrix)

RSS์— ๋Œ€ํ•œ ์‹์„ $\beta$์— ๋Œ€ํ•ด ๋ฏธ๋ถ„ํ•˜๋ฉด solution์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ์ •๋ง ๋ฏธ๋ถ„๋งŒ ์ž˜ ํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์‹ค์ œ ์œ ๋„ ๊ณผ์ •์€ ์—ฌ๊ธฐ์„œ๋Š” ์ƒ๋žตํ•œ๋‹ค.

\[\hat{\beta} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \; \text{RSS}(\beta) = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\]

์ด๊ฒƒ์„ ์•ž์—์„œ ์–ธ๊ธ‰ํ•œ $\hat{Y} = X^T \hat{\beta}$์— ๋Œ€์ž…ํ•ด์ฃผ๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[\hat{Y} = X^T \hat{\beta} = \mathbf{X}^T (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} = \left( \mathbf{X}^T (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \right) \mathbf{y} = \mathbf{H} \mathbf{y}\]

์ด๋•Œ์˜ $\mathbf{H}$๋ฅผ <hat matrix>๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.


Design Matrix

<design matrix> $\mathbf{X}$์—๋Š” ๋‘ ๊ฐ€์ง€ ํƒ€์ž…์ด ์žˆ๋‹ค.

(1) <Random Design>: $x_i$โ€™s are regarded as i.i.d. realization

(2) <Fixed Design>: $x_i$โ€™s are fixed (non-random)

๋‘ ๊ฐœ๋…์ด <regression estimation>์—๋Š” ํฐ ์ฐจ์ด๊ฐ€ ์—†๋‹ค๊ณ  ํ•œ๋‹ค. ์šฐ๋ฆฌ๋Š” ์•ž์œผ๋กœ๋„ ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ์—์„œ $\mathbf{X}$๋ฅผ <fixed design>์œผ๋กœ ์ทจ๊ธ‰ํ•  ๊ฒƒ์ด๋‹ค.


์•ž์—์„œ RSS ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ด $\hat{\beta}$๋ฅผ ๊ตฌํ–ˆ๋‹ค. ์ด๋•Œ, ์ด ๋ชจ๋ธ์ด ์–ผ๋งˆ๋‚˜ ์ข‹์€์ง€๋ฅผ ๋…ผํ•˜๊ธฐ ์œ„ํ•ด <prediction error>๋ฅผ ๊ตฌํ•ด์•ผ ํ•œ๋‹ค. ์ด๋•Œ ํ•„์š”ํ•œ ๊ฐœ๋…์ด <bias>์™€ <variance>์ด๋‹ค. ์ด ๋‘ ๊ฐœ๋…์— ๋ฌด์—‡์ธ์ง€๋Š” ๋ณ„๋„์˜ ํฌ์ŠคํŠธ์— ์ •๋ฆฌํ•ด๋‘์—ˆ๋‹ค. ๋งŒ์•ฝ bias๋„ ์ž‘๊ณ  variance๋„ ์ž‘๋‹ค๋ฉด, ์šฐ๋ฆฌ๋Š” ๊ทธ ๋ชจ๋ธ์ด ์ข‹๋‹ค๊ณ  ํ‰๊ฐ€ํ•œ๋‹ค.
๐Ÿ‘‰ bias & variance

\[\text{Err}(x_0) = \sigma^2 + \left\{ \text{Bias}(\hat{f}(x_0)) \right\}^2 + \text{Var}(\hat{f}(x_0))\]

$Y = X^T \beta + \epsilon$๋ผ๊ณ  ๊ฐ€์ •ํ•˜์ž.

๋งŒ์•ฝ, $\text{Var}(Y) = \text{Var}(\epsilon) = \sigma^2$๋ผ๋ฉด,

\[\begin{aligned} \text{Var}(\hat{\beta}) &= \text{Var}\left( (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \right) \\ &= \left((\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \right) \text{Var}(\mathbf{y}) \left((\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \right)^T \quad (\because \text{Var}(A\mathbf{x}) = A \text{Var}(\mathbf{x})A^T) \\ &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \cdot \text{Var}(\mathbf{y}) \cdot X (X^TX)^{-1} \\ &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \cdot \textcolor{red}{\sigma^2 I_n} \cdot X (X^TX)^{-1} \\ &= \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T X (X^TX)^{-1} \\ &= \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1} \end{aligned}\]

์œ„์˜ ์‹์—์„œ $X^TX$๋ฅผ <gram matrix>๋ผ๊ณ  ํ•œ๋‹ค.

์ด๋ฒˆ์—๋Š” bias๋ฅผ ์‚ดํŽด๋ณด์ž. $\hat{\beta}$์˜ ํ‰๊ท ์ธ $E[\hat{\beta}]$๋ฅผ ๊ตฌํ•ด๋ณด์ž.

๋งŒ์•ฝ, $E[Y] = X^T \beta$๋ผ๋ฉด,

\[\begin{aligned} E[\hat{\beta}] &= E\left[ (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{y} \right] = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T E [\mathbf{y}] \\ &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T (X \beta) = \beta \end{aligned}\]
$E[\mathbf{y}]$ ์œ ๋„

$\mathbf{y} = (y_1, \dots, y_n)^T$์— ๋Œ€ํ•ด $E[\mathbf{y}]$๋Š”

\[E[\mathbf{y}] = \begin{pmatrix} E[y_1] \\ \vdots \\ E[y_n] \end{pmatrix} = \begin{pmatrix} x_1^T \beta \\ \vdots \\ x_n^T \beta \end{pmatrix} = \mathbf{X} \beta\]

$E[\hat{\beta}] = \beta$์ด๊ธฐ ๋•Œ๋ฌธ์— unbiased estimator๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๊ฒƒ์˜ ์˜๋ฏธ๋Š” ์ด estimator์˜ ์„ฑ๋Šฅ์ด ํ‰๊ท ์ ์ธ ๊ด€์ ์—์„œ๋Š” ์ •๋ง ์ž˜ ์ถ”์ •ํ•œ๋‹ค๋Š” ๋ง์ด๋‹ค.

์ข…ํ•ฉํ•˜๋ฉด, LS estimator๋Š” bias์˜ ๊ฒฝ์šฐ unbiased์˜€๋‹ค. ํ•˜์ง€๋งŒ, variance์˜ ๊ฒฝ์šฐ ํ–‰๋ ฌ์˜ ํ˜•ํƒœ๋กœ ๋‚˜์™”๋‹ค. ์ „์ฒด์˜ ๊ด€์ ์—์„œ ๋ดค์„ ๋•Œ, LS estimator๋Š” ๋ถ„์‚ฐ์ด ํฐ ํŽธ์ด๊ธฐ ๋•Œ๋ฌธ์— ์•„์ฃผ ์ข‹์€ estimator๋Š” ์•„๋‹ˆ๋ผ๊ณ  ํ•œ๋‹ค.

์ด๋ฒˆ์—๋Š” estimator์—์„œ ์˜ค์ฐจ์— ๋Œ€ํ•œ variance์ธ $\sigma^2$๋„ ์ถ”์ •ํ•ด๋ณด์ž.

\[\hat{\sigma} = \frac{1}{n} \sum^n_{i=1} (y_i - \hat{y_i})^2 = \frac{1}{n} \sum^n_{i=1} (y_i - x_i \hat{\beta})^2\]

๊ทธ๋Ÿฐ๋ฐ ์—ฌ๊ธฐ์—์„œ $n$์ด ์•„๋‹ˆ๋ผ $n-p$๋กœ ๋‚˜๋‘๋„๋ก ํ•œ๋‹ค.

\[\hat{\sigma} = \frac{1}{n-p} \sum^n_{i=1} (y_i - x_i \hat{\beta})^2 = \frac{1}{n-p} \| \mathbf{y} - \hat{\mathbf{y}} \|\]

์ด๋•Œ, $(n-p)$๋Š” <์ž์œ ๋„>๋ฅผ ์˜๋ฏธํ•˜๋Š”๋ฐ, ์ด ๋ถ€๋ถ„์€ ์†”์งํžˆ ์•„์ง ์ž˜ ๋ชจ๋ฅด๋Š” ๋ถ€๋ถ„์ด๋ผ ์ž์„ธํ•œ ์„ค๋ช…์€ ์ƒ๋žตํ•œ๋‹ค.

์ผ๋‹จ ๋งŒ์•ฝ ์ €๋ ‡๊ฒŒ $\sigma^2$๋ฅผ ์ถ”์ •ํ•œ๋‹ค๋ฉด, ์ด๊ฒƒ์ด unbiased estimaor ์ž„์„ ์œ ๋„ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

\[E[\hat{\sigma^2}] = \sigma^2\]