2021-1ν•™κΈ°, λŒ€ν•™μ—μ„œ β€˜ν†΅κ³„μ  λ°μ΄ν„°λ§ˆμ΄λ‹β€™ μˆ˜μ—…μ„ λ“£κ³  κ³΅λΆ€ν•œ λ°”λ₯Ό μ •λ¦¬ν•œ κΈ€μž…λ‹ˆλ‹€. 지적은 μ–Έμ œλ‚˜ ν™˜μ˜μž…λ‹ˆλ‹€ :)

5 minute read

2021-1ν•™κΈ°, λŒ€ν•™μ—μ„œ β€˜ν†΅κ³„μ  λ°μ΄ν„°λ§ˆμ΄λ‹β€™ μˆ˜μ—…μ„ λ“£κ³  κ³΅λΆ€ν•œ λ°”λ₯Ό μ •λ¦¬ν•œ κΈ€μž…λ‹ˆλ‹€. 지적은 μ–Έμ œλ‚˜ ν™˜μ˜μž…λ‹ˆλ‹€ :)

Goal.

Regression의 λͺ©ν‘œλŠ” μ•„λž˜μ™€ 같은 <regression function>을 μΆ”μ •ν•˜λŠ” 것에 μžˆλ‹€.

\[f(x) = E[Y \mid X = x]\]

μœ„μ˜ 관계식은 μ•„λž˜μ˜ 식과 λ™μΉ˜λ‹€. 즉, μœ„μ˜ ν•¨μˆ˜ $f(x)$λ₯Ό μ°ΎλŠ” κ²ƒμ΄λ‚˜ μ•„λž˜μ˜ $f(x)$λ₯Ό 잘 찾으면 <regression>의 λͺ©ν‘œλ₯Ό μ„±μ·¨ν•œ κ²ƒμœΌλ‘œ λ³Έλ‹€.

\[Y = f(x) + \epsilon, \quad E[\epsilon \mid X] = 0\]

<linear regression>을 λ‹¬μ„±ν•˜κ³  μ‹Άλ‹€λ©΄, <regression function> $f(x)$λ₯Ό μ°ΎκΈ° μœ„ν•΄ $X$, $Y$의 관계식을 μ•„λž˜μ™€ 같이 λͺ¨λΈλ§ν•œλ‹€.

\[\hat{Y} = \hat{\beta_0} + \sum^p_{j=1} \hat{\beta}_j X_j\]

ν‘œκΈ°μ˜ 편의λ₯Ό μœ„ν•΄ <intercept> λ˜λŠ” <bias> 텀을 포함해 μ•„λž˜μ™€ 같이 κΈ°μˆ ν•˜κΈ°λ‘œ ν•œλ‹€.

\[\hat{Y} = \sum^p_{j=0} \hat{\beta}_j X_j = X^T \hat{\beta}\]

Least Squared Estimator

<Linear regression>의 ν•΄λ₯Ό κ΅¬ν•˜κΈ° μœ„ν•΄ RSSλ₯Ό μ‚¬μš©ν•΄ μ ‘κ·Όν•  수 μžˆλ‹€.

\[\begin{aligned} \text{RSS}(\beta) &= \sum^n_{i=1} \left( y_i - x_i^T \beta\right)^2 \\ &= (\mathbf{y} - \mathbf{X}\beta)^T (\mathbf{y} - \mathbf{X}\beta) \end{aligned}\]

where $\mathbf{y} = (y_1, \dots, y_n)^T$ (response vector) and $\mathbf{X} = (\mathbf{x}_1, \dots, \mathbf{x}_p)$ (design matrix)

RSS에 λŒ€ν•œ 식을 $\beta$에 λŒ€ν•΄ λ―ΈλΆ„ν•˜λ©΄ solution을 ꡬ할 수 μžˆλ‹€. 정말 λ―ΈλΆ„λ§Œ 잘 ν•˜λ©΄ 되기 λ•Œλ¬Έμ— μ‹€μ œ μœ λ„ 과정은 μ—¬κΈ°μ„œλŠ” μƒλž΅ν•œλ‹€.

\[\hat{\beta} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \; \text{RSS}(\beta) = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\]

이것을 μ•žμ—μ„œ μ–ΈκΈ‰ν•œ $\hat{Y} = X^T \hat{\beta}$에 λŒ€μž…ν•΄μ£Όλ©΄ μ•„λž˜μ™€ κ°™λ‹€.

\[\hat{Y} = X^T \hat{\beta} = \mathbf{X}^T (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} = \left( \mathbf{X}^T (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \right) \mathbf{y} = \mathbf{H} \mathbf{y}\]

μ΄λ•Œμ˜ $\mathbf{H}$λ₯Ό <hat matrix>라고 λΆ€λ₯Έλ‹€.


Design Matrix

<design matrix> $\mathbf{X}$μ—λŠ” 두 κ°€μ§€ νƒ€μž…μ΄ μžˆλ‹€.

(1) <Random Design>: $x_i$’s are regarded as i.i.d. realization

(2) <Fixed Design>: $x_i$’s are fixed (non-random)

두 κ°œλ…μ΄ <regression estimation>μ—λŠ” 큰 차이가 μ—†λ‹€κ³  ν•œλ‹€. μš°λ¦¬λŠ” μ•žμœΌλ‘œλ„ λŒ€λΆ€λΆ„μ˜ κ²½μš°μ—μ„œ $\mathbf{X}$λ₯Ό <fixed design>으둜 μ·¨κΈ‰ν•  것이닀.


μ•žμ—μ„œ RSS 방식을 μ‚¬μš©ν•΄ $\hat{\beta}$λ₯Ό κ΅¬ν–ˆλ‹€. μ΄λ•Œ, 이 λͺ¨λΈμ΄ μ–Όλ§ˆλ‚˜ 쒋은지λ₯Ό λ…Όν•˜κΈ° μœ„ν•΄ <prediction error>λ₯Ό ꡬ해야 ν•œλ‹€. μ΄λ•Œ ν•„μš”ν•œ κ°œλ…μ΄ <bias>와 <variance>이닀. 이 두 κ°œλ…μ— λ¬΄μ—‡μΈμ§€λŠ” λ³„λ„μ˜ ν¬μŠ€νŠΈμ— μ •λ¦¬ν•΄λ‘μ—ˆλ‹€. λ§Œμ•½ bias도 μž‘κ³  variance도 μž‘λ‹€λ©΄, μš°λ¦¬λŠ” κ·Έ λͺ¨λΈμ΄ μ’‹λ‹€κ³  ν‰κ°€ν•œλ‹€.
πŸ‘‰ bias & variance

\[\text{Err}(x_0) = \sigma^2 + \left\{ \text{Bias}(\hat{f}(x_0)) \right\}^2 + \text{Var}(\hat{f}(x_0))\]

$Y = X^T \beta + \epsilon$라고 κ°€μ •ν•˜μž.

λ§Œμ•½, $\text{Var}(Y) = \text{Var}(\epsilon) = \sigma^2$라면,

\[\begin{aligned} \text{Var}(\hat{\beta}) &= \text{Var}\left( (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \right) \\ &= \left((\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \right) \text{Var}(\mathbf{y}) \left((\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \right)^T \quad (\because \text{Var}(A\mathbf{x}) = A \text{Var}(\mathbf{x})A^T) \\ &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \cdot \text{Var}(\mathbf{y}) \cdot X (X^TX)^{-1} \\ &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \cdot \textcolor{red}{\sigma^2 I_n} \cdot X (X^TX)^{-1} \\ &= \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T X (X^TX)^{-1} \\ &= \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1} \end{aligned}\]

μœ„μ˜ μ‹μ—μ„œ $X^TX$λ₯Ό <gram matrix>라고 ν•œλ‹€.

μ΄λ²ˆμ—λŠ” biasλ₯Ό μ‚΄νŽ΄λ³΄μž. $\hat{\beta}$의 평균인 $E[\hat{\beta}]$λ₯Ό κ΅¬ν•΄λ³΄μž.

λ§Œμ•½, $E[Y] = X^T \beta$라면,

\[\begin{aligned} E[\hat{\beta}] &= E\left[ (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{y} \right] = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T E [\mathbf{y}] \\ &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T (X \beta) = \beta \end{aligned}\]
$E[\mathbf{y}]$ μœ λ„

$\mathbf{y} = (y_1, \dots, y_n)^T$에 λŒ€ν•΄ $E[\mathbf{y}]$λŠ”

\[E[\mathbf{y}] = \begin{pmatrix} E[y_1] \\ \vdots \\ E[y_n] \end{pmatrix} = \begin{pmatrix} x_1^T \beta \\ \vdots \\ x_n^T \beta \end{pmatrix} = \mathbf{X} \beta\]

$E[\hat{\beta}] = \beta$이기 λ•Œλ¬Έμ— unbiased estimator라고 ν•  수 μžˆλ‹€. μ΄κ²ƒμ˜ μ˜λ―ΈλŠ” 이 estimator의 μ„±λŠ₯이 평균적인 κ΄€μ μ—μ„œλŠ” 정말 잘 μΆ”μ •ν•œλ‹€λŠ” 말이닀.

μ’…ν•©ν•˜λ©΄, LS estimatorλŠ” bias의 경우 unbiasedμ˜€λ‹€. ν•˜μ§€λ§Œ, variance의 경우 ν–‰λ ¬μ˜ ν˜•νƒœλ‘œ λ‚˜μ™”λ‹€. μ „μ²΄μ˜ κ΄€μ μ—μ„œ 봀을 λ•Œ, LS estimatorλŠ” 뢄산이 큰 편이기 λ•Œλ¬Έμ— μ•„μ£Ό 쒋은 estimatorλŠ” μ•„λ‹ˆλΌκ³  ν•œλ‹€.

μ΄λ²ˆμ—λŠ” estimatorμ—μ„œ μ˜€μ°¨μ— λŒ€ν•œ variance인 $\sigma^2$도 μΆ”μ •ν•΄λ³΄μž.

\[\hat{\sigma} = \frac{1}{n} \sum^n_{i=1} (y_i - \hat{y_i})^2 = \frac{1}{n} \sum^n_{i=1} (y_i - x_i \hat{\beta})^2\]

그런데 μ—¬κΈ°μ—μ„œ $n$이 μ•„λ‹ˆλΌ $n-p$둜 λ‚˜λ‘λ„λ‘ ν•œλ‹€.

\[\hat{\sigma} = \frac{1}{n-p} \sum^n_{i=1} (y_i - x_i \hat{\beta})^2 = \frac{1}{n-p} \| \mathbf{y} - \hat{\mathbf{y}} \|\]

μ΄λ•Œ, $(n-p)$λŠ” <μžμœ λ„>λ₯Ό μ˜λ―Έν•˜λŠ”λ°, 이 뢀뢄은 μ†”μ§νžˆ 아직 잘 λͺ¨λ₯΄λŠ” 뢀뢄이라 μžμ„Έν•œ μ„€λͺ…은 μƒλž΅ν•œλ‹€.

일단 λ§Œμ•½ μ €λ ‡κ²Œ $\sigma^2$λ₯Ό μΆ”μ •ν•œλ‹€λ©΄, 이것이 unbiased estimaor μž„μ„ μœ λ„ν•  수 μžˆλ‹€κ³  ν•œλ‹€.

\[E[\hat{\sigma^2}] = \sigma^2\]