2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

5 minute read

2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

Set-up

  • Input Variables: $X = (X_1, \dots, X_p)^T$
    • a.k.a. Covariates, features, $p$-dim random vector, independent variables
  • Output Variables: $Y$
    • a.k.a. Responses, $y$-values random variable, dependent variables
  • Data: $\{(y_1, x_1), \dots, (y_n, x_n)\}$
    • Realization of $(X, Y)$ (often regarded as i.i.d. independent and identically distributed)

Variable types

  • Qauntitive Variables
    • Continuous variables
  • Qualitive Variables
    • Discrete variables or Categorical Variables
  • Ordinal Variables
    • ex: small < medium < large

  • Two Supervised Learnings Tasks
    • Regression
      • Output $Y$ is continuous
      • Often, modeled as $Y = f(X) + \epsilon$
      • Goal: construct good model $f$ with low $\epsilon$
    • Classification
      • Output $Y$ is categorical
      • Often, we model $P(Y=k \mid X)$ (= ํ™•๋ฅ  ์ถ”์ •)
      • Goal: construct good model to determine output value with given $X$.s

<Supervised Learning>์—์„  ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์•„๋ž˜์˜ ๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ์„ ์ทจํ•œ๋‹ค.

  • Least Squared Estimator
  • Nearest Neighbor

์ด ๋‘ ์ ‘๊ทผ๋ฒ•์€ <Supervised Learning> ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๊ฐ€์žฅ ๊ธฐ๋ณธ์ด ๋œ๋‹ค.

Regression


Definition. Linear Model for regression

Given input $X = (X_1, \dots, X_p)^T$, we predict $Y$ as

\[\hat{Y} = \hat{\beta_0} + \sum^{p}_{i=1} {\hat{\beta_i} X_i}\]
  • ์ผ๋ฐ˜์ ์œผ๋กœ hat($\hat{x}$)์ด ์žˆ์œผ๋ฉด, predicted value๋กœ ์ทจ๊ธ‰ํ•œ๋‹ค.
  • $\beta_i$๋Š” โ€˜regression coefficientโ€™๋ผ๊ณ  ํ•œ๋‹ค. ํŠนํžˆ, $\beta_0$๋ฅผ <intercept> ๋˜๋Š” <bias>๋ผ๊ณ  ํ•œ๋‹ค.
    • ํŽธ์˜๋ฅผ ์œ„ํ•ด $X_0=1$๋ผ๊ณ  ์„ค์ •ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.
    • ์ด ๊ฒฝ์šฐ, ์‹์€ $\hat{Y} = X^T \hat{\beta}$๊ฐ€ ๋œ๋‹ค.

Least Squared Estimator

๊ทธ๋Ÿด๋“ฏํ•œ $\hat{\beta}$๋ฅผ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด โ€œresidual sum of squaresโ€๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” LSE์˜ ์ ‘๊ทผ์„ ์ทจํ•  ์ˆ˜ ์žˆ๋‹ค.

\[\begin{aligned} \mbox{RSS}(\beta) &= \sum^n_{i=1} (y_i - {x_i}^T \beta)^2 \\ &= (\mathbf{y} - \mathbf{X}\beta)^T(\mathbf{y} - \mathbf{X}\beta) \end{aligned}\]

์œ„ ์‹์—์„œ $\mathbf{y}$, $\mathbf{X}$๋Š” ๊ฐ๊ฐ ์•„๋ž˜์˜ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„๋‹ค.

  • $\mathbf{y}$๋Š” ์ •๋‹ต ๋ ˆ์ด๋ธ”์„ ๋ชจ์€ vector๋กœ <response vector>๋ผ๊ณ  ํ•œ๋‹ค.
\[\mathbf{y} = \left( y_1, \dots, y_n \right)^T\]
  • $\mathbf{X}$๋Š” ์ž…๋ ฅ๋˜๋Š” feature vector๋ฅผ ๋ชจ์€ matrix๋กœ <design matrix>๋ผ๊ณ  ํ•œ๋‹ค.
\[\mathbf{X} = \left( x_1, \dots, x_n \right)^T\]

<design matrix>๋ฅผ ๋‹ค๋ฅด๊ฒŒ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[\mathbf{X} = \left( x_1, \dots, x_n \right)^T = \left( \mathbf{x}_1, \dots, \mathbf{x}_p \right)\]

์ฒซ๋ฒˆ์งธ ํ‘œ๊ธฐ๋Š” $p$-dim feature vector $n$๊ฐœ๋ฅผ ์ฐจ๊ณก์ฐจ๊ณก ํ‘œํ˜„ํ•œ ๊ฒƒ์ด๊ณ , ๋‘๋ฒˆ์งธ ํ‘œ๊ธฐ๋Š” $n$๊ฐœ feature vector์—์„œ feature $x_i$ ํ•˜๋‚˜์— ๋Œ€ํ•œ ๊ฐ’์„ ๋ชจ๋‘ ๋ชจ์•„ vector $\mathbf{x}_i$๋กœ ํ‘œํ˜„ํ–ˆ๋‹ค.


LS ์ ‘๊ทผ์—์„œ๋Š” $\hat{\beta}$๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์ถ”์ •ํ•œ๋‹ค.

\[\begin{aligned} \hat{\beta} &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \; \mbox{RSS}(\beta) \\ &= \left( \mathbf{X}^T \mathbf{X}\right)^{-1} \mathbf{X}^T \cdot \mathbf{y} \end{aligned}\]

์ด๊ฒƒ์€ $\text{RSS}(\beta)$์— ๋Œ€ํ•œ ๋ฏธ๋ถ„์œผ๋กœ ์‰ฝ๊ฒŒ ์œ ๋„ํ•  ์ˆ˜ ์žˆ๋‹ค.

\[\begin{aligned} \text{RSS}(\beta) &= \sum^n_{i=1} (y_i - x^T_i \beta)^2 \\ &= (Y - X\beta)^T (Y - X\beta) \\ &= (Y^T - \beta^T X^T) (Y - X \beta) \\ &= Y^T Y - \beta^T X^T Y - Y^T X \beta + \beta^T X^T X \beta \\ &= Y^T Y - 2 Y^T X \beta + \beta^T X^T X \beta \end{aligned}\]

์ด์ œ $\text{RSS}(\beta)$๋ฅผ $\beta$์— ๋Œ€ํ•ด ๋ฏธ๋ถ„ํ•ด๋ณด์ž.

\[\begin{aligned} \frac{\partial}{\partial\beta} \text{RSS}(\beta) &= \frac{\partial}{\partial\beta} \left(Y^T Y - 2 Y^T X \beta + \beta^T X^T X \beta \right) \\ &= 0 - 2 Y^T X + 2 X^T X \beta \end{aligned}\]

$\displaystyle \frac{\partial}{\partial\beta} \text{RSS}(\beta) $๊ฐ€ 0์ด ๋˜๋Š” ์ง€์ ์—์„œ ๊ทน์†Œ๊ฐ’์ด ๋ฐœ์ƒํ•œ๋‹ค. ๋”ฐ๋ผ์„œ,

\[\begin{aligned} \frac{\partial}{\partial\beta} \text{RSS}(\beta) &= - 2 Y^T X + 2 X^T X \beta = 0 \\ &\Updownarrow \\ 2 X^TX \beta &= 2 Y^T X \\ &\Updownarrow \\ \hat{\beta} &= \left( X^T X \right)^{-1} Y^T X \end{aligned}\]

$\blacksquare$

์œ„์˜ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ตฌํ•œ $\hat{\beta}$๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ Linear Regressor $\hat{f}(x)$๋ฅผ ๊ธฐ์ˆ ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[\hat{f}(x) = x^T \hat{\beta}\]

Nearest-Neighbor Methods

<Nearest-Neighbor Method>๋กœ Regression ๋ฌธ์ œ๋ฅผ ์ ‘๊ทผํ•ด๋ณผ ์ˆ˜๋„ ์žˆ๋‹ค.

Definition. Nearest-Neighbor Methods for regression

Let $N_k(x)$ be the set of points which are the top-$k$ closest to $x$.

\[\hat{f}(x) = \frac{1}{k} \sum_{x_i \in N_k(x)} y_i\]

<NN>์œผ๋กœ์˜ ์ ‘๊ทผ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ธ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ Estimation์˜ ๊ฒฝ๊ณ„ ๋ถ€๊ทผ์„ ๋ณด๋ฉด, <NN>์˜ ๊ฒฝ์šฐ Boundary์—์„œ Estimation ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.


Classification

<๋ถ„๋ฅ˜ Classification> ๋ฌธ์ œ์—์„œ๋„ <Least Squared Method>์™€ <Nearest Neighbor Method>๋ฅผ ์ ์šฉํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. <๋ถ„๋ฅ˜ Classification>์— ๋Œ€ํ•œ ์ฃผ์ œ๋Š” ์ˆ˜์—…์˜ ๋’ท๋ถ€๋ถ„์—์„œ ์ž์„ธํžˆ ๋‹ค๋ฃจ๊ธฐ ๋•Œ๋ฌธ์— ๋ณธ ํฌ์ŠคํŠธ์—์„œ๋Š” ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ชจํ˜•๋งŒ ์ œ์‹œํ•œ๋‹ค.