2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

9 minute read

2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

์ด ํฌ์ŠคํŠธ๋Š” Regression Spline๊ณผ Spline Method (2)์ด์–ด์ง€๋Š” ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค ๐Ÿ˜Š

KNN; K-Nearest Neighbor method

Regression function $f(x) = E(Y \mid X=x)$๋ฅผ ๋ชจ๋ธ๋ง ํ•˜๋Š” ๊ฐ€์žฅ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ ‘๊ทผ์ค‘ ํ•˜๋‚˜๋Š” ๋ฐ”๋กœ $x$์— ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด $k$๊ฐœ์˜ ์ ์œผ๋กœ ํ‰๊ท ์„ ๋‚ด๋Š” <KNN>์ด๋‹ค.

\[\hat{f}(x) = \text{Avg} \, (y_i \, \mid \, x_i \in N_k(x))\]

where $N_k(x)$ is the set of $k$ points nearest to $x$.

์ด๋ ‡๊ฒŒ KNN์„ ์“ธ๋•Œ ๊น”๋ฆฐ ๊ฐ€์ •์€ โ€œ์ถ”์ •ํ•˜๋ ค๋Š” ํ•จ์ˆ˜ $f(x)$๊ฐ€ smooth ํ•˜๋‹คโ€์ด๋‹ค.

KNN์˜ parameter์ธ $k$๋Š” model complexity๋ฅผ ์ปจํŠธ๋กค ํ•œ๋‹ค.

  • small $k$: model complexity โ–ฒ, bias โ–ผ, variance โ–ฒ // make ziggle model
  • Large $k$: model complexity โ–ผ, bias โ–ฒ, variance โ–ผ // make smooth model

[Problem ๐Ÿ˜ฑ] KNN $\hat{f}(x)$ is not smooth and not continuous!!


KNN์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋„์ž…ํ•˜๋Š” ๊ฒƒ์ด ๋ฐ”๋กœ <kernel method>๋‹ค!


Kernel Method

KNN์˜ ์‹์„ ๋‹ค์‹œ ๊ธฐ์ˆ ํ•ด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[\hat{f}(x) = \frac{\displaystyle \sum^n_{i=1} \, K_k (x, x_i) \cdot y_i}{\displaystyle \sum^n_{i=1} \, K_k (x, x_i) }\]

where $K_k (x, x_i) = I(x_i \in N_k (x))$.

<kernel method>์˜ ๋ฉ”์ธ ์•„์ด๋””์–ด๋Š” KNN์—์„œ $I(x_i \in N_k(x))$๋ฅผ ๋‹ค๋ฅธ smooth function $K_{\lambda}(x, xโ€™)$๋กœ ๋Œ€์ฒดํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค!! ์ด๋•Œ์˜ ๊ทธ smooth function $K_{\lambda}(x, xโ€™)$๋ฅผ <kernel function>์ด๋ผ๊ณ  ํ•œ๋‹ค!!

์œ„์˜ ๊ทธ๋ฆผ์€ <kernel function> ์ค‘ ํ•˜๋‚˜์ธ <Epanechnikov kernel>์„ ์‹œ๊ฐํ™”ํ•œ ๊ฒƒ์ด๋‹ค.

์ฆ‰, <kernel method>๋Š” equally weighting ํ•˜๋Š” KNN๊ณผ ๋‹ฌ๋ฆฌ, kernel์— ๋”ฐ๋ผ ๊ฐ€๊นŒ์ด ์žˆ๋Š” point์—๋Š” ๋” ํฐ weight์„ ๋ถ€์—ฌํ•ด์ฃผ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

Q. <kernel method>๋ฅผ ์ ์šฉํ•œ ๊ทธ๋ฆผ์—์„œ boundary ์ชฝ์„ ๋ณด๋ฉด, estimated curve๊ฐ€ ์•ฝ๊ฐ„ ์˜ฌ๋ผ๊ฐ€ ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๊ฒƒ์€ ์™œ ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์ด๋ฉฐ, ์ด ๋ฌธ์ œ๋Š” dataset์— ์ƒ๊ด€์—†์ด ์ผ์–ด๋‚˜๋Š” <kernel method>์˜ ๋ณธ์งˆ์ ์ธ ๋ฌธ์ œ์ธ๊ฐ€?

A. ์•„๋ฌด๋ฆฌ <kernel method>๋ผ๊ณ  ํ•ด๋„ boundary problem์—์„œ ์ž์œ ๋กœ์šธ ์ˆœ ์—†๋‹ค. ์ด๊ฒƒ์€ <kernel method>์˜ ๋ณธ์งˆ์ ์ธ ๋ฌธ์ œ๋‹ค.

// ์ถ”๊ฐ€๋กœ ํ•จ์ˆ˜์˜ ๊ฐ€์šด๋ฐ ๋ถ€๋ถ„์—์„œ๋„ ์ถ”์ •์ด ์ •ํ™•ํ•˜์ง€ ์•Š์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค ๐Ÿ˜ฅ


Nadaraya-Watson Estimator

For a given kernel $K_\lambda$, <Nadaraya-Watson Estimator> is

\[\hat{f}(x) = \frac{\displaystyle \sum^n_{i=1} K_{\lambda} (x, x_i) \cdot y_i}{\displaystyle\sum^n_{i=1} K_{\lambda} (x, x_i)}\]

์ด <NW Estimator>์— ์‚ฌ์šฉ๋˜๋Š” ์ปค๋„ ์ค‘ ์œ ๋ช…ํ•œ ๊ฒƒ์€ ๋Œ€๋ถ€๋ถ„ ์•„๋ž˜์˜ ํ˜•ํƒœ๋ฅผ ์ง€๋‹Œ๋‹ค.

\[K_{\lambda} (x, x') = D \left( \frac{\left| x - x' \right|}{\lambda} \right)\]

where $D$ is a non-negative decreasing on $[0, \infty)$.

$D$๋Š” ๋Œ€์ถฉ ์ข…๋ชจ์–‘์˜ ์šฐํ•จ์ˆ˜๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค. <Epanechnikov kernel>๊ฐ€ ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์‹œ๋‹ค.


Examples.

  • Epanechnikov kernel
\[D(t) = \begin{cases} \dfrac{3}{4} (1-t^2) & \text{if } \left| t \right| \le 1 \\ \qquad 0 & \text{else} \end{cases}\]
  • Tri-cube kernel
\[D(t) = \begin{cases} \left( 1 - \left| t \right|^3 \right)^3 & \text{if } \left| t \right| \le 1 \\ \qquad 0 & \text{else} \end{cases}\]
  • Gaussian kernel
\[D(t) = \dfrac{1}{\sqrt{2\pi}} \exp \left( - t^2 / 2\right)\]


$\lambda$๋Š” ์ผ์ข…์˜ tuning parameter๋กœ, scale์„ ์กฐ์ •ํ•˜๋Š” ๊ธฐ๋Šฅ์„ ํ•œ๋‹ค; bandwidth (smoothing) parameter.

  • small $\lambda$: model complexity โ–ฒ, bias โ–ผ, variance โ–ฒ // make ziggle model
  • Large $\lambda$: model complexity โ–ผ, bias โ–ฒ, variance โ–ผ // make smooth model

// KNN์˜ $k$์ฒ˜๋Ÿผ model complexity๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค!!

๐Ÿ’ฅ Problem: <Nadaraya-Watson Estimator> has large bias around the boundary ๐Ÿ˜ฅ


Local Linear Regression with kernel method

kernel method๋ฅผ ์“ฐ๋Š” <Nadaraya-Watson Estimator>์—์„  โ€œcurvurse๋ฅผ ์ œ๋Œ€๋กœ catchํ•˜์ง€ ๋ชป ํ•œ๋‹คโ€, โ€œboundary์—์„œ bias๊ฐ€ ํฌ๋‹คโ€ ๋“ฑ์˜ ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณด์ž!

์ง€๊ธˆ๊นŒ์ง€์˜ <kernel method>๋Š” ์‚ฌ์‹ค weighted local constant fitting์„ ํ•œ ๊ฒƒ์ด๋‹ค. ์ฆ‰, โ€œlocal constant fittingโ€์„ ํ•˜๊ธด ํ–ˆ๋Š”๋ฐ, kernel function์œผ๋กœ weighting์„ ํ–ˆ๋‹ค๋Š” ๋ง์ด๋‹ค. ๊ทธ๋ž˜์„œ ์ด๋ฒˆ์—๋Š” constant fitting ๋Œ€์‹ ์— linear fitting ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋„์ž…ํ•˜๊ฒŒ ๋œ๋‹ค!

๐Ÿ‘€ <local linear regression>์ด <NW estimator> ๋ณด๋‹ค boundary์—์„œ ํ›จ์”ฌ bias๊ฐ€ ์ค„์–ด๋“  ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค!! ๐Ÿ‘€

Find a local linear estimator $\hat{f}(x) = \alpha(x) + \beta(x)\cdot x$ where

\[\underset{\alpha(x), \, \beta(x)}{\text{argmin}} \; \sum^n_{i=1} \; K_{\lambda} (x, x_i) \, \left[ y_i - \alpha(x) - \beta(x) \cdot x_i \right]^2\]

์œ„์˜ ์‹์€ local ๊ตฌ๊ฐ„ ์ค‘ ํ•˜๋‚˜์˜ ๊ตฌ๊ฐ„์— ๋Œ€ํ•œ estimated function $\hat{f}(x)$์„ ๊ธฐ์ˆ ํ•œ ๊ฒƒ์ด๋‹ค. ์ด๊ฒƒ์„ ์ „์ฒด ๊ตฌ๊ฐ„์œผ๋กœ ํ™•์žฅํ•ด์„œ ํ–‰๋ ฌ์˜ ํ˜•ํƒœ๋กœ ๊ธฐ์ˆ ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Let $\mathbf{W}(x) = \text{diag} (K_\lambda (x, x_i))$, and $\mathbf{B}$ be $n\times 2$ with $i$-th row $b(x_i)^T$, where $b(x) = (1, x)^T$.

Then,

\[\hat{f}(x) = b(x)^T \cdot \left( \mathbf{B}^T \mathbf{W}(x) \mathbf{B} \right)^{-1} \cdot \mathbf{B}^T \mathbf{W}(x) \mathbf{y}\]

์‹์ด ๋งŽ์ด ๋ณต์žกํ•˜๋‹ค. ์กฐ๊ธˆ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ณด๋ ค๋ฉด, weight ๋ถ€๋ถ„์ธ $\mathbf{W}(x)$๋ฅผ ๋นผ๊ณ  ์‹์„ ๋จผ์ € ์ดํ•ดํ•˜๋Š”๊ฒŒ ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.


์ด <local linear regression>์˜ ์•„์ด๋””์–ด๋ฅผ ํ™•์žฅ์— <local polynomial regression>์œผ๋กœ ํ™•์žฅํ•ด๋ณผ ์ˆ˜๋„ ์žˆ๋‹ค!!

๐Ÿ‘€ <local quadratic regression>์ด <local linear regression>๋ณด๋‹ค curverture๊ฐ€ ์žˆ๋Š” ๋ถ€๋ถ„์„ ๋” ์ž˜ fitting ํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๐Ÿ‘€

Q. ์ง€๊ธˆ๊นŒ์ง€์˜ ์‚ดํŽด๋ณธ ๊ฒƒ์„ ์ข…ํ•ฉํ•ด๋ณด๋ฉด, <local constant fit> ๋ณด๋‹จ <local linear fit>์ด ์ข‹์•˜๋‹ค. ๋˜, <local quadratic fit>์ด ๋” ์ข‹์•˜๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด, polynomial fitting์˜ ์ฐจ์ˆ˜๊ฐ€ ๋†’์„ ์ˆ˜๋ก ์ข‹์€ ๊ฒƒ ์ผ๊นŒ?

A. NO!!!

์ƒํ™ฉ์— ๋”ฐ๋ผ ๋‹ค๋ฅด๋‹ค!!! ์•„๋ž˜์˜ ๊ทธ๋ฆผ์„ ์‚ดํŽด๋ณด์ž.

์ด ๊ฒฝ์šฐ, <constant fit>์ด ๊ฐ€์žฅ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค. ๋˜, boundary์—์„œ <quadratic fit>์€ ํฐ variance์„ ๋ณด์ด๋Š” ๊ฑธ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. // ์œ„ ๊ทธ๋ฆผ์—์„  bias๋Š” ๋˜‘๊ฐ™์€๋ฐ variance๋Š” quadratic์ด ๊ฐ€์žฅ ์ปธ๋‹ค.

์ด๊ฒƒ์€ ๊ฒฐ๊ตญ <bias-variance tradeoff>๋กœ higher polynomial์€ bias๋ฅผ ์ž˜ ๋งž์ถœ ์ˆ˜ ์žˆ์–ด๋„ variance๊ฐ€ ๋†’์•„์ง€๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค.

๋”ฐ๋ผ์„œ, higher polynomial๋กœ fitting ํ•˜๋Š” ์ „๋žต์ด ํ•ญ์ƒ ์šฐ์„ธํ•˜๋‹ค๊ณ  ๋งํ•  ์ˆ˜๋Š” ์—†๋Š” ๊ฒƒ์ด๋‹ค.

๐Ÿ’ฅ ๋‹น์—ฐํ•˜๊ฒŒ๋„ dataset์˜ true relation์˜ curvurture๊ฐ€ ์ž‘์„ ์ˆ˜๋ก linear fit์ด ์ผ๋ฐ˜์ ์œผ๋กœ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ธ๋‹ค.


Local Likelihood Approach for logistic regression

์ด๋ฒˆ์—๋Š” <spline method>์—์„œ ์‚ดํŽด๋ดค๋˜ <non-parametric (binary) logistic regression> ๋ชจ๋ธ์— <kernel method>๋ฅผ ์ ์šฉํ•ด๋ณด์ž.

<logistic regreeion>์—์„œ๋Š” $f(x) = x^T \beta(x)$๋ฅผ ๊ตฌํ–ˆ์—ˆ๋‹ค. ์—ฌ๊ธฐ์— <kernel method>๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด, ์•„๋ž˜์˜ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ’€์–ด $\hat{\beta}(x)$๋ฅผ ๊ตฌํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋œ๋‹ค.

\[\hat{\beta}(x) = \underset{\beta}{\text{argmax}} \; \sum^n_{i=1} \; K_{\lambda}(x, x_i) \cdot \ell (y_i, x_i^T\beta)\]

where $\ell(y, f(x)) = y \cdot f(x) - \log (1 + e^{f(x)})$.


Kernel method with $p > 1$

์ง€๊ธˆ๊นŒ์ง€ ์‚ดํŽด๋ณธ <kernel method>๋Š” input variable์˜ ์ฐจ์› $p$๊ฐ€ 1์ธ ๊ฒฝ์šฐ ์˜€๋‹ค. ์ด๋ฒˆ์—๋Š” $p$๊ฐ€ 1 ์ด์ƒ์ธ ๊ฒฝ์šฐ, <kernel method>์— ๋Œ€ํ•œ ์‹์„ ์‚ดํŽด๋ณด๊ฒ ๋‹ค.

์ผ๋ฐ˜์ ์œผ๋กœ $p$๊ฐ€ ์ปค์ง€๋ฉด, โ€œneighborhoodโ€์˜ ๊ฐœ๋…์ด ์ž˜ ๋™์ž‘ํ•˜์ง€ ์•Š๊ฒŒ ๋œ๋‹ค. (curse of dimensionality) ๊ทธ๋ž˜์„œ $p > 1$์—์„œ๋Š” โ€œneighborโ€๋ฅผ ๋‹จ์ˆœํžˆ ๊ตฌํ•˜๋Š” ๊ฒƒ๋ณด๋‹จ ์–ด๋–ค structure๋ฅผ ๊ฐ€์ •ํ•˜๊ณ  ์ ‘๊ทผํ•œ๋‹ค.

\[K_\lambda (x, x') = D \left( \frac{(x-x')^T A (x - x')}{\lambda} \right)\]

์ด๋Ÿฐ โ€œstructured assumptionโ€์˜ ์˜ˆ์‹œ๋กœ๋Š”

  • $f(X) = \alpha + \beta^T X$; linear structure
  • $f$ is a super-smooth function
  • $f$ can be represented by few basis functions
  • Additive model: $f(X_1, \dots, X_p) = \alpha + f_1(X_1) + \cdots + f_p(X_p)$
  • โ€ฆ

์ด ์ค‘์—์„œ ๋งˆ์ง€๋ง‰ ์ ‘๊ทผ์ธ <Additive Model>์ด ๋ฐ”๋กœ ๋‹ค์Œ ํฌ์ŠคํŠธ์—์„œ ๋‹ค๋ฃฐ ๋‚ด์šฉ์ด๋ฉฐ, $p$ ์ฐจ์› ํ•จ์ˆ˜๋ฅผ ์ถ”์ •ํ•˜๋Š” ์–ด๋ ค์šด ๋ฌธ์ œ๋ฅผ 1์ฐจ์› ํ•จ์ˆ˜ $p$๊ฐœ๋ฅผ ์ถ”์ •ํ•˜๋Š” ๋ฌธ์ œ๋กœ ๊ฐ€์ •ํ•ด ํ•ด๊ฒฐํ•˜๋Š” ์ ‘๊ทผ๋ฒ•์ด๋‹ค.

๐Ÿ‘‰ Additive Model ๐Ÿ”ฅ