2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

5 minute read

2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

Goal.

Regression์˜ ๋ชฉํ‘œ๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ <regression function>์„ ์ถ”์ •ํ•˜๋Š” ๊ฒƒ์— ์žˆ๋‹ค.

f(x)=E[YโˆฃX=x]

์œ„์˜ ๊ด€๊ณ„์‹์€ ์•„๋ž˜์˜ ์‹๊ณผ ๋™์น˜๋‹ค. ์ฆ‰, ์œ„์˜ ํ•จ์ˆ˜ f(x)๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด๋‚˜ ์•„๋ž˜์˜ f(x)๋ฅผ ์ž˜ ์ฐพ์œผ๋ฉด <regression>์˜ ๋ชฉํ‘œ๋ฅผ ์„ฑ์ทจํ•œ ๊ฒƒ์œผ๋กœ ๋ณธ๋‹ค.

Y=f(x)+ฯต,E[ฯตโˆฃX]=0

<linear regression>์„ ๋‹ฌ์„ฑํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด, <regression function> f(x)๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด X, Y์˜ ๊ด€๊ณ„์‹์„ ์•„๋ž˜์™€ ๊ฐ™์ด ๋ชจ๋ธ๋งํ•œ๋‹ค.

Y^=ฮฒ0^+โˆ‘j=1pฮฒ^jXj

ํ‘œ๊ธฐ์˜ ํŽธ์˜๋ฅผ ์œ„ํ•ด <intercept> ๋˜๋Š” <bias> ํ…€์„ ํฌํ•จํ•ด ์•„๋ž˜์™€ ๊ฐ™์ด ๊ธฐ์ˆ ํ•˜๊ธฐ๋กœ ํ•œ๋‹ค.

Y^=โˆ‘j=0pฮฒ^jXj=XTฮฒ^

Least Squared EstimatorPermalink

<Linear regression>์˜ ํ•ด๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด RSS๋ฅผ ์‚ฌ์šฉํ•ด ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋‹ค.

RSS(ฮฒ)=โˆ‘i=1n(yiโˆ’xiTฮฒ)2=(yโˆ’Xฮฒ)T(yโˆ’Xฮฒ)

where y=(y1,โ€ฆ,yn)T (response vector) and X=(x1,โ€ฆ,xp) (design matrix)

RSS์— ๋Œ€ํ•œ ์‹์„ ฮฒ์— ๋Œ€ํ•ด ๋ฏธ๋ถ„ํ•˜๋ฉด solution์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ์ •๋ง ๋ฏธ๋ถ„๋งŒ ์ž˜ ํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์‹ค์ œ ์œ ๋„ ๊ณผ์ •์€ ์—ฌ๊ธฐ์„œ๋Š” ์ƒ๋žตํ•œ๋‹ค.

ฮฒ^=argminฮฒโˆˆRpRSS(ฮฒ)=(XTX)โˆ’1XTy

์ด๊ฒƒ์„ ์•ž์—์„œ ์–ธ๊ธ‰ํ•œ Y^=XTฮฒ^์— ๋Œ€์ž…ํ•ด์ฃผ๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Y^=XTฮฒ^=XT(XTX)โˆ’1XTy=(XT(XTX)โˆ’1XT)y=Hy

์ด๋•Œ์˜ H๋ฅผ <hat matrix>๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.


Design MatrixPermalink

<design matrix> X์—๋Š” ๋‘ ๊ฐ€์ง€ ํƒ€์ž…์ด ์žˆ๋‹ค.

(1) <Random Design>: xiโ€™s are regarded as i.i.d. realization

(2) <Fixed Design>: xiโ€™s are fixed (non-random)

๋‘ ๊ฐœ๋…์ด <regression estimation>์—๋Š” ํฐ ์ฐจ์ด๊ฐ€ ์—†๋‹ค๊ณ  ํ•œ๋‹ค. ์šฐ๋ฆฌ๋Š” ์•ž์œผ๋กœ๋„ ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ์—์„œ X๋ฅผ <fixed design>์œผ๋กœ ์ทจ๊ธ‰ํ•  ๊ฒƒ์ด๋‹ค.


์•ž์—์„œ RSS ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ด ฮฒ^๋ฅผ ๊ตฌํ–ˆ๋‹ค. ์ด๋•Œ, ์ด ๋ชจ๋ธ์ด ์–ผ๋งˆ๋‚˜ ์ข‹์€์ง€๋ฅผ ๋…ผํ•˜๊ธฐ ์œ„ํ•ด <prediction error>๋ฅผ ๊ตฌํ•ด์•ผ ํ•œ๋‹ค. ์ด๋•Œ ํ•„์š”ํ•œ ๊ฐœ๋…์ด <bias>์™€ <variance>์ด๋‹ค. ์ด ๋‘ ๊ฐœ๋…์— ๋ฌด์—‡์ธ์ง€๋Š” ๋ณ„๋„์˜ ํฌ์ŠคํŠธ์— ์ •๋ฆฌํ•ด๋‘์—ˆ๋‹ค. ๋งŒ์•ฝ bias๋„ ์ž‘๊ณ  variance๋„ ์ž‘๋‹ค๋ฉด, ์šฐ๋ฆฌ๋Š” ๊ทธ ๋ชจ๋ธ์ด ์ข‹๋‹ค๊ณ  ํ‰๊ฐ€ํ•œ๋‹ค.
๐Ÿ‘‰ bias & variance

Err(x0)=ฯƒ2+{Bias(f^(x0))}2+Var(f^(x0))

Y=XTฮฒ+ฯต๋ผ๊ณ  ๊ฐ€์ •ํ•˜์ž.

๋งŒ์•ฝ, Var(Y)=Var(ฯต)=ฯƒ2๋ผ๋ฉด,

Var(ฮฒ^)=Var((XTX)โˆ’1XTy)=((XTX)โˆ’1XT)Var(y)((XTX)โˆ’1XT)T(โˆตVar(Ax)=AVar(x)AT)=(XTX)โˆ’1XTโ‹…Var(y)โ‹…X(XTX)โˆ’1=(XTX)โˆ’1XTโ‹…ฯƒ2Inโ‹…X(XTX)โˆ’1=ฯƒ2(XTX)โˆ’1XTX(XTX)โˆ’1=ฯƒ2(XTX)โˆ’1

์œ„์˜ ์‹์—์„œ XTX๋ฅผ <gram matrix>๋ผ๊ณ  ํ•œ๋‹ค.

์ด๋ฒˆ์—๋Š” bias๋ฅผ ์‚ดํŽด๋ณด์ž. ฮฒ^์˜ ํ‰๊ท ์ธ E[ฮฒ^]๋ฅผ ๊ตฌํ•ด๋ณด์ž.

๋งŒ์•ฝ, E[Y]=XTฮฒ๋ผ๋ฉด,

E[ฮฒ^]=E[(XTX)โˆ’1XTy]=(XTX)โˆ’1XTE[y]=(XTX)โˆ’1XT(Xฮฒ)=ฮฒ
E[y] ์œ ๋„

y=(y1,โ€ฆ,yn)T์— ๋Œ€ํ•ด E[y]๋Š”

E[y]=(E[y1]โ‹ฎE[yn])=(x1Tฮฒโ‹ฎxnTฮฒ)=Xฮฒ

E[ฮฒ^]=ฮฒ์ด๊ธฐ ๋•Œ๋ฌธ์— unbiased estimator๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๊ฒƒ์˜ ์˜๋ฏธ๋Š” ์ด estimator์˜ ์„ฑ๋Šฅ์ด ํ‰๊ท ์ ์ธ ๊ด€์ ์—์„œ๋Š” ์ •๋ง ์ž˜ ์ถ”์ •ํ•œ๋‹ค๋Š” ๋ง์ด๋‹ค.

์ข…ํ•ฉํ•˜๋ฉด, LS estimator๋Š” bias์˜ ๊ฒฝ์šฐ unbiased์˜€๋‹ค. ํ•˜์ง€๋งŒ, variance์˜ ๊ฒฝ์šฐ ํ–‰๋ ฌ์˜ ํ˜•ํƒœ๋กœ ๋‚˜์™”๋‹ค. ์ „์ฒด์˜ ๊ด€์ ์—์„œ ๋ดค์„ ๋•Œ, LS estimator๋Š” ๋ถ„์‚ฐ์ด ํฐ ํŽธ์ด๊ธฐ ๋•Œ๋ฌธ์— ์•„์ฃผ ์ข‹์€ estimator๋Š” ์•„๋‹ˆ๋ผ๊ณ  ํ•œ๋‹ค.

์ด๋ฒˆ์—๋Š” estimator์—์„œ ์˜ค์ฐจ์— ๋Œ€ํ•œ variance์ธ ฯƒ2๋„ ์ถ”์ •ํ•ด๋ณด์ž.

ฯƒ^=1nโˆ‘i=1n(yiโˆ’yi^)2=1nโˆ‘i=1n(yiโˆ’xiฮฒ^)2

๊ทธ๋Ÿฐ๋ฐ ์—ฌ๊ธฐ์—์„œ n์ด ์•„๋‹ˆ๋ผ nโˆ’p๋กœ ๋‚˜๋‘๋„๋ก ํ•œ๋‹ค.

ฯƒ^=1nโˆ’pโˆ‘i=1n(yiโˆ’xiฮฒ^)2=1nโˆ’pโ€–yโˆ’y^โ€–

์ด๋•Œ, (nโˆ’p)๋Š” <์ž์œ ๋„>๋ฅผ ์˜๋ฏธํ•˜๋Š”๋ฐ, ์ด ๋ถ€๋ถ„์€ ์†”์งํžˆ ์•„์ง ์ž˜ ๋ชจ๋ฅด๋Š” ๋ถ€๋ถ„์ด๋ผ ์ž์„ธํ•œ ์„ค๋ช…์€ ์ƒ๋žตํ•œ๋‹ค.

์ผ๋‹จ ๋งŒ์•ฝ ์ €๋ ‡๊ฒŒ ฯƒ2๋ฅผ ์ถ”์ •ํ•œ๋‹ค๋ฉด, ์ด๊ฒƒ์ด unbiased estimaor ์ž„์„ ์œ ๋„ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

E[ฯƒ2^]=ฯƒ2