โ€œํ™•๋ฅ ๊ณผ ํ†ต๊ณ„(MATH230)โ€ ์ˆ˜์—…์—์„œ ๋ฐฐ์šด ๊ฒƒ๊ณผ ๊ณต๋ถ€ํ•œ ๊ฒƒ์„ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ „์ฒด ํฌ์ŠคํŠธ๋Š” Probability and Statistics์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ๐ŸŽฒ

6 minute read

โ€œํ™•๋ฅ ๊ณผ ํ†ต๊ณ„(MATH230)โ€ ์ˆ˜์—…์—์„œ ๋ฐฐ์šด ๊ฒƒ๊ณผ ๊ณต๋ถ€ํ•œ ๊ฒƒ์„ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ „์ฒด ํฌ์ŠคํŠธ๋Š” Probability and Statistics์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ๐ŸŽฒ

์šฐ๋ฆฌ๋Š” ์ด์ „ ํฌ์ŠคํŠธ โ€œTest on Regressionโ€œ์—์„œ regression coefficient B1, B0์˜ ๋ถ„ํฌ๋ฅผ ํ™•์ธํ–ˆ๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ๋Š” ์ด ๊ฒฐ๊ณผ๋ฅผ ์ข…ํ•ฉํ•ด ์šฐ๋ฆฌ๊ฐ€ regression ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ์–ป๋Š” response์˜ ๋ถ„ํฌ๋ฅผ ์ถ”์ •ํ•ด๋ณด๋Š” ๊ณผ์ •์„ ์ง„ํ–‰ํ•œ๋‹ค.

๊ฐœ์ธ์ ์œผ๋กœ๋Š” โ€œB1์™€ B0์ด estimated regression coefficient์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ์–ป๋Š” response y ์—ญ์‹œ estimated response๋กœ ์–ด๋А์ •๋„์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์ด๋ฅผ B1๊ณผ B0์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๋ชจ๋ธ๋งํ•œ ์ด๋“ค์˜ ๋ถ„ํฌ๋ฅผ ์ด์šฉํ•ด ์ถ”์ •ํ•œ๋‹ค!โ€๋ผ๊ณ  ์ดํ•ดํ–ˆ๋‹ค.

์šฐ๋ฆฌ๋Š” mean response ฮผYโˆฃx0์„ ํ†ตํ•ด ๋ชจ๋ธ์ด ๋ฑ‰๋Š” response์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ์ถ”์ •ํ•  ๊ฒƒ์ด๋ฉฐ, ๋˜ new data X0=x0์— ๋Œ€ํ•ด ์ˆ˜ํ–‰ํ•˜๋Š” prediction์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ์ถ”์ •ํ•  ๊ฒƒ์ด๋‹ค.


Estimate on Mean ResponsePermalink

Supp. we have sample points (x1,y1),โ€ฆ,(xn,yn) from Yi=ฮฒ0+ฮฒ1xi+ฯตi where ฯตis are iid N(0,ฯƒ2). Here, ฮฒ0 and ฮฒ1 are unknown parameters.

Q. Given data x=x0, what can be the mean response ฮผYโˆฃx0?

์ด๋•Œ, x0๋Š” sample point์—์„œ ์œ ๋ž˜ํ•˜๊ฑฐ๋‚˜ ๋ฏธ๋ฆฌ ์„ค์ •ํ•œ ๊ฐ’์ด ์•„๋‹ˆ๋ผ, variable Y0์˜ ๊ฐ’ y0๋ฅผ predictํ•˜๋Š” ์šฉ๋„์˜ ๊ฐ’์ด๋‹ค.

ฮผYโˆฃx0=E[Y0]=E[ฮฒ0+ฮฒ1x0+ฯตi]=ฮฒ0+ฮฒ1x0+E[ฯตi]0

๊ทธ๋Ÿฌ๋‚˜ ์šฐ๋ฆฌ๋Š” ฮฒ0, ฮฒ1์˜ ๊ฐ’์„ ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์ƒ˜ํ”Œ๋กœ๋ถ€ํ„ฐ ์ ๋‹นํ•œ point estimator Y^0๋ฅผ ์ •์˜ํ•  ๊ฒƒ์ด๋‹ค.

Y^0=B0+B1x0

์ด์ œ, Y^0์˜ ๋ถ„ํฌ์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด์ž. ์ด๋•Œ, B0, B1๊ฐ€ normal ๋ถ„ํฌ์ด๋ฏ€๋กœ, Y^0 ์—ญ์‹œ normal ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค.

1. Mean

E[Y^0]=E[B0+B1x0]=ฮฒ0+ฮฒ1x0=ฮผYโˆฃx0

์ด๋•Œ ์œ„์˜ ์‚ฌ์‹ค์„ ํ†ตํ•ด Y^0๊ฐ€ unbiased estimator์ž„๋„ ์•Œ ์ˆ˜ ์žˆ๋‹ค!

2. Variance

Var(Y^0)=Var(yยฏ+B1(x0โˆ’xยฏ))=Var(yยฏ)+Var(B1(x0โˆ’xยฏ))+Cov(yยฏ,B1)

์ด๋•Œ, yยฏโŠฅB1์ด๋ฏ€๋กœ, Cov(yยฏ,B1)=0์ด ๋œ๋‹ค. (Homework ๐ŸŽˆ)

๋”ฐ๋ผ์„œ,

=Var(yยฏ)+Var(B1(x0โˆ’xยฏ))+Cov(yยฏ,B1)0=ฯƒ2n+(x0โˆ’xยฏ)2โ‹…Var(B1)=ฯƒ2n+(x0โˆ’xยฏ)2โ‹…ฯƒ2Sxx=ฯƒ2(1n+(x0โˆ’xยฏ)2Sxx)

๋”ฐ๋ผ์„œ, Y^0์˜ ๋ถ„ํฌ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Y^0โˆผN(ฮผYโˆฃx0,ฯƒ2(1n+(x0โˆ’xยฏ)2Sxx))

์ด๋•Œ error variance ฯƒ2์˜ ๊ฐ’์„ ๋ชจ๋ฅด๋ฏ€๋กœ, sample error variance s2๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด,

Y^0โˆ’ฮผYโˆฃx0s1n+(x0โˆ’xยฏ)2Sxxโˆผt(nโˆ’2)

์ด์— ์œ„์˜ ๋ถ„ํฌ๋ฅผ ์‚ฌ์šฉํ•ด, data x0์— ๋Œ€ํ•œ mean response ฮผYโˆฃx0์˜ โ€œconfidence intervalโ€์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค! ๐Ÿ˜†


Prediction IntervalPermalink

์•ž์—์„œ ๊ตฌํ•œ โ€œmean response ฮผYโˆฃx0โ€๋Š” ์šฐ๋ฆฌ์—๊ฒŒ x=x0๋ผ๋Š” ๊ฐ’์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ๋ชจ๋ธ์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ์ถ”์ •ํ•˜๋Š” ๊ณผ์ •์ด์—ˆ๋‹ค. ์ด๋ฒˆ์—๋Š” ๋ชจ๋ธ์— new data X0=x0๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ์ด์— ๋Œ€ํ•œ prediction์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ์ถ”์ •ํ•˜๋Š” ๊ณผ์ •์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด๊ฒƒ์€ X0์˜ response Y0๊ฐ€ ๊ธฐ์กด์˜ Yi์™€ independent ํ•˜๊ธฐ ๋•Œ๋ฌธ์— - ์‹ฌ์ง€์–ด x0=xi ์ผ์ง€๋ผ๋„ Y0โŠฅYi์ด๋‹ค - ์•ž์˜ โ€œmean responseโ€์™€๋Š” ๋‹ค๋ฅด๊ฒŒ ์ ‘๊ทผํ•ด์•ผ ํ•œ๋‹ค!

Y0๋Š” Y0=ฮฒ0+ฮฒ1x0+ฯต0 where ฯต0โˆผN(0,ฯƒ2) and iid.

๋”ฐ๋ผ์„œ, Y0์˜ ๋ถ„ํฌ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Y0โˆผN(ฮฒ0+ฮฒ1x0,ฯƒ2)

์ด๋•Œ, Y0โŠฅYi์ด๊ณ , ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ Y0โŠฅY^0์ด๋‹ค.

์ด๋•Œ, Y^0์— ๋Œ€ํ•œ ๋ถ„ํฌ๋Š” ์œ„์—์„œ ๊ตฌํ•œ ์ ์ด ์žˆ๋‹ค. ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋ฉด,

Y^0โˆผN(ฮฒ0+ฮฒ1x0,ฯƒ2(1n+(x0โˆ’xยฏ)2Sxx))

์ด๋•Œ Y0๋Š” Y^0์™€ ๋…๋ฆฝ์ด๋ฏ€๋กœ ์•„๋ž˜๊ฐ€ ์„ฑ๋ฆฝํ•œ๋‹ค.

Y0โˆ’Y^0โˆผN(0,ฯƒ2(1+1n+(x0โˆ’xยฏ)2Sxx))

์ด๋•Œ error variance ฯƒ2์˜ ๊ฐ’์„ ๋ชจ๋ฅด๋ฏ€๋กœ, sample error variance s2๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด,

Y0โˆ’Y^0s1+1n+(x0โˆ’xยฏ)2Sxxโˆผt(nโˆ’2)

์ด๋•Œ, ์ฃผ๋ชฉํ•  ์ ์€ ์ผ๋ฐ˜์ ์œผ๋กœ โ€œresponse intervalโ€์ด โ€œprediction intervalโ€๋ณด๋‹ค ๋” ์ข๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ฐœ์ธ์ ์œผ๋กœ ํ•ด์„ํ•ด๋ณด์ž๋ฉด, โ€œprediction intervalโ€์˜ ๊ฒฝ์šฐ, ์ƒˆ๋กญ๊ฒŒ ์ถ”๊ฐ€๋˜๋Š” data X0์ด ๊ธฐ์กด์˜ ๋ฐ์ดํ„ฐ์™€ ๋…๋ฆฝ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Ÿฐ ์ฐจ์ด๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ๋˜, ์• ์ดˆ์— โ€œresponse intervalโ€๊ณผ โ€œprediction intervalโ€์€ ์ถ”์ •์˜ ๋Œ€์ƒ ์ž์ฒด๊ฐ€ ๋‹ค๋ฅด๋‹ค! ๐Ÿ˜

๋ณธ์ธ ๋ง๊ณ ๋„ ๋‘ ๊ฐœ๋…์ด ํ—ท๊ฐˆ๋ฆฌ๋Š” ์‚ฌ๋žŒ์ด ๋งŽ์€ ๊ฒƒ ๊ฐ™์•„. ๊ตฌ๊ธ€์— ๊ฒ€์ƒ‰ํ•ด๋ณด๋‹ˆ ๋‘˜์„ ๋น„๊ตํ•˜๋Š” ํฌ์ŠคํŠธ๊ฐ€ ๊ฝค ์žˆ์—ˆ๋‹ค. ์•„๋ž˜๋Š” ๊ทธ ์ค‘์—์„œ ๋‘˜์„ ํ•œ ๋ฌธ์žฅ์„ ๋น„๊ตํ•œ ๋ฌธ๊ตฌ๋ฅผ ๊ฐ€์ ธ์˜จ ๊ฒƒ์ด๋‹ค.

A mean response interval is a confidence interval for the mean of all Yโ€™s at a given X value.

A prediction interval is a prediction interval for one single Y at a given X value.

โ€“ from a post of โ€˜Carsten Grubeโ€™


์ด๊ฒƒ์œผ๋กœ โ€œํ™•๋ฅ ๊ณผ ํ†ต๊ณ„(MATH230)โ€์˜ ์ •๊ทœ์ˆ˜์—…์—์„œ ๋‹ค๋ฃฌ ๋ชจ๋“  ๋‚ด์šฉ์„ ์‚ดํŽด๋ดค๋‹ค!! ๐Ÿ˜