โ€œํ™•๋ฅ ๊ณผ ํ†ต๊ณ„(MATH230)โ€ ์ˆ˜์—…์—์„œ ๋ฐฐ์šด ๊ฒƒ๊ณผ ๊ณต๋ถ€ํ•œ ๊ฒƒ์„ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ „์ฒด ํฌ์ŠคํŠธ๋Š” Probability and Statistics์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ๐ŸŽฒ

6 minute read

โ€œํ™•๋ฅ ๊ณผ ํ†ต๊ณ„(MATH230)โ€ ์ˆ˜์—…์—์„œ ๋ฐฐ์šด ๊ฒƒ๊ณผ ๊ณต๋ถ€ํ•œ ๊ฒƒ์„ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ „์ฒด ํฌ์ŠคํŠธ๋Š” Probability and Statistics์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ๐ŸŽฒ

์šฐ๋ฆฌ๋Š” ์ด์ „ ํฌ์ŠคํŠธ โ€œTest on Regressionโ€œ์—์„œ regression coefficient $B_1$, $B_0$์˜ ๋ถ„ํฌ๋ฅผ ํ™•์ธํ–ˆ๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ๋Š” ์ด ๊ฒฐ๊ณผ๋ฅผ ์ข…ํ•ฉํ•ด ์šฐ๋ฆฌ๊ฐ€ regression ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ์–ป๋Š” response์˜ ๋ถ„ํฌ๋ฅผ ์ถ”์ •ํ•ด๋ณด๋Š” ๊ณผ์ •์„ ์ง„ํ–‰ํ•œ๋‹ค.

๊ฐœ์ธ์ ์œผ๋กœ๋Š” โ€œ$B_1$์™€ $B_0$์ด estimated regression coefficient์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ์–ป๋Š” response $y$ ์—ญ์‹œ estimated response๋กœ ์–ด๋Š์ •๋„์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์ด๋ฅผ $B_1$๊ณผ $B_0$์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๋ชจ๋ธ๋งํ•œ ์ด๋“ค์˜ ๋ถ„ํฌ๋ฅผ ์ด์šฉํ•ด ์ถ”์ •ํ•œ๋‹ค!โ€๋ผ๊ณ  ์ดํ•ดํ–ˆ๋‹ค.

์šฐ๋ฆฌ๋Š” mean response $\mu_{Y\mid x_0}$์„ ํ†ตํ•ด ๋ชจ๋ธ์ด ๋ฑ‰๋Š” response์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ์ถ”์ •ํ•  ๊ฒƒ์ด๋ฉฐ, ๋˜ new data $X_0 = x_0$์— ๋Œ€ํ•ด ์ˆ˜ํ–‰ํ•˜๋Š” prediction์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ์ถ”์ •ํ•  ๊ฒƒ์ด๋‹ค.


Estimate on Mean Response

Supp. we have sample points $(x_1, y_1), \dots, (x_n, y_n)$ from $Y_i = \beta_0 + \beta_1 x_i + \epsilon_i$ where $\epsilon_i$s are iid $N(0, \sigma^2)$. Here, $\beta_0$ and $\beta_1$ are unknown parameters.

Q. Given data $x=x_0$, what can be the mean response $\mu_{Y\mid x_0}$?

์ด๋•Œ, $x_0$๋Š” sample point์—์„œ ์œ ๋ž˜ํ•˜๊ฑฐ๋‚˜ ๋ฏธ๋ฆฌ ์„ค์ •ํ•œ ๊ฐ’์ด ์•„๋‹ˆ๋ผ, variable $Y_0$์˜ ๊ฐ’ $y_0$๋ฅผ predictํ•˜๋Š” ์šฉ๋„์˜ ๊ฐ’์ด๋‹ค.

\[\mu_{Y \mid x_0} = E[Y_0] = E[\beta_0 + \beta_1 x_0 + \epsilon_i] = \beta_0 + \beta_1 x_0 + \cancelto{0}{E[\epsilon_i]}\]

๊ทธ๋Ÿฌ๋‚˜ ์šฐ๋ฆฌ๋Š” $\beta_0$, $\beta_1$์˜ ๊ฐ’์„ ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์ƒ˜ํ”Œ๋กœ๋ถ€ํ„ฐ ์ ๋‹นํ•œ point estimator $\hat{Y}_0$๋ฅผ ์ •์˜ํ•  ๊ฒƒ์ด๋‹ค.

\[\hat{Y}_0 = B_0 + B_1 x_0\]

์ด์ œ, $\hat{Y}_0$์˜ ๋ถ„ํฌ์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด์ž. ์ด๋•Œ, $B_0$, $B_1$๊ฐ€ normal ๋ถ„ํฌ์ด๋ฏ€๋กœ, $\hat{Y}_0$ ์—ญ์‹œ normal ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค.

1. Mean

\[\begin{aligned} E[\hat{Y}_0] &= E[B_0 + B_1 x_0] \\ &= \beta_0 + \beta_1 x_0 = \mu_{Y \mid x_0} \end{aligned}\]

์ด๋•Œ ์œ„์˜ ์‚ฌ์‹ค์„ ํ†ตํ•ด $\hat{Y}_0$๊ฐ€ unbiased estimator์ž„๋„ ์•Œ ์ˆ˜ ์žˆ๋‹ค!

2. Variance

\[\begin{aligned} \text{Var}(\hat{Y}_0) &= \text{Var}(\bar{y} + B_1 (x_0 - \bar{x})) \\ &= \text{Var}(\bar{y}) + \text{Var}(B_1 (x_0 - \bar{x})) + \text{Cov}(\bar{y}, B_1) \end{aligned}\]

์ด๋•Œ, $\bar{y} \perp B_1$์ด๋ฏ€๋กœ, $\text{Cov}(\bar{y}, B_1) = 0$์ด ๋œ๋‹ค. (Homework ๐ŸŽˆ)

๋”ฐ๋ผ์„œ,

\[\begin{aligned} &= \text{Var}(\bar{y}) + \text{Var}(B_1 (x_0 - \bar{x})) + \cancelto{0}{\text{Cov}(\bar{y}, B_1)} \\ &= \frac{\sigma^2}{n} + (x_0 - \bar{x})^2 \cdot \text{Var}(B_1) \\ &= \frac{\sigma^2}{n} + (x_0 - \bar{x})^2 \cdot \frac{\sigma^2}{S_{xx}} \\ &= \sigma^2 \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right) \end{aligned}\]

๋”ฐ๋ผ์„œ, $\hat{Y}_0$์˜ ๋ถ„ํฌ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[\hat{Y}_0 \sim N \left( \mu_{Y \mid x_0}, \; \sigma^2 \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right) \right)\]

์ด๋•Œ error variance $\sigma^2$์˜ ๊ฐ’์„ ๋ชจ๋ฅด๋ฏ€๋กœ, sample error variance $s^2$๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด,

\[\frac{\hat{Y}_0 - \mu_{Y \mid x_0}}{s \sqrt{\dfrac{1}{n} + \dfrac{(x_0 - \bar{x})^2}{S_{xx}}}} \sim t(n-2)\]

์ด์— ์œ„์˜ ๋ถ„ํฌ๋ฅผ ์‚ฌ์šฉํ•ด, data $x_0$์— ๋Œ€ํ•œ mean response $\mu_{Y \mid x_0}$์˜ โ€œconfidence intervalโ€์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค! ๐Ÿ˜†


Prediction Interval

์•ž์—์„œ ๊ตฌํ•œ โ€œmean response $\mu_{Y \mid x_0}$โ€๋Š” ์šฐ๋ฆฌ์—๊ฒŒ $x=x_0$๋ผ๋Š” ๊ฐ’์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ๋ชจ๋ธ์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ์ถ”์ •ํ•˜๋Š” ๊ณผ์ •์ด์—ˆ๋‹ค. ์ด๋ฒˆ์—๋Š” ๋ชจ๋ธ์— new data $X_0 = x_0$๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ์ด์— ๋Œ€ํ•œ prediction์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ์ถ”์ •ํ•˜๋Š” ๊ณผ์ •์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด๊ฒƒ์€ $X_0$์˜ response $Y_0$๊ฐ€ ๊ธฐ์กด์˜ $Y_i$์™€ independent ํ•˜๊ธฐ ๋•Œ๋ฌธ์— - ์‹ฌ์ง€์–ด $x_0 = x_i$ ์ผ์ง€๋ผ๋„ $Y_0 \perp Y_i$์ด๋‹ค - ์•ž์˜ โ€œmean responseโ€์™€๋Š” ๋‹ค๋ฅด๊ฒŒ ์ ‘๊ทผํ•ด์•ผ ํ•œ๋‹ค!

$Y_0$๋Š” $Y_0 = \beta_0 + \beta_1 x_0 + \epsilon_0$ where $\epsilon_0 \sim N(0, \sigma^2)$ and iid.

๋”ฐ๋ผ์„œ, $Y_0$์˜ ๋ถ„ํฌ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[Y_0 \sim N(\beta_0 + \beta_1 x_0, \; \sigma^2)\]

์ด๋•Œ, $Y_0 \perp Y_i$์ด๊ณ , ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ $Y_0 \perp \hat{Y}_0$์ด๋‹ค.

์ด๋•Œ, $\hat{Y}_0$์— ๋Œ€ํ•œ ๋ถ„ํฌ๋Š” ์œ„์—์„œ ๊ตฌํ•œ ์ ์ด ์žˆ๋‹ค. ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋ฉด,

\[\hat{Y}_0 \sim N \left( \beta_0 + \beta_1 x_0, \; \sigma^2 \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right) \right)\]

์ด๋•Œ $Y_0$๋Š” $\hat{Y}_0$์™€ ๋…๋ฆฝ์ด๋ฏ€๋กœ ์•„๋ž˜๊ฐ€ ์„ฑ๋ฆฝํ•œ๋‹ค.

\[Y_0 - \hat{Y}_0 \sim N \left( 0, \; \sigma^2 \left( 1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right) \right)\]

์ด๋•Œ error variance $\sigma^2$์˜ ๊ฐ’์„ ๋ชจ๋ฅด๋ฏ€๋กœ, sample error variance $s^2$๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด,

\[\frac{Y_0 - \hat{Y}_0}{s \sqrt{1 + \dfrac{1}{n} + \dfrac{(x_0 - \bar{x})^2}{S_{xx}}}} \sim t(n-2)\]

์ด๋•Œ, ์ฃผ๋ชฉํ•  ์ ์€ ์ผ๋ฐ˜์ ์œผ๋กœ โ€œresponse intervalโ€์ด โ€œprediction intervalโ€๋ณด๋‹ค ๋” ์ข๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ฐœ์ธ์ ์œผ๋กœ ํ•ด์„ํ•ด๋ณด์ž๋ฉด, โ€œprediction intervalโ€์˜ ๊ฒฝ์šฐ, ์ƒˆ๋กญ๊ฒŒ ์ถ”๊ฐ€๋˜๋Š” data $X_0$์ด ๊ธฐ์กด์˜ ๋ฐ์ดํ„ฐ์™€ ๋…๋ฆฝ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Ÿฐ ์ฐจ์ด๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ๋˜, ์• ์ดˆ์— โ€œresponse intervalโ€๊ณผ โ€œprediction intervalโ€์€ ์ถ”์ •์˜ ๋Œ€์ƒ ์ž์ฒด๊ฐ€ ๋‹ค๋ฅด๋‹ค! ๐Ÿ˜

๋ณธ์ธ ๋ง๊ณ ๋„ ๋‘ ๊ฐœ๋…์ด ํ—ท๊ฐˆ๋ฆฌ๋Š” ์‚ฌ๋žŒ์ด ๋งŽ์€ ๊ฒƒ ๊ฐ™์•„. ๊ตฌ๊ธ€์— ๊ฒ€์ƒ‰ํ•ด๋ณด๋‹ˆ ๋‘˜์„ ๋น„๊ตํ•˜๋Š” ํฌ์ŠคํŠธ๊ฐ€ ๊ฝค ์žˆ์—ˆ๋‹ค. ์•„๋ž˜๋Š” ๊ทธ ์ค‘์—์„œ ๋‘˜์„ ํ•œ ๋ฌธ์žฅ์„ ๋น„๊ตํ•œ ๋ฌธ๊ตฌ๋ฅผ ๊ฐ€์ ธ์˜จ ๊ฒƒ์ด๋‹ค.

A mean response interval is a confidence interval for the mean of all Yโ€™s at a given X value.

A prediction interval is a prediction interval for one single Y at a given X value.

โ€“ from a post of โ€˜Carsten Grubeโ€™


์ด๊ฒƒ์œผ๋กœ โ€œํ™•๋ฅ ๊ณผ ํ†ต๊ณ„(MATH230)โ€์˜ ์ •๊ทœ์ˆ˜์—…์—์„œ ๋‹ค๋ฃฌ ๋ชจ๋“  ๋‚ด์šฉ์„ ์‚ดํŽด๋ดค๋‹ค!! ๐Ÿ˜