โ€œํ™•๋ฅ ๊ณผ ํ†ต๊ณ„(MATH230)โ€ ์ˆ˜์—…์—์„œ ๋ฐฐ์šด ๊ฒƒ๊ณผ ๊ณต๋ถ€ํ•œ ๊ฒƒ์„ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ „์ฒด ํฌ์ŠคํŠธ๋Š” Probability and Statistics์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ๐ŸŽฒ

5 minute read

โ€œํ™•๋ฅ ๊ณผ ํ†ต๊ณ„(MATH230)โ€ ์ˆ˜์—…์—์„œ ๋ฐฐ์šด ๊ฒƒ๊ณผ ๊ณต๋ถ€ํ•œ ๊ฒƒ์„ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ „์ฒด ํฌ์ŠคํŠธ๋Š” Probability and Statistics์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ๐ŸŽฒ

Interval Estimation ํฌ์ŠคํŠธ์—์„œ ๋‹ค๋ฃฌ <Interval Estimation>์„ ํŠน์ • ์ƒํ™ฉ์— ์–ด๋–ป๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ๋‹ค๋ฃจ๋Š” ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค.

Two Samples Estimation

Supp. there are two populations and assume that both follow some random distributions with means $\mu_1$ and $\mu_2$ and variances $\sigma_1^2$, $\sigma_2^2$ respectively.

Take random samples $X_1, \dots, X_{n_1}$ and $Y_1, \dots, Y_{n_2}$, and assume that $X_i$s and $Y_j$s are independent.

Supp. that their observed sample means are $\bar{x}$ and $\bar{y}$, and their sample variances are $s_1^2$ and $s_2^2$.

โœจ Goal: Find $100(1-\alpha)\%$ confidence interval for $\mu_1 - \mu_2$.


$\sigma_1^2$ and $\sigma_2^2$ are known

By CLT, $\bar{X} \overset{D}{\approx} N(\mu_1, \sigma_1^2 / n_1)$ and $\bar{Y} \overset{D}{\approx} N(\mu_2, \sigma_2^2 / n_2)$, in addition $\bar{X} \perp \bar{Y}$.

Then,

\[\bar{X} - \bar{Y} \overset{D}{\approx} N(\mu_1 - \mu_2, \; \sigma_1^2 / n_1 + \sigma_2^2 / n_2)\]

then, the confidence interval is

\[\left( \bar{x} - \bar{y} - z_{\alpha/2} \cdot \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}} , \; \bar{x} - \bar{y} + z_{\alpha/2} \cdot \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}} \right)\]

๐Ÿ’ฅ ์ด๋•Œ, ์ฃผ์˜ํ•  ์ ์€ ์ด๊ฒƒ์€ true interval์ด ์•„๋‹ˆ๋ผ approximate interval์ด๋ผ๋Š” ์ ์ด๋‹ค; by CLT

๐Ÿ’ฅ ๋˜, ์ด ๊ทผ์‚ฌ๋Š” $X_i$, $Y_j$๊ฐ€ ๋ชจ๋‘ iid normal์ด์—ฌ์•ผ ๊ฐ€๋Šฅํ•˜๋‹ค!


$\sigma_1^2$ and $\sigma_2^2$ are unknown, but known that $\sigma_1^2 = \sigma_2^2$

์•ž์—์„œ ์šฐ๋ฆฌ๋Š” CLT๋ฅผ ์‚ฌ์šฉํ•ด $\frac{\bar{X} - \mu}{\sigma / \sqrt{n}}$๋ฅผ ์‚ฌ์šฉํ–ˆ์—ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ด๋ฒˆ์—๋Š” ์ •ํ™•ํ•œ $\sigma^2$์˜ ๊ฐ’์„ ์•Œ์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์— $\sigma^2$ ๋Œ€์‹  sample variance $s^2$์„ ์‚ฌ์šฉํ•œ๋‹ค!!

[Previous]

\[\frac{(\bar{X} - \bar{Y}) - (\mu_1 - \mu_2)}{\sqrt{\dfrac{\color{red}{\sigma_1^2}}{n_1} + \dfrac{\color{red}{\sigma_2^2}}{n_2}}} \;\overset{D}{\sim} \; N(0, 1)\]

์šฐ๋ฆฌ๋Š” $\sigma^2$๋ฅผ ๋Œ€์ฒดํ•˜๊ธฐ ์œ„ํ•ด <pooled sample variance>๋ผ๋Š” ๋‘ ์ƒ˜ํ”Œ์˜ sample variance๋ฅผ ์ข…ํ•ฉํ•œ ๋…€์„์„ ์“ธ ๊ฒƒ์ด๋‹ค.

Recall. sample variance and chi-square

\[\frac{(n -1) \cdot S^2}{\sigma^2} = \sum^n \left(\frac{X_i - \bar{X}}{\sigma}\right)^2 \; \sim \; \chi^2 (n-1)\]

Definition. pooled sample variance

\[\begin{aligned} S_p^2 &= \frac{\displaystyle \sum^{n_1}_1 (X_i - \bar{X})^2 + \sum^{n_2}_1 (Y_i - \bar{Y})^2}{n_1 - 1 + n_2 - 1} \end{aligned}\]

์œ„์˜ ์‹์„ ์ž˜ ๋ณ€ํ˜•ํ•ด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[(n_1 - 1 + n_2 - 1) \cdot S_p^2 = (n_1 - 1) \cdot S_1^2 + (n_2 - 1) \cdot S_2^2\] \[\frac{(n_1 - 1 + n_2 - 1) \cdot S_p^2}{\sigma^2} = \frac{(n_1 - 1) \cdot S_1^2 + (n_2 - 1) \cdot S_2^2}{\sigma^2} \; \overset{D}{\sim} \; \chi^2(n_1 + n_2 - 2)\]

๋”ฐ๋ผ์„œ! pooled sample variance $S_p^2$๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์‹์œผ๋กœ ๋‹ค์‹œ ์“ฐ๋ฉด,

\[\frac{(\bar{X} - \bar{Y}) - (\mu_1 - \mu_2)}{S_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \; \overset{D}{\sim} t (n_1 + n_2 - 2)\]

๊ทธ ๋‹ค์Œ์€ ์ง€๊ธˆ๊นŒ์ง€ ํ•ด์˜จ ๊ฒƒ์ฒ˜๋Ÿผ t-distribution์„ ๋ฐ”ํƒ•์œผ๋กœ interval estimation์„ ์ง„ํ–‰ํ•˜๋ฉด ๋œ๋‹ค!


$\sigma_1^2$ and $\sigma_2^2$ are unknown and not equal

๋จผ์ € population parameter๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์‹์„ ์„ธ์›Œ๋ณด์ž.

\[\frac{(\bar{X} - \bar{Y}) - (\mu_1 - \mu_2)}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}} \; \sim \; N(0, 1)\]

์ด๋•Œ, ์šฐ๋ฆฌ๊ฐ€ $\sigma_1^2$, $\sigma_2^2$๋ฅผ ์•Œ์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์— sample variance๋กœ ํ•ด๋‹น ๋ถ€๋ถ„์„ ๋Œ€์ฒดํ•œ๋‹ค.

\[\frac{(\bar{X} - \bar{Y}) - (\mu_1 - \mu_2)}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \; \overset{D}{\sim} \; ??\]

์ด๋•Œ, ์šฐ๋ณ€์ด ??์ธ ์ด์œ ๋Š” ์•„์ง๊นŒ์ง€ ์œ„์˜ ๊ฒฝ์šฐ์— ๋Œ€ํ•ด ์ •ํ™•ํ•œ ๋ถ„ํฌ๋ฅผ ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค!! ๊ทธ๋ž˜์„œ ์ด๊ฒƒ์ด ์–ด๋–ค DOF $d$์˜ t-distribution์„ ๋งŒ์กฑํ•œ๋‹ค๊ณ  ๊ทผ์‚ฌํ•˜์—ฌ estimation์„ ์ง„ํ–‰ํ•œ๋‹ค!

\[\frac{(\bar{X} - \bar{Y}) - (\mu_1 - \mu_2)}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \; \overset{D}{\sim} \; t(d)\]

where

\[d = \frac{(s_1^2/n_1 + s_2^2/n_2)^2}{(s_1^2/n_1)^2 / (n_1-1) + (s_2^2/n_2)^2 / (n_2-1)}\]

์ด ๊ทผ์‚ฌ๋ฒ•์„ <Welchโ€™s t-test>๋ผ๊ณ  ํ•˜๋ฉฐ, ์ด๋•Œ dof๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•œ ์‹์„ <Welch-Satterthwaite equation>์ด๋ผ๊ณ  ํ•œ๋‹ค.


๋งบ์Œ๋ง

์ด์–ด์ง€๋Š” ํฌ์ŠคํŠธ์—์„œ๋Š” ๋˜๋‹ค๋ฅธ Two Samples Estimation์ธ <Paired Observation>์˜ ๊ฒฝ์šฐ๋ฅผ ์‚ดํŽด๋ณธ๋‹ค! ๐Ÿ˜

๐Ÿ‘‰ Two Samples Estimation: Paired Observations


์ง€๊ธˆ๊นŒ์ง€์˜ ๊ด€์‹ฌ์‚ฌ๋Š” population mean $\mu$์— ๋Œ€ํ•œ ์ถ”์ •์ด์—ˆ๋‹ค. population mean $\mu$๋ฅผ ์ถ”์ •ํ•˜๊ฑฐ๋‚˜ ๋‘ population mean $\mu$์˜ ์ฐจ์ด๊ฐ’์„ ์ถ”์ •ํ–ˆ๋‹ค. ๊ทธ ๋‹ค์Œ ํฌ์ŠคํŠธ์—์„œ๋Š” population variance $\sigma^2$๋ฅผ ์ถ”์ •ํ•œ๋‹ค. $\sigma^2$๋ฅผ ์ถ”์ •ํ•˜๊ฑฐ๋‚˜, ๋‘ ์ƒ˜ํ”Œ์˜ population variance์˜ ๋น„์œจ $\sigma_1^2 / \sigma_2^2$๋ฅผ ์ถ”์ •ํ•œ๋‹ค.

๐Ÿ‘‰ Variance Estimation