Two Samples Estimation: Diff Btw Two Means
โํ๋ฅ ๊ณผ ํต๊ณ(MATH230)โ ์์ ์์ ๋ฐฐ์ด ๊ฒ๊ณผ ๊ณต๋ถํ ๊ฒ์ ์ ๋ฆฌํ ํฌ์คํธ์ ๋๋ค. ์ ์ฒด ํฌ์คํธ๋ Probability and Statistics์์ ํ์ธํ์ค ์ ์์ต๋๋ค ๐ฒ
Interval Estimation ํฌ์คํธ์์ ๋ค๋ฃฌ <Interval Estimation>์ ํน์ ์ํฉ์ ์ด๋ป๊ฒ ์ ์ฉํ ์ ์๋์ง๋ฅผ ๋ค๋ฃจ๋ ํฌ์คํธ์ ๋๋ค.
Two Samples Estimation
Supp. there are two populations and assume that both follow some random distributions with means $\mu_1$ and $\mu_2$ and variances $\sigma_1^2$, $\sigma_2^2$ respectively.
Take random samples $X_1, \dots, X_{n_1}$ and $Y_1, \dots, Y_{n_2}$, and assume that $X_i$s and $Y_j$s are independent.
Supp. that their observed sample means are $\bar{x}$ and $\bar{y}$, and their sample variances are $s_1^2$ and $s_2^2$.
โจ Goal: Find $100(1-\alpha)\%$ confidence interval for $\mu_1 - \mu_2$.
$\sigma_1^2$ and $\sigma_2^2$ are known
By CLT, $\bar{X} \overset{D}{\approx} N(\mu_1, \sigma_1^2 / n_1)$ and $\bar{Y} \overset{D}{\approx} N(\mu_2, \sigma_2^2 / n_2)$, in addition $\bar{X} \perp \bar{Y}$.
Then,
\[\bar{X} - \bar{Y} \overset{D}{\approx} N(\mu_1 - \mu_2, \; \sigma_1^2 / n_1 + \sigma_2^2 / n_2)\]then, the confidence interval is
\[\left( \bar{x} - \bar{y} - z_{\alpha/2} \cdot \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}} , \; \bar{x} - \bar{y} + z_{\alpha/2} \cdot \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}} \right)\]๐ฅ ์ด๋, ์ฃผ์ํ ์ ์ ์ด๊ฒ์ true interval์ด ์๋๋ผ approximate interval์ด๋ผ๋ ์ ์ด๋ค; by CLT
๐ฅ ๋, ์ด ๊ทผ์ฌ๋ $X_i$, $Y_j$๊ฐ ๋ชจ๋ iid normal์ด์ฌ์ผ ๊ฐ๋ฅํ๋ค!
$\sigma_1^2$ and $\sigma_2^2$ are unknown, but known that $\sigma_1^2 = \sigma_2^2$
์์์ ์ฐ๋ฆฌ๋ CLT๋ฅผ ์ฌ์ฉํด $\frac{\bar{X} - \mu}{\sigma / \sqrt{n}}$๋ฅผ ์ฌ์ฉํ์๋ค. ํ์ง๋ง, ์ด๋ฒ์๋ ์ ํํ $\sigma^2$์ ๊ฐ์ ์์ง ๋ชปํ๊ธฐ ๋๋ฌธ์ $\sigma^2$ ๋์ sample variance $s^2$์ ์ฌ์ฉํ๋ค!!
[Previous]
\[\frac{(\bar{X} - \bar{Y}) - (\mu_1 - \mu_2)}{\sqrt{\dfrac{\color{red}{\sigma_1^2}}{n_1} + \dfrac{\color{red}{\sigma_2^2}}{n_2}}} \;\overset{D}{\sim} \; N(0, 1)\]์ฐ๋ฆฌ๋ $\sigma^2$๋ฅผ ๋์ฒดํ๊ธฐ ์ํด <pooled sample variance>๋ผ๋ ๋ ์ํ์ sample variance๋ฅผ ์ข ํฉํ ๋ ์์ ์ธ ๊ฒ์ด๋ค.
Recall. sample variance and chi-square
Definition. pooled sample variance
์์ ์์ ์ ๋ณํํด๋ณด๋ฉด ์๋์ ๊ฐ๋ค.
\[(n_1 - 1 + n_2 - 1) \cdot S_p^2 = (n_1 - 1) \cdot S_1^2 + (n_2 - 1) \cdot S_2^2\] \[\frac{(n_1 - 1 + n_2 - 1) \cdot S_p^2}{\sigma^2} = \frac{(n_1 - 1) \cdot S_1^2 + (n_2 - 1) \cdot S_2^2}{\sigma^2} \; \overset{D}{\sim} \; \chi^2(n_1 + n_2 - 2)\]๋ฐ๋ผ์! pooled sample variance $S_p^2$๋ฅผ ๋ฐํ์ผ๋ก ์์ผ๋ก ๋ค์ ์ฐ๋ฉด,
\[\frac{(\bar{X} - \bar{Y}) - (\mu_1 - \mu_2)}{S_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \; \overset{D}{\sim} t (n_1 + n_2 - 2)\]๊ทธ ๋ค์์ ์ง๊ธ๊น์ง ํด์จ ๊ฒ์ฒ๋ผ t-distribution์ ๋ฐํ์ผ๋ก interval estimation์ ์งํํ๋ฉด ๋๋ค!
$\sigma_1^2$ and $\sigma_2^2$ are unknown and not equal
๋จผ์ population parameter๋ฅผ ๊ธฐ์ค์ผ๋ก ์์ ์ธ์๋ณด์.
\[\frac{(\bar{X} - \bar{Y}) - (\mu_1 - \mu_2)}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}} \; \sim \; N(0, 1)\]์ด๋, ์ฐ๋ฆฌ๊ฐ $\sigma_1^2$, $\sigma_2^2$๋ฅผ ์์ง ๋ชปํ๊ธฐ ๋๋ฌธ์ sample variance๋ก ํด๋น ๋ถ๋ถ์ ๋์ฒดํ๋ค.
\[\frac{(\bar{X} - \bar{Y}) - (\mu_1 - \mu_2)}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \; \overset{D}{\sim} \; ??\]์ด๋, ์ฐ๋ณ์ด ??์ธ ์ด์ ๋ ์์ง๊น์ง ์์ ๊ฒฝ์ฐ์ ๋ํด ์ ํํ ๋ถํฌ๋ฅผ ๋ชจ๋ฅด๊ธฐ ๋๋ฌธ์ด๋ค!! ๊ทธ๋์ ์ด๊ฒ์ด ์ด๋ค DOF $d$์ t-distribution์ ๋ง์กฑํ๋ค๊ณ ๊ทผ์ฌํ์ฌ estimation์ ์งํํ๋ค!
\[\frac{(\bar{X} - \bar{Y}) - (\mu_1 - \mu_2)}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \; \overset{D}{\sim} \; t(d)\]where
\[d = \frac{(s_1^2/n_1 + s_2^2/n_2)^2}{(s_1^2/n_1)^2 / (n_1-1) + (s_2^2/n_2)^2 / (n_2-1)}\]์ด ๊ทผ์ฌ๋ฒ์ <Welchโs t-test>๋ผ๊ณ ํ๋ฉฐ, ์ด๋ dof๋ฅผ ๊ตฌํ๊ธฐ ์ํด ์ฌ์ฉํ ์์ <Welch-Satterthwaite equation>์ด๋ผ๊ณ ํ๋ค.
๋งบ์๋ง
์ด์ด์ง๋ ํฌ์คํธ์์๋ ๋๋ค๋ฅธ Two Samples Estimation์ธ <Paired Observation>์ ๊ฒฝ์ฐ๋ฅผ ์ดํด๋ณธ๋ค! ๐
๐ Two Samples Estimation: Paired Observations
์ง๊ธ๊น์ง์ ๊ด์ฌ์ฌ๋ population mean $\mu$์ ๋ํ ์ถ์ ์ด์๋ค. population mean $\mu$๋ฅผ ์ถ์ ํ๊ฑฐ๋ ๋ population mean $\mu$์ ์ฐจ์ด๊ฐ์ ์ถ์ ํ๋ค. ๊ทธ ๋ค์ ํฌ์คํธ์์๋ population variance $\sigma^2$๋ฅผ ์ถ์ ํ๋ค. $\sigma^2$๋ฅผ ์ถ์ ํ๊ฑฐ๋, ๋ ์ํ์ population variance์ ๋น์จ $\sigma_1^2 / \sigma_2^2$๋ฅผ ์ถ์ ํ๋ค.
๐ Variance Estimation