โ€œํ™•๋ฅ ๊ณผ ํ†ต๊ณ„(MATH230)โ€ ์ˆ˜์—…์—์„œ ๋ฐฐ์šด ๊ฒƒ๊ณผ ๊ณต๋ถ€ํ•œ ๊ฒƒ์„ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ „์ฒด ํฌ์ŠคํŠธ๋Š” Probability and Statistics์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ๐ŸŽฒ

6 minute read

โ€œํ™•๋ฅ ๊ณผ ํ†ต๊ณ„(MATH230)โ€ ์ˆ˜์—…์—์„œ ๋ฐฐ์šด ๊ฒƒ๊ณผ ๊ณต๋ถ€ํ•œ ๊ฒƒ์„ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ „์ฒด ํฌ์ŠคํŠธ๋Š” Probability and Statistics์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ๐ŸŽฒ

Interval Estimation ํฌ์ŠคํŠธ์—์„œ ๋‹ค๋ฃฌ <Interval Estimation>์„ ํŠน์ • ์ƒํ™ฉ์— ์–ด๋–ป๊ฒŒ ์ ์šฉ ํ•˜๋Š”์ง€๋ฅผ ๋‹ค๋ฃจ๋Š” ํฌ์ŠคํŠธ์ด๋‹ค. ์ง€๊ธˆ๊นŒ์ง€์˜ ์ถ”์ •(Estimation)์€ ๋ชจ๋‘ <Normal Distribution>์—์„œ ์ถ”์ถœํ•œ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด ์‹œํ–‰ํ–ˆ๋‹ค. ์ด๋ฒˆ์—๋Š” <Bernoulli Distribution>์˜ ์ƒ˜ํ”Œ์—์„œ ์ถ”์ •์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ฆ‰, <Bernoulli Distribution>์˜ parameter์ธ ํ™•๋ฅ  $p$๊ฐ€ ์ถ”์ •์˜ ๋Œ€์ƒ์ธ ๊ฒƒ์ด๋‹ค!


Single Sample Estimation: Proportion Estimation

Supp. we have a p-coin. We want to verify that the coin is really a p-coin.

โœจ Goal: Estimate $p$ based on the sample points!

Q1. Which coin estimator can we use for $p$?

A1. Here, we can use $\hat{p} = \dfrac{\text{# of heads}}{\text{# of toss}} = \dfrac{X_1 + \cdots + X_n}{n} = \bar{X}$.

Q2. How about 95% confidence interval for $p$?

A2. We can use CLT!!

\[\frac{\bar{X} - \mu}{\sigma / \sqrt{n}} = \frac{\hat{p} - p}{\sqrt{p(1-p)} / \sqrt{n}} \approx N(0, 1)\]

then, the confidence interval is

\[\hat{p} - z_{\alpha/2} \cdot \sqrt{\frac{p(1-p)}{n}} \; \le \; p \; \le \; \hat{p} + z_{\alpha/2} \cdot \sqrt{\frac{p(1-p)}{n}}\]

๐Ÿ’ฅ ํ•˜์ง€๋งŒ!!! ์œ„์˜ ์‹์€ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค!! ๋ฐ”๋กœ $p$๋ฅผ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด interval์„ ์žก์•˜๋Š”๋ฐ, interval์˜ ์ขŒ์šฐ๋ณ€์— ๋˜ $p$๊ฐ€ ๋“ฑ์žฅํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค!! ๐Ÿ˜ฒ

[Solution 1] solve the inequality for $p$.

[Solution 2] replace $p$ by $\hat{p}$ // if $n$ is large, $\hat{p} \rightarrow p$ by LLN

Therefore, we the $n$ is large,

\[\hat{p} - z_{\alpha/2} \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \; \le \; p \; \le \; \hat{p} + z_{\alpha/2} \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]

์ด๋ฒˆ์—๋Š” <Proportion Estimation>์˜ Error์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด์ž. Error๋Š” ์ด์ „์˜ Estimation๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์•„๋ž˜์™€ ๊ฐ™์ด ์ œ์‹œ๋œ๋‹ค.

The error $\left| \hat{p} - p \right|$ will not exceed $z_{\alpha/2} \cdot \sqrt{\dfrac{\hat{p}\hat{q}}{n}}$

Q. How large should the sample size be so that the error is at most $\epsilon$?

์ด๊ฒƒ์€ Error์— ๋Œ€ํ•œ ์‹์„ $n$์œผ๋กœ ๋‹ค์‹œ ํ’€์–ด์„œ ์‰ฝ๊ฒŒ ์œ ๋„ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

\[\begin{aligned} z_{\alpha/2} \cdot \sqrt{\frac{\hat{p}\hat{q}}{n}} &\le \epsilon \\ z_{\alpha/2}^2 \cdot \frac{\hat{p}\hat{q}}{n} &\le \epsilon^2 \\ \frac{(z_{\alpha/2})^2 \cdot \hat{p}\hat{q}}{\epsilon^2} &\le n \end{aligned}\]

์ด๋•Œ, ์œ„์˜ ์‹์€ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค!! ๋ฐ”๋กœ sample proportion $\hat{p}$๋Š” ์šฐ๋ฆฌ๊ฐ€ $n$์„ ๊ฒฐ์ •ํ•ด ์ƒ˜ํ”Œ๋งํ•˜๊ธฐ ์ „์—๋Š” ๊ทธ ๊ฐ’์„ ์•Œ ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ์ด๋‹ค!!!

[Solution 1] Guess $\hat{p}$, or use small size of sample to estimate $p$. From this, we get $\hat{p}$, and then use it!

[Solution 2] Consider the worst case, maximum error situation.

\[z_{\alpha/2} \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \le z_{\alpha/2} \cdot \sqrt{\frac{1}{4}\frac{1}{n}} \le \epsilon\] \[n \ge \left( \frac{z_{\alpha/2}}{2\epsilon} \right)^2\]

Two Samples Estimation: Diff Btw Two Proportions

Two Samples Estimation: Diff Btw Two Means ํฌ์ŠคํŠธ์—์„œ ์ด๊ฒƒ๊ณผ ๋น„์Šทํ•œ ์ƒํ™ฉ์„ ์ ‘ํ•œ ์ ์ด ์žˆ๋‹ค. ๊ทธ๋•Œ๋Š” Normal Distribution์—์„œ ์ˆ˜ํ–‰ํ–ˆ๊ณ , sample variance $s^2$๋ฅผ ์“ฐ๊ฒŒ ๋˜๋ฉด์„œ, pooled sample variance $S_p^2$๋‚˜ <Welchโ€™s t-test>๋ฅผ ์ˆ˜ํ–‰ํ–ˆ๋‹ค. ์œ„์˜ ์ƒํ™ฉ๊ณผ <Proportion Estimation>์ด ์–ด๋–ป๊ฒŒ ๋‹ค๋ฅธ์ง€ ๋น„๊ตํ•˜๋ฉด์„œ ์‚ดํŽด๋ณด์ž!

Select $n_1$: math major, and $n_2$: physics major, independently.

Assume $X_1, \dots, X_{n_1}$ and $Y_1, \dots, Y_{n_2}$ are two independent random samples.

From the sample, we get $\hat{p_1} = \bar{x}$, $\hat{p_2} = \bar{y}$. Then, $\hat{p_1} - \hat{p_2}$ can be estimator for $p_1 - p_2$.

By CLT,

\[\frac{(\hat{p_1} - \hat{p_2}) - (p_1 - p_2)}{\sqrt{p_1q_1/n_1 + p_2q_2/n_2}} \; \approx \; N(0, 1)\]

์ด๋•Œ, ์‹์—์„œ population proportion $p_1$, $p_2$ ๋ถ€๋ถ„์„ sample proportion $\hat{p}_1$, $\hat{p}_2$๋กœ ๋Œ€์ฒดํ•˜๋ฉด,

\[\frac{(\hat{p_1} - \hat{p_2}) - (p_1 - p_2)}{\sqrt{\hat{p}_1\hat{q}_1/n_1 + \hat{p}_2\hat{q}_2/n_2}}\]

Then, the $100(1-\alpha)\%$ confidence interval for $p_1 - p_2$ is

\[\left( (\hat{p_1} - \hat{p_2}) - z_{\alpha/2} \cdot \sqrt{\dfrac{\hat{p}_1\hat{q}_1}{n_1} + \dfrac{\hat{p}_2\hat{q}_2}{n_2}}, \; (\hat{p_1} - \hat{p_2}) + z_{\alpha/2} \cdot \sqrt{\dfrac{\hat{p}_1\hat{q}_1}{n_1} + \dfrac{\hat{p}_2\hat{q}_2}{n_2}} \right)\]

Proportion Estimation and t-distribution

<Mean Estimation>์˜ ๊ธฐ์–ต์„ ๋– ์˜ฌ๋ฆฌ๋ฉด, population variance $\sigma^2$๋ฅผ ๋ชจ๋ฅด๊ธฐ์— sample variance $s^2$๋ฅผ ์“ฐ๊ณ  normal distribution $N(0, 1)$ ๋Œ€์‹  t-distribution $t(n)$๋กœ ๊ทผ์‚ฌํ•œ ๊ธฐ์–ต์ด ์žˆ์„ ๊ฒƒ์ด๋‹ค.

\[\frac{\bar{X} - \mu}{S / \sqrt{n}} \sim t(n-1)\]

<Proportion Estimation>์—์„œ๋„ populations proportion $p$์˜ ๊ฐ’์„ ๋ชจ๋ฅด๊ธฐ์— sample proportion์ธ $\hat{p}$๋ฅผ ๋Œ€์‹  ์‚ฌ์šฉ ํ–ˆ๋‹ค.

\[\frac{\bar{X} - \mu}{\sigma / \sqrt{n}} = \frac{\hat{p} - p}{\sqrt{pq} / \sqrt{n}} = \frac{\hat{p} - p}{\sqrt{\hat{p}\hat{q}} / \sqrt{n}} \approx N(0, 1)\]

๊ทธ๋Ÿฌ๋‚˜ ์ด๋ฒˆ ๊ฒฝ์šฐ๋Š” t-distribution์ด ์•„๋‹ˆ๋ผ ๊ทธ๋Œ€๋กœ normal distribution์œผ๋กœ ๊ทผ์‚ฌํ•˜์—ฌ ์‹์„ ์–ป์—ˆ๋‹ค. ์™œ ์ด๋ฒˆ ๊ฒฝ์šฐ์—” t-distribution์ด ์•„๋‹Œ ๊ฑธ๊นŒ?


<Mean Estimation>์—์„  ๊ฐœ๋ณ„ sample $X_i$๊ฐ€ ์ •๊ทœ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๋Š” RV๋ผ๋Š” ๊ฐ€์ •์ด ์žˆ๋‹ค.

\[X_i \sim N(\mu, \sigma^2)\]

๊ทธ๋Ÿฌ๋‚˜ <Proportion Estimation>์—์„  ๊ฐœ๋ณ„ outcome $X_i$๊ฐ€ ๋™์ „ ์•ž๋’ค ๊ฐ™์€ ์นดํ…Œ๊ณ ๋ฆฌ ๋ณ€์ˆ˜์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๊ฒƒ์€ bernoulli distribution์„ ๋”ฐ๋ฅธ๋‹ค.

\[X_i \sim \text{Ber}(p)\]

์ฆ‰, sample variable $X_i$์ด normal distribution sample์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— sample proportion $\hat{p}$์„ ์“ฐ๋”๋ผ๋„ <t-distribution>์œผ๋กœ ๊ทผ์‚ฌํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ด๋‹ค๋ผ๊ณ  ์ดํ•ดํ•˜๊ณ  ์žˆ๋‹ค. t-distribution์„ ์ƒ๊ฐํ•  ์ „์ œ๊ฐ€ ์„ฑ๋ฆฝํ•˜์ง€๋Š” ๊ฒƒ์ด๋ผ๊ณ  ๋ง์ด๋‹ค!


๋งบ์Œ๋ง

์ง€๊ธˆ๊นŒ์ง€

  • population mean $\mu$
  • population proportion $p$

์— ๋Œ€ํ•œ ์ถ”์ •์„ ์‚ดํŽด๋ดค๋‹ค. ๋‹ค์Œ ํฌ์ŠคํŠธ์—์„œ๋Š” sample variance $S^2$๋กœ๋ถ€ํ„ฐ population variance $\sigma^2$๋ฅผ ์ถ”์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด๊ฒ ๋‹ค.

๐Ÿ‘‰ Variance Estimation

Reference