Sampling Distribution of Mean, and CLT
โํ๋ฅ ๊ณผ ํต๊ณ(MATH230)โ ์์ ์์ ๋ฐฐ์ด ๊ฒ๊ณผ ๊ณต๋ถํ ๊ฒ์ ์ ๋ฆฌํ ํฌ์คํธ์ ๋๋ค. ์ ์ฒด ํฌ์คํธ๋ Probability and Statistics์์ ํ์ธํ์ค ์ ์์ต๋๋ค ๐ฒ
์๋ฆฌ์ฆ: Sampling Distributions
Sampling Distribution of Mean
Let $X_1, \dots, X_n$ be a random sample with $E[X_i] = \mu$ and $\text{Var}(X_i) = \sigma^2$.
Then,
- $E[\overline{X}] = \mu$
- $\text{Var}(\overline{X}) = E\left[\left(\overline{X} - E[\overline{X}]\right)^2 \right] = \dfrac{\sigma^2}{n}$
<LLN; Law of Large Numbers>์ ๋ฐ๋ฅด๋ฉด, $n$์ด ๋ฌดํ์ผ๋ก ๊ฐ๋, ๋ถ์ฐ $\text{Var}(\overline{X}) = \sigma^2/n$๊ฐ 0์ผ๋ก ์๋ ดํ๋ค. ๋ฐ๋ผ์ $\overline{X} \rightarrow \mu$๊ฐ ๋๋ค!
Weak Law of Large Numbers
Theorem. WLLN
Let $X_1, \dots, X_n$ be a random sample with $E[X_i] = \mu$ and $\text{Var}(X_i) = \sigma^2$.
Let $\overline{X}$ be a sample mean.
For any $\epsilon > 0$, we have
\[\lim_{n\rightarrow\infty} P\left(\left| \overline{X} - \mu \right| > \epsilon\right) = 0\]Proof.
<Chebyshevโs Inequality>๋ฅผ ์ฌ์ฉํ๋ฉด ์์ฃผ ์ฝ๊ฒ ์ฆ๋ช ํ ์ ์๋ค!
\[\begin{aligned} P\left(\left| \overline{X} - \mu \right| > \epsilon\right) &\le \frac{\text{Var}(\overline{X})}{\epsilon^2} \\ &= \frac{1}{\epsilon^2} \cdot \frac{\sigma^2}{n} \rightarrow 0 \quad \text{as} \quad n \rightarrow \infty \end{aligned}\]$\blacksquare$
โWLLN says that as the sample size $n$ gets larger, then the sample mean is close to the true mean in probability!โ
์ด๋, WLLN๊ณผ ๊ฐ์ ํํ์ ์๋ ด์ โthe convergence in probabilityโ๋ผ๊ณ ํ๋ค.
cf) ์ฐธ๊ณ ๋ก <Strong Law of Large Numbers>๋ ์กด์ฌํ๋ค. ๊ทธ๋ฌ๋ ์ด ์ ๋ฆฌ๋ฅผ ์ฆ๋ช ํ๋ ค๋ฉด, ์ธก๋(measure)์ ๋ํ ๊ฐ๋ ์ด ํ์ํ๊ธฐ ๋๋ฌธ์ ์๊ฐ๋ง ํ๊ณ ๋์ด๊ฐ๊ฒ ๋ค.
\[P\left(\lim_{n\rightarrow\infty} \overline{X} = \mu \right) = 1\]CLT; Central Limit Theorem
Example.
Q. What is the probability of $P\left( \overline{X} > 7\right)$?
Note that $X_1, \dots, X_n \sim \text{Ber}(p)$, and then $(X_1 + \cdots + X_n) \sim \text{BIN}(n, p)$.
When $n$ is largest, then $(X_1 + \cdots + X_n) \rightarrow N(\mu, \sigma^2)$.
Letโs standardize it, then
\[P\left( \frac{(X_1 + \cdots + X_n) - np}{\sqrt{npq}} \le z \right) \approx P(Z \le z)\]์ด๋, ์ข๋ณ์ ๋ถ๋ชจ/๋ถ์์ $n$๋ฅผ ๋์ค์ฃผ๋ฉด
\[\begin{aligned} P\left( \frac{((X_1 + \cdots + X_n) - np)/n}{(\sqrt{npq})/n} \le z \right) &= P\left( \frac{(X_1 + \cdots + X_n)/n \; - \; p}{\sqrt{pq/n}} \le z \right) \\ &= P\left( \frac{\overline{X} - E[\overline{X}]}{\sqrt{\text{Var}(\overline{X})}} \le z \right) \\ &\approx P(Z \le z) \end{aligned}\]๊ฒฐ๋ก ์, ์๋ ๋ฌธ์ ์๋ $P\left(\overline{X} > 7\right)$์ ์ ์ ๊ทํํด์ normal ๋ถํฌ๋ก ๊ทผ์ฌํ์ฌ ํ๋ฉด ๋๋ค๋ ๊ฒ์ด๋ค.
๊ทธ๋ฐ๋ฐ, ์ง๊ธ์ ๊ฒฝ์ฐ๋ $\overline{X}$๊ฐ BIN ๋ถํฌ์๊ธฐ ๋๋ฌธ์ <Normal Approximation>์ ์ํด ์์ฐ์ค๋ฝ๊ฒ ์ ๋๋ ๊ฒ์ด์๋ค. ๊ณผ์ฐ $\overline{X}$ ๋๋ $X_i$๊ฐ ๋ค๋ฅธ ๋ถํฌ๋ฅผ ๊ฐ์ ธ๋ ์์ ๊ฐ์ ๋ฐฉ์์ ํ ์ ์์๊น? ์ด ์๋ฌธ์ ๋ํ ๋ต์ ์ ์ํ๋ ๊ฒ์ด ๋ฐ๋ก <CLT; Central Limit Theorem>์ด๋ค ๐คฉ
Theorem. CLT; Central Limit Theorem
Let $X_1, \dots, X_n$ be a random sample with $E[X_i] = \mu$ and $\text{Var}(X_i) = \sigma^2$.
Let $\overline{X}_n := (X_1 + \cdots + X_n)/n$, sample mean.
Let $Z_n := \dfrac{\overline{X}_n - E[\overline{X}]}{\sqrt{\text{Var}(\overline{X})}} = \dfrac{\overline{X}_n - \mu}{\sigma/\sqrt{n}}$.
then, for any $z \in \mathbb{R}$, we have
\[P(Z_n \le z) \rightarrow P(Z \le z) \quad \text{as} \quad n \rightarrow \infty\]where $Z \sim N(0, 1)$.
์ฆ, ๋ชจ์ง๋จ์์ ์ถ์ถํ ํ๋ณธ $n$์ด ์ถฉ๋ถํ ํฌ๋ค๋ฉด, โํ๋ณธํ๊ท โ $\bar{X}$์ ๋ถํฌ๋ ์ ๊ท ๋ถํฌ์ ๊ทผ์ฌํ๋ค!
Remark.
1. As long as iid, RVs have finite second moment[^1], then we have CLT.
์ด๊ฒ์ด ์๋ฏธํ๋ ๋ฐ๋ ์์ฃผ ๊ฐ๋ ฅํ๋ค๐ฅ $X_i$๊ฐ ์ด๋ค ๋ถํฌ๋ฅผ ๋ฐ๋ฅด๋ ์๊ด์์ด CLT๋ฅผ ์ ์ฉํ ์ ์๋ค๋ ๋ง์ด๊ธฐ ๋๋ฌธ์ด๋ค!! ์ด๋ฐ ์ ๋๋ฌธ์ CLT๋ฅผ โuniversal resultโ๋ผ๊ณ ํ๋ค!
2. We call $z: = \dfrac{\overline{x} - \mu}{\sigma / \sqrt{n}}$ as <$z$-value> or <$z$-score; $z$-์ ์, ํ์ค ์ ์>, we define $z_\alpha$ as the number $z$ s.t. $P(Z \ge z) = \alpha$ when $Z \sim N(0, 1)$.
Proof of CLT
Proof. CLT
<CLT>๋ฅผ ์ฆ๋ช ํ๊ธฐ ์ํด ์๋์ ์ ๋ฆฌ๋ฅผ ์ฌ์ฉํ๋ค.
Theorem. Uniqueness of MGF
If the mgf of $X_n$ converges to the mgf of $X$ for $t \in (-\delta, \delta)$ for some $\delta > 0$,
i.e.
\[M_{X_n} (t) \rightarrow M_{X} (t) \quad \text{for} \quad t \in (-\delta, \delta)\]and $X$ is continuous, then CDF $F_{X_n}(x)$ converges to $F_{X}(x)$ for all $x \in \mathbb{R}$.
\[F_{X_n}(x) \rightarrow F_{X}(x)\]โจ Goal: Show that the MGF of $Z = \dfrac{\bar{X} - \mu}{\sigma / \sqrt{n}}$ converges to the MGF of $N(0, 1)$ as $n \rightarrow \infty$.
Let $W = \dfrac{\bar{X} - \mu}{\sigma / \sqrt{n}}$, and then multiply $n$ both to the numerator and denominator.
\[W = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} = \frac{\displaystyle \sum_{i=1}^n X_i - n \mu}{\sqrt{n} \cdot \sigma}\]The mgf of $W$ is
\[\begin{aligned} M_W (t) &= E [e^{tW}] = E\left[\exp \left( \frac{t}{\sqrt{n} \sigma}\sum^n_{i=1} X_i - n\mu \right) \right] \\ &= E \left[ \exp \left( \frac{t}{\sqrt{n}} \cdot \frac{X_1 - \mu}{\sigma}\right) \cdot \exp \left( \frac{t}{\sqrt{n}} \cdot \frac{X_2 - \mu}{\sigma}\right) \cdots \exp \left( \frac{t}{\sqrt{n}} \cdot \frac{X_n - \mu}{\sigma}\right) \right] \\ &= E \left[ \exp \left( \frac{t}{\sqrt{n}} \cdot \frac{X_1 - \mu}{\sigma}\right) \right] \cdots E \left[ \exp \left( \frac{t}{\sqrt{n}} \cdot \frac{X_n - \mu}{\sigma}\right) \right] \quad \text{iid} \\ &= M_{Z_1} (t / \sqrt{n}) \cdots M_{Z_n} (t / \sqrt{n}) \\ &= \left[ M_{Z} (t / \sqrt{n}) \right]^n \end{aligned}\]์ด์ $M_W(t)$์ $\log$๋ฅผ ์ทจํ๊ณ , ๊ทนํ $n\rightarrow \infty$๋ฅผ ์ทจํ๋ฉด,
\[\begin{aligned} \lim_{n\rightarrow \infty} \log M_W(t) \\ = \lim_{n\rightarrow \infty} \log \left[ M_{Z} (t / \sqrt{n}) \right]^n \\ &= \lim_{n\rightarrow \infty} n \cdot \log M_Z (t / \sqrt{n}) \end{aligned}\]์ฌ๊ธฐ์ $y = 1 / \sqrt{n}$๋ก ์นํํด์ฃผ๋ฉด, ์์ ๊ทนํ์์ ์๋์ ๊ฐ๋ค.
\[\lim_{n\rightarrow \infty} n \cdot \log M_Z (t / \sqrt{n}) = \lim_{y \rightarrow 0} \frac{\log M_Z (yt)}{y^2}\]์ด๋, $\displaystyle \lim_{y\rightarrow 0} M_Z(yt) = M_Z(0) = 1$์ด๋ฏ๋ก ๊ทนํ์์ด $\dfrac{0}{0}$ ๊ผด์ ๋ถ์ ํ์ด ๋๋ค. ์ด๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด โ๋กํผํ ์ ๋ฆฌโ๋ฅผ ์ฌ์ฉํ๋ค.
\[\begin{aligned} \lim_{y \rightarrow 0} \frac{\log M_Z (yt)}{y^2} &= \lim_{y \rightarrow 0} \frac{t \cdot \dfrac{M_z(yt)'}{M_Z(yt)}}{2y} \\ &= \frac{t}{2} \cdot \lim_{y \rightarrow 0} \frac{M_z(yt)'}{y \cdot M_Z (yt)} \\ &= \frac{t}{2} \cdot \lim_{y \rightarrow 0} \frac{M_z(yt)'}{y} \quad \left(\because \; \lim_{y\rightarrow 0} M_z(yt) = 1\right) \end{aligned}\]์ด๋, ์์ ์์์๋ $\displaystyle \lim_{y\rightarrow 0} M_Z(yt)โ = M_Z(0)โ = 0 = \mu$์ด๋ฏ๋ก, ๋ค์ โ๋กํผํ ์ ๋ฆฌโ๋ฅผ ์ ์ฉํ๋ฉด,
\[\begin{aligned} \frac{t}{2} \cdot \lim_{y \rightarrow 0} \frac{M_z(yt)'}{y} &= \frac{t}{2} \cdot \lim_{y \rightarrow 0} \frac{t M_z (yt) ''}{1} \\ &= \frac{t^2}{2} \cdot \lim_{y \rightarrow 0} M_z (yt) '' \\ &= \frac{t^2}{2} \quad \left(\because \; \lim_{y \rightarrow 0} M_z (yt) '' = 1 = \sigma^2\right) \end{aligned}\]๋ฐ๋ผ์,
\[\lim_{n\rightarrow \infty} \log M_W(t) = \frac{t^2}{2}\]์์ ์์์ $\log$๋ฅผ ์์ผ์ $\bar{X}$์ ์ ๊ทํ์ธ $W$์ mgf $M_W(t)$๋ฅผ ์ป์ผ๋ฉด ์๋์ ๊ฐ๋ค.
\[\lim_{n\rightarrow \infty} M_W(t) = e^{t^2/2}\]์ด๋, ํ์ค์ ๊ท๋ถํฌ $N(0, 1)$์ mgf๊ฐ $e^{t^2/2}$์ด๊ณ , ๋ ๋ถํฌ์ mgf๊ฐ ๊ฐ์ผ๋ฏ๋ก, โUniqueness of mgfโ์ ์ํด ์๋์ ๋ช ์ ๊ฐ ์ฑ๋ฆฝํ๋ค.
โ$n$์ด ์ถฉ๋ถํ ์ปค์ง๋ฉด, $\bar{X}$์ ์ ๊ทํ์ธ $\dfrac{\bar{X} - \mu}{\sigma/\sqrt{n}}$๋ ํ์ค์ ๊ท๋ถํฌ $N(0, 1)$์ ๋ฐ๋ฅธ๋ค!โ
$\blacksquare$
Sampling Distribution of the difference btw two mean
์ด๋ฒ์๋ ๋ ๊ฐ์ ์๋ก population์์ ๋ฝ์ ๋ independent sample์ ์๊ฐํด๋ณด์!
Let $X_1, \dots, X_{n_1}$, and $Y_1, \dots, Y_{n_2}$ be two independent random samples with $E[X_1] = \mu_1$, $\text{Var}(X_1) = \sigma_1^2$, and $E[X_2] = \mu_2$, $\text{Var}(Y_2) = \sigma_2^2$.
์ฐ๋ฆฌ๋ โ๋ ์ํ ํ๊ท ์ ์ฐจโ $\mu_1 - \mu_2$์ ๋ํ ๋ถํฌ๋ฅผ ๋ชจ๋ธ๋งํ๊ณ ์ ํ๋ค. ์ด๋, $\overline{X} - \overline{Y}$๋ฅผ ์ฌ์ฉํ๋ฉด โ๋ ์ํ ํ๊ท ์ ์ฐจโ์ ๋ํด ์ถ๋ก ํ ์ ์๋ค!!
By CLT,
\[\begin{aligned} \frac{\overline{X} - \mu_1}{\sigma_1/\sqrt{n_1}} \sim N(0, 1) \quad & \iff \quad \overline{X} \sim N\left(\mu_1, \sigma_1^2/n_1\right) \\ \frac{\overline{Y} - \mu_2}{\sigma_2/\sqrt{n_2}} \sim N(0, 2) \quad & \iff \quad \overline{Y} \sim N\left(\mu_2, \sigma_2^2/n_2\right) \end{aligned}\]๋ฐ๋ผ์, $\overline{X} - \overline{Y}$์ ๋ํ ๋ถํฌ๋ independentํ ๋ normal distribution์ ๋ํ ๋ง์ ์ผ๋ก ์ฝ๊ฒ ์ ๋ํ ์ ์๋ค!
\[\overline{X} - \overline{Y} \sim N\left( \mu_1 - \mu_2, \; \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2} \right)\]์์ ์ฌ์ค์ ์ด์ฉํด์, โ๋ ์ํ ํ๊ท ์ ์ฐจโ์ ๋ํ ์ถ๋ก ๋ ์ฝ๊ฒ ์ํํ ์ ์๋ค ๐
๋งบ์๋ง
์ด๋ฒ ํฌ์คํธ์์๋ ํ๋ณธํ๊ท $\bar{X}$์ ๋ํ ๋ถํฌ์ธ โSampling Distribution of Meanโ์ ๋ณด์๋ค. ๋, ํ๋ณธํ๊ท $\bar{X}$์ ๋ถํฌ๋ฅผ ํ์ ํ๊ณ , ํ์ฉํ๋๋ฐ ํ์ํ <WLLN>๊ณผ <CLT>๋ฅผ ์ดํด๋ณด์๋ค.
์ด์ด์ง๋ ํฌ์คํธ์์๋ โํ๊ท โ๊ณผ ํจ๊ป, ํ๋ฅ ๋ถํฌ์ ํน์ฑ์ ๊ฒฐ์ ํ๋ parameter์ธ โ๋ถ์ฐ(Variance)โ์ด Random Sample์์ ์ด๋ป๊ฒ ์ ๋๋๋์ง ์ดํด๋ณผ ์์ ์ด๋ค.
๐ Sampling Distribution of Variance