โ€œํ™•๋ฅ ๊ณผ ํ†ต๊ณ„(MATH230)โ€ ์ˆ˜์—…์—์„œ ๋ฐฐ์šด ๊ฒƒ๊ณผ ๊ณต๋ถ€ํ•œ ๊ฒƒ์„ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ „์ฒด ํฌ์ŠคํŠธ๋Š” Probability and Statistics์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ๐ŸŽฒ

8 minute read

โ€œํ™•๋ฅ ๊ณผ ํ†ต๊ณ„(MATH230)โ€ ์ˆ˜์—…์—์„œ ๋ฐฐ์šด ๊ฒƒ๊ณผ ๊ณต๋ถ€ํ•œ ๊ฒƒ์„ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ „์ฒด ํฌ์ŠคํŠธ๋Š” Probability and Statistics์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ๐ŸŽฒ

Introduction

ํ™•ํ†ต ์ˆ˜์—…์„ ๋“ฃ๋Š” ์ „์ฒด ํ•™์ƒ์„ ๋Œ€์ƒ์œผ๋กœ, ํ™•ํ†ต ์ˆ˜์—…์„ ์„ ํ˜ธํ•˜๋Š” ํ•™์ƒ์˜ ๋น„์œจ์„ ๊ตฌํ•˜๊ณ ์ž ํ•œ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ, ํ™•ํ†ต ์ˆ˜์—…์„ ๋“ฃ๋Š” ํ•™์ƒ ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„์„œ ์ „์ฒด๋ฅผ ์กฐ์‚ฌํ•  ์ˆœ ์—†๊ณ , ์ „์ฒด ์ค‘ $n$๋ช… ํ•™์ƒ์„ ๋Œ€์ƒ์œผ๋กœ ์„ค๋ฌธ์กฐ์‚ฌ๋ฅผ ์‹œํ–‰ํ•œ๋‹ค๊ณ  ํ•˜์ž.

$X$๊ฐ€ โ€œ$n$๋ช…์˜ ํ•™์ƒ ์ค‘์— ํ™•ํ†ต ์ˆ˜์—…์„ ์„ ํ˜ธํ•œ๋‹ค๊ณ  ์‘๋‹ตํ•œ ํ•™์ƒ ์ˆ˜โ€๋ผ๋Š” RV๋ผ๋ฉด, $X$๋Š” HyperGeo์˜ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅผ ๊ฒƒ์ด๋‹ค.

๋˜, ๋งŒ์•ฝ ์ „์ฒด ํ•™์ƒ ์ˆ˜๊ฐ€ ์ถฉ๋ถ„ํžˆ ํฌ๋‹ค๋ฉด, HyperGeo๋ฅผ BIN์œผ๋กœ ๊ทผ์‚ฌํ•  ์ˆ˜๋„ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

์ด๋•Œ, ๊ฐ ํ•™์ƒ $i$์˜ ์„ ํ˜ธ๋ฅผ RV $X_i$๋กœ ํ‘œํ˜„ํ•ด๋ณด์ž. ๊ทธ๋Ÿฌ๋ฉด,

\[X_i = \begin{cases} 1 & i\text{-th student likes it!} \\ 0 & \text{else} \end{cases}\]

๊ทธ๋Ÿฌ๋ฉด, ์ „์ฒด RV $X_1, \dots, X_n$๋ฅผ ์ข…ํ•ฉํ•˜๋ฉด, ์ƒˆ๋กœ์šด RV $\overline{X}$๋ฅผ ์œ ๋„ํ•  ์ˆ˜ ์žˆ๋‹ค.

\[\overline{X} := \frac{X_1 + \cdots X_n}{n}\]

์šฐ๋ฆฌ๋Š” ์ด $\overline{X}$๋ฅผ <sample mean>์ด๋ผ๊ณ  ํ•œ๋‹ค!


์œ„์˜ ์˜ˆ์‹œ๋ฅผ ์ข€๋” ๊ตฌ์ฒดํ™” ํ•ด์„œ ์ƒ๊ฐํ•ด๋ณด์ž.

$n=100$, and 60 students said they like lecture. Then, $\overline{x} = \frac{60}{100} = 0.6$

์ด๋•Œ, ์šฐ๋ฆฌ๊ฐ€ <sample mean> $\overline{x}$์— ๋Œ€ํ•ด ๋…ผํ•˜๊ณ ์ž ํ•˜๋Š” ์ฃผ์ œ๋Š” ๋ฐ”๋กœ

\[P(\left| \overline{x} - 0.6 \right| < \epsilon)\]

๊ณผ ๊ฐ™์€ ํ™•๋ฅ ์„ ์–ด๋–ป๊ฒŒ ๊ตฌํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๊ฒƒ์ด๋‹ค. ์ด๊ฒƒ์„ ๊ตฌํ•˜๋Š” ์ด์œ ๋Š”

\[P(\left| \overline{x} - \mu_0 \right| < \epsilon)\]

์˜ ํ™•๋ฅ ์„ ๊ตฌํ•˜์—ฌ, ์ œ์‹œํ•œ $\mu_0$์™€ ์šฐ๋ฆฌ๊ฐ€ ์–ป์€ sample mean์ด ์–ผ๋งˆ๋‚˜ ์ฐจ์ด ๋‚˜๋Š”์ง€๋ฅผ ํ™•์ธํ•˜๊ณ , ์ด๊ฒƒ์„ ํ™œ์šฉํ•ด $\mu = \mu_0$๋ผ๋Š” ๊ฐ€์„ค(Hypothesis)๋ฅผ ๊ฒ€์ •(Test)ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด ๋‚ด์šฉ์€ ๋’ค์˜ <๊ฐ€์„ค ๊ฒ€์ •; Hypothesis Test> ๋ถ€๋ถ„์—์„œ ์ข€๋” ์ž์„ธํžˆ ๋‹ค๋ฃฌ๋‹ค.

$P(\left| \overline{x} - \mu_0 \right| < \epsilon)$, ์ด๊ฒƒ์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” $\overline{x}$์— ๋Œ€ํ•œ ๋ถ„ํฌ๋ฅผ ์•Œ์•„์•ผ ํ•˜๋ฉฐ, ์šฐ๋ฆฌ๋Š” ์ด๊ฒƒ์„ <sampling distribution; ํ‘œ๋ณธ ๋ถ„ํฌ>์ด๋ผ๊ณ  ํ•œ๋‹ค! ํ‘œ๋ณธ ๋ถ„ํฌ์— ๋Œ€ํ•œ ์ •์˜๋Š” ์•„ํ‹ฐํด์˜ ๋งจ ๋งˆ์ง€๋ง‰์— ์ •๋ฆฌํ•˜์˜€๋‹ค.


Definition. population

A <population> is the totality of observations.

Definition. sample

A <sample> is a subset of population.

Definition. random sample

RVs $X_1, \dots, X_n$ are said to be a <random sample> of size $n$, if they are independent and identically distributed as pmf or pdf $f(x)$.

That is,

\[f_{(X_1, \dots, X_n)} (x_1, \dots, x_n) = f_{X_1} (x_1) \cdots f_{X_n} (x_n)\]

The observed values $x_1, \dots, x_n$ of $X_1, \dots, X_n$ are called <sample points> or <observations>.


Definition. Statistics; ํ†ต๊ณ„๋Ÿ‰

A <Statistics; ํ†ต๊ณ„๋Ÿ‰> is a function of a random sample $X_1, \dots, X_n$, not depending on unknown parameters.

์ฆ‰, $f(X_1, \dots, X_n)$ ํ˜•ํƒœ์˜ ํ•จ์ˆ˜๋ฅผ <Statistics>๋ผ๊ณ  ํ•œ๋‹ค. ์ด <Statistics>๋Š” ํ•ด๋‹น RV ์ง‘ํ•ฉ์˜ ๋Œ€ํ‘œ๊ฐ’ ์—ญํ• ์„ ํ•œ๋‹ค.


Example.

Supp. $X_1, \dots, X_n$ is a random sample from $N(\mu, 1)$.

Then,

1. $\dfrac{X_1 + \cdots + X_n}{n}$ is a Statistics!

2. $\max \{ X_1, \dots, X_n \}$ is a Statistics!

3. $\dfrac{X_1 + \cdots + X_n + \mu}{n}$ is not a Statistics!

์šฐ๋ฆฌ๋Š” ์˜ค์ง <Statistics>์„ ํ†ตํ•ด์„œ๋งŒ population์— ๋Œ€ํ•œ inference๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.


Location Measures of a Sample

Let $X_1, \dots, X_n$ be a random sample.

Definition. sample mean

$\overline{X} = \dfrac{X_1 + \cdots + X_n}{n}$ is called a <sample mean>.

(1) $\overline{X}$ is also a random variable!

(2) If $E(X_1) = \mu$ and $\text{Var}(X_1) = \sigma^2$, then $E(\overline{X}) = \dfrac{n\mu}{n} = \mu$ and $\text{Var}(\overline{X}) = \dfrac{\sigma^2}{n}$

(3) $\overline{X}$ can be sensitive to outliers.


Definition. sample median

๊ทธ๋ƒฅ Sample์—์„œ์˜ ์ค‘๊ฐ„๊ฐ’.

Definition. sample mode

Sample์—์„œ์˜ ์ตœ๋นˆ๊ฐ’.


Variability Measures of a Sample

Definition. sample variance

Let $X_1, \dots, X_n$ be a random sample with $E[X_i] = \mu$ and $\text{Var}(X_i) = \sigma^2$.

\[S^2 := \frac{1}{n-1} \sum^n_{i=1} \left( X_i - \overline{X}\right)^2\]

Q. Why $(n-1)$ in the bottom??

A. ์™œ๋ƒํ•˜๋ฉด, $(n-1)$๋กœ ๋‚˜๋ˆ ์ค˜์•ผ ํ‘œ๋ณธ ๋ถ„์‚ฐ์˜ ํ‰๊ท  $E[S^2]$์ด $\sigma^2$์ด ๋˜๊ธฐ ๋•Œ๋ฌธ!!!

Proof.

w.l.o.g. we can assume that $E[X_i] = 0$. (๊ทธ๋ƒฅ ํŽธ์˜๋ฅผ ์œ„ํ•ด $X_i$๋ฅผ ์ ๋‹นํžˆ ํ‘œ์ค€ํ™” ํ•œ ๊ฒƒ์ด๋‹ค.)

\[\begin{aligned} S^2 &= \frac{1}{n-1} \sum^n_{i=1} \left( X_i^2 - 2 X_i \overline{X} + (\overline{X})^2 \right) \\ &= \frac{1}{n-1} \left\{ \sum^n_{i=1} X_i^2 - 2 \overline{X} \sum^n_{i=1} X_i + n (\overline{X})^2 \right\} \\ \end{aligned}\]

์ด๋•Œ, $\displaystyle\sum^n_{i=1} X_i$๋Š” ๊ทธ ์ •์˜์— ์˜ํ•ด $n\overline{X}$๊ฐ€ ๋œ๋‹ค.

\[\begin{aligned} S^2 &= \frac{1}{n-1} \left\{ \sum^n_{i=1} X_i^2 - 2 \overline{X} \cdot n\overline{X} + n (\overline{X})^2 \right\} \\ &= \frac{1}{n-1} \left\{ \sum^n_{i=1} X_i^2 - n (\overline{X})^2 \right\} \\ \end{aligned}\]

์ด์ œ ์œ„์˜ ์‹์˜ ์–‘๋ณ€์— ํ‰๊ท ์„ ์ทจํ•ด๋ณด์ž.

\[\begin{aligned} E[S^2] &= \frac{1}{n-1} \left\{ \sum^n_{i=1} E(X_i)^2 - n E\left[(\overline{X})^2\right] \right\} \\ &= \frac{1}{n-1} \left\{ n \cdot \sigma^2 - n \cdot \frac{1}{n^2} \cdot E \left[(X_1 + \cdots + X_n)^2 \right] \right\} \\ &= \frac{1}{n-1} \left\{ n \cdot \sigma^2 - \frac{1}{n} \cdot \left( n \cdot E[X_1^2] + \cancelto{0}{E[X_i X_j]} + \cdots \right) \right\} \\ &= \frac{1}{n-1} \left\{ n \cdot \sigma^2 - \frac{1}{\cancel{n}} \cdot \left( \cancel{n} \cancelto{\sigma^2}{E[X_1^2]} \right) \right\} \quad (\text{independence}) \\ &= \frac{1}{n-1} \left\{ n \cdot \sigma^2 - \sigma^2 \right\} \\ &= \sigma^2 \end{aligned}\]

$\blacksquare$


Definition. sample standard deviation

\[S := \sqrt{S^2} = \sqrt{\frac{1}{n-1} \sum^n_{i=1}\left( X_i - \overline{X} \right)^2}\]

Definition. range

\[R := \max_{1 \le i \le n} X_i - \min_{1 \le i \le n} X_i\]

Definition. sampling distribution

The probability distribution of a sample Statistics is called a <sampling distribution>.

ex) distribution of sample mean, distribution of sample variance, โ€ฆ

์ด๋•Œ, ํ‘œ๋ณธ ํ†ต๊ณ„๋Ÿ‰(sample Statisticss)๋Š” sample mean, sample variance์™€ ๊ฐ™์ด ํ‘œ๋ณธ์˜ ํŠน์„ฑ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋Œ€ํ‘œ๊ฐ’์ด๋‹ค.

๐Ÿ‘‰ Sampling Distribution of Mean, and CLT

๐Ÿ‘‰ Sampling Distribution of Variance