β€œν™•λ₯ κ³Ό 톡계(MATH230)” μˆ˜μ—…μ—μ„œ 배운 것과 κ³΅λΆ€ν•œ 것을 μ •λ¦¬ν•œ ν¬μŠ€νŠΈμž…λ‹ˆλ‹€. 전체 ν¬μŠ€νŠΈλŠ” Probability and Statisticsμ—μ„œ ν™•μΈν•˜μ‹€ 수 μžˆμŠ΅λ‹ˆλ‹€ 🎲

8 minute read

β€œν™•λ₯ κ³Ό 톡계(MATH230)” μˆ˜μ—…μ—μ„œ 배운 것과 κ³΅λΆ€ν•œ 것을 μ •λ¦¬ν•œ ν¬μŠ€νŠΈμž…λ‹ˆλ‹€. 전체 ν¬μŠ€νŠΈλŠ” Probability and Statisticsμ—μ„œ ν™•μΈν•˜μ‹€ 수 μžˆμŠ΅λ‹ˆλ‹€ 🎲

Introduction

확톡 μˆ˜μ—…μ„ λ“£λŠ” 전체 학생을 λŒ€μƒμœΌλ‘œ, 확톡 μˆ˜μ—…μ„ μ„ ν˜Έν•˜λŠ” ν•™μƒμ˜ λΉ„μœ¨μ„ κ΅¬ν•˜κ³ μž ν•œλ‹€. 그런데, 확톡 μˆ˜μ—…μ„ λ“£λŠ” 학생 μˆ˜κ°€ λ„ˆλ¬΄ λ§Žμ•„μ„œ 전체λ₯Ό 쑰사할 순 μ—†κ³ , 전체 쀑 $n$λͺ… 학생을 λŒ€μƒμœΌλ‘œ 섀문쑰사λ₯Ό μ‹œν–‰ν•œλ‹€κ³  ν•˜μž.

$X$κ°€ β€œ$n$λͺ…μ˜ 학생 쀑에 확톡 μˆ˜μ—…μ„ μ„ ν˜Έν•œλ‹€κ³  μ‘λ‹΅ν•œ 학생 μˆ˜β€λΌλŠ” RV라면, $X$λŠ” HyperGeo의 뢄포λ₯Ό λ”°λ₯Ό 것이닀.

또, λ§Œμ•½ 전체 학생 μˆ˜κ°€ μΆ©λΆ„νžˆ 크닀면, HyperGeoλ₯Ό BIN으둜 근사할 μˆ˜λ„ μžˆμ„ 것이닀.

μ΄λ•Œ, 각 학생 $i$의 μ„ ν˜Έλ₯Ό RV $X_i$둜 ν‘œν˜„ν•΄λ³΄μž. 그러면,

\[X_i = \begin{cases} 1 & i\text{-th student likes it!} \\ 0 & \text{else} \end{cases}\]

그러면, 전체 RV $X_1, \dots, X_n$λ₯Ό μ’…ν•©ν•˜λ©΄, μƒˆλ‘œμš΄ RV $\overline{X}$λ₯Ό μœ λ„ν•  수 μžˆλ‹€.

\[\overline{X} := \frac{X_1 + \cdots X_n}{n}\]

μš°λ¦¬λŠ” 이 $\overline{X}$λ₯Ό <sample mean>이라고 ν•œλ‹€!


μœ„μ˜ μ˜ˆμ‹œλ₯Ό 쒀더 ꡬ체화 ν•΄μ„œ μƒκ°ν•΄λ³΄μž.

$n=100$, and 60 students said they like lecture. Then, $\overline{x} = \frac{60}{100} = 0.6$

μ΄λ•Œ, μš°λ¦¬κ°€ <sample mean> $\overline{x}$에 λŒ€ν•΄ λ…Όν•˜κ³ μž ν•˜λŠ” μ£Όμ œλŠ” λ°”λ‘œ

\[P(\left| \overline{x} - 0.6 \right| < \epsilon)\]

κ³Ό 같은 ν™•λ₯ μ„ μ–΄λ–»κ²Œ κ΅¬ν•˜λŠ”μ§€μ— λŒ€ν•œ 것이닀. 이것을 κ΅¬ν•˜λŠ” μ΄μœ λŠ”

\[P(\left| \overline{x} - \mu_0 \right| < \epsilon)\]

의 ν™•λ₯ μ„ κ΅¬ν•˜μ—¬, μ œμ‹œν•œ $\mu_0$와 μš°λ¦¬κ°€ 얻은 sample mean이 μ–Όλ§ˆλ‚˜ 차이 λ‚˜λŠ”μ§€λ₯Ό ν™•μΈν•˜κ³ , 이것을 ν™œμš©ν•΄ $\mu = \mu_0$λΌλŠ” κ°€μ„€(Hypothesis)λ₯Ό κ²€μ •(Test)ν•  수 있기 λ•Œλ¬Έμ΄λ‹€. 이 λ‚΄μš©μ€ λ’€μ˜ <κ°€μ„€ κ²€μ •; Hypothesis Test> λΆ€λΆ„μ—μ„œ 쒀더 μžμ„Ένžˆ 닀룬닀.

$P(\left| \overline{x} - \mu_0 \right| < \epsilon)$, 이것을 κ΅¬ν•˜κΈ° μœ„ν•΄μ„œλŠ” $\overline{x}$에 λŒ€ν•œ 뢄포λ₯Ό μ•Œμ•„μ•Ό ν•˜λ©°, μš°λ¦¬λŠ” 이것을 <sampling distribution; ν‘œλ³Έ 뢄포>이라고 ν•œλ‹€! ν‘œλ³Έ 뢄포에 λŒ€ν•œ μ •μ˜λŠ” μ•„ν‹°ν΄μ˜ 맨 λ§ˆμ§€λ§‰μ— μ •λ¦¬ν•˜μ˜€λ‹€.


Definition. population

A <population> is the totality of observations.

Definition. sample

A <sample> is a subset of population.

Definition. random sample

RVs $X_1, \dots, X_n$ are said to be a <random sample> of size $n$, if they are independent and identically distributed as pmf or pdf $f(x)$.

That is,

\[f_{(X_1, \dots, X_n)} (x_1, \dots, x_n) = f_{X_1} (x_1) \cdots f_{X_n} (x_n)\]

The observed values $x_1, \dots, x_n$ of $X_1, \dots, X_n$ are called <sample points> or <observations>.


Definition. Statistics; ν†΅κ³„λŸ‰

A <Statistics; ν†΅κ³„λŸ‰> is a function of a random sample $X_1, \dots, X_n$, not depending on unknown parameters.

즉, $f(X_1, \dots, X_n)$ ν˜•νƒœμ˜ ν•¨μˆ˜λ₯Ό <Statistics>라고 ν•œλ‹€. 이 <Statistics>λŠ” ν•΄λ‹Ή RV μ§‘ν•©μ˜ λŒ€ν‘œκ°’ 역할을 ν•œλ‹€.


Example.

Supp. $X_1, \dots, X_n$ is a random sample from $N(\mu, 1)$.

Then,

1. $\dfrac{X_1 + \cdots + X_n}{n}$ is a Statistics!

2. $\max \{ X_1, \dots, X_n \}$ is a Statistics!

3. $\dfrac{X_1 + \cdots + X_n + \mu}{n}$ is not a Statistics!

μš°λ¦¬λŠ” 였직 <Statistics>을 ν†΅ν•΄μ„œλ§Œ population에 λŒ€ν•œ inferenceλ₯Ό μˆ˜ν–‰ν•  수 μžˆλ‹€.


Location Measures of a Sample

Let $X_1, \dots, X_n$ be a random sample.

Definition. sample mean

$\overline{X} = \dfrac{X_1 + \cdots + X_n}{n}$ is called a <sample mean>.

(1) $\overline{X}$ is also a random variable!

(2) If $E(X_1) = \mu$ and $\text{Var}(X_1) = \sigma^2$, then $E(\overline{X}) = \dfrac{n\mu}{n} = \mu$ and $\text{Var}(\overline{X}) = \dfrac{\sigma^2}{n}$

(3) $\overline{X}$ can be sensitive to outliers.


Definition. sample median

κ·Έλƒ₯ Sampleμ—μ„œμ˜ 쀑간값.

Definition. sample mode

Sampleμ—μ„œμ˜ μ΅œλΉˆκ°’.


Variability Measures of a Sample

Definition. sample variance

Let $X_1, \dots, X_n$ be a random sample with $E[X_i] = \mu$ and $\text{Var}(X_i) = \sigma^2$.

\[S^2 := \frac{1}{n-1} \sum^n_{i=1} \left( X_i - \overline{X}\right)^2\]

Q. Why $(n-1)$ in the bottom??

A. μ™œλƒν•˜λ©΄, $(n-1)$둜 λ‚˜λˆ μ€˜μ•Ό ν‘œλ³Έ λΆ„μ‚°μ˜ 평균 $E[S^2]$이 $\sigma^2$이 되기 λ•Œλ¬Έ!!!

Proof.

w.l.o.g. we can assume that $E[X_i] = 0$. (κ·Έλƒ₯ 편의λ₯Ό μœ„ν•΄ $X_i$λ₯Ό μ λ‹Ήνžˆ ν‘œμ€€ν™” ν•œ 것이닀.)

\[\begin{aligned} S^2 &= \frac{1}{n-1} \sum^n_{i=1} \left( X_i^2 - 2 X_i \overline{X} + (\overline{X})^2 \right) \\ &= \frac{1}{n-1} \left\{ \sum^n_{i=1} X_i^2 - 2 \overline{X} \sum^n_{i=1} X_i + n (\overline{X})^2 \right\} \\ \end{aligned}\]

μ΄λ•Œ, $\displaystyle\sum^n_{i=1} X_i$λŠ” κ·Έ μ •μ˜μ— μ˜ν•΄ $n\overline{X}$κ°€ λœλ‹€.

\[\begin{aligned} S^2 &= \frac{1}{n-1} \left\{ \sum^n_{i=1} X_i^2 - 2 \overline{X} \cdot n\overline{X} + n (\overline{X})^2 \right\} \\ &= \frac{1}{n-1} \left\{ \sum^n_{i=1} X_i^2 - n (\overline{X})^2 \right\} \\ \end{aligned}\]

이제 μœ„μ˜ μ‹μ˜ 양변에 평균을 μ·¨ν•΄λ³΄μž.

\[\begin{aligned} E[S^2] &= \frac{1}{n-1} \left\{ \sum^n_{i=1} E(X_i)^2 - n E\left[(\overline{X})^2\right] \right\} \\ &= \frac{1}{n-1} \left\{ n \cdot \sigma^2 - n \cdot \frac{1}{n^2} \cdot E \left[(X_1 + \cdots + X_n)^2 \right] \right\} \\ &= \frac{1}{n-1} \left\{ n \cdot \sigma^2 - \frac{1}{n} \cdot \left( n \cdot E[X_1^2] + \cancelto{0}{E[X_i X_j]} + \cdots \right) \right\} \\ &= \frac{1}{n-1} \left\{ n \cdot \sigma^2 - \frac{1}{\cancel{n}} \cdot \left( \cancel{n} \cancelto{\sigma^2}{E[X_1^2]} \right) \right\} \quad (\text{independence}) \\ &= \frac{1}{n-1} \left\{ n \cdot \sigma^2 - \sigma^2 \right\} \\ &= \sigma^2 \end{aligned}\]

$\blacksquare$


Definition. sample standard deviation

\[S := \sqrt{S^2} = \sqrt{\frac{1}{n-1} \sum^n_{i=1}\left( X_i - \overline{X} \right)^2}\]

Definition. range

\[R := \max_{1 \le i \le n} X_i - \min_{1 \le i \le n} X_i\]

Definition. sampling distribution

The probability distribution of a sample Statistics is called a <sampling distribution>.

ex) distribution of sample mean, distribution of sample variance, …

μ΄λ•Œ, ν‘œλ³Έ ν†΅κ³„λŸ‰(sample Statisticss)λŠ” sample mean, sample variance와 같이 ν‘œλ³Έμ˜ νŠΉμ„±μ„ λ‚˜νƒ€λ‚΄λŠ” λŒ€ν‘œκ°’μ΄λ‹€.

πŸ‘‰ Sampling Distribution of Mean, and CLT

πŸ‘‰ Sampling Distribution of Variance