โ€œํ™•๋ฅ ๊ณผ ํ†ต๊ณ„(MATH230)โ€ ์ˆ˜์—…์—์„œ ๋ฐฐ์šด ๊ฒƒ๊ณผ ๊ณต๋ถ€ํ•œ ๊ฒƒ์„ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ „์ฒด ํฌ์ŠคํŠธ๋Š” Probability and Statistics์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ๐ŸŽฒ

17 minute read

โ€œํ™•๋ฅ ๊ณผ ํ†ต๊ณ„(MATH230)โ€ ์ˆ˜์—…์—์„œ ๋ฐฐ์šด ๊ฒƒ๊ณผ ๊ณต๋ถ€ํ•œ ๊ฒƒ์„ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ „์ฒด ํฌ์ŠคํŠธ๋Š” Probability and Statistics์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ๐ŸŽฒ

Introduction

ํ™•ํ†ต ์ˆ˜์—…์„ ๋“ฃ๋Š” ์ „์ฒด ํ•™์ƒ์„ ๋Œ€์ƒ์œผ๋กœ, ํ™•ํ†ต ์ˆ˜์—…์„ ์„ ํ˜ธํ•˜๋Š” ํ•™์ƒ์˜ ๋น„์œจ์„ ๊ตฌํ•˜๊ณ ์ž ํ•œ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ, ํ™•ํ†ต ์ˆ˜์—…์„ ๋“ฃ๋Š” ํ•™์ƒ ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„์„œ ์ „์ฒด๋ฅผ ์กฐ์‚ฌํ•  ์ˆœ ์—†๊ณ , ์ „์ฒด ์ค‘ $n$๋ช… ํ•™์ƒ์„ ๋Œ€์ƒ์œผ๋กœ ์„ค๋ฌธ์กฐ์‚ฌ๋ฅผ ์‹œํ–‰ํ•œ๋‹ค๊ณ  ํ•˜์ž.

$X$๊ฐ€ โ€œ$n$๋ช…์˜ ํ•™์ƒ ์ค‘์— ํ™•ํ†ต ์ˆ˜์—…์„ ์„ ํ˜ธํ•œ๋‹ค๊ณ  ์‘๋‹ตํ•œ ํ•™์ƒ ์ˆ˜โ€๋ผ๋Š” RV๋ผ๋ฉด, $X$๋Š” HyperGeometric Distribution๋ฅผ ๋”ฐ๋ฅผ ๊ฒƒ์ด๋‹ค. ๋งŒ์•ฝ ์ „์ฒด ํ•™์ƒ ์ˆ˜๊ฐ€ ์ถฉ~๋ถ„ํžˆ ํฌ๋‹ค๋ฉด, HyperGeometric ๋ถ„ํฌ๋ฅผ Binomial ๋ถ„ํฌ๋กœ ๊ทผ์‚ฌํ•  ์ˆ˜๋„ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

๊ฐ ํ•™์ƒ $i$์˜ ์„ ํ˜ธ๋ฅผ RV $X_i$๋Š” Binary ๊ฐ’์„ ๊ฐ€์ง„๋‹ค.

\[X_i = \begin{cases} 1 & i\text{-th student likes it!} \\ 0 & \text{else} \end{cases}\]

๊ทธ๋ฆฌ๊ณ  RV $X_1, \dots, X_n$๋ฅผ ์ „์ฒด๋ฅผ ์ข…ํ•ฉํ•˜๋ฉด, ์ƒˆ๋กœ์šด RV $\overline{X}$๋ฅผ ์œ ๋„ํ•  ์ˆ˜ ์žˆ๋‹ค.

\[\overline{X} := \frac{X_1 + \cdots X_n}{n}\]

์ด๋ ‡๊ฒŒ ์œ ๋„ํ•œ $\overline{X}$๋ฅผ <sample mean>์ด๋ผ๊ณ  ํ•œ๋‹ค!


์œ„์˜ ์˜ˆ์‹œ๋ฅผ ์ข€๋” ๊ตฌ์ฒดํ™” ํ•ด์„œ ์ƒ๊ฐํ•ด๋ณด์ž.

$n=100$, and 60 students said they like lecture. Then, $\overline{x} = \frac{60}{100} = 0.6$

์ด๋•Œ, ์šฐ๋ฆฌ๊ฐ€ <sample mean> $\overline{x}$์— ๋Œ€ํ•ด ๋…ผํ•˜๊ณ ์ž ํ•˜๋Š” ์ฃผ์ œ๋Š” ๋ฐ”๋กœ

\[P(\left| \overline{X} - 0.6 \right| < \epsilon)\]

์˜ ํ™•๋ฅ ์„ ์–ด๋–ป๊ฒŒ ๊ตฌํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๊ฒƒ์ด๋‹ค. ์ด๊ฒƒ์„ ๊ตฌํ•˜๋Š” ์ด์œ ๋Š”

\[P(\left| \overline{X} - \mu_0 \right| < \epsilon)\]

์˜ ํ™•๋ฅ ์„ ๊ตฌํ•˜์—ฌ, ์ œ์‹œํ•œ $\mu_0$์™€ ์šฐ๋ฆฌ๊ฐ€ ์–ป์€ sample mean์ด ์–ผ๋งˆ๋‚˜ ์ฐจ์ด ๋‚˜๋Š”์ง€๋ฅผ ํ™•์ธํ•˜๊ณ , ์ด๊ฒƒ์„ ํ™œ์šฉํ•ด $\mu = \mu_0$๋ผ๋Š” ๊ฐ€์„ค(Hypothesis)๋ฅผ ๊ฒ€์ •(Test)ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด ๋‚ด์šฉ์€ ๋’ค์˜ <๊ฐ€์„ค ๊ฒ€์ •; Hypothesis Test> ๋ถ€๋ถ„์—์„œ ์ž์„ธํžˆ ๋‹ค๋ฃฌ๋‹ค.

$P(\left| \overline{X} - \mu_0 \right| < \epsilon)$, ์ด๊ฒƒ์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” $\overline{X}$์— ๋Œ€ํ•œ ๋ถ„ํฌ๋ฅผ ์•Œ์•„์•ผ ํ•˜๋ฉฐ, ์šฐ๋ฆฌ๋Š” ์ด๊ฒƒ์„ <sampling distribution; ํ‘œ๋ณธ ๋ถ„ํฌ>๋ผ๊ณ  ํ•œ๋‹ค! ํ‘œ๋ณธ ๋ถ„ํฌ์— ๋Œ€ํ•œ ์ •์˜๋Š” ์•„ํ‹ฐํด์˜ ๋งจ ๋งˆ์ง€๋ง‰์— ์ •๋ฆฌํ•˜์˜€๋‹ค.

Population and Sample

Definition. population; ๋ชจ์ง‘๋‹จ

A <population> is the totality of observations.

Definition. sample; ํ‘œ๋ณธ

A <sample> is a subset of population.

Definition. random sample

RVs $X_1, \dots, X_n$ are said to be a <random sample> of size $n$, if they are independent and identically distributed as pmf or pdf $f(x)$.

That is,

\[f_{(X_1, \dots, X_n)} (x_1, \dots, x_n) = f_{X_1} (x_1) \cdots f_{X_n} (x_n)\]

The observed values $x_1, \dots, x_n$ of $X_1, \dots, X_n$ are called <sample points> or <observations>.

Statistics

Definition. Statistics; ํ†ต๊ณ„๋Ÿ‰

A <Statistics; ํ†ต๊ณ„๋Ÿ‰> is a function of a random sample $X_1, \dots, X_n$, not depending on unknown parameters.

์ฆ‰, $f(X_1, \dots, X_n)$ ํ˜•ํƒœ์˜ ํ•จ์ˆ˜๋ฅผ <Statistics>๋ผ๊ณ  ํ•œ๋‹ค. ์ด <Statistics>๋Š” RV ์ง‘ํ•ฉ์˜ ๋Œ€ํ‘œ๊ฐ’ ์—ญํ• ์„ ํ•œ๋‹ค.


Example.

Supp. $X_1, \dots, X_n$ is a random sample from $N(\mu, 1)$.

Then,

1. $\dfrac{X_1 + \cdots + X_n}{n}$ is a Statistics!

2. $\max \{ X_1, \dots, X_n \}$ is a Statistics!

3. $\dfrac{X_1 + \cdots + X_n + \mu}{n}$ is not a Statistics!

์šฐ๋ฆฌ๋Š” ๊ฐœ๋ณ„ ์ƒ˜ํ”Œ๊ฐ’ $X_i = x_i$๊ฐ€ ์•„๋‹ˆ๋ผ, ํ†ต๊ณ„๋Ÿ‰ <Statistics>์„ ํ†ตํ•ด์„œ๋งŒ ๋ชจ์ง‘๋‹จ์— ๋Œ€ํ•œ ๊ฐ์ข… ์„ฑ์งˆ์„ ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ๋‹ค.

Location Measures of a Sample

Let $X_1, \dots, X_n$ be a random sample.

Definition. sample mean

$\overline{X} = \dfrac{X_1 + \cdots + X_n}{n}$ is called a <sample mean>.

(1) $\overline{X}$ is also a random variable!

(2) If $E(X_1) = \mu$ and $\text{Var}(X_1) = \sigma^2$, then $E(\overline{X}) = \dfrac{n\mu}{n} = \mu$ and $\text{Var}(\overline{X}) = \dfrac{\sigma^2}{n}$

(3) $\overline{X}$ can be sensitive to outliers.


Definition. sample median

Sample์—์„œ์˜ ์ค‘๊ฐ„๊ฐ’.

Definition. sample mode

Sample์—์„œ์˜ ์ตœ๋นˆ๊ฐ’.

Variability Measures of a Sample

Definition. sample variance

Let $X_1, \dots, X_n$ be a random sample with $E[X_i] = \mu$ and $\text{Var}(X_i) = \sigma^2$.

\[S^2 := \frac{1}{n-1} \sum^n_{i=1} \left( X_i - \overline{X}\right)^2\]

Why divide by (n-1)?

Q. Why $(n-1)$ in the bottom??

A. ์™œ๋ƒํ•˜๋ฉด, $(n-1)$๋กœ ๋‚˜๋ˆ ์ค˜์•ผ ํ‘œ๋ณธ ๋ถ„์‚ฐ์˜ ํ‰๊ท  $E[S^2]$์ด $\sigma^2$์ด ๋˜๊ธฐ ๋•Œ๋ฌธ!!!

Proof.

w.l.o.g. we can assume that $E[X_i] = 0$. (๊ทธ๋ƒฅ ํŽธ์˜๋ฅผ ์œ„ํ•ด $X_i$๋ฅผ ์ ๋‹นํžˆ ํ‘œ์ค€ํ™” ํ•œ ๊ฒƒ์ด๋‹ค.)

\[\begin{aligned} S^2 &= \frac{1}{n-1} \sum^n_{i=1} \left( X_i^2 - 2 X_i \overline{X} + (\overline{X})^2 \right) \\ &= \frac{1}{n-1} \left\{ \sum^n_{i=1} X_i^2 - 2 \overline{X} \sum^n_{i=1} X_i + n (\overline{X})^2 \right\} \\ \end{aligned}\]

์ด๋•Œ, $\displaystyle\sum^n_{i=1} X_i$๋Š” ๊ทธ ์ •์˜์— ์˜ํ•ด $n\overline{X}$๊ฐ€ ๋œ๋‹ค.

\[\begin{aligned} S^2 &= \frac{1}{n-1} \left\{ \sum^n_{i=1} X_i^2 - 2 \overline{X} \cdot n\overline{X} + n (\overline{X})^2 \right\} \\ &= \frac{1}{n-1} \left\{ \sum^n_{i=1} X_i^2 - n (\overline{X})^2 \right\} \\ \end{aligned}\]

์ด์ œ ์œ„์˜ ์‹์˜ ์–‘๋ณ€์— ํ‰๊ท ์„ ์ทจํ•ด๋ณด์ž.

\[\begin{aligned} E[S^2] &= \frac{1}{n-1} \left\{ \sum^n_{i=1} E[X_i^2] - n \cdot E\left[(\overline{X})^2\right] \right\} \\ &= \frac{1}{n-1} \left\{ \sum^n_{i=1} (\sigma^2 + \cancelto{0}{E[X_i]^2}) - n \cdot E\left[(\overline{X})^2\right] \right\} \\ &= \frac{1}{n-1} \left\{ n \cdot \sigma^2 - n \cdot \frac{1}{n^2} \cdot E \left[(X_1 + \cdots + X_n)^2 \right] \right\} \\ &= \frac{1}{n-1} \left\{ n \cdot \sigma^2 - \frac{1}{n} \cdot \left( n \cdot E[X_1^2] + E[X_i X_j] + \cdots \right) \right\} \\ \end{aligned}\]

์ด๋•Œ, $X_i$๋Š” ์„œ๋กœ ๋…๋ฆฝ์ด๋ฏ€๋กœ $E[X_i X_j] = E[X_i] E[X_j] = 0 \cdot 0 = 0$์ด ๋ฉ๋‹ˆ๋‹ค.

\[\begin{aligned} E[S^2] &= \frac{1}{n-1} \left\{ n \cdot \sigma^2 - \frac{1}{n} \cdot \left( n \cdot E[X_1^2] + \cancelto{0}{E[X_i X_j]} + \cdots \right) \right\} \\ &= \frac{1}{n-1} \left\{ n \cdot \sigma^2 - \frac{1}{\cancel{n}} \cdot \cancel{n} \cdot \cancelto{\sigma^2}{E[X_1^2]} \right\} \\ &= \frac{1}{n-1} \left\{ n \cdot \sigma^2 - \sigma^2 \right\} \\ &= \sigma^2 \end{aligned}\]

$\blacksquare$


Definition. sample standard deviation

\[S := \sqrt{S^2} = \sqrt{\frac{1}{n-1} \sum^n_{i=1}\left( X_i - \overline{X} \right)^2}\]

Definition. range

\[R := \max_{1 \le i \le n} X_i - \min_{1 \le i \le n} X_i\]

Sampling Distribution

Definition. sampling distribution

The probability distribution of a sample Statistics is called a <sampling distribution>.

ex) distribution of sample mean, distribution of sample variance, โ€ฆ

์ด๋•Œ, ํ‘œ๋ณธ ํ†ต๊ณ„๋Ÿ‰(sample Statistics)๋Š” sample mean, sample variance์™€ ๊ฐ™์ด ํ‘œ๋ณธ์˜ ํŠน์„ฑ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋Œ€ํ‘œ๊ฐ’์ด๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜์˜ ๋‘ ํฌ์ŠคํŠธ๋ฅผ ์ฐธ๊ณ ํ•˜์ž!

์ง„ํ–‰ํ–ˆ๋˜ ํ”„๋กœ์ ํŠธ ๋‘ ๊ฐ€์ง€์— ๋Œ€ํ•˜์—ฌ ํ™œ์šฉ ๊ธฐ์ˆ , ๋ณธ์ธ์˜ ์—ญํ• , ์ง„ํ–‰ ๋ฐฉ์‹, ๊ฒฐ๊ณผ๋ฅผ ํฌํ•จํ•˜์—ฌ ๊ตฌ์ฒด์ ์œผ๋กœ ์„ค๋ช…ํ•ด ์ฃผ์„ธ์š”.

[Events ํŒŒ์ดํ”„๋ผ์ธ ์šด์˜ ๋ฐ ๊ฐœ์„ ] ํ•˜๋ฃจ 4์–ต ๊ฑด, 500GB์˜ ์ด๋ฒคํŠธ ํŠธ๋ž˜ํ”ฝ์„ ์œ ์‹ค ์—†์ด ์ ์žฌํ•  ์ˆ˜ ์žˆ๋Š” ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌ์„ฑํ•˜๊ณ  ์šด์˜ํ•œ ๊ฒฝํ—˜์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ดˆ๊ธฐ์—๋Š” ํŒŒ์ดํ”„๋ผ์ธ ์œ ์ง€๋ณด์ˆ˜ ๋‹ด๋‹น์œผ๋กœ ์‹œ์ž‘ํ–ˆ์ง€๋งŒ, ํ˜„์žฌ๋Š” Data Platform ์œ ๋‹› ๋ฆฌ๋”๋กœ์„œ ์ด๋ฒคํŠธ ํŒŒ์ดํ”„๋ผ์ธ์„ ํฌํ•จํ•ด ๋ฐ์ดํ„ฐํŒ€์˜ AWS ๋ฐ Kubernetes ์ธํ”„๋ผ ์ „๋ฐ˜์˜ ๋ฐฉํ–ฅ์„ฑ๊ณผ ๊ฐœ์„ ์„ ์ฃผ๋„ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋ฒคํŠธ ํŒŒ์ดํ”„๋ผ์ธ์€ ์ธ๊ฒŒ์ž„์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ชจ๋“  ์ด๋ฒคํŠธ๋ฅผ ๋กœ๊น…ํ•˜๊ณ  ์ด๋ฅผ ๋ฐ์ดํ„ฐ ๋ ˆ์ดํฌ์— ์ ์žฌํ•˜๋Š” end-to-end ๊ตฌ์กฐ๋กœ, ๊ฐ€์žฅ ๋จผ์ € ์ด๋ฒคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” Events Producer๋Š” Istio์™€ Knative ๊ธฐ๋ฐ˜์˜ Serverless Application์œผ๋กœ ๊ตฌ์ถ•๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ํŠธ๋ž˜ํ”ฝ ๋ณ€ํ™”์— ๋”ฐ๋ผ ์ž๋™ ์Šค์ผ€์ผ๋ง์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ์„ค๊ณ„๋˜์–ด ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ๋Œ€๋Ÿ‰ ํŠธ๋ž˜ํ”ฝ์—๋„ ์œ ์‹ค ์—†์ด ๋ฐ์ดํ„ฐ๋ฅผ ์•ˆ์ •์ ์œผ๋กœ ์ˆ˜์ง‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฒคํŠธ๋Š” Kafka ํ† ํ”ฝ์— ์ ์žฌ๋˜๋Š”๋ฐ, ์ดˆ๊ธฐ์—๋Š” ๋‹จ์ผ ํ† ํ”ฝ์— ๋ชจ๋“  ์ด๋ฒคํŠธ๋ฅผ ์ €์žฅํ•˜๋Š” ๋ฐฉ์‹์ด์—ˆ์œผ๋‚˜, ์ด๋กœ ์ธํ•ด Sink Connector์˜ Lag ๋ชจ๋‹ˆํ„ฐ๋ง์ด ์–ด๋ ต๊ณ , Events ETL ๊ตฌ์„ฑ์— ๋ธ”๋กœ์ปค๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ด๋ฒคํŠธ ์œ ํ˜•๋ณ„ ์ „์šฉ Kafka ํ† ํ”ฝ์„ ๋ถ„๋ฆฌํ•˜๋Š” ๋กœ์ง์„ ๊ฐœ๋ฐœํ•˜์—ฌ ๋‹จ์ผ ํ† ํ”ฝ ๋ถ€ํ•˜๋ฅผ 50% ์ด์ƒ ๋ถ„์‚ฐํ•˜์˜€๊ณ , Flink ๊ธฐ๋ฐ˜์˜ Events ETL ๊ฐœ๋ฐœ์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ™˜๊ฒฝ์„ ์กฐ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ, Prometheus๋ฅผ ์—ฐ๋™ํ•˜์—ฌ ํŒŒ์ดํ”„๋ผ์ธ ์žฅ์•  ๋ฐœ์ƒ ์‹œ ์ƒํ™ฉ๋ณ„ ์•Œ๋žŒ์ด ์˜ค๋„๋ก ์„ค์ •ํ•˜๊ณ , ๋ถ„๋‹น ๋”๋ฏธ ๋ฐ์ดํ„ฐ๋ฅผ ์ž๋™ ์ „์†กํ•˜์—ฌ ํŒŒ์ดํ”„๋ผ์ธ์˜ ์ƒํƒœ๋ฅผ ์ง€์†์ ์œผ๋กœ ์ฒดํฌํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ฐœ์„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋” ๋‚˜์•„๊ฐ€, Custom Prometheus ์ง€ํ‘œ๋ฅผ ์ •์˜ํ•˜์—ฌ Producer๋ณ„ Success/Fail/DLQ ๋ฉ”์‹œ์ง€์˜ ์ถ”์ด๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๋ชจ๋‹ˆํ„ฐ๋งํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ตฌ์„ฑํ•จ์œผ๋กœ์จ ์œ ๊ด€ํŒ€์ด ์žฅ์•  ์ƒํ™ฉ์„ ๋ณด๋‹ค ๋ช…ํ™•ํ•˜๊ฒŒ ํŒŒ์•…ํ•˜๊ณ  ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ง€์›ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๋ฐฐํฌ ํ”„๋กœ์„ธ์Šค ๋˜ํ•œ ๊ธฐ์กด Pulumi ๊ธฐ๋ฐ˜ IaC ๋ฐฉ์‹์„ ์œ ์ง€ํ•˜๋˜, PR ๋ฆฌ๋ทฐ ๊ณผ์ •์—์„œ ๋ถˆํ•„์š”ํ•œ ํด๋ฆญ์„ ์—ฌ๋Ÿฌ ๋ฒˆ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ๋ฐœ๊ฒฌํ•˜์˜€๊ณ , ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Atlantis ๊ธฐ๋Šฅ์„ ๋ชจ์‚ฌํ•œ PR Comment ๊ธฐ๋ฐ˜ ์ž๋™ํ™”๋œ Github Workflow๋ฅผ ๊ฐœ๋ฐœํ•˜์—ฌ ๋ฐฐํฌ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋‹จ์ˆœํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด๋ฒคํŠธ ์›์ฒœ ๋ฐ์ดํ„ฐ๋Š” ๊ฒŒ์ž„ ์šด์˜๊ณผ ๋ถ„์„์— ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ํŒŒ์ดํ”„๋ผ์ธ์˜ ์•ˆ์ •์ ์ธ ์šด์˜๊ณผ ์žฅ์•  ๋Œ€์‘์ด ํ•„์ˆ˜์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์ด˜์ด˜ํ•œ ๋ชจ๋‹ˆํ„ฐ๋ง ์ฒด๊ณ„๋ฅผ ๊ตฌ์ถ•ํ•˜๊ณ , ์‰ฌ์šด ๋ฐฐํฌ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋งˆ๋ จํ•˜์˜€์œผ๋ฉฐ, ํŒ€์›๋“ค์—๊ฒŒ Istio ๋ฐ Kubernetes ์ธํ”„๋ผ์— ๋Œ€ํ•œ ์„ธ์…˜์„ ์ง„ํ–‰ํ•˜์—ฌ ํ˜‘๋ ฅํ•˜์—ฌ ์•ˆ์ „ํ•˜๊ณ  ๋น ๋ฅด๊ฒŒ ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ๋Š” ํ™˜๊ฒฝ์„ ์กฐ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, Kafka ๋ถ€ํ•˜๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ถ„์‚ฐ์‹œํ‚ค๋ฉฐ Sink Connector์˜ Lag ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์—ˆ๊ณ , ์‹ค์‹œ๊ฐ„ ๋ชจ๋‹ˆํ„ฐ๋ง ๋ฐ ์ž๋™ํ™”๋œ ์ƒํƒœ ์ฒดํฌ ์‹œ์Šคํ…œ์„ ํ†ตํ•ด ์žฅ์•  ๋Œ€์‘ ์†๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

[๋ฉ”ํƒ€์Šคํ† ์–ด ์„ค๊ณ„ ๋ฐ ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜] Hive Metastore์—์„œ Databricks Unity Catalog๋กœ 1๋งŒ ๊ฐœ ์ด์ƒ์˜ ํ…Œ์ด๋ธ”๊ณผ 1์ฒœ ๊ฐœ ์ด์ƒ์˜ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜ํ•˜๋ฉฐ ๋ฐ์ดํ„ฐ ๋ ˆ์ดํฌํ•˜์šฐ์Šค ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๊ณ  ํŒ€์˜ ์ƒ์‚ฐ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚จ ๊ฒฝํ—˜์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜ TF ํŒ€์žฅ์œผ๋กœ์„œ 2024๋…„ 3์›”๋ถ€ํ„ฐ 8์›”๊นŒ์ง€ ํ”„๋กœ์ ํŠธ๋ฅผ ์ฃผ๋„ํ•˜๋ฉฐ ์ „์ฒด ๊ณผ์ • ์„ค๊ณ„์™€ ๋””๋ฒ„๊น…์„ ๋‹ด๋‹นํ–ˆ์Šต๋‹ˆ๋‹ค.

๋งˆ์ด๊ทธ๋ ˆ์ด์…˜์„ ์ง„ํ–‰ํ•˜๊ธฐ์— ์•ž์„œ, ํ‰๊ฐ€ ๋„๊ตฌ ๋ฐ ์ปค์Šคํ…€ ์ฟผ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๊ธฐ์กด Hive Metastore์™€ Unity Catalog ๊ฐ„์˜ ํ˜ธํ™˜์„ฑ์„ ๋ถ„์„ํ•˜์˜€๊ณ , ์ƒˆ๋กœ์šด Catalog ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ์„ฑ๊ณผ ๊ด€๋ฆฌ ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐํŒ€์€ 2018๋…„๋ถ€ํ„ฐ Spark๊ณผ Databricks๋ฅผ ์‚ฌ์šฉํ•ด์™”์œผ๋ฉฐ, ๊ทธ๋™์•ˆ ์Œ“์ธ Notebook, Workflow, ACL ๋“ฑ์˜ ๋ ˆ๊ฑฐ์‹œ๋ฅผ ์ •๋ฆฌํ•˜๊ณ  ์ตœ์‹  ํ™˜๊ฒฝ์— ๋งž๊ฒŒ ์ตœ์ ํ™”ํ•˜๋Š” ๊ณผ์ •์ด ํ•„์š”ํ–ˆ์Šต๋‹ˆ๋‹ค. GitOps ๊ธฐ๋ฐ˜์œผ๋กœ ๊ด€๋ฆฌ๋˜๋Š” Notebook์—์„œ๋Š” Spark Context๋ฅผ ํ™œ์šฉํ•œ ๋ ˆ๊ฑฐ์‹œ API ๋ฐ Databricks Legacy API๋ฅผ ์ตœ์‹  API๋กœ ์ „ํ™˜ํ•˜์˜€๊ณ , Spark UDF๋ฅผ ์žฌ์„ค๊ณ„ํ•˜๊ณ  Partitioned ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ์„ ๋ณ€๊ฒฝํ•˜์—ฌ Shared Cluster์—์„œ ๋ฐœ์ƒํ•˜๋Š” Spark Session ์ถฉ๋Œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์ฟผ๋ฆฌ ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด Parquet์—์„œ Delta Lake๋กœ ๋ฐ์ดํ„ฐ ํฌ๋งท์„ ๋ณ€๊ฒฝํ•˜๊ณ  ์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ ๋ฐฉ์‹์„ ๋„์ž…ํ•˜์—ฌ S3 ์š”์ฒญ์„ ๊ธฐ์กด์˜ 10% ์ˆ˜์ค€์œผ๋กœ ์ค„์˜€์Šต๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ํŒŒ์ดํ”„๋ผ์ธ ์žฅ์• ๋„ ์—ฌ๋Ÿฌ ์ฐจ๋ก€ ๋ฐœ์ƒํ•˜์˜€๊ณ , ์œ ๊ด€ ๋ถ€์„œ์—์„œ ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜์œผ๋กœ ์ธํ•œ ๋ถˆํŽธ์„ ํ˜ธ์†Œํ•˜๊ธฐ๋„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜ ์ „ํ›„์˜ ์„ฑ๋Šฅ ๋น„๊ต ์ž๋ฃŒ๋ฅผ ์ œ๊ณตํ•˜๊ณ , Auditing ๋ฐ Lineage ๊ธฐ๋Šฅ์„ ํ™œ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ํ™œ์šฉ๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ๊ฐ•์กฐํ•˜๋ฉฐ ์ ๊ทน์ ์œผ๋กœ ์ปค๋ฎค๋‹ˆ์ผ€์ด์…˜์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ, ๋ชจ๋“  ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ Unity Catalog ํ™˜๊ฒฝ์—์„œ ์ผ์›ํ™”ํ•˜์—ฌ ๊ด€๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ์œผ๋ฉฐ, ์ฟผ๋ฆฌ ์†๋„ ๊ฐœ์„  ๋ฐ ๋ฆฌ์†Œ์Šค ์ ˆ๊ฐ ํšจ๊ณผ๋ฅผ ๊ฑฐ๋‘˜ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, LLM Agent ๊ฐœ๋ฐœ, Audit Log ๊ธฐ๋ฐ˜ ๋ชจ๋‹ˆํ„ฐ๋ง ๋“ฑ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ™œ์šฉ ์˜์—ญ์— ๋„์ „ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฐ˜์ด ๋งˆ๋ จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ”„๋กœ์ ํŠธ ๊ณผ์ •์—์„œ ๋ฐ˜๋ณต์ ์ธ ์ž‘์—…๊ณผ ์žฅ์•  ๋Œ€์‘์œผ๋กœ ์ธํ•œ ๋ถ€๋‹ด์ด ์ปธ์ง€๋งŒ, ์ด๋ฅผ ํ†ตํ•ด Spark์— ๋Œ€ํ•œ ๊นŠ์€ ์ดํ•ด์™€ ์„ฑ๋Šฅ ์ตœ์ ํ™” ๋ฐ ์žฅ์•  ๋Œ€์‘ ์—ญ๋Ÿ‰์„ ํ™•๋ณดํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ์–ต์— ๋‚จ๋Š” ํŠธ๋Ÿฌ๋ธ”์ŠˆํŒ… ๊ฒฝํ—˜์„ ๊ฐ€๋Šฅํ•œ ์ƒ์„ธํ•˜๊ฒŒ ์„ค๋ช…ํ•ด ์ฃผ์„ธ์š”.

Datahub๋ฅผ ํ†ตํ•œ ์‚ฌ๋‚ด Data Discovery Platform(DDP) ๊ตฌ์ถ•์ด ๊ฐ€์žฅ ๊ธฐ์–ต์— ๋‚จ์Šต๋‹ˆ๋‹ค. ๋ฒ ์ด๊ธ€์ฝ”๋“œ์—์„œ Data Scientist๋กœ ๊ทผ๋ฌดํ•  ๋•Œ, ์œ ๊ด€ ๋ถ€์„œ๋กœ๋ถ€ํ„ฐ ํ…Œ์ด๋ธ”๊ณผ ์ปฌ๋Ÿผ์— ๋Œ€ํ•œ ๋ฉ”ํƒ€์ •๋ณด์™€ ๋ฐ์ดํ„ฐ์ถ”์ถœ์— ๋Œ€ํ•œ ์š”์ฒญ์ด ๋นˆ๋ฒˆ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ดํ•˜ ์œ„ํ•ด ๋น„๊ฐœ๋ฐœ ์ง๊ตฐ์„ ๋Œ€์ƒ์œผ๋กœ โ€œQuery101โ€ ์„ธ์…˜์„ ์ง„ํ–‰ํ•˜๋ฉฐ ๋ฐ์ดํ„ฐ ์ถ”์ถœ์— ๋Œ€ํ•œ ๋ฐฉ๋ฒ•๊ณผ Superset, Admunsen, ์ธํ•˜์šฐ์Šค ์Šคํ‚ค๋งˆ ํŽ˜์ด์ง€ ๋“ฑ ๋„๊ตฌ๋ฅผ ์†Œ๊ฐœํ–ˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ ๋น„๊ฐœ๋ฐœํŒ€์ด ์ง์ ‘ ๋ฐ์ดํ„ฐ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ค์šด ์ผ์ด์—ˆ์Šต๋‹ˆ๋‹ค.

Data Engineer๋กœ ๋ถ€์„œ๋ฅผ ์ด๋™ํ•œ ํ›„์—๋„ ๊ฐ™์€ ๋ฌธ์ œ๊ฐ€ ์ง€์†๋˜์—ˆ๊ณ , ์ด์— ์ฒซ ๋ฒˆ์งธ ํ”„๋กœ์ ํŠธ๋กœ ๊ธฐ์กด Amundsen์„ ๋Œ€์ฒดํ•  ์ƒˆ๋กœ์šด Data Discovery Platform(DDP) ๊ตฌ์ถ•์„ ์ง„ํ–‰ํ•˜๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์˜คํ”ˆ์†Œ์Šค์ธ Datahub๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๊ฒ€์ƒ‰๊ณผ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ๋ฅผ ๋ณด๋‹ค ์ฒด๊ณ„์ ์œผ๋กœ ์šด์˜ํ•˜๊ณ ์ž ํ–ˆ์œผ๋ฉฐ, ์ฃผ์š” ๊ณผ์ œ๋กœ ์›Œํฌ๋กœ๋“œ ๋ฐฐํฌ ๋ฐ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๋ฐ˜์˜ ์ž๋™ํ™”๋ฅผ ์„ค์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์ดˆ๊ธฐ ์›Œํฌ๋กœ๋“œ ๋ฐฐํฌ ๊ณผ์ •์—์„œ ๊ฐœ๋ฐœ์šฉ ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ์ต๋ช…์˜ ์‚ฌ์šฉ์ž๊ฐ€ Admin ๊ณ„์ •์„ ์•Œ์•„๋‚ด๋Š” ๋ณด์•ˆ ์ด์Šˆ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹คํ–‰ํžˆ ์šด์˜ ํ™˜๊ฒฝ๊ณผ ์—ฐ๊ฒฐ๋˜์ง€ ์•Š์€ ๊ฐœ๋ฐœ ํด๋Ÿฌ์Šคํ„ฐ์˜€๊ธฐ์— ์ฆ‰์‹œ ๋น„๋ฐ€๋ฒˆํ˜ธ๋ฅผ ๋ณ€๊ฒฝํ•˜์—ฌ ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋””๋ฒ„๊น…์„ ์ง„ํ–‰ํ•œ ๊ฒฐ๊ณผ, Datahub์˜ Helm Chart์—์„œ Admin ๊ณ„์ •์˜ ๋น„๋ฐ€๋ฒˆํ˜ธ๊ฐ€ ๊ณ ์ •๊ฐ’์œผ๋กœ ์„ค์ •๋˜์–ด ์žˆ๋‹ค๋Š” ๋ฌธ์ œ๋ฅผ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋Š” ์ €๋ฟ๋งŒ ์•„๋‹ˆ๋ผ Datahub๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋“  ์œ ์ €๋“ค์ด ๊ฒช์„ ์ˆ˜ ์žˆ๋Š” ๋ณด์•ˆ ์ทจ์•ฝ์ ์ด๋ผ ํŒ๋‹จํ–ˆ๊ณ , ์ฆ‰์‹œ GitHub์— Issue๋ฅผ ์ƒ์„ฑํ•œ ํ›„ ์ง์ ‘ PR์„ ์ž‘์„ฑํ•˜์—ฌ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๊ฐœ์„  ์‚ฌํ•ญ์€ Datahub ํ”„๋กœ์ ํŠธ์— ๋ฐ˜์˜๋˜์—ˆ๊ณ , ์ด๋ฅผ ํ†ตํ•ด ๋Œ€๊ทœ๋ชจ ์˜คํ”ˆ์†Œ์Šค์— ๋Œ€ํ•œ ์ปจํŠธ๋ฆฌ๋ทฐ์…˜์„ ์ฒ˜์Œ์œผ๋กœ ๊ฒฝํ—˜ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๋ฐ˜์˜ ์ž๋™ํ™” ๊ณผ์ •์—์„œ๋Š” Databricks ํ…Œ์ด๋ธ” ๋ชฉ๋ก๊ณผ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋Š” ์ •์ƒ์ ์œผ๋กœ ์ˆ˜์ง‘๋˜์ง€๋งŒ, ํ…Œ์ด๋ธ” ๊ฐ„ ์˜์กด์„ฑ์ด ๋ณด์ด์ง€ ์•Š๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋ถ„์„ ๋ฐ ETL ๊ณผ์ •์—์„œ ํ…Œ์ด๋ธ” ๊ฐ„ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ๋ฐ, ์ด๋ฅผ ํ•ด๊ฒฐํ•˜์ง€ ์•Š์œผ๋ฉด ๋ฐ์ดํ„ฐ ํ๋ฆ„์„ ์ถ”์ ํ•˜๊ธฐ ์–ด๋ ค์› ์Šต๋‹ˆ๋‹ค. ์ด์— Spark Query Plan์„ ํŒŒ์‹ฑํ•˜์—ฌ ํ…Œ์ด๋ธ” ๊ฐ„์˜ ์˜์กด์„ฑ์„ ๋ณ„๋„๋กœ ์ €์žฅํ•˜๊ณ , ์ด๋ฅผ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํ”Œ๋žซํผ์— ๋ฐ˜์˜ํ•˜๋Š” ๋ฐฉ์‹์„ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐœ์„ ์„ ํ†ตํ•ด ํ…Œ์ด๋ธ” ๊ฐ„ ๋ฆฌ๋‹ˆ์ง€๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๊ณ , ์žฅ์•  ๋ฐœ์ƒ ์‹œ ํŒ€์›๋“ค์ด ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐฑํ•„์„ ๊ณ„ํšํ•˜๊ณ  ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ง€์›ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ Airflow-Datahub ์—ฐ๋™์„ ๊ตฌํ˜„ํ•˜์—ฌ Databricks ํ…Œ์ด๋ธ”๊ณผ Airflow Task ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ์—ฐ๊ฒฐํ•จ์œผ๋กœ์จ, ๋ฐ์ดํ„ฐ ๋ฐ ์›Œํฌํ”Œ๋กœ์šฐ ๊ฐ„์˜ ์—ฐ๊ฒฐ์„ฑ์„ ํ•œ๋ˆˆ์— ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ์„ ํ™•์žฅํ•˜์˜€์Šต๋‹ˆ๋‹ค.

ํ”„๋กœ์ ํŠธ๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐํŒ€๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋น„๊ฐœ๋ฐœ ์ง๊ตฐ๊นŒ์ง€ ํฌํ•จํ•œ ์ „์‚ฌ์ ์ธ ๋ฐ์ดํ„ฐ ๊ฐ€์‹œ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ๊ธฐ์ˆ ์  ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๊ณผ์ •์—์„œ ๋ณด์•ˆ ๊ฐœ์„ , ์˜คํ”ˆ์†Œ์Šค ๊ธฐ์—ฌ, ๋ฐ์ดํ„ฐ ์—”์ง€๋‹ˆ์–ด๋ง ์—ญ๋Ÿ‰์„ ํ‚ค์šธ ์ˆ˜ ์žˆ์—ˆ๋˜ ์ ์ด ํฐ ์˜๋ฏธ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.