โ€œํ™•๋ฅ ๊ณผ ํ†ต๊ณ„(MATH230)โ€ ์ˆ˜์—…์—์„œ ๋ฐฐ์šด ๊ฒƒ๊ณผ ๊ณต๋ถ€ํ•œ ๊ฒƒ์„ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ „์ฒด ํฌ์ŠคํŠธ๋Š” Probability and Statistics์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ๐ŸŽฒ

17 minute read

โ€œํ™•๋ฅ ๊ณผ ํ†ต๊ณ„(MATH230)โ€ ์ˆ˜์—…์—์„œ ๋ฐฐ์šด ๊ฒƒ๊ณผ ๊ณต๋ถ€ํ•œ ๊ฒƒ์„ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ „์ฒด ํฌ์ŠคํŠธ๋Š” Probability and Statistics์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ๐ŸŽฒ

IntroductionPermalink

ํ™•ํ†ต ์ˆ˜์—…์„ ๋“ฃ๋Š” ์ „์ฒด ํ•™์ƒ์„ ๋Œ€์ƒ์œผ๋กœ, ํ™•ํ†ต ์ˆ˜์—…์„ ์„ ํ˜ธํ•˜๋Š” ํ•™์ƒ์˜ ๋น„์œจ์„ ๊ตฌํ•˜๊ณ ์ž ํ•œ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ, ํ™•ํ†ต ์ˆ˜์—…์„ ๋“ฃ๋Š” ํ•™์ƒ ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„์„œ ์ „์ฒด๋ฅผ ์กฐ์‚ฌํ•  ์ˆœ ์—†๊ณ , ์ „์ฒด ์ค‘ n๋ช… ํ•™์ƒ์„ ๋Œ€์ƒ์œผ๋กœ ์„ค๋ฌธ์กฐ์‚ฌ๋ฅผ ์‹œํ–‰ํ•œ๋‹ค๊ณ  ํ•˜์ž.

X๊ฐ€ โ€œn๋ช…์˜ ํ•™์ƒ ์ค‘์— ํ™•ํ†ต ์ˆ˜์—…์„ ์„ ํ˜ธํ•œ๋‹ค๊ณ  ์‘๋‹ตํ•œ ํ•™์ƒ ์ˆ˜โ€๋ผ๋Š” RV๋ผ๋ฉด, X๋Š” HyperGeometric Distribution๋ฅผ ๋”ฐ๋ฅผ ๊ฒƒ์ด๋‹ค. ๋งŒ์•ฝ ์ „์ฒด ํ•™์ƒ ์ˆ˜๊ฐ€ ์ถฉ~๋ถ„ํžˆ ํฌ๋‹ค๋ฉด, HyperGeometric ๋ถ„ํฌ๋ฅผ Binomial ๋ถ„ํฌ๋กœ ๊ทผ์‚ฌํ•  ์ˆ˜๋„ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

๊ฐ ํ•™์ƒ i์˜ ์„ ํ˜ธ๋ฅผ RV Xi๋Š” Binary ๊ฐ’์„ ๊ฐ€์ง„๋‹ค.

Xi={1i-th student likes it!0else

๊ทธ๋ฆฌ๊ณ  RV X1,โ€ฆ,Xn๋ฅผ ์ „์ฒด๋ฅผ ์ข…ํ•ฉํ•˜๋ฉด, ์ƒˆ๋กœ์šด RV Xโ€•๋ฅผ ์œ ๋„ํ•  ์ˆ˜ ์žˆ๋‹ค.

Xโ€•:=X1+โ‹ฏXnn

์ด๋ ‡๊ฒŒ ์œ ๋„ํ•œ Xโ€•๋ฅผ <sample mean>์ด๋ผ๊ณ  ํ•œ๋‹ค!


์œ„์˜ ์˜ˆ์‹œ๋ฅผ ์ข€๋” ๊ตฌ์ฒดํ™” ํ•ด์„œ ์ƒ๊ฐํ•ด๋ณด์ž.

n=100, and 60 students said they like lecture. Then, xโ€•=60100=0.6

์ด๋•Œ, ์šฐ๋ฆฌ๊ฐ€ <sample mean> xโ€•์— ๋Œ€ํ•ด ๋…ผํ•˜๊ณ ์ž ํ•˜๋Š” ์ฃผ์ œ๋Š” ๋ฐ”๋กœ

P(|Xโ€•โˆ’0.6|<ฯต)

์˜ ํ™•๋ฅ ์„ ์–ด๋–ป๊ฒŒ ๊ตฌํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๊ฒƒ์ด๋‹ค. ์ด๊ฒƒ์„ ๊ตฌํ•˜๋Š” ์ด์œ ๋Š”

P(|Xโ€•โˆ’ฮผ0|<ฯต)

์˜ ํ™•๋ฅ ์„ ๊ตฌํ•˜์—ฌ, ์ œ์‹œํ•œ ฮผ0์™€ ์šฐ๋ฆฌ๊ฐ€ ์–ป์€ sample mean์ด ์–ผ๋งˆ๋‚˜ ์ฐจ์ด ๋‚˜๋Š”์ง€๋ฅผ ํ™•์ธํ•˜๊ณ , ์ด๊ฒƒ์„ ํ™œ์šฉํ•ด ฮผ=ฮผ0๋ผ๋Š” ๊ฐ€์„ค(Hypothesis)๋ฅผ ๊ฒ€์ •(Test)ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด ๋‚ด์šฉ์€ ๋’ค์˜ <๊ฐ€์„ค ๊ฒ€์ •; Hypothesis Test> ๋ถ€๋ถ„์—์„œ ์ž์„ธํžˆ ๋‹ค๋ฃฌ๋‹ค.

P(|Xโ€•โˆ’ฮผ0|<ฯต), ์ด๊ฒƒ์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” Xโ€•์— ๋Œ€ํ•œ ๋ถ„ํฌ๋ฅผ ์•Œ์•„์•ผ ํ•˜๋ฉฐ, ์šฐ๋ฆฌ๋Š” ์ด๊ฒƒ์„ <sampling distribution; ํ‘œ๋ณธ ๋ถ„ํฌ>๋ผ๊ณ  ํ•œ๋‹ค! ํ‘œ๋ณธ ๋ถ„ํฌ์— ๋Œ€ํ•œ ์ •์˜๋Š” ์•„ํ‹ฐํด์˜ ๋งจ ๋งˆ์ง€๋ง‰์— ์ •๋ฆฌํ•˜์˜€๋‹ค.

Population and SamplePermalink

Definition. population; ๋ชจ์ง‘๋‹จ

A <population> is the totality of observations.

Definition. sample; ํ‘œ๋ณธ

A <sample> is a subset of population.

Definition. random sample

RVs X1,โ€ฆ,Xn are said to be a <random sample> of size n, if they are independent and identically distributed as pmf or pdf f(x).

That is,

f(X1,โ€ฆ,Xn)(x1,โ€ฆ,xn)=fX1(x1)โ‹ฏfXn(xn)

The observed values x1,โ€ฆ,xn of X1,โ€ฆ,Xn are called <sample points> or <observations>.

StatisticsPermalink

Definition. Statistics; ํ†ต๊ณ„๋Ÿ‰

A <Statistics; ํ†ต๊ณ„๋Ÿ‰> is a function of a random sample X1,โ€ฆ,Xn, not depending on unknown parameters.

์ฆ‰, f(X1,โ€ฆ,Xn) ํ˜•ํƒœ์˜ ํ•จ์ˆ˜๋ฅผ <Statistics>๋ผ๊ณ  ํ•œ๋‹ค. ์ด <Statistics>๋Š” RV ์ง‘ํ•ฉ์˜ ๋Œ€ํ‘œ๊ฐ’ ์—ญํ• ์„ ํ•œ๋‹ค.


Example.

Supp. X1,โ€ฆ,Xn is a random sample from N(ฮผ,1).

Then,

1. X1+โ‹ฏ+Xnn is a Statistics!

2. max{X1,โ€ฆ,Xn} is a Statistics!

3. X1+โ‹ฏ+Xn+ฮผn is not a Statistics!

์šฐ๋ฆฌ๋Š” ๊ฐœ๋ณ„ ์ƒ˜ํ”Œ๊ฐ’ Xi=xi๊ฐ€ ์•„๋‹ˆ๋ผ, ํ†ต๊ณ„๋Ÿ‰ <Statistics>์„ ํ†ตํ•ด์„œ๋งŒ ๋ชจ์ง‘๋‹จ์— ๋Œ€ํ•œ ๊ฐ์ข… ์„ฑ์งˆ์„ ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ๋‹ค.

Location Measures of a SamplePermalink

Let X1,โ€ฆ,Xn be a random sample.

Definition. sample mean

Xโ€•=X1+โ‹ฏ+Xnn is called a <sample mean>.

(1) Xโ€• is also a random variable!

(2) If E(X1)=ฮผ and Var(X1)=ฯƒ2, then E(Xโ€•)=nฮผn=ฮผ and Var(Xโ€•)=ฯƒ2n

(3) Xโ€• can be sensitive to outliers.


Definition. sample median

Sample์—์„œ์˜ ์ค‘๊ฐ„๊ฐ’.

Definition. sample mode

Sample์—์„œ์˜ ์ตœ๋นˆ๊ฐ’.

Variability Measures of a SamplePermalink

Definition. sample variance

Let X1,โ€ฆ,Xn be a random sample with E[Xi]=ฮผ and Var(Xi)=ฯƒ2.

S2:=1nโˆ’1โˆ‘i=1n(Xiโˆ’Xโ€•)2

Why divide by (n-1)?Permalink

Q. Why (nโˆ’1) in the bottom??

A. ์™œ๋ƒํ•˜๋ฉด, (nโˆ’1)๋กœ ๋‚˜๋ˆ ์ค˜์•ผ ํ‘œ๋ณธ ๋ถ„์‚ฐ์˜ ํ‰๊ท  E[S2]์ด ฯƒ2์ด ๋˜๊ธฐ ๋•Œ๋ฌธ!!!

Proof.

w.l.o.g. we can assume that E[Xi]=0. (๊ทธ๋ƒฅ ํŽธ์˜๋ฅผ ์œ„ํ•ด Xi๋ฅผ ์ ๋‹นํžˆ ํ‘œ์ค€ํ™” ํ•œ ๊ฒƒ์ด๋‹ค.)

S2=1nโˆ’1โˆ‘i=1n(Xi2โˆ’2XiXโ€•+(Xโ€•)2)=1nโˆ’1{โˆ‘i=1nXi2โˆ’2Xโ€•โˆ‘i=1nXi+n(Xโ€•)2}

์ด๋•Œ, โˆ‘i=1nXi๋Š” ๊ทธ ์ •์˜์— ์˜ํ•ด nXโ€•๊ฐ€ ๋œ๋‹ค.

S2=1nโˆ’1{โˆ‘i=1nXi2โˆ’2Xโ€•โ‹…nXโ€•+n(Xโ€•)2}=1nโˆ’1{โˆ‘i=1nXi2โˆ’n(Xโ€•)2}

์ด์ œ ์œ„์˜ ์‹์˜ ์–‘๋ณ€์— ํ‰๊ท ์„ ์ทจํ•ด๋ณด์ž.

E[S2]=1nโˆ’1{โˆ‘i=1nE[Xi2]โˆ’nโ‹…E[(Xโ€•)2]}=1nโˆ’1{โˆ‘i=1n(ฯƒ2+E[Xi]20)โˆ’nโ‹…E[(Xโ€•)2]}=1nโˆ’1{nโ‹…ฯƒ2โˆ’nโ‹…1n2โ‹…E[(X1+โ‹ฏ+Xn)2]}=1nโˆ’1{nโ‹…ฯƒ2โˆ’1nโ‹…(nโ‹…E[X12]+E[XiXj]+โ‹ฏ)}

์ด๋•Œ, Xi๋Š” ์„œ๋กœ ๋…๋ฆฝ์ด๋ฏ€๋กœ E[XiXj]=E[Xi]E[Xj]=0โ‹…0=0์ด ๋ฉ๋‹ˆ๋‹ค.

E[S2]=1nโˆ’1{nโ‹…ฯƒ2โˆ’1nโ‹…(nโ‹…E[X12]+E[XiXj]0+โ‹ฏ)}=1nโˆ’1{nโ‹…ฯƒ2โˆ’1nโ‹…nโ‹…E[X12]ฯƒ2}=1nโˆ’1{nโ‹…ฯƒ2โˆ’ฯƒ2}=ฯƒ2

โ—ผ


Definition. sample standard deviation

S:=S2=1nโˆ’1โˆ‘i=1n(Xiโˆ’Xโ€•)2

Definition. range

R:=max1โ‰คiโ‰คnXiโˆ’min1โ‰คiโ‰คnXi

Sampling DistributionPermalink

Definition. sampling distribution

The probability distribution of a sample Statistics is called a <sampling distribution>.

ex) distribution of sample mean, distribution of sample variance, โ€ฆ

์ด๋•Œ, ํ‘œ๋ณธ ํ†ต๊ณ„๋Ÿ‰(sample Statistics)๋Š” sample mean, sample variance์™€ ๊ฐ™์ด ํ‘œ๋ณธ์˜ ํŠน์„ฑ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋Œ€ํ‘œ๊ฐ’์ด๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜์˜ ๋‘ ํฌ์ŠคํŠธ๋ฅผ ์ฐธ๊ณ ํ•˜์ž!

์ง„ํ–‰ํ–ˆ๋˜ ํ”„๋กœ์ ํŠธ ๋‘ ๊ฐ€์ง€์— ๋Œ€ํ•˜์—ฌ ํ™œ์šฉ ๊ธฐ์ˆ , ๋ณธ์ธ์˜ ์—ญํ• , ์ง„ํ–‰ ๋ฐฉ์‹, ๊ฒฐ๊ณผ๋ฅผ ํฌํ•จํ•˜์—ฌ ๊ตฌ์ฒด์ ์œผ๋กœ ์„ค๋ช…ํ•ด ์ฃผ์„ธ์š”.

[Events ํŒŒ์ดํ”„๋ผ์ธ ์šด์˜ ๋ฐ ๊ฐœ์„ ] ํ•˜๋ฃจ 4์–ต ๊ฑด, 500GB์˜ ์ด๋ฒคํŠธ ํŠธ๋ž˜ํ”ฝ์„ ์œ ์‹ค ์—†์ด ์ ์žฌํ•  ์ˆ˜ ์žˆ๋Š” ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌ์„ฑํ•˜๊ณ  ์šด์˜ํ•œ ๊ฒฝํ—˜์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ดˆ๊ธฐ์—๋Š” ํŒŒ์ดํ”„๋ผ์ธ ์œ ์ง€๋ณด์ˆ˜ ๋‹ด๋‹น์œผ๋กœ ์‹œ์ž‘ํ–ˆ์ง€๋งŒ, ํ˜„์žฌ๋Š” Data Platform ์œ ๋‹› ๋ฆฌ๋”๋กœ์„œ ์ด๋ฒคํŠธ ํŒŒ์ดํ”„๋ผ์ธ์„ ํฌํ•จํ•ด ๋ฐ์ดํ„ฐํŒ€์˜ AWS ๋ฐ Kubernetes ์ธํ”„๋ผ ์ „๋ฐ˜์˜ ๋ฐฉํ–ฅ์„ฑ๊ณผ ๊ฐœ์„ ์„ ์ฃผ๋„ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋ฒคํŠธ ํŒŒ์ดํ”„๋ผ์ธ์€ ์ธ๊ฒŒ์ž„์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ชจ๋“  ์ด๋ฒคํŠธ๋ฅผ ๋กœ๊น…ํ•˜๊ณ  ์ด๋ฅผ ๋ฐ์ดํ„ฐ ๋ ˆ์ดํฌ์— ์ ์žฌํ•˜๋Š” end-to-end ๊ตฌ์กฐ๋กœ, ๊ฐ€์žฅ ๋จผ์ € ์ด๋ฒคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” Events Producer๋Š” Istio์™€ Knative ๊ธฐ๋ฐ˜์˜ Serverless Application์œผ๋กœ ๊ตฌ์ถ•๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ํŠธ๋ž˜ํ”ฝ ๋ณ€ํ™”์— ๋”ฐ๋ผ ์ž๋™ ์Šค์ผ€์ผ๋ง์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ์„ค๊ณ„๋˜์–ด ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ๋Œ€๋Ÿ‰ ํŠธ๋ž˜ํ”ฝ์—๋„ ์œ ์‹ค ์—†์ด ๋ฐ์ดํ„ฐ๋ฅผ ์•ˆ์ •์ ์œผ๋กœ ์ˆ˜์ง‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฒคํŠธ๋Š” Kafka ํ† ํ”ฝ์— ์ ์žฌ๋˜๋Š”๋ฐ, ์ดˆ๊ธฐ์—๋Š” ๋‹จ์ผ ํ† ํ”ฝ์— ๋ชจ๋“  ์ด๋ฒคํŠธ๋ฅผ ์ €์žฅํ•˜๋Š” ๋ฐฉ์‹์ด์—ˆ์œผ๋‚˜, ์ด๋กœ ์ธํ•ด Sink Connector์˜ Lag ๋ชจ๋‹ˆํ„ฐ๋ง์ด ์–ด๋ ต๊ณ , Events ETL ๊ตฌ์„ฑ์— ๋ธ”๋กœ์ปค๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ด๋ฒคํŠธ ์œ ํ˜•๋ณ„ ์ „์šฉ Kafka ํ† ํ”ฝ์„ ๋ถ„๋ฆฌํ•˜๋Š” ๋กœ์ง์„ ๊ฐœ๋ฐœํ•˜์—ฌ ๋‹จ์ผ ํ† ํ”ฝ ๋ถ€ํ•˜๋ฅผ 50% ์ด์ƒ ๋ถ„์‚ฐํ•˜์˜€๊ณ , Flink ๊ธฐ๋ฐ˜์˜ Events ETL ๊ฐœ๋ฐœ์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ™˜๊ฒฝ์„ ์กฐ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ, Prometheus๋ฅผ ์—ฐ๋™ํ•˜์—ฌ ํŒŒ์ดํ”„๋ผ์ธ ์žฅ์•  ๋ฐœ์ƒ ์‹œ ์ƒํ™ฉ๋ณ„ ์•Œ๋žŒ์ด ์˜ค๋„๋ก ์„ค์ •ํ•˜๊ณ , ๋ถ„๋‹น ๋”๋ฏธ ๋ฐ์ดํ„ฐ๋ฅผ ์ž๋™ ์ „์†กํ•˜์—ฌ ํŒŒ์ดํ”„๋ผ์ธ์˜ ์ƒํƒœ๋ฅผ ์ง€์†์ ์œผ๋กœ ์ฒดํฌํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ฐœ์„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋” ๋‚˜์•„๊ฐ€, Custom Prometheus ์ง€ํ‘œ๋ฅผ ์ •์˜ํ•˜์—ฌ Producer๋ณ„ Success/Fail/DLQ ๋ฉ”์‹œ์ง€์˜ ์ถ”์ด๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๋ชจ๋‹ˆํ„ฐ๋งํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ตฌ์„ฑํ•จ์œผ๋กœ์จ ์œ ๊ด€ํŒ€์ด ์žฅ์•  ์ƒํ™ฉ์„ ๋ณด๋‹ค ๋ช…ํ™•ํ•˜๊ฒŒ ํŒŒ์•…ํ•˜๊ณ  ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ง€์›ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๋ฐฐํฌ ํ”„๋กœ์„ธ์Šค ๋˜ํ•œ ๊ธฐ์กด Pulumi ๊ธฐ๋ฐ˜ IaC ๋ฐฉ์‹์„ ์œ ์ง€ํ•˜๋˜, PR ๋ฆฌ๋ทฐ ๊ณผ์ •์—์„œ ๋ถˆํ•„์š”ํ•œ ํด๋ฆญ์„ ์—ฌ๋Ÿฌ ๋ฒˆ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ๋ฐœ๊ฒฌํ•˜์˜€๊ณ , ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Atlantis ๊ธฐ๋Šฅ์„ ๋ชจ์‚ฌํ•œ PR Comment ๊ธฐ๋ฐ˜ ์ž๋™ํ™”๋œ Github Workflow๋ฅผ ๊ฐœ๋ฐœํ•˜์—ฌ ๋ฐฐํฌ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋‹จ์ˆœํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด๋ฒคํŠธ ์›์ฒœ ๋ฐ์ดํ„ฐ๋Š” ๊ฒŒ์ž„ ์šด์˜๊ณผ ๋ถ„์„์— ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ํŒŒ์ดํ”„๋ผ์ธ์˜ ์•ˆ์ •์ ์ธ ์šด์˜๊ณผ ์žฅ์•  ๋Œ€์‘์ด ํ•„์ˆ˜์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์ด˜์ด˜ํ•œ ๋ชจ๋‹ˆํ„ฐ๋ง ์ฒด๊ณ„๋ฅผ ๊ตฌ์ถ•ํ•˜๊ณ , ์‰ฌ์šด ๋ฐฐํฌ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋งˆ๋ จํ•˜์˜€์œผ๋ฉฐ, ํŒ€์›๋“ค์—๊ฒŒ Istio ๋ฐ Kubernetes ์ธํ”„๋ผ์— ๋Œ€ํ•œ ์„ธ์…˜์„ ์ง„ํ–‰ํ•˜์—ฌ ํ˜‘๋ ฅํ•˜์—ฌ ์•ˆ์ „ํ•˜๊ณ  ๋น ๋ฅด๊ฒŒ ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ๋Š” ํ™˜๊ฒฝ์„ ์กฐ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, Kafka ๋ถ€ํ•˜๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ถ„์‚ฐ์‹œํ‚ค๋ฉฐ Sink Connector์˜ Lag ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์—ˆ๊ณ , ์‹ค์‹œ๊ฐ„ ๋ชจ๋‹ˆํ„ฐ๋ง ๋ฐ ์ž๋™ํ™”๋œ ์ƒํƒœ ์ฒดํฌ ์‹œ์Šคํ…œ์„ ํ†ตํ•ด ์žฅ์•  ๋Œ€์‘ ์†๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

[๋ฉ”ํƒ€์Šคํ† ์–ด ์„ค๊ณ„ ๋ฐ ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜] Hive Metastore์—์„œ Databricks Unity Catalog๋กœ 1๋งŒ ๊ฐœ ์ด์ƒ์˜ ํ…Œ์ด๋ธ”๊ณผ 1์ฒœ ๊ฐœ ์ด์ƒ์˜ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜ํ•˜๋ฉฐ ๋ฐ์ดํ„ฐ ๋ ˆ์ดํฌํ•˜์šฐ์Šค ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๊ณ  ํŒ€์˜ ์ƒ์‚ฐ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚จ ๊ฒฝํ—˜์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜ TF ํŒ€์žฅ์œผ๋กœ์„œ 2024๋…„ 3์›”๋ถ€ํ„ฐ 8์›”๊นŒ์ง€ ํ”„๋กœ์ ํŠธ๋ฅผ ์ฃผ๋„ํ•˜๋ฉฐ ์ „์ฒด ๊ณผ์ • ์„ค๊ณ„์™€ ๋””๋ฒ„๊น…์„ ๋‹ด๋‹นํ–ˆ์Šต๋‹ˆ๋‹ค.

๋งˆ์ด๊ทธ๋ ˆ์ด์…˜์„ ์ง„ํ–‰ํ•˜๊ธฐ์— ์•ž์„œ, ํ‰๊ฐ€ ๋„๊ตฌ ๋ฐ ์ปค์Šคํ…€ ์ฟผ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๊ธฐ์กด Hive Metastore์™€ Unity Catalog ๊ฐ„์˜ ํ˜ธํ™˜์„ฑ์„ ๋ถ„์„ํ•˜์˜€๊ณ , ์ƒˆ๋กœ์šด Catalog ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ์„ฑ๊ณผ ๊ด€๋ฆฌ ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐํŒ€์€ 2018๋…„๋ถ€ํ„ฐ Spark๊ณผ Databricks๋ฅผ ์‚ฌ์šฉํ•ด์™”์œผ๋ฉฐ, ๊ทธ๋™์•ˆ ์Œ“์ธ Notebook, Workflow, ACL ๋“ฑ์˜ ๋ ˆ๊ฑฐ์‹œ๋ฅผ ์ •๋ฆฌํ•˜๊ณ  ์ตœ์‹  ํ™˜๊ฒฝ์— ๋งž๊ฒŒ ์ตœ์ ํ™”ํ•˜๋Š” ๊ณผ์ •์ด ํ•„์š”ํ–ˆ์Šต๋‹ˆ๋‹ค. GitOps ๊ธฐ๋ฐ˜์œผ๋กœ ๊ด€๋ฆฌ๋˜๋Š” Notebook์—์„œ๋Š” Spark Context๋ฅผ ํ™œ์šฉํ•œ ๋ ˆ๊ฑฐ์‹œ API ๋ฐ Databricks Legacy API๋ฅผ ์ตœ์‹  API๋กœ ์ „ํ™˜ํ•˜์˜€๊ณ , Spark UDF๋ฅผ ์žฌ์„ค๊ณ„ํ•˜๊ณ  Partitioned ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ์„ ๋ณ€๊ฒฝํ•˜์—ฌ Shared Cluster์—์„œ ๋ฐœ์ƒํ•˜๋Š” Spark Session ์ถฉ๋Œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์ฟผ๋ฆฌ ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด Parquet์—์„œ Delta Lake๋กœ ๋ฐ์ดํ„ฐ ํฌ๋งท์„ ๋ณ€๊ฒฝํ•˜๊ณ  ์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ ๋ฐฉ์‹์„ ๋„์ž…ํ•˜์—ฌ S3 ์š”์ฒญ์„ ๊ธฐ์กด์˜ 10% ์ˆ˜์ค€์œผ๋กœ ์ค„์˜€์Šต๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ํŒŒ์ดํ”„๋ผ์ธ ์žฅ์• ๋„ ์—ฌ๋Ÿฌ ์ฐจ๋ก€ ๋ฐœ์ƒํ•˜์˜€๊ณ , ์œ ๊ด€ ๋ถ€์„œ์—์„œ ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜์œผ๋กœ ์ธํ•œ ๋ถˆํŽธ์„ ํ˜ธ์†Œํ•˜๊ธฐ๋„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜ ์ „ํ›„์˜ ์„ฑ๋Šฅ ๋น„๊ต ์ž๋ฃŒ๋ฅผ ์ œ๊ณตํ•˜๊ณ , Auditing ๋ฐ Lineage ๊ธฐ๋Šฅ์„ ํ™œ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ํ™œ์šฉ๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ๊ฐ•์กฐํ•˜๋ฉฐ ์ ๊ทน์ ์œผ๋กœ ์ปค๋ฎค๋‹ˆ์ผ€์ด์…˜์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ, ๋ชจ๋“  ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ Unity Catalog ํ™˜๊ฒฝ์—์„œ ์ผ์›ํ™”ํ•˜์—ฌ ๊ด€๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ์œผ๋ฉฐ, ์ฟผ๋ฆฌ ์†๋„ ๊ฐœ์„  ๋ฐ ๋ฆฌ์†Œ์Šค ์ ˆ๊ฐ ํšจ๊ณผ๋ฅผ ๊ฑฐ๋‘˜ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, LLM Agent ๊ฐœ๋ฐœ, Audit Log ๊ธฐ๋ฐ˜ ๋ชจ๋‹ˆํ„ฐ๋ง ๋“ฑ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ™œ์šฉ ์˜์—ญ์— ๋„์ „ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฐ˜์ด ๋งˆ๋ จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ”„๋กœ์ ํŠธ ๊ณผ์ •์—์„œ ๋ฐ˜๋ณต์ ์ธ ์ž‘์—…๊ณผ ์žฅ์•  ๋Œ€์‘์œผ๋กœ ์ธํ•œ ๋ถ€๋‹ด์ด ์ปธ์ง€๋งŒ, ์ด๋ฅผ ํ†ตํ•ด Spark์— ๋Œ€ํ•œ ๊นŠ์€ ์ดํ•ด์™€ ์„ฑ๋Šฅ ์ตœ์ ํ™” ๋ฐ ์žฅ์•  ๋Œ€์‘ ์—ญ๋Ÿ‰์„ ํ™•๋ณดํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ์–ต์— ๋‚จ๋Š” ํŠธ๋Ÿฌ๋ธ”์ŠˆํŒ… ๊ฒฝํ—˜์„ ๊ฐ€๋Šฅํ•œ ์ƒ์„ธํ•˜๊ฒŒ ์„ค๋ช…ํ•ด ์ฃผ์„ธ์š”.

Datahub๋ฅผ ํ†ตํ•œ ์‚ฌ๋‚ด Data Discovery Platform(DDP) ๊ตฌ์ถ•์ด ๊ฐ€์žฅ ๊ธฐ์–ต์— ๋‚จ์Šต๋‹ˆ๋‹ค. ๋ฒ ์ด๊ธ€์ฝ”๋“œ์—์„œ Data Scientist๋กœ ๊ทผ๋ฌดํ•  ๋•Œ, ์œ ๊ด€ ๋ถ€์„œ๋กœ๋ถ€ํ„ฐ ํ…Œ์ด๋ธ”๊ณผ ์ปฌ๋Ÿผ์— ๋Œ€ํ•œ ๋ฉ”ํƒ€์ •๋ณด์™€ ๋ฐ์ดํ„ฐ์ถ”์ถœ์— ๋Œ€ํ•œ ์š”์ฒญ์ด ๋นˆ๋ฒˆ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ดํ•˜ ์œ„ํ•ด ๋น„๊ฐœ๋ฐœ ์ง๊ตฐ์„ ๋Œ€์ƒ์œผ๋กœ โ€œQuery101โ€ ์„ธ์…˜์„ ์ง„ํ–‰ํ•˜๋ฉฐ ๋ฐ์ดํ„ฐ ์ถ”์ถœ์— ๋Œ€ํ•œ ๋ฐฉ๋ฒ•๊ณผ Superset, Admunsen, ์ธํ•˜์šฐ์Šค ์Šคํ‚ค๋งˆ ํŽ˜์ด์ง€ ๋“ฑ ๋„๊ตฌ๋ฅผ ์†Œ๊ฐœํ–ˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ ๋น„๊ฐœ๋ฐœํŒ€์ด ์ง์ ‘ ๋ฐ์ดํ„ฐ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ค์šด ์ผ์ด์—ˆ์Šต๋‹ˆ๋‹ค.

Data Engineer๋กœ ๋ถ€์„œ๋ฅผ ์ด๋™ํ•œ ํ›„์—๋„ ๊ฐ™์€ ๋ฌธ์ œ๊ฐ€ ์ง€์†๋˜์—ˆ๊ณ , ์ด์— ์ฒซ ๋ฒˆ์งธ ํ”„๋กœ์ ํŠธ๋กœ ๊ธฐ์กด Amundsen์„ ๋Œ€์ฒดํ•  ์ƒˆ๋กœ์šด Data Discovery Platform(DDP) ๊ตฌ์ถ•์„ ์ง„ํ–‰ํ•˜๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์˜คํ”ˆ์†Œ์Šค์ธ Datahub๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๊ฒ€์ƒ‰๊ณผ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ๋ฅผ ๋ณด๋‹ค ์ฒด๊ณ„์ ์œผ๋กœ ์šด์˜ํ•˜๊ณ ์ž ํ–ˆ์œผ๋ฉฐ, ์ฃผ์š” ๊ณผ์ œ๋กœ ์›Œํฌ๋กœ๋“œ ๋ฐฐํฌ ๋ฐ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๋ฐ˜์˜ ์ž๋™ํ™”๋ฅผ ์„ค์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์ดˆ๊ธฐ ์›Œํฌ๋กœ๋“œ ๋ฐฐํฌ ๊ณผ์ •์—์„œ ๊ฐœ๋ฐœ์šฉ ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ์ต๋ช…์˜ ์‚ฌ์šฉ์ž๊ฐ€ Admin ๊ณ„์ •์„ ์•Œ์•„๋‚ด๋Š” ๋ณด์•ˆ ์ด์Šˆ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹คํ–‰ํžˆ ์šด์˜ ํ™˜๊ฒฝ๊ณผ ์—ฐ๊ฒฐ๋˜์ง€ ์•Š์€ ๊ฐœ๋ฐœ ํด๋Ÿฌ์Šคํ„ฐ์˜€๊ธฐ์— ์ฆ‰์‹œ ๋น„๋ฐ€๋ฒˆํ˜ธ๋ฅผ ๋ณ€๊ฒฝํ•˜์—ฌ ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋””๋ฒ„๊น…์„ ์ง„ํ–‰ํ•œ ๊ฒฐ๊ณผ, Datahub์˜ Helm Chart์—์„œ Admin ๊ณ„์ •์˜ ๋น„๋ฐ€๋ฒˆํ˜ธ๊ฐ€ ๊ณ ์ •๊ฐ’์œผ๋กœ ์„ค์ •๋˜์–ด ์žˆ๋‹ค๋Š” ๋ฌธ์ œ๋ฅผ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋Š” ์ €๋ฟ๋งŒ ์•„๋‹ˆ๋ผ Datahub๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋“  ์œ ์ €๋“ค์ด ๊ฒช์„ ์ˆ˜ ์žˆ๋Š” ๋ณด์•ˆ ์ทจ์•ฝ์ ์ด๋ผ ํŒ๋‹จํ–ˆ๊ณ , ์ฆ‰์‹œ GitHub์— Issue๋ฅผ ์ƒ์„ฑํ•œ ํ›„ ์ง์ ‘ PR์„ ์ž‘์„ฑํ•˜์—ฌ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๊ฐœ์„  ์‚ฌํ•ญ์€ Datahub ํ”„๋กœ์ ํŠธ์— ๋ฐ˜์˜๋˜์—ˆ๊ณ , ์ด๋ฅผ ํ†ตํ•ด ๋Œ€๊ทœ๋ชจ ์˜คํ”ˆ์†Œ์Šค์— ๋Œ€ํ•œ ์ปจํŠธ๋ฆฌ๋ทฐ์…˜์„ ์ฒ˜์Œ์œผ๋กœ ๊ฒฝํ—˜ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๋ฐ˜์˜ ์ž๋™ํ™” ๊ณผ์ •์—์„œ๋Š” Databricks ํ…Œ์ด๋ธ” ๋ชฉ๋ก๊ณผ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋Š” ์ •์ƒ์ ์œผ๋กœ ์ˆ˜์ง‘๋˜์ง€๋งŒ, ํ…Œ์ด๋ธ” ๊ฐ„ ์˜์กด์„ฑ์ด ๋ณด์ด์ง€ ์•Š๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋ถ„์„ ๋ฐ ETL ๊ณผ์ •์—์„œ ํ…Œ์ด๋ธ” ๊ฐ„ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ๋ฐ, ์ด๋ฅผ ํ•ด๊ฒฐํ•˜์ง€ ์•Š์œผ๋ฉด ๋ฐ์ดํ„ฐ ํ๋ฆ„์„ ์ถ”์ ํ•˜๊ธฐ ์–ด๋ ค์› ์Šต๋‹ˆ๋‹ค. ์ด์— Spark Query Plan์„ ํŒŒ์‹ฑํ•˜์—ฌ ํ…Œ์ด๋ธ” ๊ฐ„์˜ ์˜์กด์„ฑ์„ ๋ณ„๋„๋กœ ์ €์žฅํ•˜๊ณ , ์ด๋ฅผ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํ”Œ๋žซํผ์— ๋ฐ˜์˜ํ•˜๋Š” ๋ฐฉ์‹์„ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐœ์„ ์„ ํ†ตํ•ด ํ…Œ์ด๋ธ” ๊ฐ„ ๋ฆฌ๋‹ˆ์ง€๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๊ณ , ์žฅ์•  ๋ฐœ์ƒ ์‹œ ํŒ€์›๋“ค์ด ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐฑํ•„์„ ๊ณ„ํšํ•˜๊ณ  ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ง€์›ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ Airflow-Datahub ์—ฐ๋™์„ ๊ตฌํ˜„ํ•˜์—ฌ Databricks ํ…Œ์ด๋ธ”๊ณผ Airflow Task ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ์—ฐ๊ฒฐํ•จ์œผ๋กœ์จ, ๋ฐ์ดํ„ฐ ๋ฐ ์›Œํฌํ”Œ๋กœ์šฐ ๊ฐ„์˜ ์—ฐ๊ฒฐ์„ฑ์„ ํ•œ๋ˆˆ์— ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ์„ ํ™•์žฅํ•˜์˜€์Šต๋‹ˆ๋‹ค.

ํ”„๋กœ์ ํŠธ๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐํŒ€๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋น„๊ฐœ๋ฐœ ์ง๊ตฐ๊นŒ์ง€ ํฌํ•จํ•œ ์ „์‚ฌ์ ์ธ ๋ฐ์ดํ„ฐ ๊ฐ€์‹œ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ๊ธฐ์ˆ ์  ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๊ณผ์ •์—์„œ ๋ณด์•ˆ ๊ฐœ์„ , ์˜คํ”ˆ์†Œ์Šค ๊ธฐ์—ฌ, ๋ฐ์ดํ„ฐ ์—”์ง€๋‹ˆ์–ด๋ง ์—ญ๋Ÿ‰์„ ํ‚ค์šธ ์ˆ˜ ์žˆ์—ˆ๋˜ ์ ์ด ํฐ ์˜๋ฏธ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.