Stale ๋œ Parquet ํŒŒ์ผ๋“ค์„ ์ •๋ฆฌํ•ด์„œ ์šฉ๋Ÿ‰ ์••์ถ•ํ•˜๊ธฐ ๐Ÿงน

3 minute read

ํšŒ์‚ฌ์—์„œ Databricks๋ฅผ ํ†ตํ•ด Spark Cluster๋ฅผ ์šด์˜ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๊ธ€์€ Databricks๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ž‘์„ฑํ–ˆ์Œ์„ ๋ฏธ๋ฆฌ ๋ฐํž™๋‹ˆ๋‹ค.

Vacuum

Delta ํ…Œ์ด๋ธ”์€ _delta_log/ ํด๋”๋ฅผ ํ†ตํ•ด ๋ฒ„์ „์ด ๊ด€๋ฆฌ ๋˜๊ณ , DML ์—ฐ์‚ฐ์„ ํ•  ๋•Œ๋งˆ๋‹ค ์˜ํ–ฅ ๋ฐ›๋Š” .parquet ํŒŒ์ผ์„ ์ƒˆ๋กœ ์“ฐ๊ธฐ ๋•Œ๋ฌธ์— ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜๋ก ๋ฐ์ดํ„ฐ๊ฐ€ ์Œ“์—ฌ๋งŒ ๊ฐ„๋‹ค. ์ด๋ ‡๊ฒŒ ์Œ“์—ฌ์ง„ ๋ฐ์ดํ„ฐ ๋•๋ถ„์— Delta์—์„œ ์‹œ๊ฐ„ ์—ฌํ–‰(Time Travel)์ด ๊ฐ€๋Šฅํ•œ ๊ฒƒ์ด์ง€๋งŒ, ์‹œ๊ฐ„ ์—ฌํ–‰์ด๋ž€ ๊ธฐ๋Šฅ์ด ํ•ญ์ƒ ํ•„์š”ํ•œ ๊ฒƒ๋„ ์•„๋‹ˆ๊ณ , ์–ด๋Š ์‹œ์ ์ด ์ง€๋‚˜๋ฉด Stale ๋œ ๋ณ€๊ฒฝ ์‚ฌํ•ญ๊ณผ .parquet ํŒŒ์ผ๋“ค์„ ์ •๋ฆฌํ•˜๋Š”๊ฒŒ ์ €์žฅ ๊ณต๊ฐ„์„ ํšจ์œจ์ ์œผ๋กœ ์“ฐ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. Delta์—์„œ๋Š” ์ด๋Ÿฐ Stale ํŒŒ์ผ๋“ค์„ ์ •๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด VACUUM ๋ช…๋ น์–ด๋ฅผ ์ง€์›ํ•œ๋‹ค.

VACUUM <schema>.<table> RETAIN 120 HOURS

์ด VACUUM์ด๋ž€ ์—ฐ์‚ฐ์€ ์ž๋™์œผ๋กœ ์ˆ˜ํ–‰๋˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ฉฐ, ์ง์ ‘ ์‹คํ–‰ํ•˜์—ฌ ๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ •๋ฆฌํ•ด์ค˜์•ผ ํ•œ๋‹ค. Stale ๋œ .parquet ํŒŒ์ผ์€ ๊ธฐ๋ณธ์ ์œผ๋กœ 7 Days์˜ Deleted File Retention์„ ๊ฐ€์ง„๋‹ค. ๋งŒ์•ฝ VACUUM ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ๋•Œ, ๋ณ„๋„๋กœ ์ง€์ •ํ•˜์ง€ ์•Š์œผ๋ฉด ํ…Œ์ด๋ธ”์— ๋ช…์‹œ๋œ delta.deletedFileRetentionDuration ๊ฐ’์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ์ •๋ฆฌํ•œ๋‹ค.

ALTER TABLE <schema>.<table>
SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = 'interval 14 days')

(๋ฒˆ์™ธ) Spark CalendarInterval

์ฐธ๊ณ ๋กœ Retention Duration์€ internval ...์˜ ํ˜•์‹์œผ๋กœ ์ ๋Š”๋ฐ, ์ด๋•Œ Spark์˜ CalendarInterval๋ผ๋Š” ํฌ๋งท์„ ๋”ฐ๋ผ์•ผ ํ•œ๋‹ค๊ณ  Delta Lake ๋ฌธ์„œ์— ์ ํ˜€์žˆ๋‹ค. [๋ฌธ์„œ]

ํ•ด๋‹นํ•˜๋Š” Spark ๋ช…์„ธ์„œ([๋งํฌ])๋ฅผ ์ฐพ์•„๋ณด๋ฉด, ์ด CalendarInterval์€ days, months, microseconds ํ•„๋“œ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

๋”ฐ๋ผ์„œ Retention Duration์„ ์ค„ ๋•Œ๋„ ์š” ํ•„๋“œ๋ฅผ ๋”ฐ๋ผ์•ผ ํ•˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ๋งŒ์•ฝ, ํ•œ ์‹œ๊ฐ„

Log Retention

๋ชจ๋“  Delta ํŒŒ์ผ์€ _delta_log/๋ผ๋Š” ํŠน๋ณ„ํ•œ ํด๋”์— ์—ฐ์‚ฐ ๊ธฐ๋ก์„ ๋‚จ๊ธด๋‹ค. ์šฐ๋ฆฌ๊ฐ€ DESC HISTORY ...๋ฅผ ํ†ตํ•ด ํ™•์ธํ•˜๋Š” ์—ฐ์‚ฐ ๊ธฐ๋ก๊ณผ ์‹œ๊ฐ„ ์—ฌํ–‰ ๊ธฐ๋Šฅ์ด ๋ชจ๋‘ _delta_log/์— ๊ธฐ๋ก๋œ ์ •๋ณด๋“ค ๋•๋ถ„์ธ ๊ฒƒ์ด๋‹ค.

Each time a checkpoint is written, Delta Lake automatically cleans up log entries older than the retention interval

Delta์˜ ์—ฐ์‚ฐ ๊ธฐ๋ก์€ ๋ฌดํ•œํžˆ ๋‚จ๊ฒจ๋‘๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ Log Retention์— ๋”ฐ๋ผ์„œ ์–ผ๋งŒํผ์˜ ๊ธฐ๋ก์„ ๋‚จ๊ธธ์ง€๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ , Retention์ด ์ง€๋‚œ ๊ธฐ๋ก๋“ค์€ Delta ํ…Œ์ด๋ธ”์˜ ์ฒดํฌํฌ์ธํŠธ(Checkpoint)๊ฐ€ ์ž‘์„ฑ๋  ๋•Œ๋งˆ๋‹ค ํ•จ๊ป˜ ์ •๋ฆฌ๋œ๋‹ค. Log Retention ๊ฐ’์€ ์•„๋ž˜์˜ ๋ช…๋ น์–ด๋ฅผ ํ†ตํ•ด ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

ALTER TABLE <schema>.<table>
SET TBLPROPERTIES ('delta.logRetentionDuration' = 'interval 14 days')

์ฐธ๊ณ ๋กœ ์œ„์ฒ˜๋Ÿผ ํ…Œ์ด๋ธ” ์†์„ฑ์„ ๋ฐ”๊พธ๋Š” ๊ฒƒ๋„ ๋ชจ๋‘ Delta ํžˆ์Šคํ† ๋ฆฌ์— SET TBLPROPERTIES๋กœ ๊ธฐ๋ก ๋œ๋‹ค.

์•ž์—์„œ๋„ ๋งํ–ˆ๋“ฏ Stale Log์˜ ์ •๋ฆฌ๋Š” Checkpoint๊ฐ€ ์ž‘์„ฑ๋  ๋•Œ ์ˆ˜ํ–‰๋œ๋‹ค. ์˜ˆ์ „์—๋Š” VACUUM์„ ์ˆ˜ํ–‰ํ•˜๋ฉด Stale Log๋„ ๊ฐ™์ด ์ •๋ฆฌ๋˜๋Š” ์ค„ ์•Œ์•˜๋‹ค. (๋จธ์“ฑ)

References