2020-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜์•Œ๊ณ ๋ฆฌ์ฆ˜โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

8 minute read

2020-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜์•Œ๊ณ ๋ฆฌ์ฆ˜โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

์•„์‰ฝ๊ฒŒ๋„ <Huffman Encoding>์— ๋Œ€ํ•œ ๋ฐฑ์ค€ ๋ฌธ์ œ๋ฅผ ์ฐพ์ง€ ๋ชป ํ–ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ž์ฒด์ ์œผ๋กœ ๋ฌธ์ œ์™€ ํ…Œ์ŠคํŠธ ์ผ€์ด์Šค๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋งŒ๋“ค์–ด ์ด๊ฒƒ์„ ๊ธฐ์ค€์œผ๋กœ ๋ฌธ์„œ๋ฅผ ์ž‘์„ฑํ•˜๋„๋ก ํ•˜๊ฒ ๋‹ค ๐Ÿ™


Situation

Consider a string consisting of 130 million characters and the alphabet {A, B, C, D}. CDABDBACCDDBBCCDCC... How to write this long string in binary with most economic way?


* Input

$1 \le N \le 20$ length alphabetical string.

* Output

Most economic binary encoding.


* Testcase 1

# input
AAAABBBCCD

# output
0000101010111111110

* Testcase 2

# input
ABBCCCBBA

# output
10001111110010

โ€ป NOTE: Several optimal encodings may exist.

์‚ฌ์‹ค ์•„๋ž˜์˜ ์‚ฌ์ดํŠธ์—์„œ ๋ช‡๊ฐ€์ง€ ํ…Œ์ŠคํŠธ์ผ€์ด์Šค๋ฅผ ์ง์ ‘ ๋งŒ๋“ค์–ด์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

๐Ÿ‘‰ Huffman Coding


<Huffman Encoding>์€ ๋ฌธ์ž์—ด์„ ์••์ถ•ํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹์ด๋‹ค. ๊ฐ€๋ณ€ ๊ธธ์ด ์ธ์ฝ”๋”ฉ(variable-length encoding) ๋ฐฉ์‹์ธ๋ฐ ๊ฐ ๋ฌธ์ž์˜ ์ถœํ˜„ ๋นˆ๋„์— ๋”ฐ๋ผ ์ž์ฃผ ์ถœํ˜„ํ•˜๋Š” ๋ฌธ์ž๋ผ๋ฉด ๋” ์งง์€ ์ฝ”๋“œ๋ฅผ ๋ถ€์—ฌํ•œ๋‹ค.

๊ณ ์ • ๊ธธ์ด ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹๊ณผ ๊ฐ€๋ณ€ ๊ธธ์ด ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹์„ ํ•œ๋ฒˆ ๋น„๊ตํ•ด๋ณด์ž.

1. ๊ณ ์ • ๊ธธ์ด ์ธ์ฝ”๋”ฉ

A = 00, B = 01, C = 10, D = 11

AAAABBBCCD -> 0000 0000 0101 0110 1011 (20 bits)

2. ๊ฐ€๋ณ€ ๊ธธ์ด ์ธ์ฝ”๋”ฉ

A = 0, B = 10, C = 111, D = 110

AAAABBBCCD -> 0000 1010 1011 1111 110 (19 bits)

์—โ€ฆ ์‚ฌ์‹ค ์œ„์˜ ์˜ˆ์ œ์—์„œ๋Š” ๊ทธ๋ ‡๊ฒŒ ๋งŽ์ด ์ค„์ง€ ์•Š์•˜์ง€๋งŒ, ๋งŒ์•ฝ ํŠน์ • ๋ฌธ์ž์˜ ์ถœํ˜„ ๋นˆ๋„๊ฐ€ ๋‹ค๋ฅธ ๋ฌธ์ž๋ณด๋‹ค ์••๋„์ ์œผ๋กœ ๋งŽ๋‹ค๋ฉด ๊ฐ€๋ณ€ ๊ธธ์ด ์ธ์ฝ”๋”ฉ์€ ์ข‹์€ ์„ ํƒ์ด ๋  ์ˆ˜ ์žˆ๋‹ค! (์˜ˆ๋ฅผ ๋“ค์–ด ์˜์–ด์—์„œ๋Š” ๋ชจ์Œ์— ํ•ด๋‹นํ•˜๋Š” a, i, u, e, o ๋ฌธ์ž๊ฐ€ ์ž์ฃผ ๋“ฑ์žฅํ•  ๊ฒƒ์ด๋‹ค.)


๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฐ <๊ฐ€๋ณ€ ๊ธธ์ด ์ธ์ฝ”๋”ฉ> ๋ฐฉ์‹์€ encoding-decoding์ด ์œ ์ผ(unique)ํ•ด์•ผ ํ•œ๋‹ค๋Š” ์กฐ๊ฑด์„ ๋งŒ์กฑํ•ด์•ผ ํ•œ๋‹ค! ์šฐ๋ฆฌ๊ฐ€ ์•”ํ˜ธํ™”(encoding) ํ–ˆ์ง€๋งŒ ์ •ํ™•ํžˆ ๋ณตํ˜ธํ™”(decoding)ํ•˜์ง€ ๋ชปํ•œ๋‹ค๋ฉด ๊ทธ๊ฒƒ์€ ์ „ํ˜€ ๋„์›€์ด ๋˜์ง€ ์•Š๋Š”๋‹ค. <Huffman Encoding>์—์„œ๋Š” ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด prefix-free ๋ฐฉ์‹์œผ๋กœ ๋ฌธ์ž๋ฅผ ์ธ์ฝ”๋”ฉ ํ•œ๋‹ค. ์ด๊ฒƒ์€ ์–ด๋–ค ์ธ์ฝ”๋”ฉ ์ฝ”๋“œ๋ผ๋„ ๋‹ค๋ฅธ ์ธ์ฝ”๋“œ์˜ prefix๊ฐ€ ๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๋ง์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด A๋ฅผ 101์œผ๋กœ ์ธ์ฝ”๋”ฉ ํ–ˆ๋‹ค๋ฉด, 1, 10, 101๋กœ๋Š” ์ ˆ๋Œ€ ์ธ์ฝ”๋”ฉ ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ทœ์น™์ด๋‹ค.

<Huffman Encoding>์€ ์•„๋ž˜์™€ ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ตœ์ ์˜ ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹์„ ์ฐพ๋Š”๋‹ค.

1. ๋ฌธ์ž์—ด์—์„œ ๋ฌธ์ž๋“ค์˜ ๋นˆ๋„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค: $\{f_i \mid i \in \Omega \}$

2. ๋นˆ๋„์ˆ˜ $f_i$๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ฌธ์ž ์ง‘ํ•ฉ์„ ์ •๋ ฌํ•œ๋‹ค.

3. ์ •๋ ฌ๋œ ๋ฌธ์ž ๋ฐฐ์—ด์—์„œ ๊ฐ€์žฅ ์ž‘์€ ๋นˆ๋„๋ฅผ ๊ฐ€์ง€๋Š” 2๊ฐœ ๋ฌธ์ž๋ฅผ ๊บผ๋‚ด ๊ฐ๊ฐ์„ left child, right child๋กœ ๊ฐ–๋Š” ํŠธ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋‘ ํŠธ๋ฆฌ์˜ ๋นˆ๋„์ˆ˜๋ฅผ ํ•ฉ์นœ ๋…ธ๋“œ๋ฅผ ๋‹ค์‹œ ๋ฌธ์ž ๋ฐฐ์—ด์— ์‚ฝ์ž…ํ•œ๋‹ค.

4. [2-3] ๊ณผ์ •์„ ๋ฌธ์ž ๋ฐฐ์—ด์— ๋‹จ ํ•˜๋‚˜์˜ ๋…ธ๋“œ๋งŒ ๋‚จ์„ ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณตํ•œ๋‹ค.


์œ„์™€ ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด <์šฐ์„ ์ˆœ์œ„ ํ>๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ด ์ž๋ฃŒ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด [2] ๊ณผ์ •์˜ ์ •๋ ฌ์„ ์ง์ ‘ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์•„๋„ ๋œ๋‹ค! ๊ทธ๋ž˜์„œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹ค์‹œ ์“ฐ๋ฉด,

1. ๋ฌธ์ž์—ด์—์„œ ๋ฌธ์ž๋“ค์˜ ๋นˆ๋„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค: $\{f_i \mid i \in \Omega \}$

2. insert all $f_i$ into a priority queue $H$

3. repeat this until there left one node in $H$

3-1. $u = \texttt{deletemin}(H)$, $v = \texttt{deletemin}(H)$

3-2. create a node $k$ with children $u$ and $v$ and $f_k = f_u + f_v$

3-3. insert new node into $H$

๊ตฌํ˜„

<Huffman Encoding>์„ C++๋กœ ๊ตฌํ˜„ํ•ด๋ณด์ž!

๋ฌธ์ž์—ด์„ ์ž…๋ ฅ๋ฐ›๊ณ , ๋นˆ๋„ ์ˆ˜์— ๋Œ€ํ•œ Map์€ ์•„๋ž˜์™€ ๊ฐ™์ด ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

string ss;
cin >> ss; // Assumption. only uppercase Alphabet

// generate frequency map
map<char, int> freq_map;
for (int i = 0; i < ss.length(); i++) {
  if (freq_map[ss[i]]) {
    freq_map[ss[i]] += 1;
  } else {
    freq_map[ss[i]] = 1;
  }
}

๋‹ค์Œ์€ Priority Queue $H$์— ๋…ธ๋“œ๋ฅผ ๋„ฃ๋Š” ๊ฒƒ์ธ๋ฐ, ๊ทธ ์ „์— Node ์ž๋ฃŒ๊ตฌ์กฐ๋ฅผ ์ง์ ‘ ์ •์˜ํ•ด์•ผ ํ•œ๋‹ค.

struct Node {
  int freq;
  char character;
  Node *left;
  Node *right;
};

๊ทธ๋ฆฌ๊ณ  PQ $H$์— ์“ฐ๊ธฐ ์œ„ํ•ด ๋น„๊ต์—ฐ์‚ฐ์ž๋„ ์ง์ ‘ ๊ตฌํ˜„ํ•ด์ค€๋‹ค.

struct Node_greater {
  bool operator()(const Node *left, const Node *right) {
    return left->freq > right->freq;
  }
};

๊ทธ๋ฆฌ๊ณ  PQ $H$๋ฅผ ์ƒ์„ฑํ•˜๋ฉด,

// insert all nodes into priority queue
priority_queue<Node *, vector<Node *>, Node_greater> pq;
for (auto &freq_pair: freq_map) {
  char character = freq_pair.first;
  int freq = freq_pair.second;
  Node *node = new Node({freq, character, NULL, NULL});
  pq.push(node);
}

์ฒ˜์Œ์—๋Š” priorirty_queue<Node, ...>๋กœ ๊ตฌํ˜„ํ–ˆ๋Š”๋ฐ ์ฃผ์†Œ ํ• ๋‹น์—์„œ ๋ญ”๊ฐ€ ๊ผฌ์ด๋Š” ๋Š๋‚Œ์ด ์žˆ์–ด์„œ ๋™์ž‘ ํ• ๋‹น๊ณผ priority_queue<Node *, ...>๋ฅผ ์“ฐ๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๊ตฌํ˜„ํ–ˆ๋‹ค.

๋‹ค์Œ์€ PQ $H$์—์„œ ๋…ธ๋“œ๋ฅผ 2๊ฐœ์”ฉ ๊บผ๋‚ด์„œ Huffman Tree๋ฅผ ์ƒ์„ฑํ•˜๋ฉด ๋œ๋‹ค.

// generate Huffman tree
while (pq.size() > 1) {
  Node *node1 = pq.top();
  pq.pop();
  Node *node2 = pq.top();
  pq.pop();
  Node *newNode = new Node({node1->freq + node2->freq, '0', node1, node2});
  pq.push(newNode);
}

๊ทธ๋Ÿฌ๋‚˜ ์‹ค์ œ๋กœ ๋ฌธ์ž๋ฅผ ์ธ์ฝ”๋”ฉ ํ•˜๊ธฐ ์œ„ํ•ด์„  ์œ„์˜ ๊ณผ์ •์œผ๋กœ ์–ป์€ PQ $H$์— ์ €์žฅ๋œ ๋งˆ์ง€๋ง‰ ๋…ธ๋“œ์ธ root ๋…ธ๋“œ๋ฅผ ์‚ฌ์šฉํ•ด Huffman Tree๋ฅผ ์ˆœํšŒํ•ด์•ผ ํ•œ๋‹ค. ์ด๊ฒƒ์€ DFS๋กœ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

void dfs(map<char, string> &encode_map, const Node &node, string encode) {
  if (!node.left && !node.right) {
    encode_map[node.character] = encode;
  } else {
    dfs(encode_map, *node.left, encode + "0");
    dfs(encode_map, *node.right, encode + "1");
  }
}
...

// encoding using Huffman tree!
Node *root = pq.top();
map<char, string> encode_map;
dfs(encode_map, *root, "");

for (auto &encode: encode_map) {
  cout << encode.first << ": " << encode.second << "\n";
}

์œ„์™€ ๊ฐ™์ด ์ž˜ ๊ตฌํ˜„ํ•˜๋ฉด ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

# Testcase 1
(A 4) (B 3) (C 2) (D 1)
A: 0
B: 10
C: 111
D: 110

# Testcase 2
ABBCCCBBA
(A 2) (B 4) (C 3)
A: 10
B: 0
C: 11

๋ณธ ํฌ์ŠคํŠธ๋Š” <Greedy Algorithm> ์‹œ๋ฆฌ์ฆˆ์˜ ์ผ๋ถ€ ์ž…๋‹ˆ๋‹ค. ๋” ๋งŽ์€ <Greedy Algorithm>์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด๊ณ  ์‹ถ๋‹ค๋ฉด ์•„๋ž˜์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ๐Ÿ˜

๐Ÿ‘‰ Greedy Algorithm

Categories:

Updated: