์ด ํฌ์ŠคํŠธ๋Š” ์ œ๊ฐ€ ๊ฐœ์ธ์ ์ธ ์šฉ๋„๋กœ ์ •๋ฆฌํ•œ ๊ธ€ ์ž…๋‹ˆ๋‹ค.

7 minute read

์ด ํฌ์ŠคํŠธ๋Š” ์ œ๊ฐ€ ๊ฐœ์ธ์ ์ธ ์šฉ๋„๋กœ ์ •๋ฆฌํ•œ ๊ธ€ ์ž…๋‹ˆ๋‹ค.



์ด๊ณณ์—์„œ ๋งŽ์€ ๋„์›€์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค! ํฌ์ŠคํŠธ์— ์‚ฌ์šฉ๋œ ์‚ฌ์ง„์€ ๋Œ€๋ถ€๋ถ„ ์ขŒ์ธก์˜ ๋งํฌ์—์„œ ๊ฐ€์ ธ์˜จ ๊ฒƒ ์ž…๋‹ˆ๋‹ค!

Attention Mechanism(2015)

๊ธฐ์กด seq2seq ๋ชจ๋ธ์˜ ํ•œ๊ณ„

seq2seq ๋ชจ๋ธ์€ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Œ. ํ•˜์ง€๋งŒ, ์ธ์ฝ”๋”์—์„œ ์ž…๋ ฅ sequence๋ฅผ ํ•˜๋‚˜์˜ context vector๋กœ ์••์ถ•ํ•˜๋Š” ๊ณผ์ •์—์„œ ์ž…๋ ฅ์˜ ์ •๋ณด๊ฐ€ ์ผ๋ถ€ ์†์‹ค๋จ.

๊ทธ๋ฆฌ๊ณ  RNN์˜ ๊ณ ์งˆ์ ์ธ ๋ฌธ์ œ์ธ ์ž…๋ ฅ ์‹œํ€€์Šค๊ฐ€ ๊ธธ์–ด์ง€๋ฉด, ์ถœ๋ ฅ ์‹œํ€€์Šค์˜ ์ •ํ™•๋„๊ฐ€ ๋–จ์–ด์ง„๋‹ค๋Š” ๋ฌธ์ œ๋„ ๋ฐœ์ƒํ•จ.

์ด๊ฒƒ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ œ์‹œ๋œ ๊ฒƒ์ด Attention Mechanism!!

Attention Mechanism

โ€œAttention์˜ ๊ธฐ๋ณธ ์•„์ด๋””์–ด๋Š” ๋””์ฝ”๋”์—์„œ ์ถœ๋ ฅ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋งค ์‹œ์ (time step)๋งˆ๋‹ค, ์ธ์ฝ”๋”์—์„œ์˜ ์ „์ฒด ์ž…๋ ฅ ๋ฌธ์žฅ์„ ๋‹ค์‹œ ํ•œ๋ฒˆ ์ฐธ๊ณ ํ•œ๋‹ค๋Š” ์ ์ด๋‹ค.

๋‹จ, ์ž…๋ ฅ ๋ฌธ์žฅ ์ „์ฒด๋ฅผ ์ „๋ถ€ ๋™์ผํ•œ ๋น„์œจ๋กœ ์ฐธ๊ณ ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ํ•ด๋‹น ์‹œ์ ์—์„œ ์˜ˆ์ธกํ•ด์•ผ ํ•  ๋‹จ์–ด์™€ ์—ฐ๊ด€์ด ์žˆ๋Š” ์ž…๋ ฅ ๋‹จ์–ด ๋ถ€๋ถ„์„ ์ข€๋” ์ง‘์ค‘(atteion)ํ•ด์„œ ๋ณด๊ฒŒ ๋œ๋‹ค!โ€

Attention Function

Attention Function์€ ์ฃผ์–ด์ง„ Query $Q$์— ๋Œ€ํ•ด ๋ชจ๋“  ํ‚ค $K$์™€์˜ โ€œ์œ ์‚ฌ๋„โ€๋ฅผ ๊ฐ๊ฐ ๊ตฌํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ์œ ์‚ฌ๋„๋ฅผ ํ‚ค์™€ ๋ฌถ์—ฌ์žˆ๋Š” ๊ฐ’ $V$์— ๋ฐ˜์˜ํ•ด์ค€๋‹ค. ์ถœ๋ ฅ์ธ Attention Value๋Š” ์ด ๊ฐ’ $V$์— ์œ ์‚ฌ๋„๋ฅผ ๋ฐ˜์˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ชจ๋‘ ๋”ํ•ด์„œ ๋ฆฌํ„ดํ•œ ๊ฒƒ์ด๋‹ค.

์ด์ œ $Q$, $K$, $V$๋ฅผ ์ข€๋” ๊ตฌ์ฒดํ™” ํ•ด๋ณด์ž.

  • Query $Q$: $t$ ์‹œ์ ์˜ ๋””์ฝ”๋” ์…€์—์„œ์˜ ์€๋‹‰ ์ƒํƒœ
  • Keys $K$: ๋ชจ๋“  ์‹œ์ ์˜ ์ธ์ฝ”๋” ์…€์˜ ์€๋‹‰ ์ƒํƒœ
  • Values $V$: ๋ชจ๋“  ์‹œ์ ์˜ ์ธ์ฝ”๋” ์…€์˜ ์€๋‹‰ ์ƒํƒœ

Dot-Product Attention

Attention์„ ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์—๋„ ์—ฌ๋Ÿฌ ์ข…๋ฅ˜๊ฐ€ ์žˆ๋Š”๋ฐ, โ€œDot-Product Attentionโ€์€ ๊ทธ์ค‘์—์„œ๋„ ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ฐฉ์‹์ด๋‹ค.

๋ณธ๋ž˜ ๋””์ฝ”๋”์—์„  $t$ ์‹œ์ ์—์„œ์˜ ์ถœ๋ ฅ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ๋‘ ๊ฐ€์ง€ ์ž…๋ ฅ๊ฐ’์ด ํ•„์š”ํ–ˆ๋‹ค.

  1. $t-1$ ์‹œ์ ์˜ ์€๋‹‰์ƒํƒœ; $s_{t-1}$
  2. $t-1$ ์‹œ์ ์˜ ์ถœ๋ ฅ๋‹จ์–ด

๊ทธ๋Ÿฐ๋ฐ Attention Mechanism์—์„œ๋Š” ์ถœ๋ ฅ ๋‹จ์–ด ์˜ˆ์ธก์— Attention Value $a_t$๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์ถœ๋ ฅ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ฒŒ ๋œ๋‹ค!


Step 1: Calculate Attention Scores

์ธ์ฝ”๋”์˜ ์€๋‹‰์ƒํƒœ $h_i$์™€ ๋””์ฝ”๋”์˜ ํ˜„์žฌ ์‹œ์  $t$์—์„œ์˜ ์€๋‹‰์ƒํƒœ๋ฅผ $s_t$๋ผ๊ณ  ํ•˜์ž. ๊ทธ๋ฆฌ๊ณ  ๋…ผ์˜์˜ ํŽธ์˜๋ฅผ ์œ„ํ•ด $h_i$์™€ $s_t$์˜ ์ฐจ์›์€ ๊ฐ™๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค.

Dot-Product๋ฅผ ํ†ตํ•ด ์ธ์ฝ”๋”์˜ ์€๋‹‰ ์ƒํƒœ ๊ฐ๊ฐ์ด ๋””์ฝ”๋”์˜ ํ˜„์žฌ์˜ ์€๋‹‰ ์ƒํƒœ $s_t$์™€ ์–ผ๋งˆ๋‚˜ ์œ ์‚ฌํ•œ์ง€ ํŒ๋‹จํ•œ๋‹ค. ์ด ์œ ์‚ฌ๋„ ๊ฐ’์„ Attention Score๋ผ๊ณ  ํ•œ๋‹ค.

\[\textrm{score}(s_t, h_i) = {s_t}^T \cdot h_i\]

$s_t$์™€ ์ธ์ฝ”๋”์˜ ๋ชจ๋“  ์€๋‹‰ ์ƒํƒœ์™€์˜ Attention score๋ฅผ ๋ชจ์•„์„œ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ์„ $e^t$๋ผ๊ณ  ํ‘œํ˜„ํ•˜์ž. ๊ทธ๋Ÿฌ๋ฉด $e^t$๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[e^t = [{s_t}^T h_1\, , \, \dots \, , \, {s_t}^T h_N]\]


Step 2: Get Attention Distribution by Softmax

$e^t$์— Softmax ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•ด, ํ™•๋ฅ  ๋ถ„ํฌ Attention Distribution์„ ์–ป๋Š”๋‹ค.
์ด๋•Œ, ๊ฐ๊ฐ์˜ ๊ฐ’์€ Attention Weight๊ฐ€ ๋œ๋‹ค.
(๊ฐ„๋‹จํ•˜๊ฒŒ ์ƒ๊ฐํ•˜๋ฉด, ๊ทธ๋ƒฅ $e^t$๋ฅผ normalizeํ•œ ๊ฒƒ์ด๋‹ค.)

\[\alpha^t = \textrm{softmax}(e^t)\]


Step 3: Get Attention Value

๊ฐ ์ธ์ฝ”๋”์˜ ์€๋‹‰ ์ƒํƒœ $h_i$์™€ ์•ž์—์„œ ์–ป์€ Attention Weight๋ฅผ ๊ณฑํ•œ๋‹ค. ์ดํ›„ ๊ทธ ๋ชจ๋‘๋ฅผ ๋”ํ•ด์ค€๋‹ค.

๊ฐ„๋‹จํ•˜๊ฒŒ ์ƒ๊ฐํ•˜๋ฉด Attention Weight์œผ๋กœ ๊ฐ€์ค‘ํ•ฉ์„ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

\[a_t = \sum^{N}_{i=1} { {\alpha^t}_i \cdot h_i }\]

์ด Attention value$a_t$๋Š” ์ธ์ฝ”๋”์˜ ๋ฌธ๋งฅ์„ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค๊ณ  ํ•˜์—ฌ โ€œContext Vectorโ€๋ผ๊ณ ๋„ ๋ถˆ๋ฆฐ๋‹ค. seq2seq์—์„  ์ธ์ฝ”๋”์˜ ๋งˆ์ง€๋ง‰ ์€๋‹‰ ์ƒํƒœ๋ฅผ Context Vector๋ผ๊ณ  ๋ถ€๋ฅธ ๊ฒƒ๊ณผ๋Š” ๋‹ค๋ฅด๋‹ค!


Step 4: Concatenate Attention Value and Decoder Hidden State

๋””์ฝ”๋”์˜ ์€๋‹‰ ์ƒํƒœ์ธ $s_t$์— Attention Value $a_t$๋ฅผ Concatenateํ•˜์—ฌ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ $v_t$๋กœ ๋งŒ๋“ ๋‹ค.

์ด $v_t$๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์‚ผ์•„ $\hat{y}$๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค.

๊ทธ ์ดํ›„๋Š” ์ผ๋ฐ˜์ ์ธ RNN๊ณผ ๋™์ผํ•˜๋‹ค.



Transformer(2017)

RNN์˜ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Attention ๋ชจ๋ธ์ด ์ œ์‹œ๋˜์—ˆ๋‹ค. Transformer์—์„œ๋Š” Attention์„ ์ƒˆ๋กœ์šด ์‹œ๊ฐ์œผ๋กœ ์ ‘๊ทผํ•œ๋‹ค.

"Attention์„ ๋””์ฝ”๋”์˜ hidden state๋ฅผ ๋ณด์ •ํ•˜๋Š” ์šฉ๋„๋กœ๋งŒ ์“ฐ๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ
์•„์˜ˆ Attention์œผ๋กœ ๋™์ž‘ํ•˜๋Š” ์ธ์ฝ”๋”-๋””์ฝ”๋”๋ฅผ ๋งŒ๋“ค์–ด๋ณด์ž!"


Transformer๋Š” RNN์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค! ๊ทธ๋Ÿฌ๋‹ˆ ๊ธฐ์กด์˜ seq2seq์ฒ˜๋Ÿผ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ๋Š” ๊ทธ๋Œ€๋กœ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ๋‹ค๋งŒ, ๊ธฐ์กด ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ๋ฅผ ์™„์ „ํžˆ ๋”ฐ๋ฅด๊ณ  ์žˆ๋Š” ๊ฑด ์•„๋‹Œ๋ฐ, ์ธ์ฝ”๋”-๋””์ฝ”๋”๊ฐ€ 1:1๋กœ ์กด์žฌํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๋‹ค์ˆ˜:๋‹ค์ˆ˜๋กœ ์กด์žฌํ•˜๋Š” ๊ตฌ์กฐ์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค!!

๊ทธ๋ž˜์„œ ๋‹ค์ˆ˜:๋‹ค์ˆ˜ ๊ตฌ์กฐ๋ฅผ ๊ณ ๋ คํ•˜์—ฌ โ€œEncoder-Decoderโ€๊ฐ€ ์•„๋‹ˆ๋ผ โ€œEncoders-Decodersโ€๋ผ๊ณ  ํ‘œํ˜„ํ•œ๋‹ค.

Positional Encoding

Transformer๋Š” ๊ธฐ์กด ์ธ์ฝ”๋”-๋””์ฝ”๋” ๋ชจ๋ธ๊ณผ๋Š” ๋‹ฌ๋ฆฌ ์ž…๋ ฅ Sequence๋ฅผ ๊ทธ๋Œ€๋กœ ๋ฐ›๋Š” ๊ตฌ์กฐ๊ฐ€ ์•„๋‹ˆ๋ผ โ€œPositional Encodingโ€๋กœ ์ „์ฒ˜๋ฆฌ๋œ ๊ฐ’์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๋Š”๋‹ค.

๊ธฐ์กด RNN์—์„œ๋Š” Sequence๋ฅผ ๋‹จ์–ด๋ณ„๋กœ ์ˆœ์ฐจ์ ์œผ๋กœ ์ž…๋ ฅ ๋ฐ›๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์–ด์˜ ์œ„์น˜ ์ •๋ณด(Positional Information)์ด ์•Œ์•„์„œ ๋‹ด๊ธฐ๊ฒŒ ๋œ๋‹ค. ๋ฌผ๋ก  ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  RNN์—์„œ๋„ Encoding ๊ณผ์ •์€ ๊ฑฐ์นœ๋‹ค! ์ด๋ฅผ โ€œ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ(Word Embedding)โ€์ด๋ผ๊ณ  ํ•œ๋‹ค. link


์ด์ œ Transformer์—์„œ ์–ด๋–ค ๋ฐฉ์‹์œผ๋กœ Positional Encoding Vector๋ฅผ ์ƒ์„ฑํ•˜๋Š”์ง€ ์‚ดํŽด๋ณด์ž.

Transformer๋Š” $(\textrm{pos}, i)$๋ฅผ ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ Positional Encoding์„ ์ƒ์„ฑํ•œ๋‹ค. ์ด๋•Œ, $\textrm{pos}$๋Š” ์ž…๋ ฅ sequence ๋‚ด์—์„œ ๋‹จ์–ด์˜ ์œ„์น˜๋ฅผ ์˜๋ฏธํ•˜๊ณ , $i$๋Š” embedding vector์˜ ์ฐจ์› ๋‚ด์—์„œ์˜ ์ธ๋ฑ์Šค๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

$\textrm{pos}$์— ๋Œ€ํ•œ Positional Encoding์€ ์•„๋ž˜์˜ ํ•จ์ˆ˜์— ์˜ํ•ด ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

\[\begin{aligned} \textrm{PE}(\textrm{pos}, 2i) &= \sin \left( \left({\textrm{pos} / 10000}\right)^{2i/d_{\textrm{model}}} \right) \\ \textrm{PE}(\textrm{pos}, 2i + 1) &= \cos \left( \left({\textrm{pos} / 10000}\right)^{2i/d_{\textrm{model}}} \right) \end{aligned}\]

(์œ„์˜ ๊ทธ๋ฆผ์—์„  $d_{\textrm{model}}=4$๋กœ ์„ค์ •ํ•˜์˜€์ง€๋งŒ, ์‹ค์ œ ๋…ผ๋ฌธ์—์„  512๋กœ ์„ค์ •๋˜์–ด ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.)

์œ„์™€ ๊ฐ™์€ Positional Encoding ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋ฉด, ์ž…๋ ฅ sequence์— ๋Œ€ํ•œ ์ˆœ์„œ ์ •๋ณด๊ฐ€ ๋ณด์กด๋œ๋‹ค๊ณ  ํ•œ๋‹ค! ์ด๋ฅผ ํ†ตํ•ด ๊ฐ™์€ ๋‹จ์–ด์ผ์ง€๋ผ๋„ ๋ฌธ์žฅ ๋‚ด์˜ ์œ„์น˜์— ๋”ฐ๋ผ์„œ Transformer์˜ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ€๋Š” Embedding Vector์˜ ๊ฐ’์ด ๋‹ฌ๋ผ์ง€๋Š” ๊ฒƒ์ด๋‹ค!



Three Attentions in Transformer

Transformer์—์„  3๊ฐ€์ง€ ๋ฐฉ์‹์œผ๋กœ Attention์„ ์‚ฌ์šฉํ•œ๋‹ค.

๊ทธ๋ฆผ์—์„œ ํšŒ์ƒ‰, ์ดˆ๋ก์ƒ‰, ํŒŒ๋ž€์ƒ‰์˜ ํ™”์‚ดํ‘œ๋Š” ๊ฐ๊ฐ Query, Key, Value๋ฅผ ์˜๋ฏธํ•œ๋‹ค. (์•„๋งˆ๋„)

๋จผ์ € โ€œSelf-Attentionโ€์ด๋ž€ ๋ฌด์—‡์ธ์ง€ ์‚ดํŽด๋ณด์ž. Self-Attention์€ Query์™€ Key, Value ๋™์ผํ•œ ๊ณณ์—์„œ ์œ ๋ž˜ํ•œ๋‹ค๋ฉด, โ€œSelf-Attentionโ€์ด๋ผ๊ณ  ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด 3๋ฒˆ์งธ Attention Model์˜ ๊ฒฝ์šฐ Query๊ฐ€ ๋””์ฝ”๋”์—์„œ ์œ ๋ž˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์— Self-Attention์ด ์•„๋‹ˆ๋‹ค. Attention Mechanism์—์„œ ์‚ดํŽด๋ณธ ๋ชจ๋ธ๋„ Self-Attention Model์ด ์•„๋‹ˆ๋‹ค.

1๋ฒˆ์งธ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, Self-Attention์ด๋ฉฐ ์ธ์ฝ”๋”์—์„œ ์ด๋ฃจ์–ด์ง„๋‹ค. ๋ฐ˜๋ฉด์— 2๋ฒˆ์งธ์™€ 3๋ฒˆ์งธ Attention Model์˜ ๊ฒฝ์šฐ ๋””์ฝ”๋”์—์„œ ์ด๋ฃจ์–ด์ง„๋‹ค.
(2๋ฒˆ์งธ๋Š” Self-Attention์ด๊ธด ํ•œ๋ฐ, Q, K, V๊ฐ€ ๋ชจ๋‘ ๋””์ฝ”๋”์˜ ๋ฒกํ„ฐ๋กœ ์ด๋ฃจ์–ด์ง€๋Š” ๊ฒƒ์ด๋‹ค.)

Transformer๋Š” ์œ„์˜ 3๊ฐ€์ง€ Attention์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•œ๋‹ค! ๊ฐ๊ฐ์ด Transformer ์•„ํ‚คํ…์ฒ˜์˜ ์–ด๋””์—์„œ ์‚ฌ์šฉ๋˜๋Š”์ง€ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.



๋‘๋ฒˆ์งธ ํฌ์ŠคํŠธ์—์„  Transformer ๋ชจ๋ธ์˜ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”, ๊ทธ๋ฆฌ๊ณ  3๊ฐ€์ง€ Attention Model์— ๋Œ€ํ•ด ๋” ์ž์„ธํžˆ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค.

Reference