์ด ํฌ์ŠคํŠธ๋Š” ์ œ๊ฐ€ ๊ฐœ์ธ์ ์ธ ์šฉ๋„๋กœ ์ •๋ฆฌํ•œ ๊ธ€ ์ž…๋‹ˆ๋‹ค.

3 minute read

์ด ํฌ์ŠคํŠธ๋Š” ์ œ๊ฐ€ ๊ฐœ์ธ์ ์ธ ์šฉ๋„๋กœ ์ •๋ฆฌํ•œ ๊ธ€ ์ž…๋‹ˆ๋‹ค.

Transformer์— ๋Œ€ํ•œ ์ฒซ๋ฒˆ์งธ ํฌ์ŠคํŠธ๋Š” ์ด๊ณณ์—์„œ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.



Encoder in Transformer

์•ž์—์„œ๋„ ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด Transformer์˜ ์ธ์ฝ”๋” ๋ชจ๋ธ์€ ๋‹ค์ˆ˜์˜ Encoder๊ฐ€ ์Œ“์—ฌ์žˆ๋Š” ๊ตฌ์กฐ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” num_layers=6์œผ๋กœ ์„ค์ •ํ•˜์—ฌ ์ธ์ฝ”๋” ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜์˜€๋‹ค.

ํ•˜๋‚˜์˜ ์ธ์ฝ”๋” ๋‚ด๋ถ€์—๋Š” ๋‘ ๊ฐ€์ง€ Layer๊ฐ€ ๋˜ ์กด์žฌํ•œ๋‹ค.

  • Self-Attention
  • Feed Forward Neural Network; FFNN

Self-Attention in Encoder Module

๊ธฐ์กด Attention์˜ ๊ฒฝ์šฐ Query $Q$๊ฐ€ $t$ ์‹œ์ ์—์„œ์˜ ๋””์ฝ”๋”์˜ ์€๋‹‰ ์ƒํƒœ์˜€๋‹ค.

ํ•˜์ง€๋งŒ, Self-Attention์˜ ๊ฒฝ์šฐ, ์ด Query $Q$๊ฐ€ ๋””์ฝ”๋”๊ฐ€ ์•„๋‹Œ ์ธ์ฝ”๋”์—์„œ ์œ ๋ž˜ํ•œ๋‹ค. ๋˜ํ•œ, $t$ ์‹œ์ ์ด ์•„๋‹Œ, ์ธ์ฝ”๋”์˜ ์ „์ฒด ์‹œ์ ์— ๋Œ€ํ•œ ์€๋‹‰ ์ƒํƒœ๋กœ Query $Q$๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค!

Self-Attention์„ ํ™œ์šฉํ•˜๋ฉด, ์•„๋ž˜์™€ ๊ฐ™์ด ๋ฌธ์žฅ ๋‚ด์—์„œ ์ง€์‹œํ•˜๋Š” ๋Œ€์ƒ์„ ์ฐพ๋„๋ก ์œ ์‚ฌ๋„๋ฅผ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

Transformer๋Š” ๋”์ด์ƒ RNN์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค. ๊ทธ๋ž˜์„œ โ€˜์€๋‹‰ ์ƒํƒœโ€™๋ผ๋Š” ๊ฐœ๋…์ด ์—†๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— $Q$, $K$, $V$๋ฅผ ์€๋‹‰ ์ƒํƒœ๊ฐ€ ์•„๋‹Œ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ์ •์˜ํ•˜๊ฒŒ ๋œ๋‹ค.

์œ„์˜ ๊ฐ™์ด ๋‹จ์–ด์˜ embedding vector์— $Q$, $K$, $V$์— ๋Œ€์‘ํ•˜๋Š” ๊ฐ€์ค‘์น˜ $W^Q$, $W^K$, $W^V$๋ฅผ ๊ณฑํ•˜์—ฌ $Q$, $K$, $V$๋ฅผ ์–ป๋Š”๋‹ค. (๊ฐ€์ค‘์น˜๋Š” ๊ฐ ๋‹จ์–ด์—์„œ ๊ณต์œ ๋˜๋Š” ๊ฐ’์ธ๊ฐ€? ์•„๋งˆ ๊ทธ๋ ‡๊ฒ ์ง€?)

๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•œ $Q$, $K$, $V$ ๊ฐ’์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด, ๋‹ค์Œ ๊ณผ์ •์€ ์•ž์—์„œ ๋‹ค๋ฃฌ Attention Mechanism๊ณผ ์™„์ „ํžˆ ๋™์ผํ•˜๋‹ค.

Query $Q$ ๋ฒกํ„ฐ๋Š” ๋ชจ๋“  $K$ ๋ฒกํ„ฐ์— ๋Œ€ํ•ด Attention score๋ฅผ ๊ตฌํ•˜๊ณ , softmax๋กœ Attention distribution๋ฅผ ๊ตฌํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  Attention distribution์— $V$ ๋ฒกํ„ฐ๋ฅผ ๊ฐ€์ค‘ํ•ฉํ•˜์—ฌ Attention Value๋ฅผ ๊ตฌํ•œ๋‹ค. ์ด ๊ณผ์ •์„ ๋ชจ๋“  Query $Q$ ๋ฒกํ„ฐ์— ๋Œ€ํ•ด ๋ฐ˜๋ณตํ•œ๋‹ค.

์ด๋•Œ, ์•ž์—์„œ ๋‹ค๋ฃฌ Attention Mechanism๊ณผ ๋‹ฌ๋ฆฌ Dot-Product Attention์ด ์•„๋‹Œ, Dot-Product ๊ฒฐ๊ณผ์— ํŠน์ •๊ฐ’์„ ๋‚˜๋ˆ„๋Š” $\textrm{score}(q, k) = \frac{q \cdot k}{\sqrt{n}}$๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Ÿฐ Attention Function์„ โ€œScaled Dot-Product Attentionโ€์ด๋ผ๊ณ  ํ•œ๋‹ค.

Attention Function์œผ๋กœ Attention Score๋ฅผ ๊ตฌํ•œ ํ›„์—๋Š” ๊ธฐ์กด ๋ฐฉ์‹์ฒ˜๋Ÿผ Softmax๋ฅผ ์”Œ์›Œ Attention Distribution์„ ๊ตฌํ•˜๊ณ , ๊ฐ $V$ ๋ฒกํ„ฐ์™€ ๊ฐ€์ค‘ํ•ฉํ•˜์—ฌ Attention Value๋ฅผ ์–ป๋Š”๋‹ค!

์ฐธ๊ณ ํ•œ ์ž๋ฃŒ์—์„œ๋Š” ์œ„์˜ ๊ณผ์ •์„ word ๋ฒกํ„ฐ ํ•˜๋‚˜ํ•˜๋‚˜๊ฐ€ ์•„๋‹ˆ๋ผ word ๋ฒกํ„ฐ๋ฅผ ๋ชจ์•„ ํ–‰๋ ฌ๋กœ ์ทจ๊ธ‰ํ•ด sequence ์ „์ฒด์— ๋Œ€ํ•ด ํ•œ๋ฒˆ์— ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์ž์„ธํ•œ ์„ค๋ช…์€ ์—ฌ๊ธฐ์„  ์ƒ-๋žต

๊ทธ๋ž˜์„œ ํ–‰๋ ฌ์˜ ๋ฐฉ์‹์œผ๋กœ ๊ฒฐ๊ณผ๋ฅผ ์ข…ํ•ฉํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[\textrm{Attention}(Q, K, V) = \textrm{softmax} \left( \frac{Q \cdot K^{T}}{\sqrt{d_k}}\right) \cdot V\]

๋’ค์—๋Š” Multi-head์— ๋Œ€ํ•œ ์„ค๋ช…์ด ๋” ์ž์„ธํžˆ ์žˆ๋˜๋ฐ, ํ โ€ฆ ์ดํ•ด๊ฐ€ ์•ˆ ๋˜๋‹ˆ ์ƒ๋žตํ•˜์ž.



FFNN

์ƒ๊ฐ๋ณด๋‹ค ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ์ด๋‹ค.

๊ทธ๋ƒฅ ํ‰๋ฒ”ํ•œ Neural Network์— ๋ถˆ๊ณผ.

์•ž์— โ€œPosition-wiseโ€๋ผ๋Š” ๋ง์ด ๋ถ™์—ˆ๋Š”๋ฐ, ๊ทธ๋ƒฅ Fully-connected๋ผ๋Š” ๊ฒƒ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋‹ค!

์ด๊ฒƒ ๋ง๊ณ ๋„ FFNN ๋ชจ๋ธ์— residual connection๊ณผ normalization์„ ์—ฐ๊ฒฐํ•ด ๋” ์ •๊ตํ•˜๊ฒŒ ๋งŒ๋“ค์–ด ์ค€๋‹ค.


์ด๋ ‡๊ฒŒ Self-Attention layer์™€ FFNN layer๋กœ ์ธ์ฝ”๋” ๋ชจ๋“ˆ ํ•˜๋‚˜๋ฅผ ๋งŒ๋“  ํ›„์—๋Š” ์ธ์ฝ”๋” ๋ชจ๋“ˆ์„ ์Œ“์•„์„œ Encoders๋ฅผ ๋งŒ๋“ค์–ด ์ค€๋‹ค.


์ธ์ฝ”๋”์˜ ๋งˆ์ง€๋ง‰ ์ถœ๋ ฅ์ธต์—์„œ์˜ ๊ฒฐ๊ณผ๊ฐ€ ๋””์ฝ”๋”์˜ ์ž…๋ ฅ์œผ๋กœ ์ „๋‹ฌ๋œ๋‹ค.



Reference