2020-2ํ•™๊ธฐ โ€œ์—ฐ๊ตฌ์ฐธ์—ฌ(CSED339A)โ€์—์„œ ์ง„ํ–‰ํ•œ โ€œStanford CS231โ€ ์Šคํ„ฐ๋””์—์„œ ๊ณต๋ถ€ํ•œ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

6 minute read

2020-2ํ•™๊ธฐ โ€œ์—ฐ๊ตฌ์ฐธ์—ฌ(CSED339A)โ€์—์„œ ์ง„ํ–‰ํ•œ โ€œStanford CS231โ€ ์Šคํ„ฐ๋””์—์„œ ๊ณต๋ถ€ํ•œ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•œ ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

Introduction to Sequential Model

์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ ๋‹ค๋ฃจ๋Š” ๋ชจ๋ธ์€ โ€œ์ˆœ์„œโ€๋ผ๋Š” ์„ฑ์งˆ์„ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ชจ๋ธ๋“ค์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด, ๋‹จ์–ด(word)๋‚˜ ๋ฌธ์žฅ(sentence), 1๋…„๊ฐ„์˜ ์ฃผ์‹ ๊ฐ€๊ฒฉ ๋“ฑ์ด ์ˆœ์„œ๊ฐ€ ์ค‘์š”ํ•˜๊ฒŒ ์—ฌ๊ฒจ์ง€๋Š” ๋ฐ์ดํ„ฐ๋“ค์ด๋‹ค. ์ด๋Ÿฐ ๋ฐ์ดํ„ฐ๋ฅผ <sequence data>๋ผ๊ณ  ํ•œ๋‹ค.

๊ธฐ์กด <Feed Forward Network>๋Š” ๊ฐ’์ด ์ „๋ฐฉ(ๅ‰ๆ–น)์œผ๋กœ๋งŒ ํ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๊ณผ๊ฑฐ์˜ ์ •๋ณด๋„ ์ค‘์š”ํ•œ <sequential data>๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ์—๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š์•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ <RNN>, <LSTM>์€ ๊ฐ’์ด ๊ทธ ์‹œํ€€์…œ(sequential)ํ•œ ๊ตฌ์กฐ๋กœ ์ธํ•ด ๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์˜ history๋ฅผ โ€˜๊ธฐ์–ตโ€™ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, <sequence data>๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์— ํŠนํ™”๋˜์–ด ์žˆ๋‹ค.


RNN; Recurrent Neural Network

<RNN; Recurrent Neural Network>์—์„œ๋Š” ์€๋‹‰์ธต์—์„œ ๊ณ„์‚ฐํ•œ ๊ฐ’์ด ์ถœ๋ ฅ์ธต ๋ฐฉํ–ฅ์œผ๋กœ๋„ ์ „ํŒŒ๋˜์ง€๋งŒ, ๋‹ค์‹œ ์€๋‹‰์ธต์œผ๋กœ ๋Œ์•„์™€ ์€๋‹‰์ธต์˜ hidden state์— ์ €์žฅ๋œ๋‹ค. ์ด ๊ฐ’์€ ๋‹ค์Œ ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ ํ™œ์šฉ๋œ๋‹ค!

<RNN>์„ ์œ„์™€ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ์•„๋ž˜์™€ ๊ฐ™์ด iteration์„ ํ’€์–ด์„œ ํ‘œํ˜„ํ•  ์ˆ˜๋„ ์žˆ๋‹ค. ์ด๊ฒƒ์„ <Cell>์ด๋ผ๊ณ  ํ•œ๋‹ค.

์€๋‹‰์ธต์€ $t$์‹œ์ ์—์„œ์˜ ์ถœ๋ ฅ $h_t$๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ๋‘ ๊ฐ€์ง€ ๊ฐ’์„ ํ™œ์šฉํ•œ๋‹ค.

  • ์ด์ „ ์‹œ์ ์˜ hidden state $h_{t-1}$
  • ํ˜„์žฌ ์‹œ์ ์˜ ์ž…๋ ฅ $x_t$

์ด ๋‘ ๊ฐ’์„ ๊ฐ€์ค‘์น˜์™€ ํ•จ๊ป˜ ์ž˜ ์กฐํ•ฉํ•ด $\tanh$ ํ•จ์ˆ˜๋ฅผ ํ™œ์„ฑ ํ•จ์ˆ˜ ์‚ผ์•„ ์ถœ๋ ฅํ•˜๋ฉด, ์€๋‹‰์ธต์˜ ์ถœ๋ ฅ $h_t$๊ฐ€ ๋œ๋‹ค.

\[h_t = \tanh (W_x x_t + W_h h_{t-1})\]

์ถœ๋ ฅ์ธต์€ ์ด ์€๋‹‰์ธต์˜ ๊ฒฐ๊ณผ $h_t$๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ์ถœ๋ ฅ์ธต์˜ ๊ฐ€์ค‘์น˜ $W_y$์™€ ์กฐํ•ฉํ•ด $y_t$๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค.

\[y_t = W_y h_t\]


ํ•˜๋‚˜์˜ <Cell>์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ sequence ํ•˜๋‚˜๋ฅผ ์ฝ์„ ๋•Œ๊นŒ์ง€ ๋ชจ๋‘ ๋™์ผํ•œ weight ๊ฐ’์„ ์‚ฌ์šฉํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด โ€œI love youโ€๋ผ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ž…๋ ฅํ•˜๋ฉด, ๊ฐ ๋ฌธ์ž๋Š” ๋ชจ๋‘ ๋™์ผํ•œ weight $W_x$, $W_h$, $W_y$๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ๊ทธ๋ž˜์„œ โ€œI love youโ€ ๋ฌธ์žฅ ํ•˜๋‚˜๊ฐ€ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ ์ž…๋ ฅ์ด๋ฉฐ, Back-propagation ์—ญ์‹œ ์ด ํ•œ ๋ฌธ์žฅ์„ ๋‹ค ์ฝ์€ ํ›„์— ์ผ์–ด๋‚˜๋Š” ๊ฒƒ์ด๋‹ค.


<RNN>์—์„œ์˜ Back-propagation์€ ๊ธฐ์กด์˜ <Feed Forward Network>์˜ ๋ฐฉ์‹๊ณผ๋Š” ์กฐ๊ธˆ ๋‹ค๋ฅด๋‹ค. <RNN>์—์„œ๋Š” time-step์œผ๋กœ <Cell>์„ ํŽผ์นœ ํ›„์— Back-prop์„ ์ ์šฉํ•œ๋‹ค. ์ด๋ฅผ <Backprop Through Time; BPTT>๋ผ๊ณ  ํ•œ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์œ„์™€ ๊ฐ™์€ ๋ฐฉ์‹์€ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ์—ญ์ „ํŒŒ๋ฅผ ๋ฌธ์ž์˜ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์ง„๋‹ค๋ฉด ๊ณ„์‚ฐ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•œ๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ „์ฒด ๋ฌธ์žฅ์„ ์ผ์ • ๊ตฌ๊ฐ„์„ ๋‚˜๋ˆ ์„œ Backprop์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ๋„ ํ•œ๋‹ค. ์ด๊ฒƒ์„ <Truncated BPTT>๋ผ๊ณ  ํ•œ๋‹ค.

๋˜, <RNN>๊ณผ ๊ฐ™์€ <Sequential Model>์€ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์˜ ๋Œ€์‘์— ๋”ฐ๋ผ 1-to-1, 1-to-many, many-to-1, many-to-many ๋“ฑ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ๋กœ ์กด์žฌํ•œ๋‹ค.

RNN structures


LSTM; Long Short Term Memory model

<RNN>์˜ ๊ฒฝ์šฐ hidden state๋ฅผ ํ†ตํ•ด ์ด์ „ ์ž…๋ ฅ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€๋งŒ, ์ž…๋ ฅ ์‹œํ€€์Šค๊ฐ€ ๊ธธ์–ด์งˆ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง„๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌํ•œ๋‹ค. ์ด๋ฅผ <The problem of learning long-term dependencies; ์žฅ๊ธฐ ์˜์กด์„ฑ ๋ฌธ์ œ>๋ผ๊ณ  ํ•œ๋‹ค. ์ด์— ๋Œ€ํ•ด์„  ๋™์ผํ•œ ๊ฐ’์˜ $W_h$์˜ ๊ฐ’์„ ์—ฌ๋Ÿฌ๋ฒˆ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๋ฉด์„œ ๋ฐœ์ƒํ•˜๋Š” Gradient Exploding ๋˜๋Š” Gradient Vanishing์„ ์›์ธ์œผ๋กœ ๊ผฝ๋Š”๋‹ค.

<LSTM>์€ <์žฅ๊ธฐ ์˜์กด์„ฑ ๋ฌธ์ œ>๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด <RNN>์˜ ๊ตฌ์กฐ์— Cell state $c_t$๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ  ์—ฌ๋Ÿฌ ๊ฒŒ์ดํŠธ(gate)๋ฅผ ์ถ”๊ฐ€ํ•œ ๋ชจ๋ธ์ด๋‹ค.

RNN(ๅทฆ)๊ณผ LSTM(ๅณ)


<LSTM>์—์„œ cell state $c_t$๋Š” ์žฅ๊ธฐ ๊ธฐ์–ต์„ ๋‹ด๋‹นํ•˜๋ฉฐ, hidden state $h_t$๋Š” ๋‹จ๊ธฐ ๊ธฐ์–ต์„ ๋‹ด๋‹นํ•œ๋‹ค.

\[c_t = f_t \circ c_{t-1} + i_t \circ g_t\] \[h_t = o_t \circ \tanh(c_t)\]


<LSTM>์—๋Š” 4๊ฐ€์ง€ ๊ฒŒ์ดํŠธ(gate)๊ฐ€ ์กด์žฌํ•˜๋ฉด ๊ฐ๊ฐ์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค. ํŽธ์˜๋ฅผ ์œ„ํ•ด bias $b$ ํ…€์€ ์ƒ๋žตํ•˜์˜€๋‹ค.

1. ์ž…๋ ฅ ๊ฒŒ์ดํŠธ

\[i_t = \sigma(W_{xi} x_t + W_{hi} h_{t-1})\]

2. ๊ฒŒ์ดํŠธ ๊ฒŒ์ดํŠธ ๐Ÿ˜ต

\[g_t = \tanh(W_{xg} x_t + W_{hg}h_{t-1})\]

๋‘ ๊ฒŒ์ดํŠธ ๋ชจ๋‘ ํ˜„์žฌ์˜ ์ž…๋ ฅ $x_t$์™€ hidden state $h_{t-1}$๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์œผ๋ฉฐ, ๋‹ค๋ฅธ ์ ์€ ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ๋ฟ์ด๋‹ค. ์ด ๋‘ ๊ฒŒ์ดํŠธ๋ฅผ ํ†ตํ•ด ํ˜„์žฌ์˜ ์ž…๋ ฅ์—์„œ ๊ธฐ์–ตํ•  ์ •๋ณด์˜ ์–‘์„ ์ •ํ•œ๋‹ค.

3. ๋ง๊ฐ ๊ฒŒ์ดํŠธ

\[f_t = \sigma(W_{xf} x_t + W_{hf}h_{t-1})\]

๋ง๊ฐ ๊ฒŒ์ดํŠธ์˜ ๊ฐ’์„ ํ†ตํ•ด cell state $c_{t-1}$์—์„œ ์žŠ์„ ์ •๋ณด์˜ ์–‘์„ ์ •ํ•œ๋‹ค.

๋‹ค์‹œ cell state $c_t$์˜ ์ˆ˜์‹์„ ์‚ดํŽด๋ณด์ž. ๋ง๊ฐ ๊ฒŒ์ดํŠธ $f_t$๋ฅผ ํ†ตํ•ด ์ด์ „ cell state $c_{t-1}$์—์„œ ๊ธฐ์–ตํ•  ์ •๋ณด๋ฅผ ๊ฒฐ์ •ํ•˜๊ณ , ์ž…๋ ฅ ๊ฒŒ์ดํŠธ $i_t$์™€ ๊ฒŒ์ดํŠธ ๊ฒŒ์ดํŠธ $g_t$๋ฅผ ํ†ตํ•ด ํ˜„์žฌ ์ž…๋ ฅ์•ž์—์„œ ๊ธฐ์–ตํ•  ์ •๋ณด๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค. ๐Ÿคฉ

\[c_t = f_t \circ c_{t-1} + i_t \circ g_t\]

4. ์ถœ๋ ฅ ๊ฒŒ์ดํŠธ

\[o_t = \sigma(W_{xo} x_t + W_{ho} h_{t-1})\]

์ถœ๋ ฅ ๊ฒŒ์ดํŠธ์˜ ๊ฐ’์€ <Cell>์˜ ์ถœ๋ ฅ์ด ๋˜๋ฉฐ, ์ดํ›„ ์ถœ๋ ฅ์ธต์—์„œ $y$์˜ ๊ฐ’์„ ๊ตฌํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋œ๋‹ค. ๋˜, ์ถœ๋ ฅ ๊ฒŒ์ดํŠธ์˜ ๊ฐ’์„ ํ†ตํ•ด cell state $c_t$์—์„œ ๋‹จ๊ธฐ์ ์œผ๋กœ ๊ธฐ์–ตํ•  ์ •๋ณด๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค.

\[h_t = o_t \circ \tanh(c_t)\]


<LSTM>์˜ ๊ฒฝ์šฐ <RNN>์— ๋น„ํ•ด ๊ตฌ์กฐ๊ฐ€ ์ •๋ง ๋ณต์žกํ•˜์ง€๋งŒ, ์ด์ „๊ณผ ๋‹ฌ๋ฆฌ Gradient Exploding์ด๋‚˜ Gradient Vanishing ํ˜„์ƒ์ด ๋‘๋“œ๋Ÿฌ์ง€์ง€ ์•Š๋Š”๋‹ค! ๐Ÿ˜ ์ž์„ธํ•œ ์ด์œ ๋ฅผ ์•Œ๊ณ  ์‹ถ๋‹ค๋ฉด, ์ด ์•„ํ‹ฐํด์„ ์ฐธ๊ณ ํ•˜๋ผ.


2014๋…„์—๋Š” <LSTM>์„ ๊ฐœ์„ ํ•œ <GRU; Gated Recurrent Unit>๋ผ๋Š” sequential model์ด ์ œ์‹œ๋˜์—ˆ๋‹ค. <LSTM>์ฒ˜๋Ÿผ ๊ฒŒ์ดํŠธ(gate)๊ฐ€ ๋‹ฌ๋ ค์žˆ์ง€๋งŒ, $h_t$ ํ•˜๋‚˜์˜ ์ƒํƒœ๋งŒ์„ ๊ธฐ์–ตํ•˜๊ธฐ ๋•Œ๋ฌธ์— <LSTM>๋ณด๋‹ค ๋” ๋น ๋ฅด๊ฒŒ ํ•™์Šตํ•œ๋‹ค๊ณ  ํ•œ๋‹ค. <GRU>์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์ด ์•„ํ‹ฐํด์„ ์ฐธ๊ณ ํ•˜๋ผ.

<RNN>, <LSTM>์€ ์ž์—ฐ์–ด(Natural Language)์™€ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฐ€์žฅ ๊ธฐ๋ณธ์ด ๋˜๋Š” ๋ชจ๋ธ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‹ˆ ์ด๋ฒˆ ๋‚ด์šฉ์— ์ต์ˆ™ํ•ด์ง€๋Š” ๊ฒƒ์„ ์ถ”์ฒœํ•œ๋‹ค.

โ–  related post

  • NLP with PyTorch Cheat Sheet

reference