2020-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜์ธ๊ณต์ง€๋Šฅโ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

3 minute read

2020-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜์ธ๊ณต์ง€๋Šฅโ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)


Backpropagation

<Back-propagation>์€ <GD>๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด Loss์˜ ํŽธ๋ฏธ๋ถ„์„ ์‰ฝ๊ฒŒ ๊ตฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.

\[\nabla_\mathbf{w} \text{TrainLoss}(\mathbf{w})\]

<Back-propagation> ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ <Neural Network>๋ฅผ ์ผ์ข…์˜ โ€œComputational Graphโ€œ๋กœ ๋ชจ๋ธ๋ง ํ•˜์—ฌ, Gradient๋ฅผ ๊ตฌํ•œ๋‹ค. Gradient๋Š” <Chain Rule>์— ์˜ํ•ด ์•ˆ์ชฝ์˜ ์—ฐ์‚ฐ๊นŒ์ง€ ์ฐจ๋ก€๋Œ€๋กœ ์ „ํŒŒ๋œ๋‹ค.

<Back-propagation>์€ ๋ณต์žกํ•œ ํ˜•ํƒœ์˜ Loss ํ•จ์ˆ˜์ผ์ง€๋ผ๋„ ๊ทธ๊ฒƒ์„ ์ž‘์€ ์š”์†Œ๋ถ€ํ„ฐ ์ฐจ๋ก€๋กœ <Chain Rule>์— ๋”ฐ๋ผ Gradient์„ ๊ตฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ณต์žกํ•œ ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„์‹์„ ์ง์ ‘ ๊ตฌํ•˜์ง€ ์•Š๊ณ ๋„ ์‰ฝ๊ฒŒ Gradient๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค! ๐Ÿ˜


Function as boxes

๋จผ์ € ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ํ˜•ํƒœ์˜ <Computational Graph>๋ถ€ํ„ฐ ์‚ดํŽด๋ณด์ž.

Remark. basic building blocks

1. Addition $f(a, b) = a+b$

\[\frac{\partial f(a, b)}{\partial a} = 1 \quad, \quad \frac{\partial f(a, b)}{\partial b} = 1\]

2. Multiplication $f(a, b) = ab$

\[\frac{\partial f(a, b)}{\partial a} = b \quad , \quad \frac{\partial f(a, b)}{\partial b} = a\]

$a$, $b$์˜ ์ˆœ์„œ๊ฐ€ ๋ฐ”๋€Œ์—ˆ๋‹ค!!

3. max function $f(a, b) = \max(a, b)$

\[\frac{\partial f(a, b)}{\partial a} = I[a > b] \quad , \quad \frac{\partial f(a, b)}{\partial b} = I[a < b]\]

๋‹น์—ฐํžˆ $a > b$์ผ ๋•Œ, $f(a, b)$๊ฐ€ $a$์— ๋Œ€ํ•œ identity func.์ด ๋˜๋ฏ€๋กœ, ์œ„์˜ ๊ฒฐ๊ณผ๋Š” ๋‹น์—ฐํ•˜๋‹ค!

4. sigmoid function $f(a) = \sigma(a)$

\[\frac{\partial f(a)}{\partial a} = \sigma(a) (1 - \sigma(a))\]

Remark. generalized build block

<Computational Graph>์˜ building block์˜ ํ˜•ํƒœ๋ฅผ ์ผ๋ฐ˜ํ™”ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Remark. composing functions

ํ•ฉ์„ฑ ํ•จ์ˆ˜์˜ ๊ฒฝ์šฐ๋„ <back-propagation>์œผ๋กœ ์‰ฝ๊ฒŒ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค!

์ฒซ block๋ถ€ํ„ฐ ์‚ดํŽด๋ณด์ž๋ฉด, $\text{mid}$๋ฅผ ํ•จ์ˆ˜๊ฐ€ ์•„๋‹ˆ๋ผ ๋ณ€์ˆ˜๋กœ ์ทจ๊ธ‰ํ•˜๋ฉด, ์†์‰ฝ๊ฒŒ gradient๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

\[\frac{\partial \text{out}}{\partial \text{mid}}\]

์—ฌ๊ธฐ์—์„œ ๊ทธ ๋‹ค์Œ block์„ ์‚ดํŽด๋ณด๋ฉด, ์ด๊ฒƒ๋„ ๊ทธ๋ƒฅ ํŽธ๋ฏธ๋ถ„์„ ์ˆ˜ํ–‰ํ•ด์ฃผ๋ฉด ๋˜๋Š” ๊ฒƒ์ด๋‹ค.

\[\frac{\partial \text{mid}}{\partial \text{in}}\]

๋‘ gradient ๊ฐ’์„ ๋‹ค ๊ตฌํ–ˆ๋‹ค๋ฉด, ๊ตฌํ•œ gradient๋ฅผ ๋ชจ๋‘ ๊ณฑํ•ด์„œ ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” gradient๋ฅผ ๊ตฌํ•œ๋‹ค! ๐Ÿ˜

\[\frac{\text{out}}{\text{in}} = \frac{\partial \text{out}}{\partial \text{mid}} \cdot \frac{\partial \text{mid}}{\partial \text{in}}\]

Remark. Branching Block

ํ•จ์ˆ˜๊ฐ’์ด ๋‘ ๊ตฐ๋ฐ๋กœ ๋ถ„๊ธฐํ•˜๋Š” ๊ฒฝ์šฐ์˜ gradient๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

\[\frac{\partial\text{out}}{\partial \text{mid}} = \frac{\partial\text{out1}}{\partial \text{mid}} + \frac{\partial\text{out2}}{\partial \text{mid}}\]

์ฆ‰, ๋ถ„๊ธฐํ•˜๋Š” ๊ฒฝ์šฐ๋Š” gradient ๊ฐ’์„ ๋”ํ•ด์ฃผ๋ฉด ๋œ๋‹ค!


Example. Binary Classification with Hinge Loss

\[\text{Loss}(x, y, \mathbf{w}) = \max \left\{\, 0, \; 1 - (\mathbf{w} \cdot \phi(x)) y \, \right\}\]

โœจ Goal: $\dfrac{\partial \text{Loss}}{\partial \mathbf{w}}$

์ œ์ผ ๋จผ์ € ํ•  ์ผ์€ <Computational Graph>๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ทธ ๋‹ค์Œ์—๋Š” Graph์˜ edge์— ๊ทœ์น™์— ๋”ฐ๋ผ ๊ณ„์‚ฐํ•œ gradient ๊ฐ’์„ ํ‘œ์‹œํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์—๋Š” gradient๋ฅผ ๊ตฌํ•˜๊ณ ์ž ํ•˜๋Š” ๋Œ€์ƒ, ์—ฌ๊ธฐ์„œ๋Š” $\mathbf{w}$๊นŒ์ง€ Graph๋ฅผ ๋”ฐ๋ผ๊ฐ€๋ฉฐ edge์— ํ‘œ์‹œ๋œ gradient๋ฅผ ๋ชจ๋‘ ๊ณฑํ•ด์ค€๋‹ค! ๐Ÿ˜

๋”ฐ๋ผ์„œ gradient๋Š”

\[\frac{\partial \text{Loss}}{\partial \mathbf{w}} = - 1 \cdot I(\text{margin} < 1) \cdot \phi(x) \cdot y\]