2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

8 minute read

2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

Classficiation by Linear RegressionPermalink

Binary ClassificationPermalink

Assume that Y={0,1}, and consider the <linear regression> model

Y=XTฮฒ+ฯต

For some estimator ฮฒ^, and given X=x, predictor Y do something like this

Y^={1if xTฮฒ^>0.50if xTฮฒ^โ‰ค0.5

์ด๋•Œ, xTฮฒ^๋Š” Y=1์— ๋Œ€ํ•œ ํ™•๋ฅ ์„ ๋ฑ‰๋Š” ํ•จ์ˆ˜์˜ ์ผ์ข…์ด๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

f^(x)=Pr(Y=1โˆฃX=x)=E(YโˆฃX=x)

์ด๋ จ ํ˜•ํƒœ์˜ ๊ฐ Class์— ๋Œ€ํ•œ posterior probability๋ฅผ ๊ณ„์‚ฐํ•ด Classify๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ๋ชจ๋ธ์„ <Bayes Classifier>๋ผ๊ณ  ํ†ต์นญํ•œ๋‹ค.

Multi-class ClassificationPermalink

Assume that Y={1,โ€ฆ,K}, and let Y be the nร—K binary matrix where the (i,k) element is 1 if yi=k and 0 otherwise.

Let

B^=(XTX)โˆ’1XTY

(deriven from RSS estimator!)

Y^=XB^

์ด๋ ‡๊ฒŒ ๋‘”๋‹ค๋ฉด inference์—์„œ๋Š”, For X=x, one may predict Y as

Y^=argmaxkโˆˆYf^k(x)

where (f^1(x),โ€ฆ,f^K(x))=xTB^.

์ด๋•Œ, ๊ฐ f^i(x)๋Š” binary classification์˜ f^(x)์™€ ๋น„์Šทํ•˜๊ฒŒ ์•„๋ž˜์˜ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  ๋˜๋Š” poster-priority๋ฅผ ์˜ˆ์ธกํ•˜๋Š” estimator์˜ ์—ญํ• ์„ ํ•œ๋‹ค.

f^k(x)=Pr(Y=kโˆฃX=x)


Problem.

(์ถ”ํ›„์— ๊ธฐ์ˆ )


Probabilistic ModelPermalink

์œ„์˜ ๋‘ ๊ฒฝ์šฐ์—์„œ ์‚ดํŽด๋ดค๋“ฏ, Linear Regression์— ์˜ํ•œ ์ ‘๊ทผ์€ ๊ฒฐ๊ตญ ์•„๋ž˜์˜ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ estimate ํ•˜๋Š” ์ ‘๊ทผ์ด์—ˆ๋‹ค.

Pr(Y=kโˆฃX=x)

์ด๋•Œ, ์ด ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์€ posterior probability๋กœ <๋ฒ ์ด์ฆˆ ์ •๋ฆฌ>์— ์˜ํ•ด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Pr(Y=kโˆฃX=x)=ฯ€kโ‹…fk(x)โˆ‘i=1Kฯ€iโ‹…fi(x)

์ด๋•Œ, ์šฐ๋ณ€์˜ ๊ฐ ํ•ญ๋ชฉ๋“ค์„ ์‚ดํŽด๋ณด๋ฉด

  • ฯ€k: the <class probability> of class k; prior probability
ฯ€k=P(y=k)
  • fk(x): the <class-conditional density> of X in class Y=k
fk(x)=P(X=xโˆฃY=k)

๋งŒ์•ฝ ์šฐ๋ฆฌ๊ฐ€ ฯ€k, fk(x)๋ฅผ ๋ชจ๋‘ ์•Œ๊ณ  ์žˆ๋‹ค๋ฉด, ์šฐ๋ฆฌ๋Š” X=x๊ฐ€ ์žˆ์„ ๋•Œ, ์–ด๋–ค Y=k๊ฐ€ ์„ ํƒ๋˜์–ด์•ผ ํ•˜๋Š”์ง€ estimate ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‹จ, ฯ€k, fk(x)๋Š” ๋ชจ๋‘ ์šฐ๋ฆฌ๊ฐ€ ์ถ”์ •ํ•ด์•ผ ํ•˜๋Š” ๊ฐ’์ด๋ฉฐ, ์ด๊ฒƒ์„ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด ์–ด๋–ป๊ฒŒ ์ ‘๊ทผํ•˜๊ณ  ์–ด๋–ค ๊ฐ€์ •์„ ์“ฐ๋Š”์ง€ ์ˆ˜์—…์—์„œ ๋ฐฐ์šด๋‹ค.

Linear Discriminant AnalysisPermalink

LDA๋Š” fk๊ฐ€ ์•„๋ž˜์™€ ๊ฐ™์€ ์ •๊ทœ ๋ถ„ํฌ N(ฮผk,ฮฃk)๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. 1

fk(x)=1|2ฯ€ฮฃk|1/2expโก[โˆ’(xโˆ’ฮผk)Tฮฃkโˆ’1(xโˆ’ฮผk)2]

์ด๋•Œ, ๋งŒ์•ฝ ๊ฐ ฮฃk์— ๋Œ€ํ•ด

ฮฃk=ฮฃfor allk

๋ผ๋Š” ์•„์ฃผ์•„์ฃผ ํŠน๋ณ„ํ•œ ๊ฐ€์ •์ด ์ถ”๊ฐ€๋œ ๊ฒƒ์ด <LDA; Linear Discriminant Analysis>๋‹ค! ๐Ÿ˜


์œ„์™€ ๊ฐ™์ด ์„ค์ •ํ–ˆ์„ ๋•Œ์˜ <decision boundary>๋ฅผ ์ƒ๊ฐํ•ด๋ณด์ž.

P(Y=kโˆฃX=x)=P(Y=lโˆฃX=x)

boundary์— ๋Œ€ํ•œ ์œ„์˜ ์‹์„ ์ž˜ ์ •๋ฆฌํ•ด๋ณด๋ฉด,

logโกP(Y=kโˆฃX=x)P(Y=lโˆฃX=x)=logโกฯ€kโ‹…fk(x)ฯ€lโ‹…fl(x)=logโกฯ€kฯ€l+logโกfk(x)fl(x)=logโกฯ€kฯ€l+logโก(expโก(โˆ’(xโˆ’ฮผk)Tฮฃโˆ’1(xโˆ’ฮผk)2)expโก(โˆ’(xโˆ’ฮผl)Tฮฃโˆ’1(xโˆ’ฮผl)2))=logโกฯ€kฯ€l+logโก(expโก(โˆ’(xโˆ’ฮผk)Tฮฃโˆ’1(xโˆ’ฮผk)+(xโˆ’ฮผl)Tฮฃโˆ’1(xโˆ’ฮผl)2))=logโกฯ€kฯ€l+(โˆ’(xโˆ’ฮผk)Tฮฃโˆ’1(xโˆ’ฮผk)+(xโˆ’ฮผl)Tฮฃโˆ’1(xโˆ’ฮผl)2)=(logโกฯ€kฯ€lโˆ’12(ฮผk+ฮผl)Tฮฃโˆ’1(ฮผkโˆ’ฮผl))+xTฮฃโˆ’1(ฮผkโˆ’ฮผl)=a+bTx=0

์ผ๋ถ€ ์Šคํ…์€ ๊ณผ์ •์„ ์ƒ๋žตํ•˜๊ณ  ๊ฒฐ๊ณผ๋งŒ ๋ฐ”๋กœ ์ ์—ˆ๋Š”๋ฐ, ์ž์„ธํ•œ ๊ณผ์ •์€ ์•„๋ž˜์˜ ํŽผ์ณ๋ณด๊ธฐ์— ๊ธฐ์ˆ ํ•ด๋‘๊ฒ ๋‹ค.

ํŽผ์ณ๋ณด๊ธฐ

(์ถ”ํ›„ ๊ธฐ์ˆ )

์œ„์˜ ์Šคํ…์˜ ๋งˆ์ง€๋ง‰์— a+bTx=0๋ผ๋Š” ํ‰๋ฉด์‹์ด ์œ ๋„ ๋˜์—ˆ๋Š”๋ฐ, ๊ฒฐ๊ตญ ์ด a+bTx=0์ด๋ผ๋Š” hyper-plain์ด <decision boundary>๊ฐ€ ๋จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค!!


Classification.

์ด์ œ LDA๋ฅผ ์ด์šฉํ•ด classification์„ ์–ด๋–ป๊ฒŒ ์ˆ˜ํ–‰ํ•˜๋Š”์ง€ ๋…ผํ•ด๋ณด์ž.

๋จผ์ € P(Y=kโˆฃX=x)์—์„œ ๋ถ„๋ชจ๊ฐ€ ๋ชจ๋‘ ๋™์ผํ•œ ๊ฐ’์„ ๊ฐ€์งˆ ํ…Œ๋‹ˆ, ๋ถ„์ž์ธ ฯ€kโ‹…fk(x)๋ฅผ maximizeํ•˜๋ฉด ๋œ๋‹ค! fk(x)์— ์ง€์ˆ˜์‹์ด ์žˆ์œผ๋‹ˆ ๊ณ„์‚ฐ์˜ ํŽธ์˜๋ฅผ ์œ„ํ•ด log๋ฅผ ์”Œ์›Œ์ฃผ์ž.

ฮดk(x)=logโกP(Y=kโˆฃX=x)=logโกฯ€k+xTฮฃโˆ’1ฮผkโˆ’12ฮผkTฮฃโˆ’1ฮผk

์ด๋•Œ, logโกP(Y=kโˆฃX=x)๋ฅผ <linear discriminant function> ฮดk(x)๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. ์šฐ๋ฆฌ๋Š” {ฮดk(x)}kโˆˆY์— ๋Œ€ํ•ด argmax๋ฅผ ์ทจํ•ด

y^=argmaxkโˆˆYฮดk(x)

๊ฐ€์žฅ ํฐ discriminant function ฮดk(x) ํ•จ์ˆ˜๊ฐ’์„ ๊ฐ–๋Š” ํด๋ž˜์Šค๋กœ classification์„ ์ง„ํ–‰ํ•œ๋‹ค!


Parameter Estimation.

์•„์ง ์šฐ๋ฆฌ๋Š” ฯ€k์™€ fk(x)๋ฅผ ์ •ํ™•ํžˆ ์ •์˜ํ•˜์ง€๋Š” ์•Š์•˜๋‹ค. fk(x)๊ฐ€ normal dist. N(ฮผk,ฮฃ)๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •์€ ํ–ˆ์ง€๋งŒ, ฮผk, ฮฃ์— ๋Œ€ํ•œ ๋ช…ํ™•ํžˆ ์ •์˜ํ•˜์ง€๋Š” ์•Š์•˜์—ˆ๋‹ค. ์ด๋ฒˆ ํŒŒํŠธ์—์„œ๋Š” ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ์–ด๋–ป๊ฒŒ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์‚ดํŽด๋ณธ๋‹ค.

Let nk=โˆ‘i=1nI(yi=k), then

ฯ€^k=nkn(sample proportion) ฮผ^k=1nkโˆ‘i:yi=kxi(sample mean) ฮฃ^=1Nโˆ’kโˆ‘k=1Kโˆ‘i:yi=k(xiโˆ’ฮผ^k)(xiโˆ’ฮผ^k)T(pooled sample cov-var matrix)
ํŽผ์ณ๋ณด๊ธฐ

1. ฮผ๋ฅผ ํ™•์‹คํžˆ ์•Œ ๋•Œ

ฮฃ^=1Nโˆ‘k=1Kโˆ‘i:yi=k(xiโˆ’ฮผk)(xiโˆ’ฮผk)T

2. ฮผ๋ฅผ ๋ชจ๋ฅผ ๋•Œ

ฮฃ^=1Nโˆ’kโˆ‘k=1Kโˆ‘i:yi=k(xiโˆ’ฮผ^k)(xiโˆ’ฮผ^k)T

Quadratic Discriminant AnalysisPermalink

LDA์—์„œ๋Š” ๊ฐ class k์˜ Cov-Var matrix ฮฃk๊ฐ€ ๋ชจ๋‘ ๊ฐ™์€ ๊ฐ’์„ ๊ฐ€์ง„๋‹ค๊ณ  ๊ฐ€์ •ํ–ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ๋งŒ์•ฝ ์ด ๊ฐ€์ •์„ ์ œ๊ฑฐํ•œ๋‹ค๋ฉด, ์šฐ๋ฆฌ๋Š” <QDA; Quadratic Discriminant Analysis>๋ฅผ ํ•˜๊ฒŒ ๋œ๋‹ค.

ฮดk(x)=logโกฯ€kโˆ’12logโก|ฮฃk|โˆ’12(xโˆ’ฮผk)Tฮฃkโˆ’1(xโˆ’ฮผk)

LDA์—์„œ์˜ ฮดk(x)์™€ ๋น„๊ตํ•ด๋ฉด, QDA์˜ ๊ฒฝ์šฐ, cancel out์ด ๋œ ๋˜๊ธฐ ๋•Œ๋ฌธ์— โ€œ2์ฐจ์‹โ€์ด ๋‚จ๊ฒŒ ๋œ๋‹ค!

(๋‘˜ ์ค‘ ํ•˜๋‚˜๋Š” QDA๋ฅผ, ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” Xi์— ๋Œ€ํ•œ 2์ฐจ์‹(Xi2,XiXj)์„ ๋„ฃ๊ณ  LDA๋ฅผ ๋Œ๋ฆฐ ๊ฒฐ๊ณผ๋‹ค.)


์ด์–ด์ง€๋Š” ํฌ์ŠคํŠธ์—์„œ๋Š” <Logistic Regression>์— ๋Œ€ํ•ด ๋‹ค๋ฃฌ๋‹ค. ์ข€๋” ์ต์ˆ™ํ•œ ์šฉ์–ด๋ฅผ ์“ฐ์ž๋ฉด, <MLE; Maximum Likelihood Estimation>์— ๋Œ€ํ•ด ๋‹ค๋ฃฌ๋‹ค๋Š” ๋ง์ด๋‹ค!

๐Ÿ‘‰ Linear Classification - 2


  1. ๊ฐ•์กฐํ•˜์ง€๋งŒ, ๋ฐ˜๋“œ์‹œ ์ด๋ ‡๊ฒŒ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋Š” ๊ฑด ์•„๋‹ˆ๋‹ค. fk(X)์˜ ๋ถ„ํฌ๊ฐ€ N(ฮผk,ฮฃk)๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  โ€˜๊ฐ€์ •โ€™ํ–ˆ์„ ๋ฟ์ด๋‹ค! ์‹ค์ œ fk(X)์˜ ๋ถ„ํฌ๋Š” ๋‹ค๋ฅผ ๊ฐ€๋Šฅ์„ฑ์ด ํฌ๋‹ค! โ†ฉ