2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

7 minute read

2021-1ํ•™๊ธฐ, ๋Œ€ํ•™์—์„œ โ€˜ํ†ต๊ณ„์  ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹โ€™ ์ˆ˜์—…์„ ๋“ฃ๊ณ  ๊ณต๋ถ€ํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ง€์ ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค :)

Introduction to PCA

<PCA; Principal Component Analysis>๋Š” ์ฐจ์›์ถ•์†Œ(dimensionality reduction)์— ์ฃผ์š”ํ•œ ๊ธฐ๋ฒ•์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ํ…Œํฌ๋‹‰์ด๋‹ค.

โœจ Goal โœจ
To find low-dimensional representation of input variables which contains most information in data

์ด๋•Œ, โ€œcontains most informationโ€์ด๋ž€ ๋ฌด์Šจ ์˜๋ฏธ์ผ๊นŒ? <PCA>์—์„œ๋Š” ์ด๊ฒƒ์„ ๋ฐ์ดํ„ฐ์˜ โ€œ๋ถ„์‚ฐ(variance)โ€์„ ์ตœ๋Œ€ํ•œ ๋ณด์กดํ•˜๋Š” ๊ฒƒ์ด๋ผ๊ณ  ๋งํ•œ๋‹ค! ์ด๋ ‡๊ฒŒ ๋ฐ์ดํ„ฐ์˜ ๋ถ„์‚ฐ์„ ๋ณด์กดํ•˜๋Š” ๋ฐฉํ–ฅ์„ <principal component direction>๋ผ๊ณ  ํ•˜๋ฉฐ, ์šฐ๋ฆฌ๊ฐ€ <PCA>๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ์ฐพ์•„์•ผ ํ•˜๋Š” ๋Œ€์ƒ์ด๋‹ค! ๐Ÿ˜


First Principal Component

For a design matrix $X = \left[ X_1, X_2, \dots, X_p \right]$, the first principal component $Z_1$ is the normalized linear combination:

\[Z_1 = \phi_{11} X_1 + \cdots + \phi_{1p}X_p\]

that has the largest variance.

by normalization, $\sum^p_{j=1} \phi_{1j}^2 = 1$; normalization์„ ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—, principal component์˜ direction์ด uniqueํ•˜๊ฒŒ ์ •์˜๋œ๋‹ค.

The coefficient vector $\phi_1 = (\phi_{11}, \dots, \phi_{1p})^T$ is called the โ€œloading vector of the first principal componentโ€.


Q. Given a $n\times p$ design matrix $X$, how can we estimate $\phi_1$??

Let $X = (X_1, \dots, X_p)^T$, and w.o.l.g. letโ€™s assume that $E[X] = 0$ by standardization and $\text{Var}(X) = \Sigma$.

Then, for $Z_1 = \phi_{11}X_1 + \cdots + \phi_{1p}X_p$, we want to maximuze $\text{Var}(Z_1)$!

\[\begin{aligned} \text{maximize} \; \text{Var}(Z_1) &= \underset{\phi_{11}, \dots, \phi_{1p}}{\text{maximize}} \; \text{Var}(\phi_{11}X_1 + \cdots + \phi_{1p}X_p) \\ &= \underset{\phi_{11}, \dots, \phi_{1p}}{\text{maximize}} \; \text{Var}(\phi_1^T X) \\ &= \underset{\phi_{11}, \dots, \phi_{1p}}{\text{maximize}} \; \phi_1^T \text{Var}(X) \phi_1 \\ &= \underset{\phi_{11}, \dots, \phi_{1p}}{\text{maximize}} \; \phi_1^T \Sigma \phi_1 \quad \text{s.t.} \quad \phi_{11}^2 + \cdots \phi_{1p}^2 = 1 \end{aligned}\]

์ด๋•Œ, covariance matrix $\Sigma$๋Š” symmetric matrix์ด๊ธฐ ๋•Œ๋ฌธ์— <spectral decomposition>์ด ๊ฐ€๋Šฅํ•˜๋‹ค!

\[\Sigma = UDU^T\]

๋˜, covariance matrix $\Sigma$๋Š” positive definite matrix์ด๋‹ค.

\[x^T \Sigma x \succ 0\]

์ด์— ๋”ฐ๋ผ ๋ชจ๋“  egiven value๊ฐ€ positive๋‹ค.

\[\Sigma = UDU^T = d_1 u_1 u_1^T + \cdots + d_p u_p u_p^T\]

w.o.l.g. $0 \le d_1 \le \cdots \le d_p$.

์ด์ œ covariance matrix $\Sigma$์— ๋Œ€ํ•ด์„  ์ถฉ๋ถ„ํžˆ ์‚ดํŽด๋ดค๊ณ , ์ด $\Sigma$๋กœ $\phi$๋ฅผ ํ‘œํ˜„ํ•ด๋ณด์ž. ์šฐ๋ฆฌ๋Š” $\Sigma$์˜ Orthogomal matrix $U$๋ฅผ ์‚ฌ์šฉํ•ด ์•„๋ž˜์™€ ๊ฐ™์ด $\phi$๋ฅผ $U$-basis expansion ํ•  ์ˆ˜ ์žˆ๋‹ค.

\[\phi = (u_1^T\phi) u_1 + \cdots (u_p^T \phi) u_p = U \begin{pmatrix} u_1^T \phi \\ \vdots \\ u_p^T \phi \end{pmatrix}\]

์ด๋•Œ, $\phi^T \phi = 1$์ด๊ธฐ ๋•Œ๋ฌธ์— $(u_1^T\phi)^2 + \cdots (u_p^T \phi)^2 = 1$์ด๋‹ค.

\[\begin{aligned} \phi^T \Sigma \phi &= (u_1^T \phi, \, \dots, \, u_p^T \phi) \cancel{U^T} \cdot \cancel{U}D\cancel{U^T} \cdot \cancel{U} \begin{pmatrix} u_1^T \phi \\ \vdots \\ u_p^T \phi \end{pmatrix} \\ &= (u_1^T \phi, \, \dots, \, u_p^T \phi) D \begin{pmatrix} u_1^T \phi \\ \vdots \\ u_p^T \phi \end{pmatrix} \\ &= d_1 (u_1^T \phi)^2 + \cdots + d_p (u_p^T \phi)^2 \end{aligned}\]

์ฆ‰, maximum variance๋ฅผ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” $\phi^T\Sigma \phi$๋ฅผ ์ตœ๋Œ€๋กœ ๋งŒ๋“ค์–ด์•ผ ํ•œ๋‹ค!

์ด๋•Œ, covariance matrix $\Sigma$์˜ eigen value $d_i$๋Š” ํ†ต์ œํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ’์ด ์•„๋‹ˆ๋‹ค! ๋Œ€์‹  ๊ฐ€์žฅ ํฐ eigen value์ด $d_p$์— ๊ณฑํ•ด์ง€๋Š” $(u_p^T \phi)^2$์˜ ๊ฐ’์€ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ๋‹ค!!

์ด๋•Œ, $(u_p^T \phi)^2$์˜ ๊ฐ’์ด ๊ฐ€์žฅ ์ปค์ง€๋Š” ๊ฒฝ์šฐ๋Š” ๋ฐ”๋กœ $\phi$๊ฐ€ $u_p$๊ฐ€ ๋  ๋•Œ์ด๋‹ค!! ($u_p^T u_p = 1$)

์ฆ‰, the constraint maximization problem for $\phi_1$์˜ solution์€ $u_p$, the eigen vector associated with the largest eigen value $d_p$์ด๋‹ค!

์•ž์—์„œ ์‚ดํŽด๋ณธ ๊ณผ์ •์„ ๋‹ค์‹œ formalํ•˜๊ฒŒ ๊ธฐ์ˆ ํ•ด๋ณด์ž!


Definition. First Principal Component

Since we are only considering variance, we may assume that each column of $X$ is centered. (์ปฌ๋Ÿผ ํ‰๊ท ์ด 0)

Let define $z_{i1}$ to be

\[z_{i1} = \phi_{11} x_{i1} + \cdots + \phi_{1p}x_{ip} = (X\phi_1)_i\]

The each $z_{i1}$ are called โ€œthe scores of the first principal componentโ€.

Then, we can find an estimator $\hat{\phi}_1$ of $\phi_1$ by solving the following constraint maximizing problem.

\[\underset{\phi_1}{\text{maximize}} \left\{ \frac{1}{n} \sum^n_{i=1} z_{i1}^2 \right\} \quad \text{subject to} \; \sum^p_{j=1} \phi_{1j}^2 = 1\]

Equivalently, $\phi_1$ can be obtained by

\[\underset{\phi_1}{\text{maximize}} \; \phi_1^T \hat{\Sigma} \phi_1 \quad \text{subject to} \; \sum^p_{j=1} \phi_{1j}^2 = 1\]

where $\hat{\Sigma} = X^T X / n$, the sample covariance matrix.

Therefore, $\hat{\phi_1}$ is the eigen vector of $\hat{\Sigma}$ associated with the largest eigen value!


Second and more Principal Components

The <second principal component> $Z_2$ is the normalized linear combination of $X_1, \dots, X_p$ that has maximal variance out of all normalized linear combination that are uncorrelated with $Z_1$.

์ด๋•Œ, <second principal component>๋Š” $Z_1$์„ ๊ตฌํ•  ๋•Œ์ฒ˜๋Ÿผ ์•„์ฃผ ์‰ฝ๊ฒŒ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ƒฅ second largest eigen value์— ์—ฐ๊ด€๋œ eigen vector๊ฐ€ $Z_2$๊ฐ€ ๋œ๋‹ค!! (์›๋ž˜ eigen vector๋ผ๋ฆฌ๋Š” ์„œ๋กœ orthogonalํ•˜๋‹ค.)

$k$-th principal component์— ๋Œ€ํ•ด์„œ๋„ ๊ทธ๋ƒฅ $k$๋ฒˆ์งธ eigen value์˜ eigen vector๋ฅผ ์“ฐ๋ฉด ๋œ๋‹ค!!


PCA Metric

<PCA>์—์„œ๋Š” <PVE; Proportion of Variance Explained> of the m-th principal component๋ผ๋Š” metric๋ฅผ ์ •์˜ํ•ด ์‚ฌ์šฉํ•œ๋‹ค.

\[\frac{\sum^n_{i=1} z_{im}^2}{\sum^p_{j=1} \sum^n_{i=1} x_{ij}^2} = \frac{\sum^n_{i=1} z_{im}^2 / n}{\sum^p_{j=1} \sum^n_{i=1} x_{ij}^2 / n} = \frac{\text{Var}(Z_m)}{\sum^p_{j=1} \text{Var}(X_j)}\]

์ด๋•Œ, <PVE> ๊ฐ’์„ ๋ณด๊ณ , ๋ช‡๊ฐœ์˜ principal component๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ธ์ง€ ๊ฒฐ์ •ํ•œ๋‹ค!


PCA: Another Interpretation


Sparse Principal Components


references