NOTES | HOME
$$ \newcommand{\RR}{\mathbb{R}} \newcommand{\GG}{\mathbb{G}} \newcommand{\PP}{\mathbb{P}} \newcommand{\PS}{\mathcal{P}} \newcommand{\SS}{\mathbb{S}} \newcommand{\NN}{\mathbb{N}} \newcommand{\ZZ}{\mathbb{Z}} \newcommand{\CC}{\mathbb{C}} \newcommand{\HH}{\mathbb{H}} \newcommand{\ones}{\mathbb{1\hspace{-0.4em}1}} \newcommand{\alg}[1]{\mathfrak{#1}} \newcommand{\mat}[1]{ \begin{pmatrix} #1 \end{pmatrix} } \renewcommand{\bar}{\overline} \renewcommand{\hat}{\widehat} \renewcommand{\tilde}{\widetilde} \newcommand{\inv}[1]{ {#1}^{-1} } \newcommand{\eqdef}{\overset{\text{def}}=} \newcommand{\block}[1]{\left(#1\right)} \newcommand{\set}[1]{\left\{#1\right\}} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\trace}[1]{\mathrm{tr}\block{#1}} \newcommand{\norm}[1]{ \left\| #1 \right\| } \newcommand{\argmin}[1]{ \underset{#1}{\mathrm{argmin}} } \newcommand{\argmax}[1]{ \underset{#1}{\mathrm{argmax}} } \newcommand{\st}{\ \mathrm{s.t.}\ } \newcommand{\sign}[1]{\mathrm{sign}\block{#1}} \newcommand{\half}{\frac{1}{2}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\dd}{\mathrm{d}} \newcommand{\ddd}[2]{\frac{\partial #1}{\partial #2} } \newcommand{\db}{\dd^b} \newcommand{\ds}{\dd^s} \newcommand{\dL}{\dd_L} \newcommand{\dR}{\dd_R} \newcommand{\Ad}{\mathrm{Ad}} \newcommand{\ad}{\mathrm{ad}} \newcommand{\LL}{\mathcal{L}} \newcommand{\Krylov}{\mathcal{K}} \newcommand{\Span}[1]{\mathrm{Span}\block{#1}} \newcommand{\diag}{\mathrm{diag}} \newcommand{\tr}{\mathrm{tr}} \newcommand{\sinc}{\mathrm{sinc}} \newcommand{\cat}[1]{\mathcal{#1}} \newcommand{\Ob}[1]{\mathrm{Ob}\block{\cat{#1}}} \newcommand{\Hom}[1]{\mathrm{Hom}\block{\cat{#1}}} \newcommand{\op}[1]{\cat{#1}^{op}} \newcommand{\hom}[2]{\cat{#1}\block{#2}} \newcommand{\id}{\mathrm{id}} \newcommand{\Set}{\mathbb{Set}} \newcommand{\Cat}{\mathbb{Cat}} \newcommand{\Hask}{\mathbb{Hask}} \newcommand{\lim}{\mathrm{lim}\ } \newcommand{\funcat}[1]{\left[\cat{#1}\right]} \newcommand{\natsq}[6]{ \begin{matrix} & #2\block{#4} & \overset{#2\block{#6}}\longrightarrow & #2\block{#5} & \\ {#1}_{#4} \hspace{-1.5em} &\downarrow & & \downarrow & \hspace{-1.5em} {#1}_{#5}\\ & #3\block{#4} & \underset{#3\block{#6}}\longrightarrow & #3\block{#5} & \\ \end{matrix} } \newcommand{\comtri}[6]{ \begin{matrix} #1 & \overset{#4}\longrightarrow & #2 & \\ #6 \hspace{-1em} & \searrow & \downarrow & \hspace{-1em} #5 \\ & & #3 & \end{matrix} } \newcommand{\natism}[6]{ \begin{matrix} & #2\block{#4} & \overset{#2\block{#6}}\longrightarrow & #2\block{#5} & \\ {#1}_{#4} \hspace{-1.5em} &\downarrow \uparrow & & \downarrow \uparrow & \hspace{-1.5em} {#1}_{#5}\\ & #3\block{#4} & \underset{#3\block{#6}}\longrightarrow & #3\block{#5} & \\ \end{matrix} } \newcommand{\cone}[1]{\mathcal{#1}} $$

Principal Component Analysis

A geometric take on PCA.

  1. First Component
  2. Eigen Decomposition
  3. Induction
  4. Summary
  5. Symmetries
    1. Reflections
    2. Rotations
    3. Non-standard Metric
  6. Pulling back Principal Components
  7. Dual Metric
    1. Partial Least Squares
  8. Notes & References

First Component

We consider a \(n\)-dimensional Euclidean space \(E\) with inner product matrix \(M\), and \(m\) samples \(x_i \in E\), arranged in rows in matrix \(X\). The first principal component \(w_1\) maximizes the projected variance of the data:

\[\begin{align} w_1 &= \argmax{\norm{w}_M = 1} \quad \sum_i \block{x_i^TMw}^2 \\ &= \argmax{\norm{w}_M = 1} \quad \norm{XMw}^2 \\ &= \argmax{\norm{w}_M = 1} \quad w^T MX^TXM w \\ \end{align}\]

The first-order stationary conditions give:

\[\exists \lambda \in \RR,\quad MX^TXMw = \lambda Mw, \quad \norm{w}_M = 1\]

In other words, \(w\) is an eigenvector of \(X^TXM\) and since \(w^T MX^TXM w = \lambda w^T M w = \lambda\) is maximized, \(w\) is the eigenvector corresponding to the largest eigenvalue.

Eigen Decomposition

More precisely, let \(M = L L^T\) be the Cholesky decomposition of \(M\), and \(L^T X^T X L = USU^T\) the eigendecomposition of \(L^T X^TX L\), then we have the following diagonalization:

\[X^TXM = X^T X L L^T = L^{-T}USU^TL^T = P S \inv{P}\]

for an eigenvector basis \(P = L^{-T}U\). Note that the basis \(P\) is \(M\)-orthonormal:

\[P^T M P = I\]

so that its inverse is the \(M\)-orthogonal projection on \(P\):

\[\inv{P} = P^T M\]

Finally, the covariance in basis \(P\) is diagonal:

\[\inv{P} X^T X P^{-T} = U^T L^T X^T X L U = S\]

Induction

Letting \(X_1 = X\), subsequent components are found by projecting out the current component:

\[X_k = X_{k-1} - X_{k-1}Mw_{k-1} w_{k-1}^T\] \[w_k = \argmax{\norm{w}_M = 1} \quad \quad w^T MX_{k}^TX_{k}M w\]

For \(w_2\), simple derivations show that:

\[X_2^T X_2 M = X X^T M - s_1 p_1 p_1^T M\]

where \(p_1 = w_1\) is the eigenvector associated to the largest eigenvalue \(s_1\). One can also show that:

\[s_1 p_1 p_1^T M = P \mat{s_1 & \\ & 0 } \inv{P}\]

so that \(X_2^T X_2 M = P S_2 \inv{P}\), where \(S_2\) replaces the largest eigenvalues in \(S\) by zero. Hence, the second principal component is the eigenvector associated with the second largest eigenvalue.

A similar derivation shows by induction that the principal components are the eigenvectors corresponding to decreasing eigenvalues.

Summary

PCA computes an \(M\)-orthonormal basis \(P\) in which the samples have a diagonal covariance matrix:

\[\inv{P} X^T X P^{-T} = S\]

It is found by diagonalizing \(X^T X M = P S \inv{P}\), or using \(P = L^{-T}U\), where \(M = L L^T\) and \(L^T X^T X L = USU^T\).

Symmetries

We now examine how symmetries in the data set influence the PCA.

Reflections

Let us assume our data set \(\Omega\) has a plane symmetry, i.e. \(\Omega\) is invariant by a plane reflection:

\[\Omega = G(\Omega)\]

More precisely, \(G\) implements the reflection by some unit normal vector \(n\) by flipping the normal component:

\[G = \underbrace{\block{I - nn^T}}_{\text{tangent}} - \underbrace{n n^T}_{\text{normal}} = \block{I - 2nn^T}\]

Some immediate properties of \(G\):

Since the data set is invariant by \(G\), the covariance matrix \(C = X^T X\) is left unchanged when applying our reflection \(G\) to the data (we just process the same points in a different order when computing \(C = \sum_i x_i^T x_i\)), so that the transformed covariance satisfies:

\[G C G^T = C\]

We immediately get:

\[Cn = G C G^T n = G C G n = - G C n\]

from which we obtain, replacing \(G\) with its definition:

\[Cn = 2 nn^TCn - Cn\]

or, equivalently:

\[Cn = n (n^T C n)\]

and \(n\) is an eigenvector of the covariance matrix, with eigenvalue equal to the projected variance on \(n\).

Rotations

Now, let us assume \(\Omega\) has a rotation symmetry: \(R(\Omega) = \Omega\) for some rotation matrix \(R\). Then the rotated covariance satisfies:

\[RCR^T = C\]

Let \(u\) be an eigenvector with eigenvalue \(\lambda \geq 0\), i.e. \(Cu = \lambda u\), then \(CRv = \lambda Rv\) for \(R v = u\), which means:

\[RCR^T R v = \lambda Rv\]

and \(v = R^T u\) is also an eigenvector with eigenvalue \(\lambda\). In other words: the stable subspaces associated with a given eigenvalue are also stable by \(R\). This yields the following requirements:

Non-standard Metric

The above results naturally extend to a non-standard metric, provided we consider \(M\)-orthogonal rotations and reflections. (TODO show it)

Pulling back Principal Components

Let us now suppose that we have a set of samples \(X\) in some output space \(F\) and a linear mapping \(P: w \mapsto x \in F\) from some input space \(E\). We now look for the direction in the input space \(E\) that maximizes the projected variance in the output space \(F\):

\[\argmax{\norm{w}=1} \quad \sum_i \block{x_i^T Pw}^2 = \norm{XPw}^2 = w^T P^T X^T X P w\]

We see that this amounts to diagonalizing the covariance matrix \(P^T X^T X P\), which is the pullback of output covariance \(X^T X\) by \(P\).

Dual Metric

So far we only considered a metric \(M\) in the feature space, i.e. the metric used to compare features for two given individuals. However, it is also natural consider the dual problem of comparing individuals for two given features, for instance in order to give more weight to some individuals than to others or even give different weights to arbitraty linear subspaces of individuals. Let us recall the projection equation:

\[\argmax{\norm{w}=1} \quad \sum_i \block{x_i^T w}^2 = \norm{Xw}^2\]

Here we see that the canonical Euclidean norm is used in the right-hand side term, but any Euclidean norm could be used instead: the projection of each sample \(x_i\) onto feature \(w\) gives each individual a score (how similar to feature \(w\) the individual is), and we may choose to measure scores using a different metric \(W\):

\[\argmax{\norm{w}=1} \quad \sum_i \norm{Xw}_W^2\]

Using such a dual metric yields a correlation matrix \(X^TWX\) (See also Generalized Least Squares which exploits the same idea). In particular, choosing \(W=XX^T\) does not change the eigenvector basis (but it does square the eigenvalues), thus producing the same principal components. More precisely, the metric \(W=XX^T\) takes a score as an input, then produces a feature by weighting all individual features by their score, and finally measures the resulting feature using the (here implicit) feature metric.

But let’s say we also have another set of features \(Y\) over the same individuals: we can then use the metric \(YY^T\) over the same individuals to weight scores when looking for principal components. This means that a \(X\)-feature must produce a score that agrees with \(Y\)-features to be selected as a principal component. In other words, variations of \(X\)-features must correspond to variations of \(Y\) features.

Partial Least Squares

Some good introductory material can be found in 12. The whole field feels like a mess, with at least three main versions competing with each other:

The SciKit page on cross decomposition is also a good start.

Notes & References

  1. http://vision.cse.psu.edu/seminars/talks/PLSpresentation.pdf 

  2. http://users.cecs.anu.edu.au/~kee/pls.pdf 

  3. https://onlinelibrary.wiley.com/doi/pdf/10.1002/cem.1067 

  4. https://onlinelibrary.wiley.com/doi/abs/10.1002/cem.1181