L03 Dimensionality Reduction

Materials are adopted from "Murphy, Kevin P. Probabilistic machine learning: an introduction. MIT press, 2022.". This handout is only for teaching. DO NOT DISTRIBUTE.

Principal components analysis (PCA)

Examples

Derivation of the algorithm

L(W,Z)=1NXZWF2=1NXWZF2=1Nn=1NxnWzn2\mathcal{L}(\mathbf{W}, \mathbf{Z})=\frac{1}{N}\left\|\mathbf{X}-\mathbf{Z} \mathbf{W}^{\top}\right\|_{F}^{2}=\frac{1}{N}\left\|\mathbf{X}^{\top}-\mathbf{W} \mathbf{Z}^{\top}\right\|_{F}^{2}=\frac{1}{N} \sum_{n=1}^{N}\left\|\boldsymbol{x}_{n}-\mathbf{W} \boldsymbol{z}_{n}\right\|^{2}

Base case

consider the 1 dimention case: w1RD\boldsymbol{w}_{1} \in \mathbb{R}^{D}

Let the coefficients for each of the data points associated with the first basis vector be denoted by z~1=[z11,,zN1]RN\tilde{\mathbf{z}}_{1}=\left[z_{11}, \ldots, z_{N 1}\right] \in \mathbb{R}^{N}. The reconstruction error is given by
L(w1,z~1)=1Nn=1Nxnzn1w12=1Nn=1N(xnzn1w1)(xnzn1w1).=1Nn=1N[xnxn2zn1w1xn+zn12w1w1]=1Nn=1N[xnxn2zn1w1xn+zn12]\begin{aligned} \mathcal{L}\left(\boldsymbol{w}_{1}, \tilde{\mathbf{z}}_{1}\right) &=\frac{1}{N} \sum_{n=1}^{N}\left\|\boldsymbol{x}_{n}-z_{n 1} \boldsymbol{w}_{1}\right\|^{2}=\frac{1}{N} \sum_{n=1}^{N}\left(\boldsymbol{x}_{n}-z_{n 1} \boldsymbol{w}_{1}\right)^{\top}\left(\boldsymbol{x}_{n}-z_{n 1} \boldsymbol{w}_{1}\right).\\ &=\frac{1}{N} \sum_{n=1}^{N}\left[\boldsymbol{x}_{n}^{\top} \boldsymbol{x}_{n}-2 z_{n 1} \boldsymbol{w}_{1}^{\top} \boldsymbol{x}_{n}+z_{n 1}^{2} \boldsymbol{w}_{1}^{\top} \boldsymbol{w}_{1}\right] \\ &=\frac{1}{N} \sum_{n=1}^{N}\left[\boldsymbol{x}_{n}^{\top} \boldsymbol{x}_{n}-2 z_{n 1} \boldsymbol{w}_{1}^{\top} \boldsymbol{x}_{n}+z_{n 1}^{2}\right] \end{aligned}

Taking derivatives wrt zn1z_{n 1} and equating to zero gives

zn1L(w1,z~1)=1N[2w1xn+2zn1]=0zn1=w1xn\frac{\partial}{\partial z_{n 1}} \mathcal{L}\left(\boldsymbol{w}_{1}, \tilde{\mathbf{z}}_{1}\right)=\frac{1}{N}\left[-2 \boldsymbol{w}_{1}^{\top} \boldsymbol{x}_{n}+2 z_{n 1}\right]=0 \Rightarrow z_{n 1}=\boldsymbol{w}_{1}^{\top} \boldsymbol{x}_{n}

Plugging this back in gives the loss for the weights:

L(w1)=L(w1,z~1(w1))=1Nn=1N[xnxnzn12]=const1Nn=1Nzn12\mathcal{L}\left(\boldsymbol{w}_{1}\right)=\mathcal{L}\left(\boldsymbol{w}_{1}, \tilde{\mathbf{z}}_{1}^{*}\left(\boldsymbol{w}_{1}\right)\right)=\frac{1}{N} \sum_{n=1}^{N}\left[\boldsymbol{x}_{n}^{\top} \boldsymbol{x}_{n}-z_{n 1}^{2}\right]=\mathrm{const}-\frac{1}{N} \sum_{n=1}^{N} z_{n 1}^{2}
To solve for w1\boldsymbol{w}_{1}, note that
L(w1)=1Nn=1Nzn12=1Nn=1Nw1xnxnw1=w1Σ^w1\mathcal{L}\left(\boldsymbol{w}_{1}\right)=-\frac{1}{N} \sum_{n=1}^{N} z_{n 1}^{2}=-\frac{1}{N} \sum_{n=1}^{N} \boldsymbol{w}_{1}^{\top} \boldsymbol{x}_{n} \boldsymbol{x}_{n}^{\top} \boldsymbol{w}_{1}=-\boldsymbol{w}_{1}^{\top} \hat{\boldsymbol{\Sigma}} \boldsymbol{w}_{1}
where Σ\boldsymbol{\Sigma} is the empirical covariance matrix (since we assumed the data is centered). We can trivially optimize this by letting w1\left\|\boldsymbol{w}_{1}\right\| \rightarrow \infty, so we impose the constraint w1=1\left\|\boldsymbol{w}_{1}\right\|=1 and instead optimize
L~(w1)=w1Σ^w1λ1(w1w11)\tilde{\mathcal{L}}\left(\boldsymbol{w}_{1}\right)=\boldsymbol{w}_{1}^{\top} \hat{\boldsymbol{\Sigma}} \boldsymbol{w}_{1}-\lambda_{1}\left(\boldsymbol{w}_{1}^{\top} \boldsymbol{w}_{1}-1\right)
where λ1\lambda_{1} is a Lagrange multiplier. Taking derivatives and equating to zero we have
w1L~(w1)=2Σ^w12λ1w1=0Σ^w1=λ1w1\begin{aligned} \frac{\partial}{\partial \boldsymbol{w}_{1}} \tilde{\mathcal{L}}\left(\boldsymbol{w}_{1}\right) &=2 \hat{\boldsymbol{\Sigma}} \boldsymbol{w}_{1}-2 \lambda_{1} \boldsymbol{w}_{1}=0 \\ \hat{\boldsymbol{\Sigma}} \boldsymbol{w}_{1} &=\lambda_{1} \boldsymbol{w}_{1} \end{aligned}
Hence the optimal direction onto which we should project the data is an eigenvector of the covariance matrix. Left multiplying by w1\boldsymbol{w}_{1}^{\top} (and using w1w1=1\boldsymbol{w}_{1}^{\top} \boldsymbol{w}_{1}=1 ) we find
w1Σ^w1=λ1\boldsymbol{w}_{1}^{\top} \hat{\boldsymbol{\Sigma}} \boldsymbol{w}_{1}=\lambda_{1}
Since we want to maximize this quantity (minimize the loss), we pick the eigenvector which corresponds to the largest eigenvalue.

Optimal weight vector maximizes the variance of the projected data

Induction step

Now let us find another direction w2\boldsymbol{w}_{2} to further minimize the reconstruction error, subject to w1w2=0\boldsymbol{w}_{1}^{\top} \boldsymbol{w}_{2}=0 and w2w2=1\boldsymbol{w}_{2}^{\top} \boldsymbol{w}_{2}=1. The error is
L(w1,z~1,w2,z~2)=1Nn=1Nxnzn1w1zn2w22\mathcal{L}\left(\boldsymbol{w}_{1}, \tilde{\mathbf{z}}_{1}, \boldsymbol{w}_{2}, \tilde{\mathbf{z}}_{2}\right)=\frac{1}{N} \sum_{n=1}^{N}\left\|\boldsymbol{x}_{n}-z_{n 1} \boldsymbol{w}_{1}-z_{n 2} \boldsymbol{w}_{2}\right\|^{2}

Computational issues

Covariance matrix vs correlation matrix

Dealing with high-dimensional data

(XX)U=UΛ\left(\mathbf{X X}^{\top}\right) \mathbf{U}=\mathbf{U} \boldsymbol{\Lambda}

Choosing the number of latent dimensions

Reconstruction error

LL1DnDxnx^n2\mathcal{L}_L\frac{1}{|\mathcal{D}|}\sum_{n\in\mathcal{D}}\|x_n-\hat{x}_n\|^2

Scree plots

Lλ=j=L+1Dλj\mathcal{L}_\lambda=\sum_{j=L+1}^D\lambda_j

FL=j=1Lλjj=1LmaxλjF_L=\frac{\sum_{j=1}^L\lambda_j}{\sum_{j'=1}^{L^{\max}}\lambda_j'}

Profile likelihood

μ1(L)=kLλkLμ2(L)=k>LλkLmaxLσ2(L)=kL(λkμ1(L))2+k>L(λkμ2(L))2Lmax\begin{aligned} \mu_{1}(L) &=\frac{\sum_{k \leq L} \lambda_{k}}{L} \\ \mu_{2}(L) &=\frac{\sum_{k>L} \lambda_{k}}{L^{\max }-L} \\ \sigma^{2}(L) &=\frac{\sum_{k \leq L}\left(\lambda_{k}-\mu_{1}(L)\right)^{2}+\sum_{k>L}\left(\lambda_{k}-\mu_{2}(L)\right)^{2}}{L^{\max }} \end{aligned}
We can then evaluate the profile log likelihood
(L)=k=1LlogN(λkμ1(L),σ2(L))+k=L+1LmaxlogN(λkμ2(L),σ2(L))\ell(L)=\sum_{k=1}^{L} \log \mathcal{N}\left(\lambda_{k} \mid \mu_{1}(L), \sigma^{2}(L)\right)+\sum_{k=L+1}^{L^{\max }} \log \mathcal{N}\left(\lambda_{k} \mid \mu_{2}(L), \sigma^{2}(L)\right)

L=argmax(L)L^{*}=\operatorname{argmax} \ell(L)

Factor analysis *

Generative model

Probabilistic PCA

where UL\mathbf{U}_{L} is a D×LD \times L matrix whose columns are given by the LL eigenvectors of S\mathbf{S} with largest eigenvalues, LL\mathbf{L}_{L} is the L×LL \times L diagonal matrix of eigenvalues, and R\mathbf{R} is an arbitrary L×LL \times L orthogonal matrix, which (WLOG) we can take to be R=I\mathbf{R}=\mathbf{I}. In the noise-free limit, where σ2=0\sigma^{2}=0, we see that Wmle =ULLL12\mathbf{W}_{\text {mle }}=\mathbf{U}_{L} \mathbf{L}_{L}^{\frac{1}{2}}, which is proportional to the PCA\mathrm{PCA} solution.

Factor analysis models for paired data

Supervised PCA

Partial least squares

Canonical correlation analysis

Autoencoders