L05 Unsupervised Learning

We will learn

  • Algorithms
    • Dimensionality Reduction: PCA
    • Clustering: K-Means
  • Coding with python

Challenges

  • Unsupervised learning: a set of features measured on observations (without target )

  • Goal: To discover interesting things about the measurements on

    • Is there an informative way to visualize the data?
    • Can we discover subgroups among the variables or among the observations?
  • The Challenge of Unsupervised Learning

    • exercise tends to be more subjective
    • there is no simple goal for the analysis

Principal Components Analysis (PCA)

How to visualize high dimensional data? (The Iris Classification Example)


  • tabular data with a small number of features: pair plot
  • higher-dimensional data: dimension reduction first and then to visualize the data in 2d or 3d
  • The big idea of PCA: fnd a low-dimensional representation of a data set that contains as much as possible of the variation

Examples

  • a simple example project from 2d to 1d
  • hand writing digit recognition (28*28d to 2d)
  • human face recognition (64*64d to 3d)

A Detailed Example

  • A set of (p-dimensional) features

  • The first principal component

    • is the normalized linear combination of the features that has the largest variance.
    • the loadings of the first principal component:
    • the principal component loading vector,
  • for a specific point :




  • the most imformative direction: :

  • the second principal component

    • maximal variance out of all linear combinations that are uncorrelated with

Another Interpretation of Principal Components

Principal components provide low-dimensional linear surfaces that are closest to the observations.

  • the best -dimensional approximation (in terms of Euclidean distance) to the th observation

  • the optimization problem

  • the smallest possible value of the objective in (12.6) is

  • Principal component loading vectors can give a good approximation to the data when is sufficiently large

The Proportion of Variance Explained (PVE)

  • The total variance present in a data set is defined as

  • the variance explained by the th principal component is

  • the PVE of the th principal component

  • the variance of the data can be decomposed into the variance of the first principal components plus the mean squared error of this -dimensional approximation, as follows:

  • we can interpret the PVE as the of the approximation for given by the first principal components.

Coding: Visualization


import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape

# project from 64 to 2 dimensions
pca = PCA(2)
projected = pca.fit_transform(digits.data)

# visualization
plt.scatter(
  projected[:, 0], projected[:, 1],
  c=digits.target, edgecolor='none', alpha=0.5,
  cmap=plt.cm.get_cmap('Spectral', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();

Clustering Methods

K-Means Clustering

  • Partitioning a data set into distinct, non-overlapping clusters.
  • Let denote sets containing the indices of the observations in each cluster. These sets satisfy two properties:
      1. . In other words, each observation belongs to at least one of the clusters.
      1. for all . In other words, the clusters are nonoverlapping: no observation belongs to more than one cluster.
  • the big idea
    • within-cluster variation is as small as possible

    • within-cluster variation