Spectral embedding and clustering

Data $s_1,\ldots,s_N$

To perform our analysis this time, we will use the MNIST handwritten digit dataset.

In [1]:
% matplotlib notebook

from sklearn.datasets import fetch_openml
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams

#rcParams['font.size'] = 30

# Load MNIST data from https://www.openml.org/d/554
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
In [2]:
% matplotlib notebook


num_samples = 2000

# Select num_samples random MNIST digits to analyze.
data = X[np.random.choice(int(X.shape[0]), num_samples), :]/255.

num_clusters = 10

# Inspect a single digit from the dataset.
digit = data[0,:]
digit = np.reshape(digit, (28,28))

plt.imshow(digit, cmap='Greys')


Kernel $K(s_i,s_j)$

Spectral embedding and other "kernel trick" methods use a kernel $K(s_i, s_j)$ that measures similarity between data points. The choice of kernel is part of the procedure and requires some intuition/trial-and-error.

A popular choice of kernel is the Gaussian kernel $$ K(s_i, s_j) = \exp\left(- \frac{||s_i - s_j||^2}{2\sigma^2}\right) $$

In [3]:
% matplotlib notebook

import numpy.linalg as nla

# Gaussian kernel with variance sigma^2.
sigma = 100.0
def kernel(s_i, s_j):
    return np.exp(-nla.norm(s_i - s_j)**2.0 / (2.0*sigma**2.0))

# The kernel matrix of the data.
kernel_matrix = np.zeros((num_samples, num_samples))
for i in range(num_samples):
    for j in range(i, num_samples):
        kernel_matrix[i,j] = kernel(data[i,:], data[j,:])
        kernel_matrix[j,i] = kernel_matrix[i,j]