Auxiliary tutorial 6: K-means clustering

(c) 2016 Porfirio Quintero. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained therein is licensed under an MIT license.

This lesson was generated from an Jupyter notebook. You can download the notebook here. You can also view it here.

In [1]:
import numpy as np
import pandas as pd

import sklearn.cluster
import sklearn.metrics
import sklearn.utils

import skimage.io

import matplotlib.pyplot as plt
import seaborn as sns
rc={'lines.linewidth': 2, 'axes.labelsize': 14, 'axes.titlesize': 14}
sns.set(rc=rc)
sns.set_style('white')
%matplotlib inline

What is clustering?

Dividing data into clusters, or groups, can be helpful in exploratory data analysis and beyond. The goal of clustering is to form groups containing data objects that are similar within groups and different between groups. Because the clusters are derived from the data themselves, i.e. there is no training with labeled data, cluster analysis is often referred to as an unsupervised learning set of methods.

This procedure can provide meaningful insights into the structure of a dataset, or it can be useful for summarizing the data for further analysis. However, it is good to keep in mind that the definition of a cluster is imprecise.

Let's generate some data points and look at how we might choose to cluster them.

In [2]:
# generate data 
np.random.seed(42) 
means = (5, 5, 8, 10) 
coords = [np.random.normal(mean, 1, 10) for mean in means]

# plot 
fig, axes = plt.subplots(2,2, sharey=True, sharex=True)
# no clusters
axes[0,0].plot(coords[0], coords[1], 'ko', coords[0], coords[2], 'ko',
    coords[3], coords[0], 'ko', coords[2], coords[3], 'ko', alpha = 0.5)
plt.margins(0.02)
axes[0,0].set_xticks([])
axes[0,0].set_yticks([])

# two clusters
axes[0,1].plot(coords[0], coords[1], 'o', coords[0], coords[2], 'ko',
    coords[3], coords[0], 'ko', coords[2], coords[3], 'ko', alpha = 0.5)

# three clusters
axes[1,0].plot(coords[0], coords[1], 'o', coords[0], coords[2], 'ko',
    coords[3], coords[0], 'o', coords[2], coords[3], 'ko', alpha = 0.5)

# four clusters
axes[1,1].plot(coords[0], coords[1], 'o', coords[0], coords[2], 'o',
    coords[3], coords[0], 'o', coords[2], coords[3], 'o', alpha = 0.5)
plt.tight_layout();

These are all different ways of clustering the same set of points. There is no universally correct way of clustering; the most appropriate definition will depend on the data and the desired output, and is often found through experimentation. For these reasons, there are several clustering algorithms. Also for these reasons, we should be careful about the conclusions we draw from clustering analysis, as they can vary widely depending on the algorithm and the parameters that we choose.

Two major clustering categories refer to whether the data is being grouped hierarchically or partitionally. In the former, each cluster belongs to a larger cluster that contains multiple clusters and so on, until the whole dataset is the cluster that contains all, in a dendogram fashion. In the latter, the clusters are non-overlapping and each data object belongs to only one cluster. To get a taste of clustering, we will take look at a partitional algorithm, K-means, which is one of the first and fastest clustering algorithms.

The K-means clustering algorithm

K-means is a centroid-based clustering algorithm. This means that a cluster is defined as a set of objects in which each object is closer to the center of its own cluster than to the center of other clusters. It thus needs a distance metric to quantify how close data are to any given point. Euclidean distance, which is simply the length of the line segment connecting two points, is one such metric that is often used. In $n$ dimensions, the squared euclidean distance between points $p$ and $q$ is

\begin{align} \sum_{i=1}^{n} (p_i - q_i)^2. \end{align}

The basic algorithm for K-Means clustering, also known as Lloyd's algorithm, is as follows:

  1. Select an initial set of $k$ centroids.
  2. Assign each point to the closest centroid.
  3. Change the centroids to be the means of the newly formed clusters.
  4. Repeat steps 2 and 3 until convergence.

Convergence is reached when the centroids and assignments no longer change. Given that each point is iteratively assigned to the closest centroid, these steps minimize the so called inertia, or within-cluster sum of squared criterion,

\begin{align} \text{inertia} = \sum_{i=1}^{k}\sum_{x\in C_i} d(c_i, x)^2, \end{align}

where $k$ is the number of centroids, $C_i$ is the $i^{th}$ cluster with mean $c_i$, $x$ is a point in that cluster, and $d$ is the euclidean distance between these two points. The mean of the points in the cluster, $c_i$, is the best centroid for minimizing the inertia. This can be shown by solving for the $c_k$ centroid, which is done by finding when it stops changing, i.e. differentiating and setting equal to zero. In the simplest, one dimensional case:

\begin{align} \frac{\partial}{\partial c_k} \sum_{i=1}^{k}\sum_{x\in C_i} (c_i - x)^2 = \sum_{x\in C_k} 2 (c_k - x_k) = 0. \end{align}

Thus,

\begin{align} \sum_{x\in C_k} (c_k - x_k) = 0, \end{align}

which yields

\begin{align} \sum_{x\in C_k} c_k &= \sum_{x\in C_k} x_k. \end{align}

Defining $m_k$ is the size of the $k^{th}$ cluster, we have

\begin{align} m_kc_k &= \sum_{x\in C_k} x_k, \end{align}

which finally gives us

\begin{align} c_k = \frac{1}{m_k} \sum_{x\in C_k} x_k. \end{align}

With this information, we can code up a simple version of the algorithm to cluster one dimensional data with random centroid initialization. We will start by writing functions to do each of the three steps of the algorithm: centroid initialization, assignment and re-computation. We will also need one to compute the euclidean distance.

In [3]:
def select_centroids(data, k):
    """ 
    Randomly select initial k number of centroids
    """
    return np.random.choice(int(np.ceil(max(data))), size=k, replace=False)


def compute_distance(centroids, x):
    """
    Compute squared euclidean distance between two points
    """
    return np.array([(c - x)**2 for c in centroids])


def assign_centroids(centroids, data):
    """
    Assign data to closest centroid and cluster
    """
    # compute distances
    centroid_distances = compute_distance(centroids, data)
    
    # get closest centroid
    ind = np.array([np.argmin(c) for c in zip(*centroid_distances)])
    closest_centroid = centroids[ind]
    
    # assign data and cluster 
    return np.array([data[np.where(closest_centroid==c)] for c in centroids])


def compute_centroids(clusters):
    """
    Compute new centroids for clusters
    """
    new_centroids = np.array([np.mean(c) for c in clusters])

    # check for empty clusters, replace with 0
    if np.isnan(new_centroids).any():
        new_centroids = np.nan_to_num(new_centroids)

    return new_centroids

We will now write a function to plot a sample of the iterations so that we can see what the algorithm is doing.

In [4]:
def plot_clustering(clusters, centroids, n_plots, title='iteration'):
    """
    Plot clustering iterations
    """
    to_plot = np.linspace(0, len(centroids)-1, num=n_plots, dtype=int)
    fig, axes = plt.subplots(n_plots, sharex=True, sharey=True)
    
    for i, ax in enumerate(axes):
        # centroids
        curr_cs = centroids[to_plot[i]]
        ax.vlines(curr_cs, 0, 2, 'r', alpha=0.5)

        # clustered data
        curr_cls = clusters[to_plot[i]]
        for cluster in curr_cls:
            ax.plot(cluster, np.full_like(cluster,1), 'o', alpha=0.2)
        ax.set_title("{0} {1}".format(title, to_plot[i]+1))
        
    # remove meaningless y ticks and tighten plot
    plt.yticks([])
    plt.tight_layout()
    
    return None

We have everything we need. Now we can construct our procedure by simply plugging in each step and iterating until convergence.

In [5]:
def kmeans_1d(data, k=2, plot_iter=0, verbose=False):
    """
    Perform K-Means clustering on 1D data
    """
    # get initial centroids and make an array to save each set of centroids
    centroids = select_centroids(data, k)
    curr_centroids = np.zeros_like(centroids)
    
    # count iterations and save initial centroids
    n, c = 0, centroids
    
    # to save iterations if plotting requested
    past_clusters, past_centroids = [], []
    
    # iterate and check for convergence (if centroids don't change)
    while not np.isclose(np.sum(curr_centroids - centroids), 0):
        
        # update centroids
        curr_centroids = centroids
        
        # assign each point to closest centroid
        clusters = assign_centroids(centroids, data)
        
        # recompute centroids
        centroids = compute_centroids(clusters)
        
        # update counter
        n+=1
        
        # save all iterations for plot if requested
        if plot_iter > 0:
            past_clusters.append(clusters)
            past_centroids.append(centroids)
    
    if plot_iter > 0:
        plot_clustering(past_clusters, past_centroids, plot_iter)
    
    if verbose:
        print("K = {0}\nInitial centroids were: {1}\nFinal centroids were:{2}\
              \nConverged after {3} iterations".format(k, c, centroids, n))
    
    return centroids, clusters

We should now have our clustering algorithm ready. Let's write a function to generate 1D data by getting samples from multiple normal distributions, so we can try it out.

In [6]:
def generate_data(n, var, means):
    """
    Generate 1D data to cluster from multiple normal distributions
    """
    for m in means:
        dist = np.random.normal(m, var, size=int(n/len(means)))
        try:
            data = np.concatenate([data, dist])
        except NameError: 
            data = dist
            
    np.random.shuffle(data)
    
    return data

We will make our toy data to have 100 samples from 4 normal distributions with unit variance and means uniformly distributed between 0 and 20.

In [7]:
np.random.seed(12)
# 1D data to cluster from k normal distributions
n_samples, var, means = 100, 1, np.linspace(0, 20, 4)
data = generate_data(n_samples, var, means)

# look at it
fig, ax = plt.subplots(1, 1, figsize=(8, 1))
ax.plot(data, np.full_like(data,1), 'o', alpha=0.2)
ax.set_yticks([]);

Looks like clusterable data. Let's cluster it and look at how the centroids move:

In [8]:
np.random.seed(12)
centroids, clusters = kmeans_1d(data, k=4, plot_iter=5, verbose=True)
K = 4
Initial centroids were: [ 7 10 21 15]
Final centroids were:[  0.05641659   6.17659145  19.69789854  13.49186343]              
Converged after 8 iterations

Data in each cluster are shown in different colors and the centroids are shown as red vertical lines. So, it seems like we got decent clustering, similar to what we would expect by eye. We can see that, although initially our centroids are not ideal, they move relatively quickly to better positions.

Effect of centroid initialization

Since we are initializing our centroids randomly, let's look at what happens when they start in a different state.

In [9]:
# cluster a bunch of times with different initial random states
centroids, clusters = [], []
runs = 5
for i in range(runs):
    np.random.seed(i+5)
    centroids_, clusters_ = kmeans_1d(data, k=4)
    centroids.append(centroids_)
    clusters.append(clusters_)

# take a look
plot_clustering(clusters, centroids, runs, title='result from run')

It seems like not every time we can do a good job. That is because this algorithm converges to a local optimum that largely depends on the initial centroids. It is then generally a good idea to run the algorithm multiple times, and take the best result that we get. We can quantify how good a given result is by measuring the inertia.

In [10]:
def inertia(clusters):
    """
    Compute within cluster sum of squares
    """
    return np.sum(np.array([np.sum((c - np.mean(c))**2) for c in clusters]))

inertias = [inertia(c) for c in clusters]
for i, inrt in enumerate(inertias):
    print('result from run {0} has inertia of {1}'.format(i+1, inrt))
result from run 1 has inertia of 542.644351211634
result from run 2 has inertia of 102.48986727940299
result from run 3 has inertia of 102.489867279403
result from run 4 has inertia of 102.489867279403
result from run 5 has inertia of 559.27790840675

This is consistent with what we saw: the first and fifth are the worst results, while the other runs converged to a similiar result. In particular, we got bad results mainly because a pair of centroids were initialized too close to each other. So, in addition to running the algorithm multiple times, being smart about the initial position of the centroids, e.g. making them distant from each other, can be helpful. These tricks and more are incorporated in the k-means implementation of the sklearn.cluster module, which we will later use to cluster real data.

A note on $k$

Since we generated these data from four normal distributions, it seemed like a good idea to use $k=4$. But what if we didn't have that information? how many clusters should we expect? In other words, what $k$ should we use?

As we said before, there is no universally correct way of clustering. But there are a few methods that can give us some insight as to what $k$ to use. These involve measuring some metric as we change $k$, and we are already familiar with one such metric: the inertia. Let's try that.

In [11]:
def elbow(k_max, data, plot=False):
    """
    Get the inertia from clustering with different k for elbow method
    """    
    inertias, ks = [], []
    for k in range(2, k_max):
        # cluster
        centroids, clusters = kmeans_1d(data, k)
        
        # compute inertia
        inertia_ = inertia(clusters)
        
        # save inertia and k
        inertias.append(inertia_)
        ks.append(k)
        
    if plot:
        plt.plot(ks, inertias, 'o')
        plt.margins(0.02)
        plt.xlabel(r'$k$')
        plt.ylabel(r'inertia')
        
    return ks, inertias

np.random.seed(12)
ks, inertias = elbow(10, data, plot=True);

The inertia initially drops very rapidly with $k$, and then keeps dropping more slowly as $k$ increases. The trick here is to identify when the marginal gain drops, i.e. when increasing $k$ is no longer worth it. In this case, it is very evident that the sharp angle we get when $k=4$ is probably the point we want. This is called the elbow method, and worked nicely for this particular dataset, but it might not for others. In such cases, we might want to try other ways of determining a reasonable number of clusters.

We can also notice that the bigger $k$ is, the smaller the inertia. The reason for this becomes evident if we consider the case where $k = N$, where $N$ is the size of the data set. In that extreme case, every point is its own cluster, and the distance between each point and the center of its cluster (that same point) is zero, so there is no inertia.

Who is buzzing? or clustering to understand data

Let's now look at a cool, real data set: insect wing beat sound. It was collected by Yanping Chen and is available in the UCR time series classification archive.

In a nutshell, these data were collected by recording the sounds of several flying insects. The insect sound was then extracted, cleaned and transformed to a frequency spectrum using the discrete fourier transform. Here is a more complete description if you are interested.

Let's import the data and take a look.

In [12]:
# Load the data (use np.genfromtxt because in convenient format for sklearn)
insect_data = np.genfromtxt('../data/InsectWingbeatSound_TRAIN', delimiter=',')

# Make plot
plt.plot(insect_data[:,1:].transpose(), alpha=0.1);
plt.xlim(0, insect_data.shape[1]);
plt.xlabel('frequency (a.u.)')
plt.ylabel('power spectral density')

print('There are {0} recordings, this is a {0} by {1} array.'.format(
        *insect_data.shape))
There are 220 recordings, this is a 220 by 257 array.

Looks like many insects...can we figure out how many different kinds of insects were recorded?

Let's try clustering!

If we can figure out how many clusters there are in these data (assuming we do a decent job clustering it), we could get an idea of how many different kinds of insects there are. So let's start by using the elbow method to try to choose a reasonable $k$. We can re-write our function to use the better K-means implementation from the sklearn.cluster module.

In [13]:
def elbow(k_max, data, plot=False):
    """
    Get the inertia from clustering with different k for elbow method
    """    
    inertias, ks = [], []
    for k in range(2, k_max):
        # initialize K-means, cluster and get inertia
        km_cluster = sklearn.cluster.KMeans(n_clusters=k, random_state=12)
        inertia = km_cluster.fit(data).inertia_

        # save inertia and k
        inertias.append(inertia)
        ks.append(k)

    if plot: 
        plt.plot(ks, inertias)
        plt.plot(ks, inertias, 'o')
        plt.margins(0.02)
        plt.xlabel(r'$k$')
        plt.ylabel(r'inertia')

    return ks, inertias

We can now look at how the inertia changes with $k$.

In [14]:
ks, inertias = elbow(30, insect_data[:,1:], plot=True)

Here the number of clusters is not so obvious. Somewhere around 10 looks like a reasonable choice, but we can try another metric to get a better idea.

We will use the silhouette score, which is another measure of how tight are data within clusters and how far they are from the neighboring clusters. It ranges from [-1, 1], with increasing value suggesting better clustering.

We can write a similar function to get the silhouette score for different $k$s. We will use the silhouette score function available in sklearn.metrics.

In [15]:
def silhouette(k_max, data, plot=False):
    """
    Get the silhouette score from clustering with different k
    """    
    sil, ks = [], []
    for k in range(2, k_max):
        # initialize K-means, cluster and get labels
        km_cluster = sklearn.cluster.KMeans(n_clusters=k, random_state=12)
        labels = km_cluster.fit_predict(data)

        # compute silhouette score
        sil.append(sklearn.metrics.silhouette_score(data, labels))
        ks.append(k)

    if plot: 
        plt.plot(ks, sil)
        plt.plot(ks, sil, '.')
        plt.xlabel('$k$')
        plt.ylabel('silhouette score')

    return ks, sil
In [16]:
ks, sil = silhouette(30, insect_data[:,1:], plot=True)

This also suggests that 10, but also 7, are reasonable choices. So we'll go with the consensus and guess there are about 10 different insect kinds in the data set.

Actually, these data are labeled, so we can find out exactly how many different kinds of insects exist. The first column of the dataset contains the labels.

In [17]:
print('There are {0} different kinds of flying insects.'.format(
                np.unique(insect_data[:,0]).shape[0]))
There are 11 different kinds of flying insects.

Our guess was not too bad. It seems like we can get an intuition of the structure of the data through this exploratory cluster analysis, which can be easy to perform with all the functions that we have available in the sklearn module.

Let's cluster the data and look at how a couple of insect kinds might sound like.

In [18]:
# we first instantiate the KMeans class
kmeans_insect = sklearn.cluster.KMeans(n_clusters=11, random_state=12)

# Now we can cluster our data and access the results through the class attributes
kmeans_insect.fit(insect_data[:,1:])

# the labels are one such attribute. Others are inertia and cluster centers.
labels = kmeans_insect.labels_
In [19]:
colors = ['k','r','c','y','g']
np.random.seed(9)
for i, l in enumerate(np.random.choice(np.unique(labels), size=3, replace=False)):
    plt.plot(insect_data[:,1:][np.where(labels==l)].transpose(), c=colors[i], 
             alpha=0.1)

plt.xlim(0, insect_data.shape[1]);    
plt.xlabel('frequency (a.u.)')
plt.ylabel('power spectral density');

The clusters seem to make sense; each kind of insect does seem to sound somewhat different. Since we know the labels of the data, we can compare it to the results of our clustering. This information would normally not be available, but comparing clustering results can be useful, e.g. to compare different algorithms.

Scikit-learn also has a few evaluation metrics for clustering analysis. Because we have the ground truth, i.e. the labels, we are interested in a supervised evaluation. We will try two metrics: adjusted random and completeness scores (there are more, this are just easy to explain). For random score, the similarity goes between [-1,1], with a value close to zero for random labeling and 1 for perfect labeling, and gives us an idea of how we are doing compared to random. This score is adjusted for matching labels that we might expect by chance only. The completeness score ranges between [0,1], and tells us about what proportion of points that have the same label were clustered together, with a score of 1 being a perfect match.

In [20]:
labels_true = insect_data[:,0]
labels_pred = labels
rand_score = sklearn.metrics.adjusted_rand_score(labels_true, labels_pred)
comp_score = sklearn.metrics.completeness_score(labels_true, labels_pred)
print('random score = {0}\ncompleteness score = {1}'.format(rand_score, comp_score))
random score = 0.3060668991166375
completeness score = 0.5467206029163796

Looks like we are doing better than random, but we are not able to correctly classify all the insects. We might want to try another clustering method. Particularly, a different metric, such as dynamic time warping, may be useful for these kind of data.

Clustering for utility: color quantization

Clustering can be useful for getting an intution of our data, but it can also be just useful. For example, we can use clustering to summarize or compress data. Color quantization on an image is a good example to show this, as we are very familiar with images and assessing their quality by eye is easy.

Let's look at a picture from Zion National Park, taken by the well known photographer Griffin Chure.

In [21]:
painting_dir = '../data/zion.jpeg'
painting = skimage.io.imread(painting_dir)
imwidth, imheight, _ = painting.shape
plt.imshow(painting)
print("This is an RGB image of {0} by {1} pixels".format(imheight,imwidth))
This is an RGB image of 2048 by 1536 pixels

Each combination of three values makes a unique color in RGB encoding. Let's figure out how many colors this image has.

In [22]:
def transform_count_colors(im):
    # normalize image to range [0-1] for technical purposes 
    im = im / np.max(im.ravel())
    # unravel the image pixels
    w, h, d = im.shape
    trans_im = np.reshape(im, (w * h, d))
    # count number of colors
    color_df = pd.DataFrame(trans_im)
    n_colors = color_df.drop_duplicates().shape[0]
    return trans_im, n_colors

_, n_colors = transform_count_colors(painting)
print("There are {0} unique colors in this image.".format(n_colors))
There are 126932 unique colors in this image.

That is a lot of colors. RGB encoding requires 3 bytes per pixel, yet a single byte can store up to 256 colors. Can we cluster the colors so that we only need 1 byte to store them? Let's try it.

Before clustering, we need to get the image array into an appropiate form. We want an something similar to the insect data. The previous function already does that transformation, so we just need to call it again.

In [23]:
painting_, _ = transform_count_colors(painting)

Let's plot a slice of the data to check that we have what we want.

In [24]:
plt.plot(painting_[:100].transpose(), alpha=0.5);

Each line is a pixel, so we have the same thing as with the insect data. Now we can cluster.

Because there are many points in the image, we will use the slightly more inaccurate but faster version of K-means, Mini-batch K-means, and we will only use a sample of the picture to fit the model.

In [25]:
n_colors = 256

# Instantiate the mini-batch k-means class
kmeans_painting = sklearn.cluster.MiniBatchKMeans(n_clusters=n_colors, 
                                                  random_state=12)

# cluster and get attributes
im_sample = sklearn.utils.shuffle(painting_, random_state=0)[:1000]
kmeans_painting.fit(im_sample)
labels = kmeans_painting.predict(painting_)
centroids = kmeans_painting.cluster_centers_

That did not take too long, and just a few lines of code thanks to the scikit-learn crew!

Let's recreate the image with our new color palette to judge the effect of color quantization. We can map each pixel to its new respective color through its cluster label.

In [26]:
def recreate_image(centroids, labels, width, height):
    """
    Create a new image with centroids as color palette
    """
    d = centroids.shape[1]
    image = np.zeros((width, height, d))
    ind = 0
    for i in range(width):
        for j in range(height):
            image[i][j] = centroids[labels[ind]]
            ind += 1
    return image

new_painting = recreate_image(centroids, labels, imwidth, imheight)

_, ncolors_original = transform_count_colors(painting)
_, ncolors_new = transform_count_colors(new_painting)

fig, axes = plt.subplots(1,2)
for ax, im, title in ((axes[0], painting, 'Original'),
                      (axes[1], new_painting, 'K-means quantized')):
    ax.imshow(im)
    ax.set_title(title)
    ax.set_xticks([])
    ax.set_yticks([]);
print("""
The original painting has {0} colors.
The new painting has only {1} colors!""".format(ncolors_original, ncolors_new))
The original painting has 126932 colors.
The new painting has only 255 colors!

Can you tell the difference?