Auxiliary tutorial 4: Principal Component Analysis

(c) 2016 Heidi Klumpe and Manuel Razo. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained therein is licensed under an MIT license.

This tutorial was generated from a Jupyter notebook. You can download the notebook here.

In [2]:
import numpy as np
import numba
import pandas as pd
import scipy.special
import scipy.stats as st

# Package to perform PCA
import sklearn.datasets
import sklearn.decomposition

# BE/Bi 103 Utilities from Justin
import bebi103

import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d
import seaborn as sns
rc={'lines.linewidth': 2, 'axes.labelsize': 14, 'axes.titlesize': 14}
sns.set(rc=rc)

# Make Matplotlib plots appear inline
%matplotlib inline

import bokeh
bokeh.io.output_notebook()
Loading BokehJS ...

Overview

  1. Principal component analysis (PCA) summarizes a multi-dimensional dataset with fewer dimensions, while still preserving important underlying structure. Stated differently, it removes smaller, high-order correlations in the hopes of revealing the larger (and presumably more important) first-order variations.
  2. We can implement PCA in two ways: using the eigenvectors of the covariance matrix or with packages from scikit-learn.
  3. We need to be careful about choosing to use PCA as a visual heuristic or statistical tool, whether we're analyzing our own data or reading about someone else's.

Using PCA in the biological sciences

Biological datasets, thanks to technological advances and the suffix "-omics," are high-dimensional. We can record information about every gene, protein, or brain voxel, but then would like to isolate the most descriptive (or maybe predictive) features of those measurements. Ideally, we could remove apparent redundancies or unimportant features, and so be able to focus on the underlying structure that remains. For this reason, we often want to reduce the dimensionality of the data, i.e. summarize it with a smaller number of features.

PCA, invented by Karl Pearson in 1901, provides a way to do just that! Given a dataset with $p$ dimensions (where each dimension could be the measurement of a single gene, or cytokine, or SNP), we can project all observations into a set of $k$ new dimensions, which end up being linear combinations of the original dimensions. Though some information is "lost" in this reduction, PCA preserves the maximal variation in those new dimensions. So projecting data onto its principal components removes the potentially superficial distinctions of the extra dimensions whilst highlighting important differences. To get a more intuitive idea of how this works, try clicking around in these interactive plots.

PCA has been used in many interesting ways. One 2-D rendering of SNPs regenerated the geography of Europe. Other examples include classifying different soft tissue tumors; visualizing gene oscillations and subsequently intuiting genetic networks; and removing noise from genetic data.

Using PCA in Python

We will perform PCA two ways. First, with techniques from linear algebra, to get an idea of what we are doing. Second, with packages from scikit-learn, which allow us to do PCA in a single line!

With techniques from linear algebra

This blog post by Sebastian Raschka provides a clean and useful overview of our approach to PCA. This is what we will implement in the following lines to reduce our data to $k$ dimensions:

  1. Standardize the data. (To make quantitative comparisons of variance, we want to be sure each measurement varies to a similar extent; more on this later.)
  2. Compute the covariance matrix and use eigenvalue decomposition to obtain the eigenvectors and eigenvalues.
  3. Select the $k$ largest eigenvalues and their associated eigenvectors.
  4. Transform the data into a $k$ dimensional subspace using those $k$ eigenvectors.

Let's give it a try!

The famous iris data set

We will use one of the most famous datasets available online. This data set, collected by Edgar Anderson and popularized by the one and only Ronald Fisher, contains the petal and sepal length and width in three different species of Iris (Iris setosa, Iris virginica and Iris versicolor).

This data set is popular enough to have its own Wikipedia entry, and you can import it with seaborn, scikit-learn, or even pandas using

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
df = pd.read_csv(filepath_or_buffer=url, header=None, sep=',')

For this tutorial, we will import it using scikit-learn and transform it into a tidy data frame.

In [3]:
# Import the Iris dataset and convert it into a Pandas DataFrame
iris = sklearn.datasets.load_iris()

# Uncomment if you want to print the dataset description
# print(iris.DESCR)

# Make a DataFrame with a species column
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris['species'] = iris.target_names[iris.target]

# Take a look at df_iris
df_iris.head()
Out[3]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

We can plot pairwise comparisons using Seaborn's pairplot() function to see if there are striking correlations between any two features of the iris data set. If any two features are highly correlated, we may assume we can summarize those two features with a single axis (i.e. "new" feature) that includes both.

In [4]:
# Plot pairwise comparison to explore the data
_ = sns.pairplot(df_iris, hue='species')

Right away from this plot we can notice that versicolor and virginica are more similar to each other than to setosa. There is also a strong correlation between petal length and petal width. We can use this to explain the concept behind PCA. So, let's focus on these two parameters.

In [21]:
# Compute the mean
m = np.array([df_iris['petal length (cm)'].mean(), 
              df_iris['petal width (cm)'].mean()])

# Plot petal length vs petal width only
for key, group in df_iris.groupby(['species']):
    plt.plot(group['petal length (cm)'], group['petal width (cm)'],
               label=key, marker='.', linestyle='none')

# Add the mean value to the plot
plt.plot(m[0], m[1], marker='*', color='black', markersize=15,
         linestyle='none', label='mean')

plt.legend(loc=0)
plt.margins(0.02)
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)');