Auxiliary tutorial 4: Principal Component Analysis

(c) 2016 Heidi Klumpe and Manuel Razo. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained therein is licensed under an MIT license.

This tutorial was generated from a Jupyter notebook. You can download the notebook here.

In [2]:
import numpy as np
import numba
import pandas as pd
import scipy.special
import scipy.stats as st

# Package to perform PCA
import sklearn.datasets
import sklearn.decomposition

# BE/Bi 103 Utilities from Justin
import bebi103

import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d
import seaborn as sns
rc={'lines.linewidth': 2, 'axes.labelsize': 14, 'axes.titlesize': 14}
sns.set(rc=rc)

# Make Matplotlib plots appear inline
%matplotlib inline

import bokeh
bokeh.io.output_notebook()
Loading BokehJS ...

Overview

  1. Principal component analysis (PCA) summarizes a multi-dimensional dataset with fewer dimensions, while still preserving important underlying structure. Stated differently, it removes smaller, high-order correlations in the hopes of revealing the larger (and presumably more important) first-order variations.
  2. We can implement PCA in two ways: using the eigenvectors of the covariance matrix or with packages from scikit-learn.
  3. We need to be careful about choosing to use PCA as a visual heuristic or statistical tool, whether we're analyzing our own data or reading about someone else's.

Using PCA in the biological sciences

Biological datasets, thanks to technological advances and the suffix "-omics," are high-dimensional. We can record information about every gene, protein, or brain voxel, but then would like to isolate the most descriptive (or maybe predictive) features of those measurements. Ideally, we could remove apparent redundancies or unimportant features, and so be able to focus on the underlying structure that remains. For this reason, we often want to reduce the dimensionality of the data, i.e. summarize it with a smaller number of features.

PCA, invented by Karl Pearson in 1901, provides a way to do just that! Given a dataset with $p$ dimensions (where each dimension could be the measurement of a single gene, or cytokine, or SNP), we can project all observations into a set of $k$ new dimensions, which end up being linear combinations of the original dimensions. Though some information is "lost" in this reduction, PCA preserves the maximal variation in those new dimensions. So projecting data onto its principal components removes the potentially superficial distinctions of the extra dimensions whilst highlighting important differences. To get a more intuitive idea of how this works, try clicking around in these interactive plots.

PCA has been used in many interesting ways. One 2-D rendering of SNPs regenerated the geography of Europe. Other examples include classifying different soft tissue tumors; visualizing gene oscillations and subsequently intuiting genetic networks; and removing noise from genetic data.

Using PCA in Python

We will perform PCA two ways. First, with techniques from linear algebra, to get an idea of what we are doing. Second, with packages from scikit-learn, which allow us to do PCA in a single line!

With techniques from linear algebra

This blog post by Sebastian Raschka provides a clean and useful overview of our approach to PCA. This is what we will implement in the following lines to reduce our data to $k$ dimensions:

  1. Standardize the data. (To make quantitative comparisons of variance, we want to be sure each measurement varies to a similar extent; more on this later.)
  2. Compute the covariance matrix and use eigenvalue decomposition to obtain the eigenvectors and eigenvalues.
  3. Select the $k$ largest eigenvalues and their associated eigenvectors.
  4. Transform the data into a $k$ dimensional subspace using those $k$ eigenvectors.

Let's give it a try!

The famous iris data set

We will use one of the most famous datasets available online. This data set, collected by Edgar Anderson and popularized by the one and only Ronald Fisher, contains the petal and sepal length and width in three different species of Iris (Iris setosa, Iris virginica and Iris versicolor).

This data set is popular enough to have its own Wikipedia entry, and you can import it with seaborn, scikit-learn, or even pandas using

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
df = pd.read_csv(filepath_or_buffer=url, header=None, sep=',')

For this tutorial, we will import it using scikit-learn and transform it into a tidy data frame.

In [3]:
# Import the Iris dataset and convert it into a Pandas DataFrame
iris = sklearn.datasets.load_iris()

# Uncomment if you want to print the dataset description
# print(iris.DESCR)

# Make a DataFrame with a species column
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris['species'] = iris.target_names[iris.target]

# Take a look at df_iris
df_iris.head()
Out[3]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

We can plot pairwise comparisons using Seaborn's pairplot() function to see if there are striking correlations between any two features of the iris data set. If any two features are highly correlated, we may assume we can summarize those two features with a single axis (i.e. "new" feature) that includes both.

In [4]:
# Plot pairwise comparison to explore the data
_ = sns.pairplot(df_iris, hue='species')

Right away from this plot we can notice that versicolor and virginica are more similar to each other than to setosa. There is also a strong correlation between petal length and petal width. We can use this to explain the concept behind PCA. So, let's focus on these two parameters.

In [21]:
# Compute the mean
m = np.array([df_iris['petal length (cm)'].mean(), 
              df_iris['petal width (cm)'].mean()])

# Plot petal length vs petal width only
for key, group in df_iris.groupby(['species']):
    plt.plot(group['petal length (cm)'], group['petal width (cm)'],
               label=key, marker='.', linestyle='none')

# Add the mean value to the plot
plt.plot(m[0], m[1], marker='*', color='black', markersize=15,
         linestyle='none', label='mean')

plt.legend(loc=0)
plt.margins(0.02)
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)');

Now, we do "manual" PCA (or "Manuel" PCA, as he coded this approach first). Our goal will be to reduce our 2-D data to 1-D. Note that by taking the mean, we already reduced it 0-D in a sense.

1) Standardize the data. Because our plot labels have units, we know the two dimensions we're considering are both measured in centimeters, and so have similar scale. But they're different distances from zero! We'll center all measurements at the same point, the mean.

In [22]:
# Substract the mean from the measurements.
df_centered = df_iris.loc[:, ['petal length (cm)', 'petal width (cm)']]
for col in df_centered.columns:
    df_centered[col] -= df_centered[col].mean()

# Take a look
df_centered.head()
Out[22]:
petal length (cm) petal width (cm)
0 -2.358667 -0.998667
1 -2.358667 -0.998667
2 -2.458667 -0.998667
3 -2.258667 -0.998667
4 -2.358667 -0.998667

2) Compute the covariance matrix and use eigenvalue decomposition to obtain the eigenvectors and eigenvalues.

We won't cover the math behind this procedure. However, it can be shown that the principal component directions are given by the eigenvectors of the matrix, and the magnitudes of the components are given by the eigenvalues.

Most of the available algorithms to do PCA use singular value decomposition instead for computational efficiency. But regardless of the algorithm the objective is still the same: compute the eigenvectors and eigenvalues from the covariance matrix.

In [25]:
cov_mat = np.cov(df_centered.transpose())
print('Covariance matrix \n', cov_mat)
Covariance matrix 
 [[ 3.11317942  1.29638747]
 [ 1.29638747  0.58241432]]

Next, we'll compute the eigensystem.

In [26]:
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors\n', eig_vecs)
print('\nEigenvalues\n', eig_vals)
Eigenvectors
 [[ 0.92154695 -0.38826694]
 [ 0.38826694  0.92154695]]

Eigenvalues
 [ 3.65937449  0.03621925]

We can plot the eigenvectors on top of our data to get a sense of how these principal components can capture the variation in the data.

In [28]:
# Plot Petal length vs petal width only
for key, group in df_iris.groupby(['species']):
    plt.plot(group['petal length (cm)'], group['petal width (cm)'],
               label=key, marker='o', linestyle='none')

# Add the mean value to the plot
plt.plot(m[0], m[1], marker='*', color='black', markersize=15)

# Add arrows showing the eigenvectors
plt.quiver([m[0]]*2, [m[1]]*2, eig_vecs[:,1], eig_vecs[:,0], zorder=11, 
           width=0.01, scale=3)
    
# Tidy up plot
plt.legend(loc=0, fontsize=15)
plt.margins(0.02)
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
plt.title('Iris data with principal components');

3) Select the $k$ largest eigenvalues and their associated eigenvectors.

As Sebastian Raschka points out in his explanation of PCA:

The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes.

We chose this pair of measurements originally because there was a clear correlation between them. This is indeed confirmed by the relative magnitudes of the eigenvalues where one of them is two orders of magnitude larger than the other. Clearly, describing this data with two axes, rather than one, does not add much additional information. Also, we know which eigenvector to take: the one with the largest eigenvalue.

Lior Pachter's great blog post on PCA, which we will discuss more later, explains this 2-D to 1-D case by thinking of the projection as a triangle. Let's say we want to project a single data point onto the first principal component. The centroid (fixed) and the data point (also fixed) form a triangle with the the transformed data point (an orthogonal projection onto the first principal component).

The hypotenuse of the triangle (the distance from the centroid to the data point) is fixed, so the other two sides must change together (by the Pythagorean Theorem). The first principal component has a minimal orthogonal distance to all the data points. So the side of the triangle between the data point and the transformed data point is as small as possible. This means the distance between the centroid and the transformed point is as large as possible (though still smaller than the original distance between the data point and the centroid)!

This is what we mean when we say that using PCA to project the data into fewer dimensions maximizes the sample variation (i.e. their distances from each other after they've been transformed), despite some information being lost.

Luckily, we can quantify about how much information is lost. As the eigenvalues are a relative measure of the data variance along the associated eigenvector (i.e. the "new feature axis"), we can use them to quantify how much of the variance is explained by our $k$ dimensions, which could be a useful way to decide what $k$ should even be in the first place. In this case, we want to know how much of the 2-D spread is described by our 1-D simplification.

In [29]:
# Compute how much variance is explained by each principal component
print("""
PCA 1: {0:.2f}% of the variance
PCA 2:  {1:.2f}% of the variance
""".format(*tuple(eig_vals / np.sum(eig_vals) * 100)))
PCA 1: 99.02% of the variance
PCA 2:  0.98% of the variance

4) Transform the data into a $k$ dimensional subspace using those $k$ eigenvectors.

To actually project our data into this new dimension, we have to multiply our data by the so-called projection matrix, which is the fancy name for concatenating the top $k$ eigenvectors together. Then, the dot product of the data with the projection matrix "projects" the data onto our new axis.

Since in this simple example we are projecting into a 1D space, we just have to matrix multiply our data by the eigenvector with the largest corresponding eigenvalue.

In [30]:
# Project data to our 1D space
df_1D = pd.DataFrame(np.dot(df_iris.loc[:,['petal length (cm)',
                                           'petal width (cm)']], eig_vecs[:,0]),
                     columns=['projection'])

# Add back the species column
df_1D['species'] = df_iris['species']
df_1D.head()
Out[30]:
projection species
0 1.367819 setosa
1 1.367819 setosa
2 1.275664 setosa
3 1.459974 setosa
4 1.367819 setosa

Now we can plot our data in 1D only while maintaining ≈98% percent of the variability in the data!

In [32]:
for key, group in df_1D.groupby(['species']):
    plt.plot(group['projection'], np.zeros_like(group['projection']), alpha=0.4, 
             label=key, marker='o', linestyle='none')

plt.margins(0.05)
plt.xlabel('PCA 1')
plt.legend(np.array(['setosa', 'versicolor','virginica']))
plt.title('Projection of petal length v. petal width onto one axis');

Exciting! We were able to project 2-D data onto a 1-D axis, and still have a sense of how these various iris species differ. Note that while the new dimension has numerical values, its interpretation is a bit fuzzy. It's in fact a weighted combination of petal length and petal width, but this doesn't necessarily tell us anything new about biology, except that these two variables are related and potentially very different in various iris species. However, we might have known this by computing their covariance.

Scikit-learn shortcut

Now that we explored step-by-step how to do PCA, we can use scikit-learn to do it in a single line. For this, we will take all 4 dimensions of the original dataset (petal length, petal width, sepal length, and sepal width) and explore how much variability is explained by each of the resulting principal components.

Scikit-learn utilizes Python's object orientation. An object is first instantiated by the user, and has various variables (i.e. class and instance variables) and methods (i.e. functions) you can access (more definitions here). We first instantiate a sklearn.decomposition.PCA object, and the use the fit() method to get PCA on our data. The attributes of the PCA instance that end in underscores are the computed values.

In [35]:
# Instantiate the PCA object
sklearn_pca = sklearn.decomposition.PCA()

# Pass the data to the fit method
sklearn_pca.fit(df_iris[iris.feature_names])

# Print the variance explained
print('Variance percent explained\n', sklearn_pca.explained_variance_ratio_)
Variance percent explained
 [ 0.92461621  0.05301557  0.01718514  0.00518309]

We can see that the first component captures 92% of the variability in the data! We can now easily project our 4-D dataset into any $k$ dimensional space we would like. Since we've already seen a 0-D and 1-D reduction, let's look at the data in 2-D space.

In [36]:
# Perform the PCA again retaining only the top 2 components
sklearn_pca = sklearn.decomposition.PCA(n_components=2)
sklearn_pca.fit(df_iris[iris.feature_names])

# Project the data into this 2D space and convert it back to a tidy dataframe
df_2D = pd.DataFrame(sklearn_pca.transform(df_iris[iris.feature_names]),
                     columns=['PCA1', 'PCA2'])

# Create a column for species name
df_2D['species'] = df_iris['species']

# Look at the result
df_2D.head()
Out[36]:
PCA1 PCA2 species
0 -2.684207 0.326607 setosa
1 -2.715391 -0.169557 setosa
2 -2.889820 -0.137346 setosa
3 -2.746437 -0.311124 setosa
4 -2.728593 0.333925 setosa

Now we can plot our original 4-D data onto a 2-D space that retains nearly 93% of the variability.

In [38]:
for key, group in df_2D.groupby(['species']):
    plt.plot(group.PCA1, group.PCA2, 'o', alpha=0.7, label=key)

# Tidy up plot
plt.legend(loc=0)
plt.margins(0.05)
plt.xlabel('PCA 1')
plt.ylabel('PCA 2');

Interesting! This single plot summarizes what we intuited from looking at all the dimensions with our pairwise plot. The species are relatively distinct, with versicolor and viriginica being more similar to each other than to setosa. Also, the second axis doesn't add much additional information; for the most part, it seems to capture the variation within species, rather than something that makes these flowers phenotypically distinct.

Summary

So what have we learned by doing this analysis?

We can implement PCA in Python.

Methodologically, there are two ways to do it: approaches from linear algebra, or with scikit-learn.

Practically, PCA takes a $p$ dimensional dataset as input and outputs two things that might interest us:

  • a transformed dataset in $k$ dimensions and
  • a set of eigenvectors and associated eigenvectors that indicate the "percent variance" explained.

Here, the transformed dataset was our endpoint, since we were satisfied to think about high-dimensional differences as a 1 or 2-D distances. But can we learn something about data from the way we transformed it? Is it more than a visual heuristic? Is a plot my only reward for all this adventurous math?

PCA doesn't necessarily provide statistical information or a useful metric. But it will make your data easier to look at.

Even if, as in the case of the iris data, we had a single principal component that described most of the variation, it is not necessarily wise to use that as a mathematical or statistical model for the data. In fact, we may know beforehand that we do not expect to predict the data with purely linear relationships. What's more, as the weights of the principal components do not necessarily represent physical quantities, such models can be hard to verify with pre-existing data. See an internet rant about "Unprincipled Component Analysis" for more information.

Also note that PCA is not scale invariant; you can easily imagine that if we analyzed the same measurements, but in different units, the eigenvector weights would be very different! Moreover, a dimension varying from 0.1 to 0.2 might be considered much less significant than a dimension varying from 100 to 200. So any implementation of PCA includes a somewhat subjective choice of scale, though scale-invariant methods do exist.

To capture the confusion around the true usefulness of PCA in data analysis, Lior Pachter collected these quotations in his blog post on PCA (all emphasis ours):

  • "PCA is not a statistical method to infer parameters or test hypotheses. Instead, it provides a method to reduce a complex dataset to lower dimension to reveal sometimes hidden, simplified structure that often underlie it."
  • "PCA is a statistical method routinely used to analyze interrelationships among large numbers of objects."
  • PCA is "more useful as a visualization technique than as an analytical method."
  • PCA is a "mathematical algorithm that reduces the dimensionality of the data while retaining most of the variation in the data set."

However, as we might intuit, the eigenvalues and eigenvectors do contain some statistical information. Pachter talks about other approaches to using PCA, which are more interested in the new dimensions onto which the data is projected, rather than the projected data itself. For example, the new dimensions define an "affine subspace closest to a set of points." In 0-D, this is the centroid, or the single point closest to all the data points. In 1-D, when we collapse the data onto a single line, that line is the "average line" at a minimal orthogonal distance from all the data.

If that last part sounds a bit like least squares regression, it's because they are related! Pachter's second definition describes how probabilitistic PCA is "a generalization of linear regression in which Gaussian noise is isotropic." We showed in class that a least squares regression is equivalent to maximizing the likelihood, if we assume Gaussian distributed errors in the dependent variable. (Note that this minimizes the vertical distince between the data and the line.) However, for biological processes, we often expect noise in the independent and dependent variables. Our "affine subspace," or the dimension onto which we project our data, may describe data better than a linear regression, as it considers both sources of noise (more information here).

PCA invokes certain assumptions, and so works well on specific datasets: similarly-scaled, high signal-to-noise ratio, with primarily linear relationships.

As described above, to make good quantitative comparison of variation, we must have data where "large" and "small" variation occur on the same scale. Also, by stating that the largest-order variation is the most scientifically interesting, we assume that the signal to noise ratio is fairly high, such that the lower-order variation we remove will be noise, or at least less crucial to summarizing the data usefully.

Other clustering approaches, which we will discuss in future auxiliary tutorials, may provide "better" ways to summarize high dimensional data. However, as we've learned throughout this course, the appropriateness of a technique depends on the question that you're asking. PCA will summarize the data without you specifying a structure beforehand, but doesn't necessarily categorize the data; other approaches, like k-means clustering, will return discrete clusters but require you to know something about the structure of the data beforehand, e.g. the number of clusters.

Using PCA to find the perfect human.

Genomes are hard to interpret. And so are PCA plots. Lior Pachter makes both these points in an interesting blog post, where he calculates who the perfect human being would be according to their genomic SNP profile.

SNPedia is a database that contains a compilation of SNPs classified as good or bad according to some GWAS criteria. Lior did the following:

  1. He created a "perfect human" in silico by setting the alleles at all SNPs so that they are "good".
  2. He then added the "perfect human" to a panel of genotyped individuals from across a variety of populations and performed PCA to reveal the location and population of origin of the individual that was closest to this hypothetical perfect human.

We will use his data and repeat the analysis to find out who the perfect human being is and where is he/she from. You can download the SNP table, as well as information about the subjects here.

We will load it into a DataFrame where each column heading refers to a SNP and each row is a given subject. The index of the row is a string representing the subject (including 'perfect'), and the column headings are the SNP IDs.

In [43]:
# Read the SNP table
df_snp = pd.read_csv('../data/pachter/geno_table.txt', delimiter='\t',
                     index_col='snp_id').transpose()
df_snp.head()
Out[43]:
snp_id rs307377 rs7366653 rs41307846 rs3753242 rs35082957 rs34154371 rs35426403 rs1143016 rs3890745 rs17472401 ... rs6007897 rs9615362 rs9627183 rs28372448 rs121913039 rs121913037 rs5770917 rs6151429 rs743616 rs2071421
perfect 0 0 0 0 0 0 0 0 2 0 ... 0 0 0 0 0 0 0 0 0 0
HG00096 2 0 1 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 1 0
HG00097 2 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 2 0
HG00099 2 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 1 0
HG00100 2 0 0 0 0 0 0 0 1 0 ... 0 0 1 0 0 0 0 0 1 0

5 rows × 4967 columns

Because we will do PCA with these data, they should be floating point numbers instead of integers.

In [44]:
df_snp = df_snp.astype(np.float)

We also have data about each individual, such as where they are from. The integrated_call_samples_v3.20130502.ALL.panel file is also included in the ZIP file containing the data.

In [45]:
# Read the description of each individual
fname = '../data/pachter/integrated_call_samples_v3.20130502.ALL.panel'
df_info = pd.read_csv(fname, delimiter='\t', 
                      index_col=0).dropna(axis=1, how='all')
df_info.head()
Out[45]:
pop super_pop gender
sample
HG00096 GBR EUR male
HG00097 GBR EUR female
HG00099 GBR EUR female
HG00100 GBR EUR female
HG00101 GBR EUR male

For convenience, we'll add this information to the SNP DataFrame.

In [46]:
# Add to the SNP dataframe the information about gender and super_pop
aux_cols = ['pop', 'super_pop', 'gender']
df_snp[aux_cols] = df_info

# Change the super_pop column for the perfect human being
df_snp.loc['perfect', ['pop', 'super_pop']] = ['Perfect Human']*2

# Take a look at auxiliary columns
df_snp[aux_cols].head()
Out[46]:
snp_id pop super_pop gender
perfect Perfect Human Perfect Human NaN
HG00096 GBR EUR male
HG00097 GBR EUR female
HG00099 GBR EUR female
HG00100 GBR EUR female

A note about units: Remember, as we discussed above, the subspaces that you obtain from performing PCA are not scale invariant. This means that, for example, if your dataset contains things measured in nanometers and things measured in meters, or even worse things measuring completely unrelated things, the units in which your measurement is stored would affect the PCA analysis.

The simplest way to avoid this issue is to form a "common set of units" by standardizing your values such that they all have a common mean and variance (usually set to be zero and one respectively). Though still a somewhat subjective choice, this ensures that variance in each dimension happen on roughly the same scale.

Fortunately, as in almost every case, scikit-learn can do the job for us. We will simply use the StandardScaler class from the preprocessing module of scikit-learn to fix this.

In [49]:
# Standardize the data before performing PCA
# We use .drop to remove the gender and super_pop columns to avoid errors.
df_snp_std = sklearn.preprocessing.StandardScaler().fit_transform(
                            df_snp.drop(aux_cols, axis=1))

# Perform the PCA and transform the data
n_components = 5
snp_pca = sklearn.decomposition.PCA(n_components=n_components)

# Project the data into this 2D space
snp_pca.fit(df_snp_std)
df_snp_pca = snp_pca.transform(df_snp_std)

# Convert back to a nice tidy dataframe
df_snp_pca = pd.DataFrame(df_snp_pca, 
                columns=['PC' + str(x) for x in range(1, n_components+1)], 
                index=df_snp.index)

# Add again the gender, pop, and super_pop columns
df_snp_pca[aux_cols] = df_snp[aux_cols]
df_snp_pca.head()
Out[49]:
PC1 PC2 PC3 PC4 PC5 pop super_pop gender
perfect -2.221873 -13.823952 193.950204 26.726979 -2.303932 Perfect Human Perfect Human NaN
HG00096 -6.037516 -15.947683 0.535965 -3.401567 -3.554958 GBR EUR male
HG00097 -6.310191 -14.804931 1.028855 -13.753593 4.573479 GBR EUR female
HG00099 -7.102287 -15.820245 4.379116 -4.122070 -3.076320 GBR EUR female
HG00100 -6.006800 -17.297061 0.387502 -4.474226 -2.636547 GBR EUR female

Great! Now, we know that plotting our multi-dimensional data in 2-D will give us an idea of which points are similar and dissimilar. So, if we believe genetic fitness is entirely determined by the absence and presence of universally good and bad SNPs, we may be interested in which humans are closest to the "perfect" human created by Lior Pachter. (Shortly after, we may offer our services to Hydra.) We will find this out by plotting all the SNP profiles in 2-D space.

In [51]:
# Plot all populations except the perfect human
cols = ~df_snp_pca['super_pop'].isin(['Perfect Human'])
df_gb = df_snp_pca[cols].groupby(['super_pop'])
for key, group in df_gb:
    plt.plot(group.PC1, group.PC2, 'o', alpha=0.5, label=key)

# Add the perfect human being to the plot
plt.plot(df_snp_pca.ix['perfect'].PC1, df_snp_pca.ix['perfect'].PC2, '*', 
         markersize=15, color='black', label=df_snp_pca.ix['perfect'].super_pop)

# Tidy the plot.
plt.margins(0.05)
plt.legend(loc=0)
plt.xlabel('PC 1')
plt.ylabel('PC 2');

Wow. Its kind of hard to find from this plot who is closest to our perfect human being. If only there was a way to plot this in an interactive fashion where we could zoom in the plot and hover over the data to get information...

Oh wait! There is such a thing, we can use bokeh to explore the data better.

In [52]:
# What pops up on hover?
tooltips = [('gender', '@gender'),
            ('pop', '@pop'),
            ('ID', '@index')]

# Make the hover tool
hover = bokeh.models.HoverTool(tooltips=tooltips)

# Create figure
p = bokeh.plotting.figure(plot_width=650, plot_height=450, x_axis_label='PC1',
                          y_axis_label='PC2')

# Add the hover tool
p.add_tools(hover)

# Define colors in a dictionary to access them with
# the key from the pandas groupby funciton.
keys = df_snp_pca.super_pop.dropna().unique()
color_dict = {k: bebi103.rgb_frac_to_hex(sns.color_palette()[i]) 
                      for i, k in enumerate(sorted(keys))}

for key, group in df_snp_pca.groupby('super_pop'):
    # Specify data source
    source = bokeh.models.ColumnDataSource(group)
    
    # Populate glyphs
    if key == 'Perfect Human':
        p.diamond_cross(x='PC1', y='PC2', size=20, source=source, 
                        color='black', fill_color=None, line_width=2,
                        legend=key)
    else:
        p.circle(x='PC1', y='PC2', size=7, alpha=0.2, source=source,
                 color=color_dict[key], legend=key)

p.legend.background_fill_alpha = 0.25
# Blanched almonds are the best kind of almonds
p.legend.background_fill_color = 'blanchedalmond'
bokeh.io.show(p)

So the closest point to the perfect human being is patient HG00737 a Puerto Rican woman!**

How confident are we that this plot indicates a Puerto Rican woman is the closest to being perfect? What if we considered additional axes?

In [53]:
# Make the hover tool
hover = bokeh.models.HoverTool(tooltips=tooltips)

# Create figure
# Create figure
p = bokeh.plotting.figure(plot_width=650, plot_height=450, x_axis_label='PC1',
                          y_axis_label='PC3')

# Add the hover tool
p.add_tools(hover)

for key, group in df_snp_pca.groupby('super_pop'):
    # Specify data source
    source = bokeh.models.ColumnDataSource(group)
    
    # Populate glyphs
    if key == 'Perfect Human':
        p.diamond_cross(x='PC1', y='PC3', size=20, source=source, 
                        color='black', fill_color=None, line_width=2,
                       legend=key)
    else:
        p.circle(x='PC1', y='PC3', size=7, alpha=0.2, source=source,
                 color=color_dict[key], legend=key)

p.legend.background_fill_alpha = 0.25
p.legend.background_fill_color = 'blanchedalmond'
bokeh.io.show(p)

Wow! The perfect human is actually... not human. What happened here?

We didn't check how much variance the first two axes explained; for the most part, we picked two dimensions because we publish papers in a two-dimensional format. It seems that there is a significant amount of variance explained by the third axis. In fact, it separates the "perfect" human from all other humans.

In [54]:
print('Variance percent explained\n', snp_pca.explained_variance_ratio_)
Variance percent explained
 [ 0.05660433  0.04227824  0.00728842  0.00597726  0.00540376]

It seems our plot explains less than 10% of the variation. A 2-D plot is vastly oversimplified. Looking at the 3-D plot, we see how we fooled ourselves.

In [56]:
fig = plt.figure(1, figsize=(8, 6))
ax = mpl_toolkits.mplot3d.Axes3D(fig)
for key, group in df_snp_pca.groupby(['super_pop']):
    if key == 'Perfect Human':
        ax.plot(group.PC1, group.PC2, group.PC3, 'k*', markersize=15,
                label=key)
    else:
        ax.plot(group.PC1, group.PC2, group.PC3, 'o', alpha=0.7, label=key)

ax.set_title("First three PC directions")
ax.set_xlabel("PC 1")
ax.set_ylabel("PC 2")
ax.set_zlabel("PC 3")
ax.legend(loc='upper left', fontsize=15);

This is an important reminder that PCA removes some variance to help us focus on the largest (and presumably most important) variance. Here however, there was important information even in the "lower" order variation. By plotting in 2-D, we implicitly assumed that most of the structure of SNP variation is itself 2-D. Perhaps, for something very complex like genomes, we should expect many higher order interactions to be still be important to understanding the system.

A note on Pachter's analysis: This blog post ended up being fairly controversial because it is so easy to (mis)interpret the 2-D PCA plot. Lior's original intent was to show that James Watson's racist ideas of having a perfect human being were completely wrong. Instead, however, most people took the sarcastic title seriously. Eventually, scientists from Puerto Rico, who collected the samples, explained the actual results for the people that only read Lior's title.

Footnotes*

*Coincidentally, iris is the state cultivated flower of Tennessee, where I (Heidi) grew up, though the state wildflower is much more exciting.

**But we are sure that Lin-Manuel Miranda, a Puerto Rican man, is a close second. That man is non-stop.

***If you're looking for additional inspiration for Homework 4:

For sampling posteriors, we've set the bar low-
Just compute with Markov Chain Monte Carlo.
Without any math, you get a map to the MAP
Where walkers stop wand'ring to start doing laps.
To begin, you can place them wherever you like.
Ignore when they grumble, or protest outright:
"Why would you drop me HERE of all places?
Don't you know how scary parameter space is?
Inhabitable regions, incalculable zones!
Cliffs of probability you fall off of, alone!"
Allow them a moment to sulk with arms crossed,
For, while the danger is real, the bluster is not.
With their transition kernel passports, they've seen
The world ergodic and made its landscape with their feet.
We can know only what they have known first.
They know very well the glory of their post,
And whether they are there or back again,
They never forget jobs come only so often.
So when you exit the script and put them away,
You may be at a loss for what to feel or say.
Just remember this tumultuous relation
Is still preferable to integration.

(dedicated to Christina Su, the human version of GitHub, who restored the accidentally discarded first draft)