Auxillary tutorial 3: Linear regression with MCMC

This lecture nugget was shown in lecture 4 and was generated from an IPython notebook. You can download the notebook here.

In [1]:
from __future__ import division, absolute_import, \
                                    print_function, unicode_literals

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.special
import scipy.optimize

import emcee

# We'll use ColorBrewer for color maps
from brewer2mpl import sequential

# We will also use numdifftools (install using Canopy)
import numdifftools as nd

# Necessary to display plots in this IPython notebook
%matplotlib inline    

Loading the data

We will load the data as in Tutorial 3a. (You can download the data set here.)

In [2]:
# Load data into DataFrame
df = pd.read_csv('../data/good_et_al/invitro_droplet_data.csv', 
                 comment='#')

# Rename them all at once
df.columns = ['droplet_diameter', 'droplet_volume', 'spindle_length', 
              'spindle_width', 'tubulin_area']

# Put droplet volume in units of µm^3 for consistency of units
df.droplet_volume *= 1e9

# Only use data for droplets less than 90 um
df = df[df.droplet_diameter < 90.0]

Performing a linear regression using MCMC

Recall our linear model for dependence of spindle length on droplet diameter.

\begin{align} l = \nu + \zeta d. \end{align}

We assume Gaussian distributed error in the measurements, and we will go ahead and use an uninformative prior.

\begin{align} \ln P(\zeta,\nu,\sigma~|~D,I) = \text{constant} - (n+1)\ln \sigma - \frac{3}{2}\ln (1+\zeta^2) - \frac{1}{2\sigma^2}\sum_{i\in D} \left(l_i - (\nu + \zeta d_i)\right)^2 \end{align}

We will use emcee to sample the posterior. We first define it.

In [3]:
def log_posterior(p, droplet_diameter, spindle_length):
    """
    Returns the log of the posterior (sans an additive constant)
    
    p = [zeta, nu, sigma]
    """
    zeta, nu, sigma = p
    n = len(droplet_diameter)
    
    # Enforce positivity of sigma
    if sigma < 0.0:
        return -np.inf
        
    return -((spindle_length - zeta * droplet_diameter - nu)**2).sum() \
                / 2.0 / sigma \
                -(n + 1) * np.log(sigma) - 1.5 * np.log(1.0 + zeta**2)

Now, we set up the MCMC calculation.

In [4]:
n_dim = 3  # number of parameters in the model (zeta, nu, sigma)
n_walkers = 10  # number of MCMC walkers
n_burn = 200  # "burn-in" period to let chains stabilize
n_steps = 10000  # number of MCMC steps to take after burn-in

# Seed random numbers for reproducibility. 
np.random.seed(42)

# Generate random starting points for walkers.  
# p0[i,j] is the starting point for walk i along variable j.
p0 = np.empty((n_walkers, n_dim))
p0[:,0] = np.random.uniform(0.0, 1.0, n_walkers)   # zeta
p0[:,1] = np.random.uniform(10.0, 90.0, n_walkers) # nu 
p0[:,2] = np.random.exponential(1.0, n_walkers)   # sigma

# Set up the EnsembleSampler instance
sampler = emcee.EnsembleSampler(n_walkers, n_dim, log_posterior, 
                                args=(df.droplet_diameter, df.spindle_length))

# Do the burn-in
pos, prob, state = sampler.run_mcmc(p0, n_burn)

# Reset sampler and run from the burn-in state we got to
sampler.reset()
pos, prob, state = sampler.run_mcmc(pos, n_steps)

Now for post processing. We have samples drawn from the posterior stored in sampler.chain. It is easier to use sampler.flatchain, since this takes the individual walkers and puts them together.

In [7]:
# Compute mean parameter values
p_mean = sampler.flatchain.mean(axis=0)
p_std = sampler.flatchain.std(axis=0)

# Pull out values
zeta_mcmc, nu_mcmc, sigma = p_mean

We'll compute the most probable parameter values using scipy.optimize.curve_fit as before.

In [8]:
# Define our function to pass to scipy.optimize.leastsq
def linear(droplet_diameter, zeta, nu):
    return zeta * droplet_diameter + nu

# Generate an initial guess to put into the solver
zeta_0 = 0.5
nu_0 = 20.0
p0 = np.array([zeta_0, nu_0])

# Rock-n-roll!
p_opt, junk_output = scipy.optimize.curve_fit(
               linear, df.droplet_diameter, df.spindle_length, p0)

# Unpack results
zeta_most_prob, nu_most_prob = p_opt

Let's compare the results!

In [9]:
print("""
            curve_fit                  MCMC
           ==============          ==============
zeta   =      {0:.3f}                   {2:.3f}
nu     =     {1:.3f} µm               {3:.3f} µm
""".format(zeta_most_prob, nu_most_prob, zeta_mcmc, nu_mcmc))

            curve_fit                  MCMC
           ==============          ==============
zeta   =      0.205                   0.205
nu     =     21.004 µm               21.005 µm


And with our results, from MCMC, we can plot the posterior!

In [19]:
# Plot the posterior as a density plot, plotting every 15 points
fig, ax = plt.subplots()
# ax.set_axis_bgcolor('black')
ax.plot(mcmc_trace[1,::15], mcmc_trace[0,::15], 'k.', alpha=0.1)
ax.set_xlabel(r'$\nu$ (µm)')
ax.set_ylabel(r'$\zeta$')
Out[19]:
<matplotlib.text.Text at 0x113581ed0>

A cool thing about MCMC results is that marginalization is trivial. You just ignore the other variables! We can plot

\begin{align} P(\zeta~|~D,I) = \int \mathrm{d}\nu \int \mathrm{d} \sigma\, P(\zeta,\nu,\sigma~|~D,I) \end{align}

just by making a histogram of the $\zeta$ values.

In [17]:
# Make histogram
n, b, p = plt.hist(sampler.flatchain[:,0], 
                   bins=np.sqrt(len(sampler.flatchain[:,0])), 
                   normed=True, histtype='step', color='k')
plt.xlabel(r'$\zeta$')
plt.ylabel(r'$P(\zeta|D,I)$')
Out[17]:
<matplotlib.text.Text at 0x11355b690>