Auxillary tutorial 3: Linear regression with MCMC¶

This lecture nugget was shown in lecture 4 and was generated from an IPython notebook. You can download the notebook here.

from __future__ import division, absolute_import, \
                                    print_function, unicode_literals

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.special
import scipy.optimize

import emcee

# We'll use ColorBrewer for color maps
from brewer2mpl import sequential

# We will also use numdifftools (install using Canopy)
import numdifftools as nd

# Necessary to display plots in this IPython notebook
%matplotlib inline

Loading the data¶

We will load the data as in Tutorial 3a. (You can download the data set here.)

# Load data into DataFrame
df = pd.read_csv('../data/good_et_al/invitro_droplet_data.csv', 
                 comment='#')

# Rename them all at once
df.columns = ['droplet_diameter', 'droplet_volume', 'spindle_length', 
              'spindle_width', 'tubulin_area']

# Put droplet volume in units of µm^3 for consistency of units
df.droplet_volume *= 1e9

# Only use data for droplets less than 90 um
df = df[df.droplet_diameter < 90.0]

Performing a linear regression using MCMC¶

Recall our linear model for dependence of spindle length on droplet diameter.

\begin{align} l = \nu + \zeta d. \end{align}

We assume Gaussian distributed error in the measurements, and we will go ahead and use an uninformative prior.

\begin{align} \ln P(\zeta,\nu,\sigma~|~D,I) = \text{constant} - (n+1)\ln \sigma - \frac{3}{2}\ln (1+\zeta^2) - \frac{1}{2\sigma^2}\sum_{i\in D} \left(l_i - (\nu + \zeta d_i)\right)^2 \end{align}

We will use emcee to sample the posterior. We first define it.

def log_posterior(p, droplet_diameter, spindle_length):
    """
    Returns the log of the posterior (sans an additive constant)
    
    p = [zeta, nu, sigma]
    """
    zeta, nu, sigma = p
    n = len(droplet_diameter)
    
    # Enforce positivity of sigma
    if sigma < 0.0:
        return -np.inf
        
    return -((spindle_length - zeta * droplet_diameter - nu)**2).sum() \
                / 2.0 / sigma \
                -(n + 1) * np.log(sigma) - 1.5 * np.log(1.0 + zeta**2)

Now, we set up the MCMC calculation.

n_dim = 3  # number of parameters in the model (zeta, nu, sigma)
n_walkers = 10  # number of MCMC walkers
n_burn = 200  # "burn-in" period to let chains stabilize
n_steps = 10000  # number of MCMC steps to take after burn-in

# Seed random numbers for reproducibility. 
np.random.seed(42)

# Generate random starting points for walkers.  
# p0[i,j] is the starting point for walk i along variable j.
p0 = np.empty((n_walkers, n_dim))
p0[:,0] = np.random.uniform(0.0, 1.0, n_walkers)   # zeta
p0[:,1] = np.random.uniform(10.0, 90.0, n_walkers) # nu 
p0[:,2] = np.random.exponential(1.0, n_walkers)   # sigma

# Set up the EnsembleSampler instance
sampler = emcee.EnsembleSampler(n_walkers, n_dim, log_posterior, 
                                args=(df.droplet_diameter, df.spindle_length))

# Do the burn-in
pos, prob, state = sampler.run_mcmc(p0, n_burn)

# Reset sampler and run from the burn-in state we got to
sampler.reset()
pos, prob, state = sampler.run_mcmc(pos, n_steps)

Now for post processing. We have samples drawn from the posterior stored in sampler.chain. It is easier to use sampler.flatchain, since this takes the individual walkers and puts them together.

# Compute mean parameter values
p_mean = sampler.flatchain.mean(axis=0)
p_std = sampler.flatchain.std(axis=0)

# Pull out values
zeta_mcmc, nu_mcmc, sigma = p_mean

We'll compute the most probable parameter values using scipy.optimize.curve_fit as before.

# Define our function to pass to scipy.optimize.leastsq
def linear(droplet_diameter, zeta, nu):
    return zeta * droplet_diameter + nu

# Generate an initial guess to put into the solver
zeta_0 = 0.5
nu_0 = 20.0
p0 = np.array([zeta_0, nu_0])

# Rock-n-roll!
p_opt, junk_output = scipy.optimize.curve_fit(
               linear, df.droplet_diameter, df.spindle_length, p0)

# Unpack results
zeta_most_prob, nu_most_prob = p_opt

Let's compare the results!

print("""
            curve_fit                  MCMC
           ==============          ==============
zeta   =      {0:.3f}                   {2:.3f}
nu     =     {1:.3f} µm               {3:.3f} µm
""".format(zeta_most_prob, nu_most_prob, zeta_mcmc, nu_mcmc))


            curve_fit                  MCMC
           ==============          ==============
zeta   =      0.205                   0.205
nu     =     21.004 µm               21.005 µm

And with our results, from MCMC, we can plot the posterior!

# Plot the posterior as a density plot, plotting every 15 points
fig, ax = plt.subplots()
# ax.set_axis_bgcolor('black')
ax.plot(mcmc_trace[1,::15], mcmc_trace[0,::15], 'k.', alpha=0.1)
ax.set_xlabel(r'$\nu$ (µm)')
ax.set_ylabel(r'$\zeta$')

<matplotlib.text.Text at 0x113581ed0>

A cool thing about MCMC results is that marginalization is trivial. You just ignore the other variables! We can plot

\begin{align} P(\zeta~|~D,I) = \int \mathrm{d}\nu \int \mathrm{d} \sigma\, P(\zeta,\nu,\sigma~|~D,I) \end{align}

just by making a histogram of the $\zeta$ values.

# Make histogram
n, b, p = plt.hist(sampler.flatchain[:,0], 
                   bins=np.sqrt(len(sampler.flatchain[:,0])), 
                   normed=True, histtype='step', color='k')
plt.xlabel(r'$\zeta$')
plt.ylabel(r'$P(\zeta|D,I)$')

<matplotlib.text.Text at 0x11355b690>