This lecture nugget was shown in lecture 4 and was generated from an IPython notebook. You can download the notebook here.
from __future__ import division, absolute_import, \
print_function, unicode_literals
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.special
import scipy.optimize
import emcee
# We'll use ColorBrewer for color maps
from brewer2mpl import sequential
# We will also use numdifftools (install using Canopy)
import numdifftools as nd
# Necessary to display plots in this IPython notebook
%matplotlib inline
We will load the data as in Tutorial 3a. (You can download the data set here.)
# Load data into DataFrame
df = pd.read_csv('../data/good_et_al/invitro_droplet_data.csv',
comment='#')
# Rename them all at once
df.columns = ['droplet_diameter', 'droplet_volume', 'spindle_length',
'spindle_width', 'tubulin_area']
# Put droplet volume in units of µm^3 for consistency of units
df.droplet_volume *= 1e9
# Only use data for droplets less than 90 um
df = df[df.droplet_diameter < 90.0]
Recall our linear model for dependence of spindle length on droplet diameter.
\begin{align} l = \nu + \zeta d. \end{align}
We assume Gaussian distributed error in the measurements, and we will go ahead and use an uninformative prior.
\begin{align} \ln P(\zeta,\nu,\sigma~|~D,I) = \text{constant} - (n+1)\ln \sigma - \frac{3}{2}\ln (1+\zeta^2) - \frac{1}{2\sigma^2}\sum_{i\in D} \left(l_i - (\nu + \zeta d_i)\right)^2 \end{align}
We will use emcee
to sample the posterior. We first define it.
def log_posterior(p, droplet_diameter, spindle_length):
"""
Returns the log of the posterior (sans an additive constant)
p = [zeta, nu, sigma]
"""
zeta, nu, sigma = p
n = len(droplet_diameter)
# Enforce positivity of sigma
if sigma < 0.0:
return -np.inf
return -((spindle_length - zeta * droplet_diameter - nu)**2).sum() \
/ 2.0 / sigma \
-(n + 1) * np.log(sigma) - 1.5 * np.log(1.0 + zeta**2)
Now, we set up the MCMC calculation.
n_dim = 3 # number of parameters in the model (zeta, nu, sigma)
n_walkers = 10 # number of MCMC walkers
n_burn = 200 # "burn-in" period to let chains stabilize
n_steps = 10000 # number of MCMC steps to take after burn-in
# Seed random numbers for reproducibility.
np.random.seed(42)
# Generate random starting points for walkers.
# p0[i,j] is the starting point for walk i along variable j.
p0 = np.empty((n_walkers, n_dim))
p0[:,0] = np.random.uniform(0.0, 1.0, n_walkers) # zeta
p0[:,1] = np.random.uniform(10.0, 90.0, n_walkers) # nu
p0[:,2] = np.random.exponential(1.0, n_walkers) # sigma
# Set up the EnsembleSampler instance
sampler = emcee.EnsembleSampler(n_walkers, n_dim, log_posterior,
args=(df.droplet_diameter, df.spindle_length))
# Do the burn-in
pos, prob, state = sampler.run_mcmc(p0, n_burn)
# Reset sampler and run from the burn-in state we got to
sampler.reset()
pos, prob, state = sampler.run_mcmc(pos, n_steps)
Now for post processing. We have samples drawn from the posterior stored in sampler.chain
. It is easier to use sampler.flatchain
, since this takes the individual walkers and puts them together.
# Compute mean parameter values
p_mean = sampler.flatchain.mean(axis=0)
p_std = sampler.flatchain.std(axis=0)
# Pull out values
zeta_mcmc, nu_mcmc, sigma = p_mean
We'll compute the most probable parameter values using scipy.optimize.curve_fit
as before.
# Define our function to pass to scipy.optimize.leastsq
def linear(droplet_diameter, zeta, nu):
return zeta * droplet_diameter + nu
# Generate an initial guess to put into the solver
zeta_0 = 0.5
nu_0 = 20.0
p0 = np.array([zeta_0, nu_0])
# Rock-n-roll!
p_opt, junk_output = scipy.optimize.curve_fit(
linear, df.droplet_diameter, df.spindle_length, p0)
# Unpack results
zeta_most_prob, nu_most_prob = p_opt
Let's compare the results!
print("""
curve_fit MCMC
============== ==============
zeta = {0:.3f} {2:.3f}
nu = {1:.3f} µm {3:.3f} µm
""".format(zeta_most_prob, nu_most_prob, zeta_mcmc, nu_mcmc))
And with our results, from MCMC, we can plot the posterior!
# Plot the posterior as a density plot, plotting every 15 points
fig, ax = plt.subplots()
# ax.set_axis_bgcolor('black')
ax.plot(mcmc_trace[1,::15], mcmc_trace[0,::15], 'k.', alpha=0.1)
ax.set_xlabel(r'$\nu$ (µm)')
ax.set_ylabel(r'$\zeta$')
A cool thing about MCMC results is that marginalization is trivial. You just ignore the other variables! We can plot
\begin{align} P(\zeta~|~D,I) = \int \mathrm{d}\nu \int \mathrm{d} \sigma\, P(\zeta,\nu,\sigma~|~D,I) \end{align}
just by making a histogram of the $\zeta$ values.
# Make histogram
n, b, p = plt.hist(sampler.flatchain[:,0],
bins=np.sqrt(len(sampler.flatchain[:,0])),
normed=True, histtype='step', color='k')
plt.xlabel(r'$\zeta$')
plt.ylabel(r'$P(\zeta|D,I)$')