Auxiliary lesson 6: The MCMC Hammer

(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [43]:
import collections
import math

import numpy as np
import pandas as pd
import scipy.stats as st
import numba

import emcee
import pymc3 as pm

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

In Tutorial 5a, you were introduced to Markov chain Monte Carlo using PyMC3. In this tutorial, we will showcase another MCMC package, emcee (formerly known as "MCMC Hammer," clearly the best name of a software package ever).

The MCMC Hammer

Emcee uses the affine invariant ensemble sampler of Goodman and Weare, which is not a Metropolis-Hastings sampler. I will not go into the details of the sampling method, but instead refer you to the emcee paper. Qualitatively, the sampler uses many walkers at the same time, and the proposal distribution for the next step of the walkers uses the positions of all other walkers. So, a salient feature of this algorithm is that many walkers are walking at the same time and talking to each other. emcee's lead author, Dan Foreman-Mackey suggests a rule of thumb that you should have at least twice as many walkers as you do parameters in your model.

If you have not already, you can install emcee using pip.

pip install emcee

For the rest of this tutorial, I will demonstrate how to use the emcee to do inference using a data set we have already seen.

The data set

We will use the same data set as in Tutorial 5a, from Singer, et al., where they looked at single molecule FISH data. The authors acquired counts of transcripts of four different genes, Rex1, Rest, Prdm14, and Nanog, in a cell population. First, we'll load in the data.

In [2]:
# Load DataFrame
df = pd.read_csv('../data/singer_transcript_counts.csv', comment='#')

As a reminder, here are the ECDFs.

In [3]:
plots = [bebi103.viz.ecdf(df[gene],
                          x_axis_label='mRNA count',
                          title=gene,
                          plot_height=150)
                 for gene in df.columns]
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=1))

The posterior

Specification of a model in emcee is simple. You simple pass a function that computes the log posterior. The first argument in the function is a Numpy array containing the parameter values that you are sampling, and all subsequent arguments are used in the computation but not directly manipulated by the sampler. Conveniently, emcee uses the same API for the posterior definition as the scipy.optimize does. So, we need to write out the log posterior to specify it.

Recall that we considered two models, one with a Negative Binomial likelihood and one with a mixture of two Negative Binomials for a likelihood. We assumed Uniform priors on the parameters of the Negative Binomials, giving a posterior of

\begin{align} g(r, p \mid \mathbf{n}) &\propto \prod_{n\in\mathbf{n}} \frac{(n + r - 1)!}{n!(r-1)!}\,p^r(1-p)^n \end{align}

for the single Negative Binomial, and

\begin{align} g(r_1, r_2, p_1, p_2, f\mid n) &\propto w_1\,\frac{(n + r_1 - 1)!}{n!\,(r_1-1)!}\,p_1^{r_1}(1-p_1)^{n} \\[1em] &\;\;\;\;+ (1-w_1) \frac{(n + r_2 - 1)!}{n!\,(r_2-1)!}\,p_2^{r_2}(1-p_2)^{n}. \end{align}

for the double Negative Binomial.

Now, we'll code up the log posterior for the two models, in much the same way we do for our treatment using optimization. We will specify that $p_1 < p_2$.

In [4]:
def log_likelihood(params, n):
    """
    Log likelihood for single Negative Binomial.
    """
    r, p = params

    return np.sum(st.nbinom.logpmf(n, r, p))


def log_prior(params, r_max):
    """
    Log prior for single Negative Binomial.
    """
    r, p = params
    
    # Zero probability of having p < 0 or p > 1
    if p < 0 or p > 1:
        return -np.inf
    
    # Zero probability of r < 1
    if r <= 0 or r > r_max:
        return -np.inf

    return -np.log(r_max - 1)


def log_posterior(params, n, r_max):
    """
    Log posterior for single Negative Binomial.
    """
    # Compute log prior
    lp = log_prior(params, r_max)

    # If log prior is -inf, return that
    if lp == -np.inf:
        return -np.inf

    # Compute and return posterior
    return lp + log_likelihood(params, n)


def log_likelihood_bimodal(params, n):
    """
    Log likelihood for linear combination of neg. binomials.
    """
    r_1, r_2, p_1, p_2, f = params
    
    return np.sum(np.log(f * st.nbinom.pmf(n, r_1, p_1)
                  + (1-f) * st.nbinom.pmf(n, r_2, p_2)))


def log_prior_bimodal(params, r_max):
    """
    Log prior for linear combination of neg. binomials.
    """
    r_1, r_2, p_1, p_2, f = params
    
    test = ( (f <= 0) or (f >= 1) or (r_1 <= 0) or (r_2 <= 0) or (p_1 > p_2) 
                      or (p_1 < 0) or (p_2 > 1) or (r_1 > r_max) or (r_2 > r_max) )

    if test:
        return -np.inf
    
    return -2 * np.log(r_max - 1)


def log_posterior_bimodal(params, n, r_max):
    """
    Log posterior for the Singer data for lin. comb. of neg. binom
    """
    # Compute log prior
    lp = log_prior_bimodal(params, r_max)

    # If log prior is -inf, return that
    if lp == -np.inf:
        return -np.inf

    # Compute and return posterior
    return lp + log_likelihood_bimodal(params, n)

Finally, because we are taking a Uniform prior for $r$, we will specify $r_\mathrm{max} = 20$.

In [44]:
r_max = 50

Sampling the posterior with MCMC

We now set up the Markov chain Monte Carlo calculation. Remember that the algorithm employed by emcee, the affine invariance MCMC sampler, relies on multiple walkers sampling in parallel. We therefore have to specify how many walkers we want. In addition to the number of walkers, we also need to specify

  • The number of parameters in the posterior.
  • The number of burn-in steps.
  • The number of steps to take after burn-in.

We'll start with the single Negative Binomial model. Let's specify these and we set up our calculation.

In [45]:
n_dim = 2        # number of parameters in the model (r and p)
n_walkers = 50   # number of MCMC walkers
n_burn = 2000    # "burn-in" period to let chains stabilize
n_steps = 2000   # number of MCMC steps to take after burn-in

Next, for consistency, I will seed the random number generator so we can compare our results as we go through the tutorial.

In [46]:
np.random.seed(42)

Now, we need to specify where in parameter space we will start each of our 50 walkers. We will randomly draw the starting point for $p$ out of a uniform distribution on $[0,1]$. Because we expect $r$ values pretty far from $r_\mathrm{max}$, since it was a bit upper bound, we will choose our starting $r$ values from an Exponential distribution with a mean of 1. Though we've made this choice, we could choose many other starting points. We could, for example, find the MAP, and then start our walkers from its vicinity. This is often done to reduce burn-in time.

In [47]:
# p0[i,j] is the starting point for walk i along variable j.
p0 = np.empty((n_walkers, n_dim))
p0[:,0] = np.random.exponential(1, n_walkers)       # r
p0[:,1] = np.random.uniform(0, 1, n_walkers)        # p

# Make sure we didn't by change pick an r that was too big
p0[:,0] = np.minimum(p0[:,0], r_max)

Now we are ready to instantiate our sampler, an emcee.EnsembleSampler instance. The syntax is straightforward. It takes as required arguments the number of walkers, the number of dimensions (that is, the number of parameters), and a function giving the log of the posterior. The kwarg args is as in scipy.optimize.minimize(); it is a tuple of other arguments to be passed into your log posterior function. These arguments almost always include your data. Finally, the threads kwarg is useful. You can specify how many cores of your machine you want to use for the calculation, and the MCMC sampling will be done with parallel processing. It is a good idea to choose two less than the number of cores you have on your machine. For me, I choose six threads.

For this example, we will look at Prdm14 expression. It is important to pass the values of a slice from a DataFrame, since the Numpy array gives better computational performance.

In [48]:
sampler = emcee.EnsembleSampler(n_walkers, n_dim, log_posterior, 
                                args=(df['Prdm14'].values, r_max), threads=6)

Now, we'll do the burn-in. This is just using the run_mcmc() method of the sampler without storing the results. The first argument to the run_mcmc() method is the starting point of the parameter values. The second is the number of MCMC steps to take (in this case n_burn). We use the kwarg storechain=False to indicate that we are not saving the samples. The sample outputs the ending position (set of parameter values), probability (posterior probability), and the state of the random number generator. We need to keep the positions of the walkers because this is where we start after burn-in.

In [49]:
# Do burn-in
pos, prob, state = sampler.run_mcmc(p0, n_burn, storechain=False)

Now that burn-in is complete, we can run the sampler to generate the MCMC samples!

In [50]:
# Sample again, starting from end burn-in state
_ = sampler.run_mcmc(pos, n_steps)

Parsing MCMC results

The sampler object has lots of properties and methods that are useful to us. The most useful are sampler.chain and sampler.flatchain, which are the samples themselves. You can read the documentation about these, but I find it is easier to work with the samples in a Pandas DataFrame. I wrote a utility function to convert a about these, but I find it is easier to work with the samples in a Pandas DataFrame. I wrote a utility function to convert the chain stored in the sampler instance to a DataFrame. You just need to specify the name of the parameters.

In [51]:
# Conver sampler output to DataFrame
df_mcmc = bebi103.emcee.sampler_to_dataframe(sampler, columns=['r', 'p'])

# Take a look
df_mcmc.head()
Out[51]:
r p lnprob chain
0 0.554329 0.110686 -717.005795 0
1 0.586584 0.115463 -717.149443 0
2 0.586231 0.116646 -717.172708 0
3 0.603870 0.119197 -717.359086 0
4 0.603870 0.119197 -717.359086 0

As we can see, the DataFrame has the values of each parameter for each sample, as well as the log probability of the posterior and the ID of the chain. Since we have 50 walkers in this case, we have 50 chains.

We can generate a corner plot from this data frame.

In [52]:
g = bebi103.viz.corner(df_mcmc, vars=['r', 'p'])
bokeh.io.show(g)

MLE with a double negative binomial

We can repeat the analysis with the Rex1 data set using a linear combination of Negative Binomials as the likelihood. Setting up the sampler and running it can get lengthy. I wrote a convenient function to automate this a bit, bebi103.emcee.run_ensemble_mcmc(). This takes an ordered dictionary (instantiated with collections.OrderedDict() as a specification of the parameters and their initial values. Each key is the parameter name, and associated with each key is a tuple. Each entry is a tuple with the function used to generate starting points for the parameter and the arguments for the function. The starting point function must have the call signature f(*args_for_function, n_walkers). Once we have this set up, we specify the number of walkers, number of burn steps, and number of sampling steps as kwargs to bebi103.run_ensemble_mcmc(). The output is a DataFrame with our samples, log posterior values, and chain IDs.

In [53]:
# Parameters and how we start them (make sure p2 > p1)
params = collections.OrderedDict(
        [('r1', (np.random.exponential, (1,))),
         ('r2', (np.random.exponential, (1,))),
         ('p1', (np.random.uniform, (0, 0.15))),
         ('p2', (np.random.uniform, (0.15, 0.3))),
         ('w1', (np.random.uniform, (0, 1)))])

# Obtain samples
df_mcmc = bebi103.emcee.run_ensemble_emcee(
                            log_post=log_posterior_bimodal,
                            n_burn=5000,
                            n_steps=5000,
                            n_walkers=50,
                            p_dict=params,
                            args=(df['Rex1'].values, r_max),
                            threads=6)

And, let's take a look at the corner plot.

In [54]:
g = bebi103.viz.corner(df_mcmc, vars=['r1', 'p1', 'r2', 'p2', 'w1'])
bokeh.io.show(g)

Very nice!

Gelman-Rubin statistics

The walkers in this algorithm as not independent. You therefore cannot compute the Gelman-Rubin statistic from the traces. Instead, you will have to run separate MCMC calculations, and use the flat chains to compute variance among chains.

A note on speed

The MCMC calculations can get long. Speed becomes an important consideration. PyMC3 uses compiled models to speed its calculations. When sampling, computing the log posterior is often most costly. One way to speed things up is to use the fabulous Numba package, which allows for just-in-time (JIT) compilation of functions. You can convert your Python function for the log posterior into compiled code, which typically runs faster, even than when using NumPy functions.

It is often as simple as putting a @numba.jit() decorator on top of your functions. However, numba only supports a subset of NumPy functions and features of the Python programming language, see here and here.

Unfortunately, it does support anything from the scipy.stats module, so we'll have to hand-code our Negative Binomial PMF.

In [55]:
@numba.vectorize([numba.float64(numba.float64)], nopython=True)
def lngamma(x):
    """
    Log of Gamma function.
    """
    if x < 0.0:
        return 0.0
    return math.lgamma(x)


@numba.vectorize([numba.float64(numba.int64, numba.float64, numba.float64)],
                 nopython=True)
def nbinom_logpmf(n, r, p):
    """
    Negative binomial PMF evaluated a n-values for a given r, p.
    """
    return lngamma(n+r) - lngamma(n+1) - lngamma(r) \
                    + n * np.log(1-p) + r * np.log(p)

Here, we have used the numba.vectorize() decorator. Now, we can code up our log likelihood and log prior. We do not JIT the log posterior function because the JITted function object cannot be pickled. This means that we cannot use multiple threads if we JIT the log posterior. The speed loss is negligible, since the likelihood is by far the most expensive calculation.

In [56]:
@numba.jit(nopython=True)
def log_likelihood_numba(params, n):
    """
    Log likelihood for MLE of Singer data.
    """
    r, p = params
    return np.sum(nbinom_logpmf(n, r, p))


@numba.jit(nopython=True)
def log_prior_numba(params, r_max):
    """
    Log prior for MLE of Singer data.
    """
    r, p = params
    
    # Zero probability of having p < 0 or p > 1
    if p < 0 or p > 1:
        return -np.inf
    
    # Zero probability of r < 1
    if r <= 0 or r > r_max:
        return -np.inf

    return -np.log(r_max - 1)


def log_posterior_numba(params, n, r_max):
    """
    Log posterior for the Singer data.
    """
    # Compute log prior
    lp = log_prior_numba(params, r_max)

    # If log prior is -inf, return that
    if lp == -np.inf:
        return -np.inf

    # Compute and return posterior
    return lp + log_likelihood_numba(params, n)

Now let's test the speed of the JITted version versus the original one we used.

In [57]:
r = 0.1
p = 0.3
n = df['Prdm14'].values

%timeit log_posterior((r, p), n, r_max)
%timeit log_posterior_numba((r, p), n, r_max)
177 µs ± 2.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
24.1 µs ± 659 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This is a speed boost of about 7 fold. This is valuable. There is a big difference between a calculation that takes 10 minutes versus 70 minutes.