Auxiliary lesson 7: Model selection with PTMCMC

(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [11]:
import math
import pickle

import numpy as np
import numba
import pandas as pd
import scipy.stats as st
import theano.tensor as tt

import pymc3 as pm

import bebi103

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

In this tutorial, we will revisit the analysis of the FISH data from Singer, et al.. We will use techniques involving sampling at multiple "temperatures" to perform model comparison.

The data set

We will revisit the data of Singer, et al, which you can download here. As a reminder, Singer and coworkers used single molecule FISH to get mRNA transcript counts of four different genes in each cell in a population of mouse embryonic stem cells. The distribution of mRNA counts in a given cell is negative binomially distributed.

\begin{align} f(n\mid r, p) = \frac{\Gamma(n+r)}{n!\,\Gamma(r)}\,p^{r}(1-p)^{n}. \end{align}

If there are two different cell types that express the gene differently, the distribution is a linear combination of negative binomials,

\begin{align} g(n\mid r_1, p_1, r_2, p_2, w) = w\,\frac{\Gamma(n+r_1)}{n!\,\Gamma(r_1)}\,p_1^{r_1}(1-p_1)^{n}

  • (1-w) \frac{\Gamma(n + r_2)}{n!\,\Gamma(r_2)}\,p_2^{r_2}(1-p_2)^{n}. \end{align}

For this exercise, we will only look at the Rex1 gene. Let's start by looking at the ECDF of counts.

In [2]:
# Load DataFrame
df = pd.read_csv('../data/singer_transcript_counts.csv', comment='#')

# Extract Rex1 data
n = df['Rex1'].values

# Plot the ECDF
bokeh.io.show(bebi103.viz.ecdf(n, x_axis_label='mRNA count'))

The double inflection points for Rex1 suggest bimodality.

Model comparison of the Singer, et al. data

We will define Model 1 to be a single negative binomial distribution and Model 2 to be a double negative binomial. As before, since we don't have a preference for either model a priori, we we will take $g(M_1\mid I) \approx g(M_2\mid I)$. We will take uniform priors for $p$, $p_1$, and $p_2$ on the interval $[0,1]$. Similarly, we will take $w$ to have a uniform prior on $[0,1]$. We will take uniform priors for $r$, $r_1$, and $r_2$ on the interval $[1, r_\mathrm{max}]$, where $r_\mathrm{max}$ is the most bursts we would expect during an mRNA lifetime.

To estimate $r_\mathrm{max}$, we will refer to the excellent BioNumbers website. The typical mRNA degradation time in mice, is 9 hours. A burst every half hour seems pretty fast to me, so we'll estimate $r_\mathrm{max} \approx 20$ and use this in our calculations.

Computing the log likelihood and log prior

In Tutorial 6a, we wrote functions to compute the log likelihood and log prior for the two models. We will reuse them here, noting that we have make sure they are normalized. Remember, for the model selection problem, the likelihood and prior must be normalized.

In [3]:
@numba.vectorize([numba.float64(numba.float64)],nopython=True)
def lngamma(x):
    """
    Log of Gamma function.
    """
    if x < 0.0:
        return 0.0
    return math.lgamma(x)


@numba.jit(nopython=True)
def nbinom_logpmf(n, r, p):
    """
    Negative binomial PMF evaluated a n-values for a given r, p.
    """
    return lngamma(n+r) - lngamma(n+1) - lngamma(r) \
                    + n * np.log(1-p) + r * np.log(p)

    
@numba.jit(nopython=True)
def double_nbinom_logpmf(n, r1, p1, r2, p2, w):
    """
    Double negative binomial PMF evaluated at n-values.
    """
    return np.logaddexp(nbinom_logpmf(n, r1, p1) + np.log(w),
                        nbinom_logpmf(n, r2, p2) + np.log(1-w))


@numba.jit(nopython=True)
def log_likelihood_1(params, n):
    """
    Log likelihood for negative binomial distributed parameters.
    """
    r, p = params
    return np.sum(nbinom_logpmf(n, r, p))


@numba.jit(nopython=True)
def log_likelihood_2(params, n):
    """
    Log likelihood for double negative binomial distributed parameters.
    """
    r1, p1, r2, p2, w = params
    return np.sum(double_nbinom_logpmf(n, r1, p1, r2, p2, w))

        
@numba.jit(nopython=True)        
def log_prior_1(params, r_max):
    """
    Log prior for negative binomially distributed bursty gene expression.
    """
    r, p = params
    # Zero probability of having p < 0 or p > 1
    if p < 0 or p > 1 or r <= 0 or r > r_max:
        return -np.inf
    return -np.log(r_max - 1)


@numba.jit(nopython=True)      
def log_prior_2(params, r_max):
    """
    Log prior for double negative binomially distributed bursty 
    gene expression.
    """
    r_1, p_1, r_2, p_2, w = params
    
    # All the things the could be wrong
    test = (w <= 0) or (w >= 1) or (r_1 <= 0) or (r_2 <= 0) or (p_1 < p_2) \
           or (p_2 < 0) or (p_1 > 1) or (r_1 > r_max) or (r_2 > r_max)

    if test:
        return -np.inf
    return -2 * np.log(r_max - 1)
        
    
@numba.jit(nopython=True)       
def log_posterior_1(params, n, r_max):
    """
    Log posterior for negative binomials.
    """
    lp = log_prior_1(params, r_max)
    if lp == -np.inf:
        return -np.inf
    
    return lp + log_likelihood_1(params, n)


@numba.jit(nopython=True)       
def log_posterior_2(params, n, r_max):
    """
    Log posterior for negative binomials.
    """
    lp = log_prior_2(params, r_max)
    if lp == -np.inf:
        return -np.inf
    
    return lp + log_likelihood_2(params, n)

Doing the PTMCMC calculation

We will use the utility function bebi103.emcee.run_pt_emcee() to perform the PTMCMC calculations. You should review its doc string to see how to call it.

Now, do our function to do PTMCMC for the Singer, et al. samples.

In [4]:
def sample_ptmcmc(n, model, r_max=20, n_temps=20, n_walkers=50, n_burn=5000, 
                  n_steps=5000):
    """
    Sample postrior using PTMCMC.
    
    n = array of mRNA counts in cells
    """
    # Arguments for likelihood
    loglargs = (n,)
    
    # Arguments for prior
    logpargs = (r_max,)
    
    # Columns headings for outputted DataFrames
    columns = {1: ['r', 'p'],
               2: ['r1', 'p1', 'r2', 'p2', 'w']}
        
    # Parameters and how we start them
    if model == 1:
        # Build p0
        p0 = np.empty((n_temps, n_walkers, 2))
        p0[:,:,0] = np.random.uniform(0, r_max, (n_temps, n_walkers))  # r
        p0[:,:,1] = np.random.uniform(0, 1, (n_temps, n_walkers))      # p

        # Get samples
        return bebi103.emcee.run_pt_emcee(
            log_likelihood_1, log_prior_1, n_burn, n_steps, n_temps=n_temps, 
            p0=p0, columns=columns[model], loglargs=loglargs, 
            logpargs=logpargs, return_lnZ=True)
    elif model == 2:
        # Build p0
        p0 = np.empty((n_temps, n_walkers, 5))
        p0[:,:,0] = np.random.uniform(0, r_max, (n_temps, n_walkers))  # r1
        p0[:,:,1] = np.random.uniform(0, 1, (n_temps, n_walkers))      # p1
        p0[:,:,2] = np.random.uniform(0, r_max, (n_temps, n_walkers))  # r2
        p0[:,:,3] = np.random.uniform(0, 1, (n_temps, n_walkers))      # p2
        p0[:,:,4] = np.random.uniform(0, 1, (n_temps, n_walkers))      # f

        # Make sure p1 > p2
        p0[:,:,1], p0[:,:,3] = np.maximum(p0[:,:,1], p0[:,:,3]), \
                                np.minimum(p0[:,:,1], p0[:,:,3])
    
        return bebi103.emcee.run_pt_emcee(
            log_likelihood_2, log_prior_2, n_burn, n_steps,
            n_temps=n_temps, p0=p0, loglargs=loglargs, logpargs=logpargs, 
            columns=columns[model], return_lnZ=True)

PTMCMC for Rex1

We'll now use our handy functions to perform PTMCMC for the Rex1 gene. The calculation is lengthy. We will burn in for 5,000 steps and sample for 5,000 steps. So, we will get a quarter of a million samples of the cold distribution. We will have a total of $10^7$ samples over all temperatures, which is a lot of MCMC sampling!

In [5]:
# Sample both models using PTMCMC; This will take a long time
df_1, lnZ_1, dlnZ_1 = sample_ptmcmc(n, 1)
df_2, lnZ_2, dlnZ_2 = sample_ptmcmc(n, 2)

Now that we have these samples, we can compute the posterior probabilities and do the usual plotting for the parameter estimation problem. We pull out the cold distribution by indexing, e.g., df_1[df_1.beta_ind==0]. First, we'll make a triangle plot for model 1.

In [6]:
bokeh.io.show(bebi103.viz.corner(df_1[df_1.beta_ind==0], vars=['r', 'p']))

And now for model 2.

In [7]:
bokeh.io.show(bebi103.viz.corner(df_2[df_2.beta_ind==0],
                                 vars=['r1', 'p1', 'r2', 'p2', 'w']))

All of the marginalized posteriors are unimodal, which allows us to cleanly parametrize model 2.

Let's now compute our goal: the odds ratio! It is easy now that we have lnZ.

In [8]:
print('Odds ratio:', np.exp(lnZ_1 - lnZ_2))
Odds ratio: 1.28552577424e-21

We get an odds ratio of about $10^{-21}$, strongly favoring a double Negative Binomial.

Model selection using PyMC3

We can also perform model selection using PyMC3. It does not do parallel tempering, but we can nonetheless define "hot" distributions, which have $\beta < 1$. We can use NUTS (or whatever other PyMC3 sampler we want) to sample out of these hot distributions. We can then perform numerical quadrature to get the odds ratio.

I wrote some convenience functions to do this. The bebi103.pm submodule has "hot" distributions for all of PyMC3's named distributions. For example, to use a hot Negative Binomial distribution, we use bebi103.pm.HotNegativeBinomail(). The only difference is that the first argument after the name of the distribution is beta_temp, the value of $\beta$ to be used for the hot distribution.

The function bebi103.pm.sample_beta_ladder() will perform sampling on a "ladder" of $\beta$ values. Its first argument is a Python function with call signature model_fun(beta_temp, *args) that returns a PyMC3 model for the hot posterior at a given value of $\beta$. The second argument is a list of $\beta$ values to use for sampling. See the documentation for other kwargs.

The sampling can be done in parallel across $\beta$ values by specifying the njobs kwarg. Note, though, that you cannot use multiple cores to sample at a single $\beta$ value. The output of the bebi103.pm.sample_beta_ladder() function is a list of tuples. The first tuple is the traces for each beta value. The second tuple is a tuple of compiled PyMC3 models. The last is a tuple of beta values.

This list of tuples can be passed into the bebi103.pm.log_evidence_estimate() function to compute the approximate log of the fully marginalized likelihood.

To do the calculation, I ran the jobs in parallel on an EC2 instance. I used the script below. You can check out the model definition function and the call to bebi103.pm.sample_beta_ladder().

In [ ]:
# %load pymc3_model_comparison.py
import pickle

import numpy as np
import pandas as pd
import theano.tensor as tt

import pymc3 as pm

import bebi103



def m_nbinom_model(beta_temp, n_types, n):
    """Model for mixture model of Negative Binomials."""
    if n_types == 1:
        with pm.Model() as model:
            # Priors
            r = pm.HalfCauchy('r', beta=1)
            p = pm.Beta('p', alpha=1, beta=1)

            # Convert p to mu
            mu = pm.Deterministic('mu', r * (1 - p) / p)

            # Likelihood
            n_obs = bebi103.pm.HotNegativeBinomial('n_obs',
                                                   beta_temp=beta_temp,
                                                   alpha=r,
                                                   mu=mu,
                                                   observed=n)
        return model       

    with pm.Model() as model:
        # Priors
        r = pm.HalfCauchy('r', beta=1, shape=n_types)
        p = pm.Beta('p', alpha=1, beta=1, shape=n_types,
                    testval=np.linspace(0.1, 0.3, n_types))
        w = pm.Dirichlet('w', a=np.ones(n_types))

        # Break symmtery by sorting
        p_sorted = tt.sort(p)
        
        # Convert p's to mu's
        mu = pm.Deterministic('mu', r*(1-p_sorted)/p_sorted)

        # Pieces of likelihood
        likelihood = pm.NegativeBinomial.dist(alpha=r, mu=mu)

        # Likelihood
        n_obs = bebi103.pm.HotMixture('n_obs', beta_temp, w, likelihood, observed=n)
        
    return model


if __name__ == '__main__':
    # Load DataFrame
    df = pd.read_csv('../data/singer_transcript_counts.csv', comment='#')

    # Extract Rex1 data
    n = df['Rex1'].values


    beta_vals = np.logspace(-3, 0, 10)
    n_types = [1, 2, 3, 4, 5, 6]
    tmbs = [bebi103.pm.sample_beta_ladder(m_nbinom_model,
                                          beta_vals,
                                          args=(nt, n),
                                          tune=20000,
                                          draws=20000,
                                          njobs=10)
                for nt in n_types]

    with open('pymc3_model_comparison.pkl', 'wb') as f:
        pickle.dump(tmbs, f)

After running this calculation, I gathered the pickle file. Let's crack these open and compute the log evidence.

In [12]:
with open('pymc3_model_comparison.pkl', 'rb') as f:
    tmbs = pickle.load(f)
    
lnZ = np.array([bebi103.pm.log_evidence_estimate(tmb) for tmb in tmbs])

Now, let's compute the odds ratios, comparing each to the most probable (which is two cell types).

In [17]:
print('Odds ratios')
for n in [1, 3, 4, 5, 6]:
    print('g(n={0:d})/g(n=2):'.format(n), np.exp(lnZ[n-1] - lnZ[1]))
Odds ratios
g(n=1)/g(n=2): 3.16771621686e-18
g(n=3)/g(n=2): 0.509666513829
g(n=4)/g(n=2): 0.0648629551754
g(n=5)/g(n=2): 0.00734378595937
g(n=6)/g(n=2): 0.000745926014383

The odds ratios suggest that there are either 2 or 3 cell types.