Tutorial 6b: Model selection with PTMCMC

(c) 2016 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This tutorial was generated from an Jupyter notebook. You can download the notebook here. You can also view it here.

In [1]:
import collections
import itertools
import math

import numpy as np
import numba
import pandas as pd
import scipy.stats as st

import emcee
import corner
import bebi103

import matplotlib.pyplot as plt
import seaborn as sns
rc={'lines.linewidth': 2, 'axes.labelsize': 14, 'axes.titlesize': 14}
sns.set(rc=rc)
%matplotlib inline

Reminder of Bayesian model selection problem

Even though we just did model selection in the previous tutorial, as a reminder, our goal is to compute the probability that a given model is true, given the data. By Bayes's theorem, this is

\begin{align} P(M_i\mid D, I) = \frac{P(D \mid M_i, I)\,P(M_i\mid I)}{P(D\mid I)}. \end{align}

Unless we can somehow sum probabilities over all models to compute

\begin{align} P(D\mid I) = \sum_{i\in\text{all models}} P(D \mid M_i, I)\,P(M_i\mid I), \end{align}

we cannot compute $P(M_i\mid D, I)$ exactly. Instead, we compare a pair of models, $M_i$ and $M_j$, by computing the odds ratio,

\begin{align} O_{ij} = \frac{P(M_i\mid I)}{P(M_j\mid I)}\,\frac{P(D\mid M_i, I)}{P(D\mid M_j,I)}. \end{align}

The second fraction in this equation is the evidence from the parameter estimation problem. By Bayes's theorem, we have, for the parameter estimation problem,

\begin{align} P(\mathbf{a}_i\mid D, M_i, I) = \frac{P(D\mid \mathbf{a}_i, M_i, I)\,P(\mathbf{a}_i\mid M_i, I)}{P(D\mid M_i, I)}, \end{align}

where the parameters are $\mathbf{a}_i$. Because the posterior of the parameter estimation problem must be normalized,

\begin{align} P(D\mid M_i, I) = \int \mathrm{d}\mathbf{a}_i\,P(D\mid \mathbf{a}_i, M_i, I)\,P(\mathbf{a}_i\mid M_i, I). \end{align}

In Tutorial 6a, we evaluated this integral by approximating the posterior as sharply peaked and then using the Laplace approximation to compute the integral. We found, however, that we run into trouble when the sharply-peaked assumption breaks down. To remedy this, we showed in lecture 5 that we can use parallel-tempering Markov chain Monte Carlo (PTMCMC) to compute the integral. Once we have done this, we only need to specify our prior belief in the respective models, and we have then computed the odds ratio without approximation.

The data set

We will revisit the data of Singer, et al, which you can download here. As a reminder, Singer and coworkers used single molecule FISH to get mRNA transcript counts of four different genes in each cell in a population of mouse embryonic stem cells. The distribution of mRNA counts in a given cell is negative binomially distributed.

\begin{align} P(n\mid r, p) = \frac{\Gamma(n+r)}{n!\,\Gamma(r)}\,p^{r}(1-p)^{n}. \end{align}

If there are two different cell types that express the gene differently, the distribution is a linear combination of negative binomials,

\begin{align} P(n\mid r_1, p_1, r_2, p_2, f) = f\,\frac{\Gamma(n+r_1)}{n!\,\Gamma(r_1)}\,p_1^{r_1}(1-p_1)^{n} + (1-f) \frac{\Gamma(n + r_2)}{n!\,\Gamma(r_2)}\,p_2^{r_2}(1-p_2)^{n}. \end{align}

We can visually inspect the ECDFs of the mRNA counts for each population of cells to get an idea about their distributions.

In [2]:
# Load DataFrame
df = pd.read_csv('../data/singer_transcript_counts.csv', comment='#')

# Make plots of ECDFs
fig, ax = plt.subplots(2, 2)
sp_inds = list(itertools.product([0,1], [0,1]))

for i, gene in enumerate(df.columns):
    # Build ECDF
    y = np.arange(len(df[gene])) / len(df[gene])
    x = np.sort(df[gene].values)
    
    # Plot
    ax[sp_inds[i]].plot(x, y, '.')
    ax[sp_inds[i]].text(0.7, 0.25, gene, transform=ax[sp_inds[i]].transAxes,
                       fontsize=18)
    ax[sp_inds[i]].margins(0.02)
    
# Clean up
for i in [0,1]:
    ax[1,i].set_xlabel('mRNA count')
    ax[i,0].set_ylabel('ECDF')
plt.tight_layout(w_pad=2, h_pad=2)

The double inflection points for Rex1 suggest bimodality. The others seem to be unimodal, but as we saw in Tutorial 4a that it is not obvious that they actually are unimodal. So, we want to rigorously compute an odds ratio to discern unimodality from bimodality.

Model selection of the Singer, et al. data

We will define Model 1 to be a single negative binomial distribution and Model 2 to be a double negative binomial. As before, since we don't have a preference for either model a priori, we we will take $P(M_1\mid I) \approx P(M_2\mid I)$. We will take uniform priors for $p$, $p_1$, and $p_2$ on the interval $[0,1]$. Similarly, we will take $f$ to have a uniform prior on $[0,1]$. We will take uniform priors for $r$, $r_1$, and $r_2$ on the interval $[1, r_\mathrm{max}]$, where $r_\mathrm{max}$ is the most bursts we would expect during an mRNA lifetime.

To estimate $r_\mathrm{max}$, we will refer to the excellent BioNumbers website. The typical mRNA degradation time in mice, is 9 hours. A burst every half hour seems pretty fast to me, so we'll estimate $r_\mathrm{max} \approx 20$ and use this in our calculations.

So, to do model selection, we need to solve the parameter estimation problem using PTMCMC for each model. We'll do this first for Rex 1.

Computing the log likelihood and log prior

In Tutorial 6a, we wrote functions to compute the log likelihood and log prior for the two models. We will reuse them here, noting that we have make sure they are normalized. Remember, for the model selection problem, the likelihood and prior must be normalized.

In [3]:
@numba.vectorize([numba.float64(numba.float64)],nopython=True)
def lngamma(x):
    """
    Log of Gamma function.
    """
    if x < 0.0:
        return 0.0
    return math.lgamma(x)


@numba.jit(nopython=True)
def nbinom_logpmf(n, r, p):
    """
    Negative binomial PMF evaluated a n-values for a given r, p.
    """
    return lngamma(n+r) - lngamma(n+1) - lngamma(r) \
                    + n * np.log(1-p) + r * np.log(p)

    
@numba.jit(nopython=True)
def double_nbinom_logpmf(n, r1, p1, r2, p2, f):
    """
    Double negative binomial PMF evaluated at n-values.
    """
    return np.logaddexp(nbinom_logpmf(n, r1, p1) + np.log(f),
                        nbinom_logpmf(n, r2, p2) + np.log(1-f))


@numba.jit(nopython=True)
def log_likelihood_1(params, n):
    """
    Log likelihood for negative binomial distributed parameters.
    """
    r, p = params
    return np.sum(nbinom_logpmf(n, r, p))


@numba.jit(nopython=True)
def log_likelihood_2(params, n):
    """
    Log likelihood for double negative binomial distributed parameters.
    """
    r1, p1, r2, p2, f = params
    return np.sum(double_nbinom_logpmf(n, r1, p1, r2, p2, f))

        
@numba.jit(nopython=True)        
def log_prior_1(params, r_max):
    """
    Log prior for negative binomially distributed bursty gene expression.
    """
    r, p = params
    # Zero probability of having p < 0 or p > 1
    if p < 0 or p > 1 or r <= 0 or r > r_max:
        return -np.inf
    return -np.log(r_max - 1)


@numba.jit(nopython=True)      
def log_prior_2(params, r_max):
    """
    Log prior for double negative binomially distributed bursty 
    gene expression.
    """
    r_1, p_1, r_2, p_2, f = params
    
    # All the things the could be wrong
    test = (f <= 0) or (f >= 1) or (r_1 <= 0) or (r_2 <= 0) or (p_1 < p_2) \
           or (p_2 < 0) or (p_1 > 1) or (r_1 > r_max) or (r_2 > r_max)

    if test:
        return -np.inf
    return -2 * np.log(r_max - 1)
        
    
@numba.jit(nopython=True)       
def log_posterior_1(params, n, r_max):
    """
    Log posterior for negative binomials.
    """
    lp = log_prior_1(params, r_max)
    if lp == -np.inf:
        return -np.inf
    
    return lp + log_likelihood_1(params, n)


@numba.jit(nopython=True)       
def log_posterior_2(params, n, r_max):
    """
    Log posterior for negative binomials.
    """
    lp = log_prior_2(params, r_max)
    if lp == -np.inf:
        return -np.inf
    
    return lp + log_likelihood_2(params, n)

Doing the PTMCMC calculation

We will use the utility function bebi103.run_pt_emcee() to perform the PTMCMC calculations. You should review its doc string to see how to call it. You should clone or pull it here. To do this, do the following:

git clone https://github.com/justinbois/bebi103_utils.git

You can then do git pull to make sure you have all updates.

Now, do our function to do PTMCMC for the Singer, et al. samples.

In [4]:
def sample_ptmcmc(n, model, r_max=20, n_temps=20, n_walkers=50, n_burn=5000, 
                  n_steps=5000):
    """
    Sample postrior using PTMCMC.
    
    n = array of mRNA counts in cells
    """
    # Arguments for likelihood
    loglargs = (n,)
    
    # Arguments for prior
    logpargs = (r_max,)
    
    # Columns headings for outputted DataFrames
    columns = {1: ['r', 'p'],
               2: ['r1', 'p1', 'r2', 'p2', 'f']}
        
    # Parameters and how we start them
    if model == 1:
        # Build p0
        p0 = np.empty((n_temps, n_walkers, 2))
        p0[:,:,0] = np.random.uniform(0, r_max, (n_temps, n_walkers))  # r
        p0[:,:,1] = np.random.uniform(0, 1, (n_temps, n_walkers))      # p

        # Get samples
        return bebi103.run_pt_emcee(
            log_likelihood_1, log_prior_1, n_burn, n_steps, n_temps=n_temps, 
            p0=p0, columns=columns[model], loglargs=loglargs, 
            logpargs=logpargs, return_lnZ=True)
    elif model == 2:
        # Build p0
        p0 = np.empty((n_temps, n_walkers, 5))
        p0[:,:,0] = np.random.uniform(0, r_max, (n_temps, n_walkers))  # r1
        p0[:,:,1] = np.random.uniform(0, 1, (n_temps, n_walkers))      # p1
        p0[:,:,2] = np.random.uniform(0, r_max, (n_temps, n_walkers))  # r2
        p0[:,:,3] = np.random.uniform(0, 1, (n_temps, n_walkers))      # p2
        p0[:,:,4] = np.random.uniform(0, 1, (n_temps, n_walkers))      # f

        # Make sure p1 > p2
        p0[:,:,1], p0[:,:,3] = np.maximum(p0[:,:,1], p0[:,:,3]), \
                                np.minimum(p0[:,:,1], p0[:,:,3])
    
        return bebi103.run_pt_emcee(
            log_likelihood_2, log_prior_2, n_burn, n_steps,
            n_temps=n_temps, p0=p0, loglargs=loglargs, logpargs=logpargs, 
            columns=columns[model], return_lnZ=True)

PTMCMC for Rex1

We'll now use our handy functions to perform PTMCMC for the Rex1 gene. The calculation is lengthy. We will burn in for 10,000 steps and sample for 10,000 steps. S, we will get a half a million samples of the cold distribution. We will have a total of $10^7$ samples over all temperatures, which is a lot of MCMC sampling!

In [5]:
# Specify gene
gene = 'Rex1'

# Sample both models using PTMCMC; This will take a long time
df_1, lnZ_1, dlnZ_1 = sample_ptmcmc(df[gene].values, 1)
df_2, lnZ_2, dlnZ_2 = sample_ptmcmc(df[gene].values, 2)

Now that we have these samples, we can compute the posterior probabilities and do the usual plotting for the parameter estimation problem. We pull out the cold distribution by indexing, e.g., df_1[df_1.beta_ind==0]. First, we'll make a triangle plot for model 1.

In [6]:
corner.corner(df_1[df_1.beta_ind==0][['r', 'p']], bins=50);

And now for model 2.

In [7]:
corner.corner(df_2[df_2.beta_ind==0][['r1', 'p1', 'r2', 'p2', 'f']], bins=50,
             smooth=1);

All of the marginalized posteriors are unimodal, which allows us to cleanly parametrize model 2.

We can plot smooth CDFs using the parameters we got from the MCMC calculation and compare then to the data. We will write a function to do it, since we will use it more later. We will randomly select 500 of the parameter values sampled by the walkers at the lowest temperature ($\beta = 1$) and plot the corresponding CDFs.

In [8]:
def plot_cdfs(df, df_1, df_2, gene, show=(1, 1), ax=None):
    
    if ax is None:
        fig, ax = plt.subplots(1, 1)

    # Indices for MCMC trace
    inds = df_1['beta_ind'] == 0

    # Build ECDF
    y = np.arange(len(df[gene])) / len(df[gene])
    x = np.sort(df[gene].values)

    # Plot ECDF
    ax.plot(x, y, '.', zorder=2)
    ax.margins(0.02)

    # Overlay theoretical CDFs
    if show[0]:
        plot_inds = np.random.choice(np.arange(np.sum(inds), dtype=int), 500)
        n_plot = np.arange(0, df[gene].max()+1)
        for ind in plot_inds:
            r, p = df_1.loc[inds, ['r', 'p']].iloc[ind].values.flatten()
            ax.plot(n_plot, st.nbinom.cdf(n_plot, r, p), '-', lw=0.2, 
                     alpha=0.2, color=sns.color_palette()[2], zorder=1)

    # Build theroetical PMF
    if show[1]:
        cols = ['r1', 'r2', 'p1', 'p2', 'f']
        inds = df_2['beta_ind'] == 0
        plot_inds = np.random.choice(np.arange(np.sum(inds), dtype=int), 500)
        n_plot = np.arange(0, df[gene].max()+1)
        for ind in plot_inds:
            r_1, r_2, p_1, p_2, f = \
                        df_2.loc[inds, cols].iloc[ind].values.flatten()
            cdf = f * st.nbinom.cdf(n_plot, r_1, p_1) +\
                (1-f) * st.nbinom.cdf(n_plot, r_2, p_2)
            ax.plot(n_plot, cdf, '-', lw=0.2, alpha=0.2, 
                     color=sns.color_palette()[5], zorder=1)
            
    # Label axes
    ax.set_xlabel('num. of mRNA transcripts')
    ax.set_ylabel('cdf')
In [9]:
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0] = plot_cdfs(df, df_1, df_2, gene, show=(1,0), ax=ax[0])
ax[1] = plot_cdfs(df, df_1, df_2, gene, show=(0,1), ax=ax[1]);

We see that the unimodal distribution cannot fit the data, but the bimodal does. Let's now compute our goal: the odds ratio! It is easy now that we have lnZ.

In [10]:
print('Odds ratio:', np.exp(lnZ_1 - lnZ_2))
Odds ratio: 1.63950212188e-19

We get an odds ratio of about $10^{-19}$, which is close to the result we got with an approximate calculation in Tutorial 6a. There are clearly more than one type of cell when considering Rex1 expression.

Analysis of Prdm14

We will do the same analysis with Prdm14, which showed qualitatively single negative binomial behavior. Before we do, let's save our results from Rex1 in a dictionary.

In [11]:
results = {'Rex1': {'lnZ_1': lnZ_1,
                    'dlnZ_1': dlnZ_1,
                    'lnZ_2': lnZ_2,
                    'dlnZ_2': dlnZ_2,
                    'df_1': df_1.copy(),
                    'df_2': df_2.copy()}}

Now we'll run the calculation for Prdm14 and store the results in the results dictionary.

In [12]:
gene = 'Prdm14'

# Sample both models using PTMCMC; This will take a long time
df_1, lnZ_1, dlnZ_1 = sample_ptmcmc(df[gene].values, 1)
df_2, lnZ_2, dlnZ_2 = sample_ptmcmc(df[gene].values, 2)

# Store results in dictionary
results[gene] = {'lnZ_1': lnZ_1,
                 'dlnZ_1': dlnZ_1,
                 'lnZ_2': lnZ_2,
                 'dlnZ_2': dlnZ_2,
                 'df_1': df_1.copy(),
                 'df_2': df_2.copy()}

We looked at the corner plot before using an ensemble sampler. We'll look again to see if it is just as nasty.

In [13]:
corner.corner(df_2.loc[df_2.beta_ind==0, ['r1', 'p1', 'r2', 'p2', 'f']], 
              bins=50, smooth=1);

The posterior has a broad shape, particularly in parameters $r_1$, $p_1$, and $f$. If the underlying distribution is bimodal, we would not be able to clearly parametrize it and would instead have to show plots of the posterior.

For completeness, let's look at the triangle plot for model 1.

In [14]:
corner.corner(df_1.loc[df_1.beta_ind==0, ['r', 'p']], bins=50, smooth=1);

Here, we have a unimodal posterior and we can cleanly parametrize the distribution of mRNA counts for Prdm14.

We can also compare the respective CDFs.

In [15]:
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0] = plot_cdfs(df, df_1, df_2, gene, show=(1,0), ax=ax[0])
ax[1] = plot_cdfs(df, df_1, df_2, gene, show=(0,1), ax=ax[1])

It is hard to tell the difference between the two. We might expect an Occam penalty to therefore make the single exponential more likely. Let's compute the odds ratio and see.

In [16]:
print('Odds ratio:', np.exp(lnZ_1 - lnZ_2))
Odds ratio: 5.96005792967

Indeed, the single negative binomial distribution is more likely, by about a factor of 6. While this tips the odds in favor of a single exponential model, it is not overwhelmingly convincing. We might want to try to do more experiments.

Analysis of Nanog and Rest

We'll finish our analysis by doing model selection on the Nanog and Rest counts.

In [17]:
for gene in ['Nanog', 'Rest']:
    # Sample both models using PTMCMC; This will take a long time
    df_1, lnZ_1, dlnZ_1 = sample_ptmcmc(df[gene].values, 1)
    df_2, lnZ_2, dlnZ_2 = sample_ptmcmc(df[gene].values, 2)

    # Store results in dictionary
    results[gene] = {'lnZ_1': lnZ_1,
                     'dlnZ_1': dlnZ_1,
                     'lnZ_2': lnZ_2,
                     'dlnZ_2': dlnZ_2,
                     'df_1': df_1.copy(),
                     'df_2': df_2.copy()}

Let's plot the triangle plots, first for Nanog.

In [18]:
df_2 = results['Nanog']['df_2']
corner.corner(df_2.loc[df_2.beta_ind==0, ['r1', 'p1', 'r2', 'p2', 'f']], 
              bins=50, smooth=1);

We see that pretty much any value of $f$ goes and that $p_1$ and $p_2$ are both small. The burst size distributions are also strongly bimodal. Again, this precludes a clean parametrization of the posterior, so we should show the posterior if we do believe that the Nanog mRNA distribution is bimodal.

Note that this calls into question results from MLE, since the posterior is so clearly multimodal. This also demonstrates the utility of PTMCMC, which allows for multimodal sampling.

Again, for completeness, we'll look at the posterior for the unimodal distribution.

In [19]:
df_1 = results['Nanog']['df_1']
corner.corner(df_1.loc[df_1.beta_ind==0, ['r', 'p']], bins=50, smooth=1);

As we might expect, it is also unimodal like Rex1 and Prdm14..

Now let's look at Rest.

In [20]:
df_2 = results['Rest']['df_2']
corner.corner(df_2.loc[df_2.beta_ind==0, ['r1', 'p1', 'r2', 'p2', 'f']], 
              bins=50, smooth=1);

Again, we have a multimodal posterior. We'll do another quick check of the posterior for model 1.

In [21]:
df_1 = results['Rest']['df_1']
corner.corner(df_1[df_1.beta_ind==0][['r', 'p']], bins=50);

Let's look at the CDFs, first for Nanog.

In [22]:
gene = 'Nanog'
df_1 = results[gene]['df_1']
df_2 = results[gene]['df_2']
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0] = plot_cdfs(df, df_1, df_2, gene, show=(1,0), ax=ax[0])
ax[1] = plot_cdfs(df, df_1, df_2, gene, show=(0,1), ax=ax[1])

We see some systematic failure of the unimodal curve to capture the Nanog data. The bimodal distribution works better. This should play out in the odds ratio.

Now, let's look at the CDFs for Rest.

In [23]:
gene = 'Rest'
df_1 = results[gene]['df_1']
df_2 = results[gene]['df_2']
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0] = plot_cdfs(df, df_1, df_2, gene, show=(1,0), ax=ax[0])
ax[1] = plot_cdfs(df, df_1, df_2, gene, show=(0,1), ax=ax[1])

We do see some systematic error in the unimodal distribution, but not as bad as in the Rex1 or Nanog examples.

Ok, now let's compute the odds ratios! We'll do them all to provide a summary. Remember that $O_{AB} > 1$ implies a unimodal distribution and $O_{AB} < 1$ implies a bimodal distribution.

In [24]:
for gene in df.columns:
    print('Odds ratio for {0:s}: {1:.2e}'.format(gene,
          np.exp(results[gene]['lnZ_1'] - results[gene]['lnZ_2'])))
Odds ratio for Rex1: 1.64e-19
Odds ratio for Rest: 3.25e-01
Odds ratio for Nanog: 1.05e-03
Odds ratio for Prdm14: 5.96e+00

So, Rex1 is surely bimodal, and it is quite likely that Nanog is as well. For Rest and Prdm14, the odds ratio is not really conclusive. It is more likely that not that Rest is double Negative Binomial and Prdm14 is Negative Binomial, but that is about all we can say.