Tutorial 7a: Parameter estimation with Markov chain Monte Carlo

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [1]:
import itertools

import numpy as np
import pandas as pd

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

In this tutorial, we will learn how to use Markov chain Monte Carlo to do parameter estimation. To get the basic idea behind MCMC, imagine for a moment that we can draw samples out of the posterior distribution. This means that the probability of choosing given values of a set of parameters is proportional to the posterior probability of that set of values. If we drew many many such samples, we could reconstruct the posterior from the samples, e.g., by making histograms. That's a big thing to imagine: that we can draw properly weighted samples. But, it turns out that we can! That is what MCMC allows us to do.

We will discuss some theory behind this seemingly miraculous capability in lecture. For this tutorial, we will just use the fact that we can do the sampling to learn about posterior distributions in the context of parameter estimation.

Our MCMC engine

We will use Stan as our main engine for performing MCMC, and will use its Python interface, PyStan. Stan is the state of the art for MCMC. Importantly, it is also a probabilistic programming language, which allows us to more easily specify Bayesian generative models. The Stan manual will be very useful for you.

The data set

The data come from the Elowitz lab, published in Singer et al., Dynamic Heterogeneity and DNA Methylation in Embryonic Stem Cells, Molec. Cell, 55, 319-331, 2014, available here.

In this paper, the authors investigated cell populations of embryonic stem cells using RNA single molecule fluorescence in situ hybridization (smFISH), a technique that enables them to count the number of mRNA transcripts in a cell for a given gene. They were able to measure four different genes in the same cells. So, for one experiment, they get the counts of four different genes in a collection of cells.

The authors focused on genes that code for pluripotency-associated regulators to study cell differentiation. Indeed, differing gene expression levels are a hallmark of differentiated cells. The authors do not just look at counts in a given cell at a given time. The temporal nature of gene expression is also important. While the authors do not directly look at temporal data using smFISH (since the technique requires fixing the cells), they did look at time lapse fluorescence movies of other regulators. We will not focus on these experiments here, but will discuss how the distribution of mRNA counts acquired via smFISH can serve to provide some insight about the dynamics of gene expression.

The data set we are analyzing now comes from an experiment where smFISH was performed in 279 cells for the genes rex1, rest, nanog, and prdm14. The data set may be downloaded here.

ECDFs of mRNA counts

Let's go ahead and explore the data. We will load in the data set and generate ECDFs for the mRNA counts for each of the four genes.

In [2]:
# Load DataFrame
df = pd.read_csv('../data/singer_transcript_counts.csv', comment='#')

# Take a look
df.head()
Out[2]:
Rex1 Rest Nanog Prdm14
0 11 34 39 0
1 172 91 33 5
2 261 70 68 0
3 178 54 88 1
4 129 54 41 0

The data are already tidy, since each row is a single measurement (a single cell). We can plot ECDFs for each gene.

In [3]:
# Make a list of plots
plots = [bebi103.viz.ecdf(df[gene], 
                          plot_height=125, 
                          title=gene)
         for gene in df]
plots[-1].xaxis.axis_label = 'mRNA count'

# Show them as a grid
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=1))

Note the difference in the $x$-axis scales. Clearly, prdm14 has far fewer mRNA copies than the other genes. The presence of two inflection points in the Rex1 EDCF implies bimodality.

The distribution of transcript counts and the "story"

The smFISH experiments give the number of mRNA copies of a given gene in a given cell. How would we expect these numbers to be distributed?

If gene expression is a purely Poisson process, we might expect a Poisson distribution. Or, if the copy number is itself somehow tightly regulated, we might expect a Gaussian distribution.

Study of gene expression dynamics, largely through fluorescence imaging, has lead to a different story. Expression of many important genes can be bursty, which means that the promoter is on for a period of time in which transcripts are made, and then it is off for a while. The "on" periods are called "bursts" and are themselves well-modeled as a Poisson process. That is to say that the amount of time that a promoter is on is Exponentially distributed. Thus, we can think of a burst as a series of Bernoulli trials. A "failure" is production of an mRNA molecule, and a "success" is a switch to an off state. The number of "successes" we get is equal to the number of bursts we get per decay time of the mRNA. We can define the number of bursts before degradation of the mRNA as $\alpha$. This is the so-called burst frequency. So, we have a series of Bernoulli trials and we wait for $\alpha$ successes. Then, $n$, the total number of failures (which is the number of mRNA transcripts), is Negative Binomially distributed, since this matches the Negative Binomial story. Referring to the parametrization of the Negative Binomial distribution from Tutorial 3c,

\begin{align} n \sim \text{NegBinom}(\alpha, \beta), \end{align}

where $\beta$ is related to the probability $p$ of success the Bernoulli trial (the switch to an off state) by $p = \beta/(1+\beta)$. The meaning of the parameter $\beta$, and the related quantity $p$, can be a little mystical here. We would like to relate it to the typical burst size, i.e., the typical number of transcripts made per burst. It will be easier for use to reason about priors of burst sizes and burst frequencies. The size of a single given burst (that is, the number of transcripts made in a burst) is geometrically distributed (since it matches that story), so

\begin{align} f(n_\mathrm{burst}\mid p) = (1-p)^{n_\mathrm{burst}}p. \end{align}

The mean number of transcripts $b$ in a burst is

\begin{align} b \equiv \left\langle n_\mathrm{burst}\right\rangle &= \sum_{n_\mathrm{burst}=0}^\infty n_\mathrm{burst}(1-p)^{n_\mathrm{burst}}p\\[1em] &= p \sum_{n_\mathrm{burst}=0}^\infty n_\mathrm{burst}(1-p)^{n_\mathrm{burst}} \\[1em] &= p(1-p)\, \frac{\mathrm{d}}{\mathrm{d}(1-p)}\sum_{n_\mathrm{burst}=0}^\infty(1-p)^{n_\mathrm{burst}} \\[1em] &= p(1-p)\, \frac{\mathrm{d}}{\mathrm{d}(1-p)}\,\frac{1}{p}\\[1em] &= -p(1-p)\, \frac{\mathrm{d}}{\mathrm{d}p}\,\frac{1}{p} \\[1em] &= \frac{1-p}{p} \\[1em] & = 1/\beta. \end{align}

So we now see that $1/\beta$ is the typical burst size.

Using the Negative Binomial property of mRNA copy numbers of bursty gene expression, we can characterize the expression levels of a given cell type by the two parameters of the Negative Binomial, the burst frequency $\alpha$ and the burst size $b$. These are the two parameters we would like to infer from transcript count data.

The conclusion of all this is that we have have our likelihood.

\begin{align} &\beta = 1/b,\\[1em] &n_i \sim \text{NegBinom}(\alpha, \beta)\;\; \forall i. \end{align}

Building the generative model

Given that we have a Negative Binomial likelihood, we are left to specify priors the burst size $b$ and the burst frequency $\alpha$. I would expect the time between bursts to be of order minutes, since that is a typical response time to signaling of a cell. This is of the same order of magnitude of an RNA lifetime, so I might then expect $\alpha$ to be of order unity. I am not very sure about this, so I will choose a rather broad Log-Normal prior.

\begin{align} \alpha \sim \text{LogNorm}(0, 2). \end{align}

I would expect the burst size to depend on promoter strength and/or strength of transcriptional activators. I could imagine anywhere from a few to a thousand transcripts per burst. So, I will again choose a broad Log-Normal prior for $b$.

\begin{align} b \sim \text{LogNorm}(2, 3). \end{align}

We then have the following model.

\begin{align} &\alpha \sim \text{LogNorm}(0,2), \\[1em] &b \sim \text{LogNorm}(2, 3), \\[1em] &\beta = 1/b,\\[1em] &n_i \sim \text{NegBinom}(\alpha, \beta) \;\forall i. \end{align}

Prior predictive checks

We will use Stan for our prior predictive checks. I remind you that you should write separate .stan files for this, but I am writing the Stan program as a string here in the notebook so that it is stand-alone.

In [4]:
model_code_prior_pred = """
data {
  int N;
}


generated quantities {
  int n[N];

  real alpha = lognormal_rng(0.0, 2.0);
  real b = lognormal_rng(2.0, 3.0);
  real beta = 1.0 / b;
  
  for (i in 1:N) {
    n[i] = neg_binomial_rng(alpha, beta);
  }
}
"""

Let's compile this into a Stan model and draw our prior predictive data sets.

In [6]:
sm_gen = bebi103.stan.StanModel(model_code=model_code_prior_pred)

data = dict(N=len(df))
samples_gen = sm_gen.sampling(data=data,
                              algorithm='Fixed_param',
                              warmup=0,
                              chains=1,
                              iter=100)
Using cached StanModel.

Let's take a look at the ECDFs we generated.

In [7]:
df_samples = bebi103.stan.extract_array(samples_gen, name='n')

bokeh.io.show(bebi103.viz.ecdf_collection(df_samples, 
                                          val='n', 
                                          cats='chain_idx', 
                                          color='#4e79a7', 
                                          alpha=0.1,
                                          show_legend=False,
                                          val_axis_type='log'))

Most of the data sets have reasonable ECDFs, but some have very high copy numbers, close to 100,000. This is the upper limit for mammalian cells (see here). So, this seems like an acceptable model to use.

Before moving one, note the diversity in the shapes of the ECDFs. This is a feature of the Negative Binomial distributions; it can give rise to a variety of shapes in the distribution function.

Sampling the posterior

To draw samples out of the posterior, we need to use some new Stan syntax. I will first code up the model in Stan and then explain the syntax.

In [8]:
model_code = """
data {
  int N;
  int n[N];
}


parameters {
  real<lower=0> alpha;
  real<lower=0> b;
}


transformed parameters {
  real beta_ = 1.0 / b;
}


model {
  // Priors
  alpha ~ lognormal(0.0, 2.0);
  b ~ lognormal(2.0, 3.0);

  // Likelihood
  n ~ neg_binomial(alpha, beta_);
}
"""
  • The data block contains the counts $n$ of the mRNA transcripts. There are $N$ cells that are measured. Most data blocks look like this. There is an integer parameter that specifies the size of the data set, and then the data set is given as an array.
  • The parameters block tell us what the parameters of the posterior are. In this case, we wish to sample out of the posterior $g(\alpha, b \mid \{n\})$, where $\{n\}$ is the set of transcript counts for the gene. So, the two parameters are $\alpha$ and $b$. The sampler assumes that the variables listed in the parameters block are those you wish to sample.
  • The transformed parameters block allows you to do any transformation of the parameters you are sampling for convenience. In this case, Stan's Negative Binomial distribution is parametrized by $\beta = 1/b$, so we make the transformation of the b to beta_. Notice that I have called this variable beta_ and not beta. I did this because beta is one of Stan's distributions, and you should avoid naming a variable after a word that is already in the Stan language.
  • Finally, the model block is where the model is specified. The syntax of the model block is almost identical to that of the hand-written model.

Now that we have specified our model, we can compile it.

In [10]:
sm = bebi103.stan.StanModel(model_code=model_code)
Using cached StanModel.

Now, we just need to specify the data and let Stan's sampler do the work! The data has to be passed in as a dictionary with keys corresponding to the variable names declared in the data block of the Stan program.

In [12]:
data = dict(N=len(df),
            n=df['Rest'].values.astype(int))

samples = sm.sampling(data=data)

To look at our samples, it is convenient to convert them to a data frame. For now, we will not include the diagnostic information in the data frame, but will discuss the diagnostics at length in a later tutorial.

In [13]:
df_mcmc = bebi103.stan.to_dataframe(samples, diagnostics=False, inc_warmup=False)

# Take a look
df_mcmc.head()
Out[13]:
chain chain_idx warmup alpha b beta_ lp__
0 1 1 0 4.016804 19.076744 0.052420 -1378.205886
1 1 2 0 3.804635 19.390038 0.051573 -1379.024612
2 1 3 0 3.734151 19.688832 0.050790 -1379.492057
3 1 4 0 4.895272 15.582572 0.064174 -1377.690316
4 1 5 0 5.115551 15.875496 0.062990 -1382.314135

Four of the columns in the data frame are always present and returned by Stan's sampler.

  • chain is the index (stating at 1) of the chain. By default, Stan samples using four chains.
  • chain_idx is the index of the step for the given chain.
  • warmup is 1 if the step constitutes a warmup step and 0 if it is a sampling step. By default, bebi103.stan.to_dataframe() does not include warmup steps.
  • lp__ is the unnormalized value of the log posterior.

The remaining columns are the parameters, the two of interest, alpha and b, and the transformed parameter beta_.

Plots of the samples

There are many ways of looking at the samples. In this case, since we have two parameters of interest, the pulse frequency and pulse size, we can plot the samples as a scatter plot to get the approximate density. We could easily use Altair for this, but I will use Bokeh. I use transparency to help visualize the density of points.

In [14]:
p = bokeh.plotting.figure(width=450, height=400, 
                          x_axis_label='α (bursts per mRNA lifetime)', 
                          y_axis_label='b (transcripts per burst)')
p.circle(df_mcmc['alpha'], df_mcmc['b'], alpha=0.05)
bokeh.io.show(p)

We see very strong corrleation between $\alpha$ and $b$. This does not necessarily mean that they depend on each other. Rather, it means that our degree of belief about their values depends on both in a correlated way. The measurements we made cannot effectively separate the effects of $\alpha$ and $\beta$ on the transcript counts.

We can also plot the marginalized distributions. Remember that the marginalized distributions properly take into account the effects of the other variable, including the strong correlation I just mentioned. To obtain the marginalized distribution, we simply ignore the other distribution. It is convenient to look at the marginalized distributions as ECDFs.

In [15]:
plots = [bebi103.viz.ecdf(df_mcmc[param], x_axis_label=param, plot_height=200, plot_width=250) 
                 for param in ['alpha', 'b']]
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

Marginalizing the posterior

Alternatively, we can visualize the marginalized posterior PDFs as histograms. Remember, to get the marginal posterior of a single parameter, you simply ignore the other parameters!

In [16]:
plots = [bebi103.viz.histogram(df_mcmc[param], x_axis_label=param, 
                               plot_height=200, plot_width=250, 
                               bins=30, density=True) 
                 for param in ['alpha', 'b']]
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

Analysis all genes

We can do the same analysis for all genes. To do so, we input the data sets for each gene into the sampler and make our plot of the psoterior.

In [17]:
plots = []
for gene in df.columns:
    data = dict(N=len(df),
                n=df[gene].values.astype(int))

    samples = sm.sampling(data=data)
    
    df_mcmc = bebi103.stan.to_dataframe(samples)

    p = bokeh.plotting.figure(width=250, height=200, title=gene,
                              x_axis_label='α (bursts per mRNA lifetime)', 
                              y_axis_label='b (transcripts per burst)')
    p.circle(df_mcmc['alpha'], df_mcmc['b'], alpha=0.025)

    plots.append(p)
    
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

Note that this single Negative Binomial model probably does not describe the Rex1 data, as can be seen from the ECDF of the measurements. Nonetheless, we can still assume the model is true and compute (i.e., sample) the posterior as if the model were true. This is always what we are doing when we perform parameter estimations. That said, we should seek more apt model for Rex1.

Mixture models

As we saw when we explored the data, the distribution in mRNA copies for some genes is clearly bimodal. Since the negative binomial distribution is unimodal, what could be the story here? It is quite possible we are seeing two different states of cells, one with one level of bursty expression of the gene of interest, and another with a different level of bursty expression. This could mean the cells are differentiating. So, we would expect the number of mRNA transcripts to be distributed according to a linear combination of negative binomial distributions. We can write out the PMF as

\begin{align} f(n\mid \alpha_1, \alpha_2, \beta_1, \beta_2, w) &= w\,\frac{(n + \alpha_1 - 1)!}{n!\,(\alpha_1-1)!}\,\left(\frac{\beta_1}{1+\beta_1}\right)^{\alpha_1}\left(\frac{1}{1+\beta_1}\right)^{n} \\[1em] &\;\;\;\;+ (1-w) \,\frac{(n + \alpha_2 - 1)!}{n!\,(\alpha_2-1)!}\,\left(\frac{\beta_2}{1+\beta_2}\right)^{\alpha_2}\left(\frac{1}{1+\beta_2}\right)^{n} , \end{align}

where $w$ is the probability that the burst size and frequency are determined by $1/\beta_1$ and $\alpha_1$. Such a model, in which a variable is distributed as a linear combination of distributions, is called a mixture model.

For this case, we can write this likelihood more concisely as

\begin{align} n_i \sim w \, \text{NegBinom}(\alpha_1, \beta_1) + (1-w)\,\text{NegBinom}(\alpha_2, \beta_2). \end{align}

We have to specify priors on $\alpha_1$, $\beta_1$, $\alpha_2$, $\beta_2$, and $w$. We can retain the same priors for the $\alpha$'s and $\beta$'s, and we will assume a Uniform prior for $w$. We then have the following model.

\begin{align} &\alpha_i \sim \text{LogNorm}(0,2) \text{ for } i \in [1, 2] \\[1em] &b_i \sim \text{LogNorm}(2, 3) \text{ for } i \in [1, 2], \\[1em] &\beta_i = 1/b_i,\\[1em] &w \sim \text{Beta}(1, 1), \\[1em] &n_i \sim w \, \text{NegBinom}(\alpha_1, \beta_1) + (1-w)\,\text{NegBinom}(\alpha_2, \beta_2). \end{align}

Prior predictive checks

Let's code up the model for prior predictive checks. For each cell in the data set we are generating, we randomly decided if it is a cell with parameters $(\alpha_1, b_1)$ or $(\alpha_2, b_2)$ with the probability of drawing the former being $w$. This is accomplished by drawing a random number between zero and one and then using parameters $(\alpha_1, b_1)$ if this random number if less than $w$. Doing this task also exposes the Stan syntax for if statements, which is apparent in the code below.

In [19]:
model_code_mix_gen = """
data {
  int N;
}


generated quantities {
  int n[N];

  real alpha_1 = lognormal_rng(0.0, 2.0);
  real alpha_2 = lognormal_rng(0.0, 2.0);

  real b_1 = lognormal_rng(2.0, 3.0);
  real b_2 = lognormal_rng(2.0, 3.0);
  
  real w = beta_rng(1.0, 1.0);
  
  real beta_1 = 1.0 / b_1;
  real beta_2 = 1.0 / b_2;
  
  for (i in 1:N) {
    // Randomly choose parameter set 1 or 2 with weight w
    if (uniform_rng(0.0, 1.0) < w) {
      n[i] = neg_binomial_rng(alpha_1, beta_1);
    }
    else {
      n[i] = neg_binomial_rng(alpha_2, beta_2);
    }
  }
}
"""

sm_mix_gen = bebi103.stan.StanModel(model_code=model_code_mix_gen)
Using cached StanModel.

Now, let's generate our samples and check the ECDFs.

In [20]:
data = dict(N=len(df))
samples_mix_gen = sm_mix_gen.sampling(data=data,
                                      algorithm='Fixed_param',
                                      warmup=0,
                                      chains=1,
                                      iter=100)

df_samples = bebi103.stan.extract_array(samples_mix_gen, name='n')

bokeh.io.show(bebi103.viz.ecdf_collection(df_samples, 
                                          val='n', 
                                          cats='chain_idx', 
                                          color='#4e79a7', 
                                          alpha=0.1,
                                          show_legend=False,
                                          val_axis_type='log'))

These look reasonable, with a large range of values, and plenty with bimodality. So, we will continue to code up the model for sampling.

Coding up a mixture model

There are a few considerations for coding up a mixture model that also introduce Stan syntax. Importantly, under the hood, Stan uses the log posterior, as do almost all samplers, when sampling out of the posterior. Dealing with a mixture model presents a unique challenge for computing the log likelihood (which is one of the summands of the log posterior). Consider the log likelihood in the present example.

\begin{align} \ln f(n\mid \alpha_1, \alpha_2, \beta_1, \beta_2, w) = \ln(w\,a_1 + (1-w)a_2), \end{align}

where

\begin{align} a_i = \frac{(n + \alpha_i - 1)!}{n!\,(\alpha_i-1)!}\,\left(\frac{\beta_i}{1+\beta_i}\right)^{\alpha_i}\left(\frac{1}{1+\beta_i}\right)^{n}. \end{align}

While the logarithm of a product is conveniently split, we cannot split the logarithm of a sum. If we consider the sum directly, we will get serious underflow errors for parameters for which the terms $a_1$ or $a_2$ are small. To compute this is a more numerically stable way, we need to use the log-sum-exp trick. Fortunately, Stan has a built-in function to compute the contributions to the log posterior of a mixture, the log_mix function. To update the posterior with this log_mix function, we need to add to target. In Stan, the keyword target is a special variable that holds the running sum of the contributions to the log posterior. When you make statements like alpha ~ lognormal(0.0, 2.0.), Stan is adding the appropriate terms to target under the hood. In the case of mixture models, we need to add to target explcitly. More generally, you can add any terms to target, and Stan considers these terms as part of the log posterior. The code below implements this.

In [21]:
model_code_mix = """
data {
  int N;
  int n[N];
}


parameters {
  vector<lower=0>[2] alpha;
  vector<lower=0>[2] b;
  real<lower=0, upper=1> w;
}


transformed parameters {
  vector[2] beta_ = 1.0 ./ b;
}


model {
  // Priors
  alpha ~ lognormal(0.0, 2.0);
  b ~ lognormal(2.0, 3.0);
  w ~ beta(1.0, 1.0);

  // Likelihood
  for (i in 1:N) {
    target += log_mix(w,
                      neg_binomial_lpmf(n[i] | alpha[1], beta_[1]),
                      neg_binomial_lpmf(n[i] | alpha[2], beta_[2]));
  }
}
"""

In addition to the log_mix function, there is some more new syntax.

  • To add the contribution of the Negative Binomial PMF to the log posterior, we use negative_binomial_lmpf. Every discrete Stan distribution has a function <distribution>_lpmf that gives the value of the log PMF, and every continuous distribution has a similar function <distribution>_lpdf. The arguments of the function are as you would write them on paper, with a bar (|) signifying the conditioning on the parameters.
  • In the above, I have specified alpha, b, and beta_ as vectors. A vector is an array that behaves like a column vector, which means you can do matrix operations with it. It also allows you to do element-wise operations. Notice that I have written
    vector[2] beta_ = 1.0 ./ b;
    This means that beta_ is a 2-vector and that each element is given by the inverse of the corresponding element in b. The ./ operator accomplishes this (note the dot in front of the slash). In general, preceding an operator with a . indicates that the operation should be done elementwise.
  • Note also that Stan is smart enough to know that if I give alpha a prior that it need to assign it independently to each element in alpha. The same is of course true for b.

Now that we have our model set up, let's compile and sample from it! I will use the seed kwarg to set the seed for the random number generator to ensure that I always get the same result for illustrative purposes.

In [23]:
# Compile
sm_mix = bebi103.stan.StanModel(model_code=model_code_mix)

# Sample!
data = dict(N=len(df),
            n=df['Rex1'].values.astype(int))
samples = sm_mix.sampling(data=data, seed=523921)
Using cached StanModel.

Let's take a look at our samples, first by looking at the data frame.

In [24]:
df_mcmc = bebi103.stan.to_dataframe(samples, diagnostics=False)

df_mcmc.head()
Out[24]:
chain chain_idx warmup alpha[1] alpha[2] b[1] b[2] w beta_[1] beta_[2] lp__
0 1 1 0 2.444427 5.526869 8.006713 29.224196 0.197404 0.124895 0.034218 -1594.558236
1 1 2 0 5.323232 4.474884 2.681878 36.056928 0.130638 0.372873 0.027734 -1595.060263
2 1 3 0 3.817855 4.425974 3.747441 37.565225 0.130080 0.266849 0.026620 -1595.119786
3 1 4 0 2.890999 5.174937 5.278922 29.838419 0.204832 0.189433 0.033514 -1595.938595
4 1 5 0 2.881455 5.169793 5.166839 29.684050 0.204584 0.193542 0.033688 -1596.358520

We see that Stan names the variables of vectors by indexing with square brackets. Note that this will mess up how Altair reads column names, so you have to change the column names if you want to plot with Altair.

A nonidentifiable model

Let's take a look at a plot of the posterior. We have five parameters of interest, $\alpha_1$, $\alpha_2$, $b_1$, $b_2$, and $w$. It would be easiest to generate a facet plot using Altair, but I will do it in Bokeh because I don't feel like changing the column headings and it's not so bad in Bokeh.

In [25]:
params = ['alpha[1]', 'b[1]', 'alpha[2]', 'b[2]', 'w']
plots = []
for param_1, param_2 in itertools.product(params, params):
    p = bokeh.plotting.figure(width=150, height=140,
                              x_axis_label=param_1, 
                              y_axis_label=param_2)
    p.circle(df_mcmc[param_1], df_mcmc[param_2], alpha=0.01)
    plots.append(p)
    
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=len(params)))

This looks peculiar. We see a strong bimodality in the posterior. The cause of this can be revealed if we color the glyphs by the chain ID. We only need to show one plot, we I will show $\alpha_2$ versus $w$.

In [26]:
# Set up colors
df_mcmc['chain'] = df_mcmc['chain'].astype(str)
color = bokeh.transform.factor_cmap('chain', 
                                    palette=['#4e79a7', '#f28e2b', '#e15759', '#76b7b2'], 
                                    factors=df_mcmc['chain'].unique())

p = bokeh.plotting.figure(width=350, height=250,
                          x_axis_label='w',
                          y_axis_label='α₂')
p.circle(source=df_mcmc, x='w', y='alpha[2]', color=color, alpha=0.1)

bokeh.io.show(p)

We see that three of the chains (colored orange, teal, and red) are centered around w = 0.8, while the other (colored blue) is around w = 0.2. Note that these two values of w sum to unity. We have just uncovered a nonidentifiable model. A nonidentifiable model is a model for which we cannot unambiguously determine the parameter values. That is, two or more parameter sets are observationally equivalent.

Label switching

There are many reasons why a model may be nonidentifiable. In this case, we are seeing a manifestation of label switching (see section 25.2 of the Stan Manual). Before launching into label switchin in this particular case, I emphasize that this is certainly not the only way models can be nonidentifiable, and I am presenting this as a case study on how to deal with a particular kind of nonidentifiability. If you are seeing nonidentifiability in your model, do not automatically assume that it is because of label switching. I also am going through this case study to demonstrate that even in fairly simple models (in this case, two cell populations, each characterized by their burst size and frequency), devilish problems can arise when trying to do inference. You need to be vigilant.

In this mixture model, it is arbitrary which $(\alpha, b)$ pair we label as $(\alpha_1, b_1)$ or $(\alpha_2, b_2)$. We can switch the labels, and also change $w$ to $1-w$, and we have exactly the same posterior probability. To demonstrate that this is the case, I will generate the same grid of plots as above, but switch the labels the appropriate labels and convert $w$ to $1-w$ for every $w < 0.5$. (Note that this will not in general work, especially if the different modes from label switching overlap, and is not a good idea for analysis; I'm just doing it here to illustrate how label switching leads to nonidentifiability.)

In [27]:
# Perform the label switch
switch = df_mcmc.loc[df_mcmc['w']>0.5, params]
switch = switch.rename(columns={'b[1]': 'b[2]',
                                'b[2]': 'b[1]',
                                'alpha[2]': 'alpha[1]',
                                'alpha[1]': 'alpha[2]'})
switch['w'] = 1 - switch['w']

df_switch = pd.concat([df_mcmc.loc[df_mcmc['w']<0.5, params], switch], 
                      sort=True)

# Make the plots
plots = []
for param_1, param_2 in itertools.product(params, params):
    p = bokeh.plotting.figure(width=150, height=140,
                              x_axis_label=param_1, 
                              y_axis_label=param_2)
    p.circle(df_switch[param_1], df_switch[param_2], alpha=0.01)
    p.xaxis.major_label_orientation = 'vertical'
    plots.append(p)
    
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=len(params)))

We see that if we fix the label switching, the posterior is indeed unimodal. So, making an identifiable model in this case means that we have to deal with the label switching problem. There are many approaches to doing this, and you can see a very detailed discussion about dealing with label switch and other problems associated with mixture models here.

We will take a strategy suggested in the Stan manual. We noted earlier that the chains tended to stay on a single mode. We will initialize the chains to instead start near only one mode. This, like any of the fixes for mixture models, does not guarantee that we get good sampling (and we will discuss more diagnostics for good in a future tutorial), but it in practice it can work quite well, as it will here.

To determine where to start the chains, we will select a chain and start the samplers at the mean. First, we need to compute the parameter means for the first chain.

In [28]:
# Previously make chain ID a string for plotting, convert back to in
df_mcmc['chain'] = df_mcmc['chain'].astype(int)

# Compute mean of parameters for chain 1
params = ['alpha[1]', 'alpha[2]', 'b[1]', 'b[2]', 'w']
param_means = df_mcmc.loc[df_mcmc['chain']==1, params].mean()

# Take a look
param_means
Out[28]:
alpha[1]     3.591666
alpha[2]     5.188468
b[1]         5.574927
b[2]        31.940156
w            0.167107
dtype: float64

Now that we have the means for chain 1, we can use them to pass into Stan's sampler. An easy way to do that is to write a function that PyStan uses to provide the sampler with starting points. The function should return a dictionary of the starting values. Note that vector-values parameters (such as alpha and b) need to be specified as vectors, which means you can use a Python list.

In [29]:
def init_fun():
    """Initialization function for sample at mean of one mode."""
    return {'alpha': [param_means['alpha[1]'], param_means['alpha[2]']],
            'b': [param_means['b[1]'], param_means['b[2]']],
            'w': param_means['w']}

# Get the samples
samples = sm_mix.sampling(data=data, init=init_fun)
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):

Let's take a quick look at a plot of the posterior.

In [30]:
df_mcmc = bebi103.stan.to_dataframe(samples)

params = ['alpha[1]', 'b[1]', 'alpha[2]', 'b[2]', 'w']
plots = []
for param_1, param_2 in itertools.product(params, params):
    p = bokeh.plotting.figure(width=150, height=140,
                              x_axis_label=param_1, 
                              y_axis_label=param_2)
    p.circle(df_mcmc[param_1], df_mcmc[param_2], alpha=0.01)
    plots.append(p)
    
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=len(params)))

This looks good! Everything is nicely behaved and we seem to have fixed the identifiability problem.

Another example: nonlinear regression

We performed nonlinear regression using optimization methods in a previous tutorial using the data set from Good, et al., Science, 342, 856-860, 2013. Here, we will perform the same analysis using MCMC. We start by loading in the data set.

In [31]:
# Load in Data Frame
df = pd.read_csv('../data/good_invitro_droplet_data.csv', comment='#')

As a reminder, our statistical model for the consserved tubulin model is

\begin{align} &\phi \sim \text{LogNorm}(\ln 20, 0.75),\\[1em] &\gamma \sim \text{Beta}(2, 2), \\[1em] &\sigma_0 \sim \text{Gamma}(2, 10),\\[1em] &\sigma = \sigma_0\,\phi,\\[1em] &\mu = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &l_i \sim \text{Norm}(\mu, \sigma) \;\forall i. \end{align}

We already did prior predictive checks, so let's proceed to code up this model in Stan and compile it.

In [33]:
model_code = """
functions {
  real ell_theor(real d, real phi, real gamma) {
    real denom_ratio = (gamma * d / phi)^3;
    return gamma * d / (1 + denom_ratio)^(1.0 / 3.0); 
  }
}


data {
  int N;
  real d[N];
  real ell[N];
}


parameters {
  real phi;
  real gamma;
  real sigma_0;
}


transformed parameters {
  real sigma = sigma_0 * phi;
}


model {
  phi ~ lognormal(log(20.0), 0.75);
  gamma ~ beta(2.0, 2.0);
  sigma_0 ~ gamma(2.0, 10.0);
  
  for (i in 1:N) {
    ell[i] ~ normal(ell_theor(d[i], phi, gamma), sigma);
  }
}
"""

sm = bebi103.stan.StanModel(model_code=model_code)
Using cached StanModel.

All right! Let's do some sampling!

In [34]:
data = dict(N=len(df),
            d=df['Droplet Diameter (um)'].values,
            ell=df['Spindle Length (um)'].values)

samples = sm.sampling(data=data)
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):

With samples in hand, let's make a plot of the posterior samples. Remember that in Tutorial 6a-posterior) we brute force plotted the posterior after numerically marginalized out $\sigma$ using the trapezoidal rule. Here, we can just plot our samples of $\phi$ and $\gamma$ and get the same result.

In [35]:
df_mcmc = bebi103.stan.to_dataframe(samples)

p = bokeh.plotting.figure(width=350, height=250,
                          x_axis_label='φ (µm)',
                          y_axis_label='γ')
p.circle(source=df_mcmc, x='phi', y='gamma', alpha=0.05)

bokeh.io.show(p)

We can further marginalize the distribution to get the CDFs of the respective parameters.

In [36]:
p1 = bebi103.viz.ecdf(df_mcmc['phi'], x_axis_label='φ (µm)', 
                      plot_height=250, plot_width=300)
p2 = bebi103.viz.ecdf(df_mcmc['gamma'], x_axis_label='γ',
                      plot_height=250, plot_width=300)

bokeh.io.show(bokeh.layouts.gridplot([p1, p2], ncols=2))

Conclusions

You have learned how to use Stan to use MCMC to sample out of a posterior distribution. I hope it is evident how convenient and powerful this is. I also hope you have an understanding of how fragile statistical modeling can be, as you saw with a label switching-based nonidentifiability.

We have looked at some visualizations of MCMC results in this tutorial, and in the next one, we will take a closer look at how to visualize and report MCMC results.

Computing environment

In [31]:
%load_ext watermark
In [32]:
%watermark -v -p numpy,pandas,pystan,bokeh,bebi103,jupyterlab
CPython 3.7.0
IPython 7.0.1

numpy 1.15.3
pandas 0.23.4
pystan 2.17.1.0
bokeh 1.0.0
bebi103 0.0.32
jupyterlab 0.35.2