Tutorial 9a: Posterior predictive checks and principled model building

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [1]:
import numpy as np
import pandas as pd

import bebi103

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

Our process for building models has been as follows.

  1. Propose a story for how the data are generated. This could involved deriving some theoretical equations, as was the case for the conserved tubulin model for microtubule spindle lengths.
  2. Match the story to a likelihood.
  3. For the parameters present in the likelihood, propose priors.
  4. Perform prior predictive checks to make sure that the realm of possibilities of the data sets are appropriately covered by the model. Adjust the model as need be to ensure prior predictive checks pass.

After building the model, we can then sample out of the posterior (or summarize the posterior in other ways, such as finding the MAP) to get an understanding of the parameter values.

In this tutorial, we will look at how we can assess how well our model performs after we have the data. We will then discuss how to come up with other models is we are not satisfied with the performance of this model.

The data set

We will again use the data from Singer and coworkers, where they performed RNA FISH to determing the copy numbers of RNA transcripts of specific genes in individual cells in their samples. As a reminder, here are the data sets for the respective genes. (The data set can be downloaded here.)

In [2]:
# Load DataFrame
df = pd.read_csv('../data/singer_transcript_counts.csv', comment='#')

# Make a list of plots
plots = [bebi103.viz.ecdf(df[gene], 
                          plot_height=125, 
                          title=gene)
         for gene in df]
plots[-1].xaxis.axis_label = 'mRNA count'

# Show them as a grid
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=1))

Posterior predictive checks

In tutorial 7a we devised a model for the transcript counts of a given gene based on observations about bursty gene expression. We chose a Negative Binomial likelihood and then assigned appropriate priors.

\begin{align} &\alpha \sim \text{LogNorm}(0,2), \\[1em] &b \sim \text{LogNorm}(2, 3), \\[1em] &\beta = 1/b,\\[1em] &n_i \sim \text{NegBinom}(\alpha, \beta) \;\forall i. \end{align}

We used this model to perform parameter estimates for the burst size $b$ and burst frequency $\alpha$ for the rest gene. How can we assess if the model is actually a reasonable model for the data? This question is quite similar to one we have asked earlier in the course: How can we tell if our model generate all of the possible data sets we could imagine? To address this question, we performed prior predictive checks. For the question of how well the model might be able to describe the data, we use a similar procedure of posterior predictive checks.

We perform posterior predictive checks in much the same way we perform prior predictive checks. To draw a data that would be generated by our model after we have seen the data, we draw a set of parameter values out of the posterior (as opposed to out of the prior for prior predictive checks) and use those in the likelihood to draw a data set. Fortunately, we already have the draws of parameter values out of the posterior. That is what sampling with MCMC provides!

Once we have our draws of data sets out of the posterior predictive distribution, we can make plots of these and compare them to the measured data set.

We are now tasked with adding draws from the posterior predictive distribution to our Stan code. We do this in the generated quantities block in much the same way we did for prior predictive checks. Remember that the generated quantities block is computed for each sample out of the posterior. So, we use the posterior values of the parameters in the likelihood to get the posterior predictive sample.

In [3]:
model_code = """
data {
  int N;
  int n[N];
}


parameters {
  real<lower=0> alpha;
  real<lower=0> b;
}


transformed parameters {
  real beta_ = 1.0 / b;
}


model {
  // Priors
  alpha ~ lognormal(0.0, 2.0);
  b ~ lognormal(2.0, 3.0);

  // Likelihood
  n ~ neg_binomial(alpha, beta_);
}


generated quantities {
  int n_ppc[N];

  // Draw posterior predictive data set
  for (i in 1:N) {
    n_ppc[i] = neg_binomial_rng(alpha, beta_);
  }
}
"""

sm = bebi103.stan.StanModel(model_code=model_code)
Using cached StanModel.

Now, when we draw our samples using Stan, we also get our posterior predictive check samples.

In [4]:
data = dict(N=len(df),
            n=df['Rest'].values.astype(int))

samples = sm.sampling(data=data)

We should always do diagnostic checks whenever we make draws.

In [5]:
bebi103.stan.check_all_diagnostics(samples)
n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0.0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
Out[5]:
0

The checks all look good. At the very least in addition to our diagnostic checks, we should also make a plot of the samples, so let's make a corner plot.

In [6]:
bokeh.io.show(bebi103.viz.corner(samples, pars=['alpha', 'b']))

There are also no obvious issues with the corner plot, so the sampling seems to be in good order.

Now, we can analyze our posterior predictive draws. So we understand how they are stored, let's convert the samples to a data frame and take a look.

In [7]:
df_mcmc = bebi103.stan.to_dataframe(samples)

df_mcmc.head()
Out[7]:
chain chain_idx warmup divergent__ energy__ treedepth__ accept_stat__ stepsize__ n_leapfrog__ alpha ... n_ppc[271] n_ppc[272] n_ppc[273] n_ppc[274] n_ppc[275] n_ppc[276] n_ppc[277] n_ppc[278] n_ppc[279] lp__
0 1 1 0 0.0 1378.937176 4 0.894455 0.201318 19 4.184201 ... 53.0 103.0 34.0 89.0 25.0 80.0 84.0 82.0 114.0 -1377.955644
1 1 2 0 0.0 1380.011083 3 0.994952 0.201318 15 4.873648 ... 38.0 70.0 26.0 61.0 38.0 63.0 90.0 70.0 137.0 -1377.441902
2 1 3 0 0.0 1377.780808 2 0.979397 0.201318 5 4.943121 ... 107.0 45.0 72.0 61.0 41.0 92.0 103.0 72.0 68.0 -1377.614487
3 1 4 0 0.0 1377.709113 2 0.987062 0.201318 3 4.927576 ... 84.0 181.0 50.0 48.0 64.0 156.0 83.0 56.0 22.0 -1377.692898
4 1 5 0 0.0 1378.206537 3 0.962624 0.201318 11 4.804724 ... 114.0 35.0 28.0 111.0 62.0 85.0 134.0 85.0 84.0 -1377.572372

5 rows × 292 columns

Each entry in the prior predictive checks is its own column in the data frame. We can extract them using the bebi103.stan.extract_array() function.

In [8]:
df_n_ppc = bebi103.stan.extract_array(samples, 'n_ppc')

df_n_ppc.head()
Out[8]:
index_1 n_ppc chain chain_idx warmup
0 1 57.0 1 1 0
1 1 108.0 1 2 0
2 1 50.0 1 3 0
3 1 37.0 1 4 0
4 1 66.0 1 5 0

The column index_1 gives the index of the array of outputs, and chain_idx gives the index of the chain from this it was acquired. To see what types of data we get, we can make a plot of all of the ECDFs as we did for prior predictive checks. It helps for comparison to then overlay the ECDF of the acquired data. We have a lot of posterior predictive data points, so we will plot every 40th sample (thereby plotting posterior predictive 100 ECDFs).

In [9]:
# Plot measured data set
p = bebi103.viz.ecdf(df['Rest'].values,
                     x_axis_label='n',
                     color='orange',
                     level='overlay')

# Plot posterior predictive ECDFs
for i in df_n_ppc['chain_idx'].unique()[::40]:
    p = bebi103.viz.ecdf(df_n_ppc.loc[df_n_ppc['chain_idx']==i, 'n_ppc'],
                         alpha=0.1, p=p)

bokeh.io.show(p)

Based on this plot, the posterior predictive distribution seems to encompass what was actually measured. We conclude that at least that the observed data are captured with the model.

A better plot (which also does not involve plotting so many glyphs to choke your browser) would be to plot the percentiles of the ECDFs from the posterior predictive distribtion. This functionality is included in the bebi103.viz.plot_predictive_ecdf() function.

In [10]:
bokeh.io.show(bebi103.viz.predictive_ecdf(samples, name='n_ppc', data=df['Rest']))

In this plot, the middle dark blue line is the median, and then the shades of blue expand to encompass the middle 20th, 40th, 60th, and 80th precentiles of posterior predictive ECDFs. This is hard to see in the above plot without zooming. A better use of space would be to plot the difference of the ECDF compared to the median of the posterior predictive ECDFs. This is accomplished using the diff=True kwarg.

In [11]:
bokeh.io.show(bebi103.viz.predictive_ecdf(samples, name='n_ppc', data_line=False, 
                                          data=df['Rest'], diff=True))

It is much clearer now. We have a few points outside of the 80th percentile envelope, which is expected. We can be satisfied, then, that the model is capturing the real data.

Principled model building

Let us now look at the Rex1 gene using the same analysis.

In [12]:
data = dict(N=len(df),
            n=df['Rex1'].values.astype(int))

# Draw samples
samples = sm.sampling(data=data)

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples)

# Make a corner plot
bokeh.io.show(bebi103.viz.corner(samples, pars=['alpha', 'b']))
n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0.0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.

The diagnostics and the corner plot check out. Let's make the predictive ECDF plot.

In [13]:
bokeh.io.show(bebi103.viz.predictive_ecdf(samples, name='n_ppc', data_line=False, 
                                          data=df['Rex1'], diff=True))

Yikes! The model fails pretty spectacularly to capture the measured data. We suspected this might be the case just looking at the ECDFs from the measured data. Now that we know the model does not describe the data well, we need to come up with another model. This is a good time to think about principled model building.

Steps for principled model building

  1. Start with a simple model. That is, a simple model that captures the simplest story for data generation.
  2. Perform prior predictive checks to make sure it the model can generate data like it should.
  3. Perform parameter estimates. Make sure the sampler performs properly by checking diagnostics.
  4. Perform posterior predictive checks.
  5. If the model is ok, great! You can stop here. If not, proceed.
  6. Build upon the simple model. When adding complexity to a model, it is important that the simple model is a special of limiting case of the more complex model. This helps evaluate how important the added complexity is and provides a clear connection among the models you propose.
  7. Go to 2.

This recipe is almost complete. You should also do some checks to make sure the model is identifiable, that the sampler can handle it, and also that the data can inform parameters. We will discuss techniques to do these things in tutorial 10.

If you want to read a more detailed treatment about principled modeling, check out this blog post from Michael Betancourt.

So, let's proceed to build upon the model. We will propose a mixture model where we have two different cell populations, each with a different pulse size and pulse frequency.

\begin{align} &\alpha_i \sim \text{LogNorm}(0,2) \text{ for } i \in [1, 2] \\[1em] &b_i \sim \text{LogNorm}(2, 3) \text{ for } i \in [1, 2], \\[1em] &\beta_i = 1/b_i,\\[1em] &w \sim \text{Beta}(1, 1), \\[1em] &n_i \sim w \, \text{NegBinom}(\alpha_1, \beta_1) + (1-w)\,\text{NegBinom}(\alpha_2, \beta_2). \end{align}

Note that this model reduces to the original model when $\alpha_1 = \alpha_2$ and $\beta_1 = \beta_2$. In that case, $w$ can be anything; it is nonidentifiable. Nonetheless, a model with this added complexity does reduce to the simpler model for a well-defined special case.

We'll code up the model in Stan, again putting in posterior predictive checks.

In [14]:
model_code_mix = """
data {
  int N;
  int n[N];
}


parameters {
  vector<lower=0>[2] alpha;
  vector<lower=0>[2] b;
  real<lower=0, upper=1> w;
}


transformed parameters {
  vector[2] beta_ = 1.0 ./ b;
}


model {
  // Priors
  alpha ~ lognormal(0.0, 2.0);
  b ~ lognormal(2.0, 3.0);
  w ~ beta(1.0, 1.0);

  // Likelihood
  for (i in 1:N) {
    target += log_mix(w,
                      neg_binomial_lpmf(n[i] | alpha[1], beta_[1]),
                      neg_binomial_lpmf(n[i] | alpha[2], beta_[2]));
  }
}


generated quantities {
  int n_ppc[N];
  
  // Posterior predictive checks
  for (i in 1:N) {
    if (uniform_rng(0.0, 1.0) < w) {
      n_ppc[i] = neg_binomial_rng(alpha[1], beta_[1]);
    }
    else {
      n_ppc[i] = neg_binomial_rng(alpha[2], beta_[2]);
    }
  }
}
"""

sm_mix = bebi103.stan.StanModel(model_code=model_code_mix)
Using cached StanModel.

Recall that there is some difficulty sampling out of this mixture model because of nonidentifiability related to label switching. Our strategy is to sample, and then to sample again, starting all of the chains at the mean values from the first sampling. We'll write a function to do this.

In [15]:
def sample_mix(data, **kwargs):
    """Sample a mixture model."""
    samples = sm_mix.sampling(data=data, chains=1, **kwargs)
    df_mcmc = bebi103.stan.to_dataframe(samples)
    params = ['alpha[1]', 'alpha[2]', 'b[1]', 'b[2]', 'w']
    param_means = df_mcmc.loc[df_mcmc['chain']==1, params].mean()
    
    def init_fun():
        """Initialization function for sample at mean of one mode."""
        return {'alpha': [param_means['alpha[1]'], param_means['alpha[2]']],
                'b': [param_means['b[1]'], param_means['b[2]']],
                'w': param_means['w']}

    # Get the samples
    return sm_mix.sampling(data=data, init=init_fun, **kwargs)

Now we'll use the function to get our samples. We'll check diagnostics and make a corner plot.

In [16]:
samples_mix = sample_mix(data)

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_mix)

# Make a corner plot
bokeh.io.show(bebi103.viz.corner(samples_mix, 
                                 pars=['alpha[1]', 'b[1]', 'alpha[2]', 'b[2]', 'w']))
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0.0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.

No problems with the diagnostics, and the corner plots looks good. Now let's make a posterior predictive ECDF plot.

In [17]:
bokeh.io.show(bebi103.viz.predictive_ecdf(samples_mix, name='n_ppc', data_line=False, 
                                          data=df['Rex1'], diff=True))

The data set is well covered by the posterior predictive distribution. Note that the spread in the posterior predictive distribution goes well beyond the observed data. This suggests that the model have more than enough flexibility to cover the observed data. There may be a simpler model to describe the generative process, but it is important that any model we use is firmly grounded in a theory for the generative process. A mixture of two negative binomials is such a model, so these results are good.

Using a mixture model with the Rest data set

In [18]:
data['n'] = df['Rest'].values.astype(int)
samples_mix = sample_mix(data, seed=4092374)

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_mix)
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
n_eff / iter looks reasonable for all parameters.
Rhat for parameter alpha[1] is 1.105910067994271.
Rhat for parameter w is 1.1904664474294557.
  Rhat above 1.1 indicates that the chains very likely have not mixed
8.0 of 4000 (0.2%) iterations ended with a divergence.
  Try running with larger adapt_delta to remove divergences.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
Out[18]:
6

We got a few divergences, and the samples for $w$ have not yet converged. So, let's run it again with longer warmup and a little thinning to get $w$ to converge and also with a larger adapt_delta to mitigate divergences.

In [19]:
samples_mix = sample_mix(data, warmup=2000, iter=4000, thin=2, 
                         control=dict(adapt_delta=0.99), seed=4092374)

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_mix)
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0.0 of 4000 (0.0%) iterations ended with a divergence.
1 of 4000 (0.025%) iterations saturated the maximum tree depth of 10.
  Try running again with max_treedepth set to a larger value to avoid saturation.
E-BFMI indicated no pathological behavior.
Out[19]:
8

That looks better for the divergences, but the R-hat for $w$ still looks problematic. Let's take a look at the corner plot.

In [20]:
# Make a corner plot
bokeh.io.show(bebi103.viz.corner(samples_mix, 
                                 pars=['alpha[1]', 'b[1]', 'alpha[2]', 'b[2]', 'w']))

In looking at this corner plot, you will need to zoom to see the shapes of the distributions.

We see $w$ tending strongly toward zero or one, and essentially uniform in between. Bimodality in a parameter, even if it is a real and correct feature of a model, will throw off the R-hat calculation. So, these samples look legitimate.

The distribution of the values of $w$ suggest that the two distributions in the mixture are difficult to identify. $w$ close to one or zero suggests a single distribution dominates, and $w$ uniform suggests that the two distributions have almost the same parameters and the $w$ is therefore indifferent. In fact, if we look at the plot of $\alpha_1$ and $\alpha_2$, we see that their values are very close to one another. The same is true for $b_1$ and $b_2$. This suggests that there is actually a single negative binomial, and that the second negative binomial is difficult to identify.

This underscores the importance of a careful graphical analysis of MCMC samples. We also see the utility in constructing the more complex model (in this case a mixture model) so that it reduces to the simpler model in certain limits. The mixture weight $w$ either tended toward zero or one, and was uniform in between, suggesting either indifference to the weighting in the mixture model (with the parameter values from the two mixtures being similar, as we saw in the plots of the samples of $\alpha_1$ and $\alpha_2$ and $b_1$ and $b_2$) or strong preference for a single distribution of the mixture.

So, it is clear from our graphical analysis that the Rest gene's expression levels is best described by a single Negative Binomial distribution.

Conclusions

We have seen the utility of graphics posterior predictive checks in ensuring that the model can appropriately describe the acquired data set. Importantly, as we build models, we need to start simple, and then build complexity, each time having the simple model of a limiting or special case of the more complex model.

Sometimes, it is not so easy to graphically compare models and we need quantitative comparisons. We will discuss methods to do this in the next tutorial.

Computing environment

In [21]:
%load_ext watermark
In [22]:
%watermark -v -p numpy,pandas,pystan,bokeh,bebi103,jupyterlab
CPython 3.7.0
IPython 7.1.1

numpy 1.15.4
pandas 0.23.4
pystan 2.17.1.0
bokeh 1.0.1
bebi103 0.0.37
jupyterlab 0.35.3