Tutorial 10b: Calibrating your models

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [1]:
import numpy as np
import pandas as pd

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
/Users/bois/Dropbox/git/bebi103/bebi103/viz.py:30: UserWarning: DataShader import failed with error "No module named 'datashader'".
Features requiring DataShader will not work and you will get exceptions.
  Features requiring DataShader will not work and you will get exceptions.""")
Loading BokehJS ...

In this tuturial, we will first describe simulation-based calibration and then demonstrate it on the mRNA FISH data set from the Singer et al.. You should refer to the original paper on SBC by Talts and coworkers. You might also want to read Mike Betancourt's writing on principled Bayesian workflows. Much of the material in this tutorial comes from those two sources.

Note on terminology: The term "simulation-based calibration" is meant to describe the self-consistency check using rank statistics described below. We are using the term here as an umbrella term to SBC and sensitivity analysis in computing shrinkages and z-scores, also described below. The comptational effort for all of these is the same, and they all use the same samples, so I hope this abuse of terminology is forgivable.

Simulation-based calibration

We have performed prior predictive checks on our generative models to ensure that the models produce reasonable data sets. Accepting the model after the prior predictive check, we sample out of the posterior using Markov chain Monte Carlo. We check the samples by performing diagnostics, and then check the ability of the model to generate the actual data set we obtained by doing posterior predictive checks. Importantly, when we look at the posterior, we can compare it to the prior to make sure that the data actually were informative. If the posterior looks too much like the prior, we really have not learned anything. The process is quite powerful, but it would be nice to know if our model makes sense and if we will learn from making measurements before we perform the experiments. Specifically, we would like to know, ante experimentum,

  • Can our model and sampling procedure capture the real parameter values of the generative process?
  • Can measured data inform the parameters?
  • Can our sampling technique handle the model and any conceivable data set we can throw at it?

The approach of simulation-based calibration addresses these questions. The procedure is as follows.

  1. Draw a parameter set $\tilde{\theta}$ out of the prior.
  2. Use $\tilde{\theta}$ to draw a data set $\tilde{y}$ out of the likelihood.
  3. Perform MCMC sampling of the posterior using $\tilde{y}$ as if it were the actual measured data set. Draw $L$ MCMC samples of the parameters.
  4. Do steps 1-3 many times (hundreds to a thousand is usually a good number).

In step 3, we are using a data set for which we know the underlying parameters that generated it. Because the data were generated using $\tilde{\theta}$ as the parameter set, $\tilde{\theta}$ is now the ground truth parameter set. So, we can check to see if we uncover the ground truth in the posterior sampling. We can also check diagnostics for each of the trials to make sure the sampler works properly. Furthermore, we can see if the posterior is narrower than the prior, meaning that the data are informing the model. We can also do a check, described below, to ensure the sampler is working properly, which is the main idea in the SBC paper by Talts, et al.

z-score

To check to see if the posterior encompasses the ground truth, we can compute a z-score for each parameter $\theta_i$. For each parameter, $\theta_i$, the z-score is a measure of how close the mean sampled parameter value is to the ground truth, relative to the posterior uncertainty in the parameter value.

\begin{align} z_i = \frac{\langle\theta_i\rangle_{\mathrm{post}} - \tilde{\theta}_i}{\sigma_{i,\mathrm{post}}}. \end{align}

Here, $\langle\theta_i\rangle_{\mathrm{post}}$ is the average value of $\theta_i$ over all posterior samples, and $\sigma_{i,\mathrm{post}}$ is the standard deviation of $\theta_i$ over all posterior samples. As a rule of thumb, the z-score should be symmetric about zero (indicating that there is no bias in under-or-overestimating the ground truth) and should have a magnitude less than five.

Shrinkage

To have a metric for how informative the data are in the posterior, we define the shrinkage for parameter $\theta_i$ to be

\begin{align} s_i = 1 - \frac{\sigma_{i,\mathrm{post}}^2}{\sigma_{i,\mathrm{prior}}^2}. \end{align}

Here, $\sigma_{i,\mathrm{prior}}^2$ is the variance in parameter $\theta_i$ in the prior distribution. When the posterior is substantially narrower than the prior (meaning that the data have informed the model), the shrinkage tends toward unity.

Rank statistics

In order to check to make sure the sampler is working properly, we can check for self-consistency for a relation that mush hold for any statistical model. To derive such a relation, we consider a joint distribution $\pi(\theta, \tilde{y}, \tilde{\theta})$. Then,

\begin{align} g(\theta) &= \int\mathrm{d}\tilde{y}\,\int\mathrm{d}\tilde{\theta}\,\pi(\theta, \tilde{y}, \tilde{\theta}) = \int\mathrm{d}\tilde{y}\,\int\mathrm{d}\tilde{\theta}\,g(\theta \mid \tilde{y}, \tilde{\theta}) \,\pi(\tilde{y},\tilde{\theta}), \end{align}

where we have used the definition of conditional probability. We note that

\begin{align} g(\theta \mid \tilde{y}, \tilde{\theta}) = g(\theta \mid \tilde{y}) \end{align}

because $\theta$ is conditioned directly on $\tilde{y}$ and not directly on $\tilde{\theta}$, though $\tilde{y}$ is conditioned on $\tilde{\theta}$. Using this latter conditioning,

\begin{align} \pi(\tilde{y},\tilde{\theta}) = f(\tilde{y}\mid \tilde{\theta})\,g(\tilde{\theta}). \end{align}

Inserting these expressions into the above integral yields

\begin{align} g(\theta) = \int\mathrm{d}\tilde{y}\,\int\mathrm{d}\tilde{\theta}\,g(\theta \mid \tilde{y}) \,f(\tilde{y}\mid \tilde{\theta})\,g(\tilde{\theta}). \end{align}

This equation says that if we average the generated data sets over the posterior distribution, we recover the prior. This means that the prior sample, $\tilde{\theta}$, and the $L$ posterior samples $\theta$, are sampled out of the same distribution. Talts and coworkers derived a general theorem which says that any the rank statistic of any variable sampled over $\theta$ is Uniformly distributed. So, we can compute a rank statistic for $\tilde{\theta}$ over the $L$ $\theta$ samples we have, and this rank statistic should be uniformly distributed on the interval $[0, L]$.

ECDFs of mRNA counts

Let's go ahead and load the data. We will use the Rest gene.

In [2]:
# Load DataFrame
df = pd.read_csv('../data/singer_transcript_counts.csv', comment='#')

# Pull out data for Stan
n = df['Rest'].values
data = dict(N=len(n), n=n)

# Take a look
bokeh.io.show(bebi103.viz.ecdf(n, x_axis_label='mRNA count'))

The generative model

In Tutorial 7a, we used a Negative Binomial likelihood (which has both a theoretical and empirical justification), parametrized by the burst size $b$ and the burst frequency $\alpha$. We had the following generative model.

\begin{align} &\alpha \sim \text{LogNorm}(0,2), \\[1em] &b \sim \text{LogNorm}(2, 3), \\[1em] &\beta = 1/b,\\[1em] &n_i \sim \text{NegBinom}(\alpha, \beta) \;\forall i. \end{align}

We can reuse the code for the prior predictive checks and for the sampling.

In [3]:
model_code_prior_pred = """
data {
  int N;
}


generated quantities {
  int n[N];

  real alpha = lognormal_rng(0.0, 2.0);
  real b = lognormal_rng(2.0, 3.0);
  real beta_ = 1.0 / b;
  
  for (i in 1:N) {
    n[i] = neg_binomial_rng(alpha, beta_);
  }
}
"""


model_code = """
data {
  int N;
  int n[N];
}


parameters {
  real<lower=0> alpha;
  real<lower=0> b;
}


transformed parameters {
  real beta_ = 1.0 / b;
}


model {
  // Priors
  alpha ~ lognormal(0.0, 2.0);
  b ~ lognormal(2.0, 3.0);

  // Likelihood
  n ~ neg_binomial(alpha, beta_);
}
"""

sm_prior_pred = bebi103.stan.StanModel(model_code=model_code_prior_pred)
sm = bebi103.stan.StanModel(model_code=model_code)
Using cached StanModel.
Using cached StanModel.

We can remind ourselves what the prior predictive checks looked like.

In [4]:
samples_prior_pred = sm_prior_pred.sampling(data=data,
                                            algorithm='Fixed_param',
                                            warmup=0,
                                            chains=1,
                                            iter=100)

df_samples = bebi103.stan.extract_array(samples_prior_pred, name='n')

bokeh.io.show(bebi103.viz.ecdf_collection(df_samples, 
                                          val='n', 
                                          cats='chain_idx', 
                                          color='#4e79a7', 
                                          alpha=0.1,
                                          show_legend=False,
                                          val_axis_type='log'))

The prior predictive checks show a wide range of mRNA counts, and all seem reasonable, even though we get some large number of counts, upwards of 10,000, consiering that the typical total mRNA count in a mammalian cell is about 100,000. Nonetheless, we will proceed now to perform SBC.

Performing SBC

Performing SBC really only requires a few ingredients. First, we need the requisite data to be used for prior predictive checks. In this case, it is just the number of measurements we are making, $N$. Second, we need a Stan model to generate the prior predictive data sets. Finally, we need a Stan model to sample out of the posterior. The bebi103.stan.sbc() function will then perform SBC and give the results back in a data frame. That is, it will draw a prior predictive data set, use that data set in a posterior sampling by MCMC calculation, and then compute the useful diagnostics and statistics (z-score, shrinkage, and rank statistic) from those samples. It does this N times (not to be confused with $N$, the number of measurements in the experiment). Let's now put it to use to perform SBC.

In [5]:
prior_sd = bebi103.stan._get_prior_sds(sm_prior_pred,
                              data,
                              ['alpha', 'b'],
                              1000)

args =(sm_prior_pred,
       sm,
       data,
       data,
       ['n'],
       ['alpha', 'b'],
       4,
       1000,
       2000,
       10,
       None,
       None,
       prior_sd,
       {'n': int})

res = bebi103.stan._perform_sbc(args)
In [6]:
df_sbc = bebi103.stan.sbc(prior_predictive_model=sm_prior_pred,
                          posterior_model=sm,
                          prior_predictive_model_data=data,
                          posterior_model_data=data,
                          measured_data=['n'],
                          parameters=['alpha', 'b'],
                          n_jobs=4,
                          N=400,
                          measured_data_dtypes=dict(n=int),
                          progress_bar='notebook')

The bebi103.stan.sbc() function gives a data frame with the SBC analysis results. Let's take a look at the data frame to see what it has.

In [7]:
df_sbc.head()
Out[7]:
ground_truth rank_statistic mean sd shrinkage z_score rhat n_eff/n_iter n_divergences n_bad_ebfmi n_max_treedepth warning_code L trial parameter
0 0.577662 249 0.566653 0.238509 0.999961 -0.046157 1.001456 0.7550 0 0 0 0 400 0 alpha
1 2.214967 392 1.926874 0.147458 0.999985 -1.953737 1.018560 0.5350 0 0 0 0 400 1 alpha
2 1.931602 375 1.719937 0.132515 0.999988 -1.597297 0.995388 0.9175 0 0 0 0 400 2 alpha
3 5.011142 279 4.771711 0.410025 0.999884 -0.583942 0.999979 1.0000 0 0 0 0 400 3 alpha
4 0.188215 262 0.338404 0.761030 0.999601 0.197350 0.997571 1.0000 0 0 0 0 400 4 alpha

For each trial, for each parameter, we get diagnostic results, z-scores, shrinkage, rank statistic, posterior mean and standard deviations for each ground truth, as well as the ground truth used in the posterior sampling. The warning_code column gives a succinct summary of the diagnostic warnings. You can parse a warning code using the bebi103.stan.parse_warning_code() function. As an example, I'll test it on warning code 14.

In [8]:
bebi103.stan.parse_warning_code(14)
rhat warning
divergence warning
treedepth warning

To visualize the results of SBC, we can first make a plot of the z-scores and of shrinkage. Ideally, the shrinkage should all be close to one, and the magnitude of the z-scores should all be less than five. Let's take a look!

In [9]:
p = bebi103.viz.colored_scatter(df_sbc, 
                                cats='parameter', 
                                x='shrinkage', 
                                y='z_score',
                                x_axis_label='shrinkage',
                                y_axis_label='z-score')
p.legend.location = 'bottom_left'
bokeh.io.show(p)

There are several data sets that led to the posterior actually being broader than the prior. Let's display the plot again, this time zooming in on the region where most of the samples are.

In [10]:
p.x_range = bokeh.models.Range1d(0, 1.05)
bokeh.io.show(p)

For most of the parametres, we get decent shrinkage and the z-scores are all resonable.

We should check to see for what ground truth parameter values we go poor shrinkage. We can look directly at the data frame to check.

In [11]:
df_sbc.groupby('parameter').median()
Out[11]:
ground_truth rank_statistic mean sd shrinkage z_score rhat n_eff/n_iter n_divergences n_bad_ebfmi n_max_treedepth warning_code L trial
parameter
alpha 1.031229 184.5 1.185424 0.252536 0.999956 0.217018 0.999724 0.9850 0.0 0.0 0.0 0.0 400.0 199.5
b 7.710490 208.5 7.707477 1.382030 1.000000 0.069754 0.998848 0.9975 0.0 0.0 0.0 0.0 400.0 199.5
In [12]:
trials = df_sbc.loc[df_sbc['shrinkage'] < 0, 'trial']
df_sbc.loc[df_sbc['trial'].isin(trials), :].sort_values(by='trial')
Out[12]:
ground_truth rank_statistic mean sd shrinkage z_score rhat n_eff/n_iter n_divergences n_bad_ebfmi n_max_treedepth warning_code L trial parameter
21 0.403167 317 0.381601 0.027068 0.999999 -0.796734 0.997551 1.0000 0 0 0 0 400 21 alpha
421 480672.184990 143 506077.645328 59722.814676 -139.268997 0.425390 0.995466 1.0000 0 0 0 0 400 21 b
344 57.478115 317 45.525055 49.405143 -0.683187 -0.241940 1.005729 0.6375 0 0 0 0 400 344 alpha
744 0.051259 81 0.110177 0.067968 1.000000 0.866847 0.997911 0.8575 0 0 0 0 400 344 b

The shrinkage is poor when the bust size $b$ is small and the burst frequency $\alpha$ is high. Having a typical burst size less than one is actually unphysical, since no transcripts are created. So, the SBC has exposed a problem in our modeling that we didn't see before. Not only can the data fail to inform the prior for these parameter values, we have also discovered that our model can give unphysical parameter values. We will abort continued analysis of our SBC results and instead adapt our model.

An adjusted prior

I would expect the time between bursts to be of order minutes, since that is a typical response time to signaling of a cell. This is of the same order of magnitude of an RNA lifetime, so I might then expect $\alpha$ to be of order unity.

\begin{align} \alpha \sim \text{Beta}(1.5, 0.25). \end{align}

I would expect the burst size to depend on promoter strength and/or strength of transcriptional activators. I could imagine anywhere from a few to a thousand transcripts per burst.

\begin{align} b \sim \text{Gamma}(3, 0.1). \end{align}

This prior moves $b$ off of zero, which we saw was problematic in our previous prior. We then have the following model.

\begin{align} &\alpha \sim \text{Gamma}(1.5, 0.25), \\[1em] &b \sim \text{Gamma}(3, 0.1), \\[1em] &\beta = 1/b,\\[1em] &n_i \sim \text{NegBinom}(\alpha, \beta) \;\forall i. \end{align}

We can code this model up and check the prior predictive checks.

In [13]:
model_code_prior_pred = """
data {
  int N;
}


generated quantities {
  int n[N];

  real alpha = gamma_rng(1.5, 0.25);
  real b = gamma_rng(3.0, 0.1);
  real beta_ = 1.0 / b;
  
  for (i in 1:N) {
    n[i] = neg_binomial_rng(alpha, beta_);
  }
}
"""

sm_prior_pred = bebi103.stan.StanModel(model_code=model_code_prior_pred)
samples_prior_pred = sm_prior_pred.sampling(data=data,
                                            algorithm='Fixed_param',
                                            warmup=0,
                                            chains=1,
                                            iter=100)
Using cached StanModel.

Let's take a look at the ECDFs of these prior predictive data sets.

In [14]:
df_samples = bebi103.stan.extract_array(samples_prior_pred, name='n')

bokeh.io.show(bebi103.viz.ecdf_collection(df_samples, 
                                          val='n', 
                                          cats='chain_idx', 
                                          color='#4e79a7', 
                                          alpha=0.1,
                                          show_legend=False,
                                          val_axis_type='log'))

Most of the data sets have reasonable ECDFs. These data sets again seem to make our intuition. We can now code up the Stan model and run SBC on this, hopefully improved, model.

In [15]:
model_code = """
data {
  int N;
  int n[N];
}


parameters {
  real<lower=0> alpha;
  real<lower=0> b;
}


transformed parameters {
  real beta_ = 1.0 / b;
}


model {
  // Priors
  alpha ~ gamma(1.5, 0.25);
  b ~ gamma(3.0, 0.1);

  // Likelihood
  n ~ neg_binomial(alpha, beta_);
}


generated quantities {
  int n_ppc[N];
  for (i in 1:N) {
    n_ppc[i] = neg_binomial_rng(alpha, beta_);
  }
}
"""

sm = bebi103.stan.StanModel(model_code=model_code)
Using cached StanModel.

And now let's let SBC go to work on this data set!

In [16]:
df_sbc = bebi103.stan.sbc(prior_predictive_model=sm_prior_pred,
                          posterior_model=sm,
                          prior_predictive_model_data=data,
                          posterior_model_data=data,
                          measured_data=['n'],
                          parameters=['alpha', 'b'],
                          n_jobs=4,
                          N=400,
                          measured_data_dtypes=dict(n=int),
                          progress_bar='notebook')

This time, let's check the diagnostics first.

In [17]:
df_sbc['warning_code'].unique()
Out[17]:
array([0])

The only warning code we got was zero, meaning that there we no diagnostic issues with any of the simulations.

Now, let's make a plot of the z-score versus shrinkage.

In [18]:
p = bebi103.viz.colored_scatter(df_sbc, 
                                cats='parameter', 
                                x='shrinkage', 
                                y='z_score',
                                x_axis_label='shrinkage',
                                y_axis_label='z-score')
p.legend.location='bottom_left'
bokeh.io.show(p)

We have good z-scores for all trials, and decent shrinkage. This all looks good. Let's now do the self-consistency check with the rank statistic. Recall that the rank statistics should be Uniformly distributed. Therefore, the ECDFs of the rank statistics should fall on a diagonal line. When we plot the ECDF, we can also plot an envelope which encompasses the 99% confidence interval for the ECDF of a Uniformly distributed random variable.

In [19]:
bokeh.io.show(bebi103.viz.sbc_rank_ecdf(df_sbc, diff=False))

It looks like the rank statistic is Uniformly distributed. We can see this more clearly if we instead plot the difference of the ECDF to the theoretical ECDF of a Uniformly distributed random variable.

In [20]:
bokeh.io.show(bebi103.viz.sbc_rank_ecdf(df_sbc))

In this clearer view, we see that a few samples for either parameter escape the 99% envelope. This is not unexpected for the 400 samples we took, so the check of the rank statistics looks fine.

With everything checking out, we can perform our sampling with real data!

Sampling with our new model

We'll now use our model with updated priors to perform parameter estimation using our real data set, checking all diagnostics after the fact, of course.

In [21]:
samples = sm.sampling(data=data)

bebi103.stan.check_all_diagnostics(samples)
n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0.0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
Out[21]:
0

Let's take a look at the corner plot.

In [22]:
bokeh.io.show(bebi103.viz.corner(samples, pars=['alpha', 'b']))

This result looks very much like what we achieved in Tutorial 7a, so the small adjustment in prior did not affect our results. Nonetheless, making that adjustment to our model improved it, since we caught a problem in the prior (it gave burst sizes that were too small). In my experience, taking a principled approach to model building often uncovers issues in your model, even in simple ones like this one, that you we not aware of before performing checks.

Finally, let's perform a posterior predictive check to make sure the model adequately captures our data.

In [23]:
bokeh.io.show(bebi103.viz.predictive_ecdf(samples, 
                                          name='n_ppc', 
                                          data=data['n'], 
                                          x_axis_label='mRNA transcript count',
                                          data_line=False,
                                          diff=True))

The model mostly captures the distribution, with just a few data points, about as many as would be expected, falling outside the 80th percentile of the posterior predictive samples.

Conclusions

The simulation-based calibration procedure (and the associated sensitivity analysis) is effective at identifying problem areas in Bayesian modeling. After passing the checks in this procedure, you can have more confidence in your modeling and the inferences you draw.

Computing environment

In [24]:
%load_ext watermark
In [25]:
%watermark -v -p numpy,pandas,pystan,bokeh,bebi103,jupyterlab
CPython 3.7.0
IPython 7.1.1

numpy 1.15.4
pandas 0.23.4
pystan 2.17.1.0
bokeh 1.0.1
bebi103 0.0.40
jupyterlab 0.35.3