14. Posterior predictive checks

Data set download


[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade colorcet bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    import cmdstanpy; cmdstanpy.install_cmdstan()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np
import pandas as pd

import cmdstanpy
import arviz as az

import bebi103

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

The approach to prior predictive checks that we worked out in the previous part of this lesson is:

  1. Draw parameter values out of the prior.

  2. Use those parameter values in the likelihood to generate a new data set.

  3. Store the result and repeat.

After collecting the dataset, you can plot summaries of the data sets you generated. Which summaries you choose to plot is up to you, and is often not a trivial choice; as Michael Betancourt says, “Constructing an interpretable yet informative summary statistic is very much a fine art.” For univariate measurements, the ECDF is a good summary, and for regressions, we have seen that predictive regression plots are useful. You may need to choose particular summaries that are best for your modeling task at hand.

In this, lesson, we will learn how to do posterior predictive checks. The procedure is the same as prior predictive checks with one difference (highlighted in bold below).

  1. Draw parameter values out of the posterior.

  2. Use those parameter values in the likelihood to generate a new data set.

  3. Store the result and repeat.

Conveniently, we get samples of parameter values out of the posterior from Markov chain Monte Carlo. Once we have the generated data sets, we can compare them to the measured data. This helps answer the question: Could this generative model actually produce the observed data? If the answer is yes, the generative model is not ruled out by the data (though it still may be a bad model). If the answer is no, then the generative model cannot fully describe the process by which the data were generated.

We will again use the Good, et al. data set from the last lesson and will again consider two models, the independent size model and the conserved tubulin model.

Before proceeding, we’ll load in the data set and store the droplet diameters and spindle lengths as Numpy arrays so we can use them as data to pass to Stan.

[2]:
df = pd.read_csv(os.path.join(data_path, "good_invitro_droplet_data.csv"), comment="#")

d = df['Droplet Diameter (um)'].values
ell = df['Spindle Length (um)'].values

The independent size model

The independent size model is

\begin{align} &\phi \sim \text{LogNorm}(\ln 20, 0.75),\\[1em] &\sigma_0 \sim \text{Gamma}(2, 10),\\[1em] &\sigma = \sigma_0\,\phi,\\[1em] &l_i \sim \text{Norm}(\phi, \sigma)\; \forall i. \end{align}

Under this model, there is no dependence on the droplet diameter of the spindle length. To enable posterior predictive checking, we simply add the data generation to the generated quantities block in Stan. The Stan code below implements this model with generation of posterior predictive checks.

data {
  int N;
  real ell[N];
}


parameters {
  real<lower=0> phi;
  real<lower=0> sigma_0;
}


model {
  phi ~ lognormal(log(20.0), 0.75);
  sigma_0 ~ gamma(2.0, 10.0);

  ell ~ normal(phi, sigma_0 * phi);
}


generated quantities {
  real ell_ppc[N];

  for (i in 1:N) {
    ell_ppc[i] = normal_rng(phi, sigma_0 * phi);
  }
}

The values in the generated quantities block are generated at every MCMC step. Furthermore, all parameters that the sampler is exploring (in this case, phi and sigma_0) are available within the generated quantities block. So, for each MCMC sample, we generate a new data set ell.

Let’s use this model to generate samples, including posterior predictive checks.

[3]:
data = {"N": len(ell), "ell": ell}

sm = cmdstanpy.CmdStanModel(stan_file="indep_size_model.stan")

samples = sm.sample(data=data, iter_sampling=1000, chains=4)

samples = az.from_cmdstanpy(posterior=samples, posterior_predictive=["ell_ppc"])
INFO:cmdstanpy:compiling stan program, exe file: /Users/bois/Dropbox/git/bebi103_course/2021/b/content/lessons/14/indep_size_model
INFO:cmdstanpy:compiler options: stanc_options=None, cpp_options=None
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103_course/2021/b/content/lessons/14/indep_size_model
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 4
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:finish chain 3

We can take a quick look at our samples in a corner plot.

[4]:
bokeh.io.show(bebi103.viz.corner(samples, xtick_label_orientation=np.pi / 4))

The samples look ok. Now, let’s look at the posterior predictive checks. As we did for prior predictive checks, we can plot a posterior predictive ECDF. Since we are checking these ECDFs against measured data in posterior predictive checks, we can overlay the data as well.

The bebi103.viz.predictive_ecdf() function expects input having n_samples rows and N columns, where N is the number of data points and n_samples is the total number of posterior predictive data sets we generated. Because we sampled with four chains, the posterior predictive array is three-dimensional.

[5]:
samples.posterior_predictive['ell_ppc'].shape
[5]:
(4, 1000, 670)

The first index is the chain, the second the draw, and the third is the number of data points. The samples are stored as an xarray, which we can reshape using the stack function. We will collapse the chain and draw indexes into a single sample index.

[6]:
ell_ppc = samples.posterior_predictive["ell_ppc"].stack({"sample": ("chain", "draw")})

We also want to be sure to specify the ordering of the indexes; samples should go first, followed by the number of the data point. We can do this using the transpose() method of an xarray DataArray, which lets us specify the ordering of the indexes.

[7]:
ell_ppc = ell_ppc.transpose('sample', 'ell_ppc_dim_0')

With this nicely shaped array, we can make our posterior predictive ECDF. We can overlay the ECDF of the data using the data kwarg.

I will adjust the percentiles we use in the plot to include the middle 99th percentile, since we have lots of data points.

[8]:
bokeh.io.show(
    bebi103.viz.predictive_ecdf(
        ell_ppc,
        percentiles=[30, 50, 70, 99],
        data=ell,
        x_axis_label='spindle length [µm]'
    )
)

The range of the ECDF is tight (this is because we have 670 data points). A better use of space would be to plot the difference of the ECDF compared to the median of the posterior predictive ECDFs. We do this using the diff kwarg.

[9]:
bokeh.io.show(
    bebi103.viz.predictive_ecdf(
        ell_ppc,
        percentiles=[30, 50, 70, 99],
        data=ell,
        diff='ecdf',
        x_axis_label='spindle length [µm]'
    )
)

The posterior predictive checks verify that this is a reasonable model for the data set.

Beware the summary statistic

In the above posterior predictive check, we used the ECDF as our summary and compared the ECDF from the data to the ECDF from the posterior predictive data sets. We chose this because the independent size model ignores the droplet diameter.

We could have instead chosen to do a posterior regression and plotted the spindle length versus the droplet diameter. Let’s see what happens when we do that.

[10]:
bokeh.io.show(
    bebi103.viz.predictive_regression(
        ell_ppc,
        samples_x=d,
        percentiles=[30, 50, 70, 99],
        data=np.vstack((d, ell)).transpose(),
        x_axis_label='droplet diameter [µm]',
        y_axis_label='spindle length [µm]'
    )
)

Plotting the data like this again shows that the model could produce the data. However, this fails the eye test: the data clearly depend on the droplet diameter when we plot the data. We should be sure to look more closely for trends like this, not just if all the data lie in the envelope.

The tubulin conservation model

The tubulin conservation model is

\begin{align} &\phi \sim \text{LogNorm}(\ln 20, 0.75),\\[1em] &\gamma \sim \text{Beta}(1.1, 1.1), \\[1em] &\sigma_0 \sim \text{Gamma}(2, 10),\\[1em] &\mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &\sigma_i = \sigma_0\,\mu_i,\\[1em] &l_i \sim \text{Norm}(\mu_i, \sigma_i) \;\forall i. \end{align}

We follow similar steps to code up the model in Stan. The Stan code is:

functions {
  real ell_theor(real d, real phi, real gamma_) {
    real denom_ratio = (gamma_ * d / phi)^3;
    return gamma_ * d / (1 + denom_ratio)^(1.0 / 3.0);
  }
}


data {
  int N;
  int N_ppc;
  real d[N];
  real d_ppc[N_ppc];
  real ell[N];
}


parameters {
  real<lower=0> phi;
  real<lower=0, upper=1> gamma_;
  real<lower=0> sigma_0;
}


transformed parameters {
  real mu[N];
  real sigma[N];

  for (i in 1:N) {
    mu[i] = ell_theor(d[i], phi, gamma_);
    sigma[i] = sigma_0 * mu[i];
  }
}


model {
  phi ~ lognormal(log(20.0), 0.75);
  gamma_ ~ beta(1.1, 1.1);
  sigma_0 ~ gamma(2.0, 10.0);

  ell ~ normal(mu, sigma);
}


generated quantities {
  real ell_ppc[N_ppc];

for (i in 1:N_ppc) {
    real mu_ppc = ell_theor(d_ppc[i], phi, gamma_);
    ell_ppc[i] = normal_rng(mu_ppc, sigma_0 * mu_ppc);
  }
}

Note that I have chosen a different set of droplet diameter values to use for the posterior predictive checks. We are free to do this, and I do it so that the posterior predictive plot is a smooth curve.

Let’s set up our data input now.

[11]:
N_ppc = 200
d_ppc = np.linspace(0.1, 250, N_ppc)
data = {
    "N": len(ell),
    "d": d,
    "ell": ell,
    "N_ppc": N_ppc,
    "d_ppc": d_ppc,
}

And now we can compile the model and sample!

[12]:
sm = cmdstanpy.CmdStanModel(stan_file='cons_tubulin_model.stan')

samples = sm.sample(data=data, iter_sampling=1000, chains=4)

samples = az.from_cmdstanpy(posterior=samples, posterior_predictive=["ell_ppc"])
INFO:cmdstanpy:compiling stan program, exe file: /Users/bois/Dropbox/git/bebi103_course/2021/b/content/lessons/14/cons_tubulin_model
INFO:cmdstanpy:compiler options: stanc_options=None, cpp_options=None
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103_course/2021/b/content/lessons/14/cons_tubulin_model
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:finish chain 3
INFO:cmdstanpy:finish chain 4

Let’s look at the corner plot to check out the samples.

[13]:
bokeh.io.show(
    bebi103.viz.corner(
        samples,
        parameters=["phi", "gamma_", "sigma_0"],
        xtick_label_orientation=np.pi / 4,
    )
)

Looks good. We can now perform posterior predictive checks by plotting the predictive regression. Again, we need to convert the xarray DataArray to have the right dimensions for the bebi103.viz.predictive_regression() function.

[14]:
ell_ppc = (
    samples.posterior_predictive["ell_ppc"]
    .stack({"sample": ("chain", "draw")})
    .transpose("sample", "ell_ppc_dim_0")
)

bokeh.io.show(
    bebi103.viz.predictive_regression(
        ell_ppc,
        samples_x=d_ppc,
        percentiles=[30, 50, 70, 99],
        data=np.vstack((d, ell)).transpose(),
        x_axis_label="droplet diameter [µm]",
        y_axis_label="spindle length [µm]",
        x_range=[0, 250],
    )
)

The predictive regression looks good. The model captures the variability of the spindle length with droplet diameter as well as the variability in the measured data.

Both the independent size model and the tubulin conservation model can generate the observed data, though the droplet diameter dependence is lost in the former. In the next couple of weeks, we will discuss model comparison, allows us to quantitatively compare models that may both be commensurate with data.

[15]:
bebi103.stan.clean_cmdstan()

Computing environment

[16]:
%load_ext watermark
%watermark -v -p numpy,pandas,cmdstanpy,arviz,bokeh,bebi103,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())
CPython 3.8.5
IPython 7.19.0

numpy 1.19.2
pandas 1.2.1
cmdstanpy 0.9.67
arviz 0.11.0
bokeh 2.2.3
bebi103 0.1.2
jupyterlab 2.2.6
cmdstan   : 2.26.0