Example model selection: regression

Data set download


[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade colorcet bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    import cmdstanpy; cmdstanpy.install_cmdstan()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np
import pandas as pd

import cmdstanpy
import arviz as az

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

We return to the data set from Good, et al., Science, 342, 856-860, 2013, measuring the length of mitotic spindles as a function of droplet size. We considered two models for how spindle length depends on droplet diameter.

  1. The spindle length is set; there is no dependence on droplet diameter. We call this the independent size model.

  2. The spindle length is set by the total amount of tubulin available. We call this the tubulin conservation model.

We can state the two models as follows.

Model 1 \begin{align} &\phi \sim \text{LogNorm}(\ln 20, 0.75),\\[1em] &\sigma_0 \sim \text{Gamma}(2, 10),\\[1em] &\sigma = \sigma_0 \, \phi,\\[1em] &l_i \sim \text{Norm}(\phi, \sigma) \;\forall i. \end{align}

Model 2 \begin{align} &\phi \sim \text{LogNorm}(\ln 20, 0.75),\\[1em] &\gamma \sim \text{Beta}(1.1, 1.1), \\[1em] &\sigma_0 \sim \text{Gamma}(2, 10),\\[1em] &\mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &\sigma_i = \sigma_0\,\mu_i,\\[1em] &l_i \sim \text{Norm}(\mu_i, \sigma_i) \;\forall i. \end{align}

Our task now is to compare these two models. Though we have already plotted posterior predictive checks for this model in a previous lesson, we will do them again here, since they are an integral part of the model comparison workflow.

Note that model 2 reduces to model 1 in the limit of \(\gamma d \gg \phi\) so we have a clear connection between these two models. Let’s code up and compile the models, including posterior predictive checks and pointwise log likelihood calculations.

First, the Stan code for the independent size model.

data {
  int N;
  real ell[N];
}


parameters {
  real<lower=0> phi;
  real<lower=0> sigma_0;
}


model {
  phi ~ lognormal(log(20.0), 0.75);
  sigma_0 ~ gamma(2.0, 10.0);

  ell ~ normal(phi, sigma_0 * phi);
}


generated quantities {
  real ell_ppc[N];
  real log_lik[N];

  // Posterior predictive checks
  for (i in 1:N) {
    ell_ppc[i] = normal_rng(phi, sigma_0 * phi);
  }

  // Pointwise log likelihood
  for (i in 1:N) {
    log_lik[i] = normal_lpdf(ell[i] | phi, sigma_0 * phi);
  }
}

And the code for the conserved tubulin model….

functions {
  real ell_theor(real d, real phi, real gamma_) {
    real denom_ratio = (gamma_ * d / phi)^3;
    return gamma_ * d / (1 + denom_ratio)^(1.0 / 3.0);
  }
}


data {
  int N;
  int N_ppc;
  real d[N];
  real d_ppc[N_ppc];
  real ell[N];
}


parameters {
  real<lower=0> phi;
  real<lower=0, upper=1> gamma_;
  real<lower=0> sigma_0;
}


transformed parameters {
  real mu[N];
  real sigma[N];

  for (i in 1:N) {
    mu[i] = ell_theor(d[i], phi, gamma_);
    sigma[i] = sigma_0 * mu[i];
  }
}


model {
  phi ~ lognormal(log(20.0), 0.75);
  gamma_ ~ beta(1.1, 1.1);
  sigma_0 ~ gamma(2.0, 10.0);

  ell ~ normal(mu, sigma);
}


generated quantities {
  real ell_ppc[N_ppc];
  real log_lik[N];

  for (i in 1:N_ppc) {
    real mu_ppc = ell_theor(d_ppc[i], phi, gamma_);
    ell_ppc[i] = normal_rng(mu_ppc, sigma_0 * mu_ppc);
  }

  // Pointwise log likelihood
  for (i in 1:N) {
    log_lik[i] = normal_lpdf(ell[i] | ell_theor(d[i], phi, gamma_), sigma[i]);
  }
}

We’ll now compile the models.

[2]:
with bebi103.stan.disable_logging():
    sm_indep = cmdstanpy.CmdStanModel(stan_file='indep_size.stan')
    sm_cons = cmdstanpy.CmdStanModel(stan_file='cons_tubulin.stan')

All right! Let’s do some sampling, first for the independent size model.

[3]:
# Load in Data Frame
df = pd.read_csv(os.path.join(data_path, "good_invitro_droplet_data.csv"), comment="#")

# Set up data dict
data = dict(
    N=len(df),
    d=df["Droplet Diameter (um)"].values,
    ell=df["Spindle Length (um)"].values,
)

# Draw samples
with bebi103.stan.disable_logging():
    samples_indep = sm_indep.sample(data=data)

samples_indep = az.from_cmdstanpy(
    posterior=samples_indep, posterior_predictive="ell_ppc", log_likelihood="log_lik"
)

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_indep)

# Corner plot
bokeh.io.show(
    bebi103.viz.corner(
        samples_indep, parameters=["phi", "sigma_0"], xtick_label_orientation=np.pi / 4
    )
)
Effective sample size looks reasonable for all parameters.

Rhat looks reasonable for all parameters.

0 of 4000 (0.0%) iterations ended with a divergence.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.

Everything looks good with the sampler. Since there is no \(d\) dependence, we can use an ECDF for our posterior predictive check. (Note, though, that this is a bad idea. There are \(d\) measurements in the data set, and we should plot the data against the droplet diameter. We did this in the previous lesson on posterior predictive checks.) I will adjust the percentiles we use in the plot to include the middle 99th percentile, since we have lots of data points.

[4]:
ell_ppc = samples_indep.posterior_predictive.ell_ppc.stack(
    {"sample": ("chain", "draw")}
).transpose("sample", "ell_ppc_dim_0")

bokeh.io.show(
    bebi103.viz.predictive_ecdf(
        ell_ppc,
        percentiles=[99, 70, 50, 30],
        name="ell_ppc",
        data=df["Spindle Length (um)"].values,
        diff="ecdf",
        x_axis_label='spindle length (µm)',
    )
)

None of the data points lie outside the 99th percentile. It seems that this model passes the posterior predictive check. Let us now analyze the conserved tubulin model in the same way.

[5]:
# Add to the data dictionary
data["N_ppc"] = 200
d_ppc = np.linspace(0.1, 250, data["N_ppc"])
data["d_ppc"] = d_ppc

# Draw samples
with bebi103.stan.disable_logging():
    samples_cons = sm_cons.sample(data=data)

samples_cons = az.from_cmdstanpy(
    posterior=samples_cons, posterior_predictive="ell_ppc", log_likelihood="log_lik"
)

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_cons)

# Corner plot
bokeh.io.show(
    bebi103.viz.corner(
        samples_cons,
        parameters=["phi", "gamma_", "sigma_0"],
        xtick_label_orientation=np.pi / 4,
    )
)
Effective sample size looks reasonable for all parameters.

Rhat looks reasonable for all parameters.

0 of 4000 (0.0%) iterations ended with a divergence.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.

This looks fine. The \(d\) dependence makes it so we cannot directly use the predictive_ecdf() function. Rather, we should plot the spindle length versus diameter, along with the percentiles from the posterior predictive checks. For regressions of this sort, the bebi103.viz.predictive_regression() function will help with these plots.

[6]:
ell_ppc = samples_cons.posterior_predictive['ell_ppc'].stack(
    {"sample": ("chain", "draw")}
).transpose("sample", "ell_ppc_dim_0")

bokeh.io.show(
    bebi103.viz.predictive_regression(
        ell_ppc,
        samples_x=d_ppc,
        percentiles=[30, 50, 70, 99],
        data=df[['Droplet Diameter (um)', 'Spindle Length (um)']].values,
        x_axis_label='droplet diameter [µm]',
        y_axis_label='spindle length [µm]',
        x_range=[0, 250],
    )
)

Like the set spindle length model, the conserved tubulin model also captures the data set, with just a few data points of the 670 just outside the 99th percentile of the samples out of the posterior predictive distribution.

So, both models cover the data set and pass the posterior predictive checks. We can then turn to the model comparisons to see which model is closer to the true generative distribution.

[7]:
az.compare(
    {"independent size model": samples_indep, "conserved tubulin model": samples_cons},
    ic="loo",
    scale="deviance",
)
/Users/bois/opt/anaconda3/lib/python3.8/site-packages/arviz/stats/stats.py:146: UserWarning: The default method used to estimate the weights for each model,has changed from BB-pseudo-BMA to stacking
  warnings.warn(
[7]:
rank loo p_loo d_loo weight se dse warning loo_scale
conserved tubulin model 0 3662.447560 3.174666 0.000000 1.000000e+00 37.461443 0.000000 False deviance
independent size model 1 4002.842434 1.890391 340.394874 3.014407e-10 36.417581 31.627608 False deviance

The conserved tubulin model is much better! When we cannot rule out models based on posterior predictive checks, computing weights based on information criteria allows us to select which model(s) best match the true generative model.

[8]:
bebi103.stan.clean_cmdstan()

Computing environment

[9]:
%load_ext watermark
%watermark -v -p numpy,pandas,cmdstanpy,arviz,bokeh,bebi103,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())
Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

numpy     : 1.19.2
pandas    : 1.2.1
cmdstanpy : 0.9.67
arviz     : 0.11.0
bokeh     : 2.2.3
bebi103   : 0.1.2
jupyterlab: 2.2.6

cmdstan   : 2.25.0