Example model selection: regression¶
[1]:
import numpy as np
import pandas as pd
import cmdstanpy
import arviz as az
import bebi103
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
We return to the data set from Good, et al., Science, 342, 856-860, 2013, measuring the length of mitotic spindles as a function of droplet size. We considered two models for how spindle length depends on droplet diameter.
The spindle length is set; there is no dependence on droplet diameter. We call this the independent size model.
The spindle length is set by the total amount of tubulin available. We call this the tubulin conservation model.
We can state the two models as follows.
Model 1 \begin{align} &\phi \sim \text{LogNorm}(\ln 20, 0.75),\\[1em] &\sigma \sim \text{HalfNorm}(10),\\[1em] &l_i \sim \text{Norm}(\phi, \sigma) \;\forall i. \end{align}
Model 2 \begin{align} &\phi \sim \text{LogNorm}(\ln 20, 0.75),\\[1em] &\gamma \sim \text{Beta}(1.1, 1.1), \\[1em] &\sigma \sim \text{HalfNorm}(10),\\[1em] &\mu = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &l_i \sim \text{Norm}(\mu, \sigma) \;\forall i. \end{align}
Our task now is to compare these two models. Though we have already plotted posterior predictive checks for this model in a previous lesson, we will do them again here, since they are an integral part of the model comparison workflow.
Note that model 2 reduces to model 1 in the limit of \(\gamma d \gg \phi\) so we have a clear connection between these two models. Let’s code up and compile the models, including posterior predictive checks and pointwise log likelihood calculations.
First, the Stan code for the independent size model.
data {
int N;
real ell[N];
}
parameters {
real phi;
real<lower=0> sigma;
}
model {
phi ~ lognormal(log(20.0), 0.75);
sigma ~ normal(0.0, 10.0);
for (i in 1:N) {
ell[i] ~ normal(phi, sigma);
}
}
generated quantities {
real ell_ppc[N];
real log_lik[N];
// Posterior predictive checks
for (i in 1:N) {
ell_ppc[i] = normal_rng(phi, sigma);
}
// Pointwise log likelihood
for (i in 1:N) {
log_lik[i] = normal_lpdf(ell[i] | phi, sigma);
}
}
And the code for the conserved tubulin model….
functions {
real ell_theor(real d, real phi, real gamma_) {
real denom_ratio = (gamma_ * d / phi)^3;
return gamma_ * d / (1 + denom_ratio)^(1.0 / 3.0);
}
}
data {
int N;
real d[N];
real ell[N];
}
parameters {
real phi;
real gamma_;
real<lower=0> sigma;
}
model {
phi ~ lognormal(log(20.0), 0.75);
gamma_ ~ beta(1.1, 1.1);
sigma ~ normal(0.0, 10.0);
for (i in 1:N) {
ell[i] ~ normal(ell_theor(d[i], phi, gamma_), sigma);
}
}
generated quantities {
real ell_ppc[N];
real log_lik[N];
// Posterior predictive checks
for (i in 1:N) {
ell_ppc[i] = normal_rng(ell_theor(d[i], phi, gamma_), sigma);
}
// Pointwise log likelihood
for (i in 1:N) {
log_lik[i] = normal_lpdf(ell[i] | ell_theor(d[i], phi, gamma_), sigma);
}
}
We’ll now compile the models.
[2]:
sm_indep = cmdstanpy.CmdStanModel(stan_file='indep_size.stan')
sm_cons = cmdstanpy.CmdStanModel(stan_file='cons_tubulin.stan')
INFO:cmdstanpy:stan to c++ (/Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_07/indep_size.hpp)
INFO:cmdstanpy:compiling c++
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_07/indep_size
INFO:cmdstanpy:stan to c++ (/Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_07/cons_tubulin.hpp)
INFO:cmdstanpy:compiling c++
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_07/cons_tubulin
All right! Let’s do some sampling, first for the independent size model.
[3]:
# Load in Data Frame
df = pd.read_csv("../data/good_invitro_droplet_data.csv", comment="#")
# Set up data dict
data = dict(
N=len(df),
d=df["Droplet Diameter (um)"].values,
ell=df["Spindle Length (um)"].values,
)
# Draw samples
samples_indep = sm_indep.sample(data=data)
samples_indep = az.from_cmdstanpy(
posterior=samples_indep, posterior_predictive="ell_ppc", log_likelihood="log_lik"
)
# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_indep)
# Corner plot
bokeh.io.show(
bebi103.viz.corner(
samples_indep, pars=["phi", "sigma"], xtick_label_orientation=np.pi / 4
)
)
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 3
INFO:cmdstanpy:finish chain 4
Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
Everything looks good with the sampler. Since there is no \(d\) dependence, we can use an ECDF for our posterior predictive check. I will adjust the percentiles we use in the plot to include the middle 99th percentile, since we have lots of data points.
[4]:
ell_ppc = samples_indep.posterior_predictive.ell_ppc.stack(
{"sample": ("chain", "draw")}
).transpose("sample", "ell_ppc_dim_0")
bokeh.io.show(
bebi103.viz.predictive_ecdf(
ell_ppc,
percentiles=[99, 70, 50, 30],
name="ell_ppc",
data=df["Spindle Length (um)"].values,
diff=True,
x_axis_label='spindle length (µm)',
)
)
None of the data points lie outside the 99th percentile. It seems that this model passes the posterior predictive check. Let us now analyze the conserved tubulin model in the same way.
[5]:
# Add to the data dictionary
data['N_ppc'] = 200
d_ppc = np.linspace(0, 250, 200)
data['d_ppc'] = d_ppc
# Draw samples
samples_cons = sm_cons.sample(data=data)
samples_cons = az.from_cmdstanpy(
posterior=samples_cons, posterior_predictive="ell_ppc", log_likelihood="log_lik"
)
# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_cons)
# Corner plot
bokeh.io.show(
bebi103.viz.corner(
samples_cons, pars=["phi", "gamma_", "sigma"], xtick_label_orientation=np.pi / 4
)
)
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 3
INFO:cmdstanpy:finish chain 4
Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
This looks fine. The \(d\) dependence makes it so we cannot directly use the predictive_ecdf()
function. Rather, we should plot the spindle length versus diameter, along with the percentiles from the posterior predictive checks. For regressions of this sort, the bebi103.viz.predictive_regression()
function will help with these plots.
[6]:
ell_ppc = samples_cons.posterior_predictive['ell_ppc'].stack(
{"sample": ("chain", "draw")}
).transpose("sample", "ell_ppc_dim_0")
bokeh.io.show(
bebi103.viz.predictive_regression(
ell_ppc,
samples_x=d_ppc,
percentiles=[30, 50, 70, 99],
data=df[['Droplet Diameter (um)', 'Spindle Length (um)']].values,
x_axis_label='droplet diameter [µm]',
y_axis_label='spindle length [µm]',
x_range=[0, 250],
)
)
Like the set spindle length model, the conserved tubulin model also captrues the data set, with just a few data points of the 670 just outside the 99th percentile of the samples out of the posterior predictive distribution.
So, both models cover the data set and pass the posterior predictive checks. We can then turn to the model comparisons to see which model is closer to the true generative distribution.
[7]:
az.compare(
{"independent size model": samples_indep, "conserved tubulin model": samples_cons},
ic="loo",
)
[7]:
rank | loo | p_loo | d_loo | weight | se | dse | warning | loo_scale | |
---|---|---|---|---|---|---|---|---|---|
conserved tubulin model | 0 | 3680.42 | 3.13856 | 0 | 1 | 33.6177 | 0 | False | deviance |
independent size model | 1 | 4003.17 | 2.05972 | 322.752 | 3.08934e-52 | 35.7015 | 30.8557 | False | deviance |
The conserved tubulin model is much better! When we cannot rule out models based on posterior predictive checks, computing weights based on information criteria allows us to select which model(s) best match the true generative model.
[8]:
bebi103.stan.clean_cmdstan()
Computing environment¶
[9]:
%load_ext watermark
%watermark -v -p numpy,pandas,cmdstanpy,arviz,bokeh,bebi103,jupyterlab
CPython 3.7.6
IPython 7.12.0
numpy 1.18.1
pandas 0.24.2
cmdstanpy 0.8.0
arviz 0.6.1
bokeh 1.4.0
bebi103 0.0.52
jupyterlab 1.2.6