Model comparison in practice

Data set download


[1]:
import numpy as np
import pandas as pd

import cmdstanpy
import arviz as az

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

When comparing models, posterior predictive checks are a must. As discussed in lecture, we can use information criteria to assess relative predictive effectiveness of models. In order to understand this lesson, you will need have carefully read and understand those lecture notes.

The key quantity we compute for more comparison is the expected log pointwise predictive density, or elpd. There were a few key ideas and assumptions in using elpd.

  1. The elpd is an approximation of the difference in the Kullback-Leibler divergence between the posterior predictive distribution and the true generative distribution.

  2. For our set of measurements \(y = (y_1, y_2, \ldots y_N)\), where each \(y_i\) may be multidimensional (as you would have, for example, of beak length/beak depth measurements for a single finch), we assume that \(y_i\)’s are independently distributed, both in the model and in the true generative distribution.

With these assumptions, we can approximately compute the elpd using the Watanabe-Akaike information criterion (WAIC) or leave-one-out cross validation (LOO). See this paper by Vehtari, Gelman, and Gabry (arXiv version) for more details about the implementation. As described in that paper, LOO, when computed using Pareto-smoothed importance sampling, is the preferred method for computing an approximate elpd.

Importantly, the (approximate) elpd by itself is not terribly useful in assessing a model. The elpd of one prospective model needs to be compared to another. For this comparison, we can compute Akaike weights. This is the most straightforward calculation of relative weights of respective models, and perhaps easiest to understand. However, it may not be the best way to assess the predictive capabilities of a model, especially in situations where the true generative model is not known (which is often the case for us as scientists). As we think about generative models, and we are not sure which model best generates observed data, it is useful to think about model averaging if our aim is to be predictive. The idea here is that we do not know which model generates data. Instead, we try to find a combination of models that spans all of the models we are considering, that best describe the data. The respective weights of the models give their contributions to this combination of models. As a scientist, I tend to shy away from model averaging; I am rather seeking to understand how nature generates the observations I see, and nature is not trying to predict, nor average models. However, taking a model averaging approach with an eye for optimizing predictive performance leads to more robust estimates of model weights, as outlined in this paper by Yao and coworkers, which describes a technique known as stacking.

In this tutorial, we will demonstrate how to calculate model weight both by using Akaike weights (and variants thereof), and stacking. Conveniently, this may be done approximately directly from samples out of the posterior distributions. The ArviZ package provides much of the functionality we need.

An example model comparison

To show how we can do an model comparison, we will again consider the data set from Singer and coworkers, where they performed RNA FISH to determing the copy numbers of RNA transcripts of specific genes in individual cells in their samples. The data set can be downloaded here.

We will work with the Rex1 gene. In previous lessons, we considered two models. First, a model in which transcript counts are generated from a single Negative Binomial distribution. Second, a mixture model in which the transcript counts are generated from two Negative Binomial distributions. Note that this is a purely academic exercise, since the first model will fail posterior predictive checks quite spectacularly. We would never really come to this point where we needed to do a model comparison, but we proceed to demonstrate how it is done in practice.

Computing the pointwise log likelihood

Recalling from lecture, the elpd depends on the logarithm of the posterior predictive distribution, \(f(\tilde{y}_i\mid y)\),

\begin{align} \text{elpd} = \sum_{i=1}^N\int\mathrm{d}\tilde{y}_i\,\,f_t(\tilde{y}_i)\,\ln f(\tilde{y}_i\mid y). \end{align}

When generating our samples, we therefore also need to compute samples of the value of the log likelihood. To do this, for each set of parameters \(\theta\) that we sample out of the posterior, we compute the pointwise log likelihood of the data set, using the parameters \(\theta\). The Stan code below includes these log likelihood samples as well as posterior predictive checks for the single Negative Binomial model. Read the code carefully.

data {
  int N;
  int n[N];
}


parameters {
  real<lower=0> alpha;
  real<lower=0> b;
}


transformed parameters {
  real beta_ = 1.0 / b;
}


model {
  // Priors
  alpha ~ lognormal(0.0, 2.0);
  b ~ lognormal(2.0, 3.0);

  // Likelihood
  n ~ neg_binomial(alpha, beta_);
}


generated quantities {
  int n_ppc[N];
  real log_lik[N];

  // Draw posterior predictive data set
  for (i in 1:N) {
    n_ppc[i] = neg_binomial_rng(alpha, beta_);
  }

  // Compute pointwise log likelihood
  for (i in 1:N) {
    log_lik[i] = neg_binomial_lpmf(n[i] | alpha, beta_);
  }
}

In the array log_lik, I store the pointwise log likelihood. That is, for each measurement (in this case for each \(n_i\)), I compute the log likelihood for that data point using the parameters (in this case alpha and beta_) that I sampled out of the posterior. Conveniently, Stan’s distributions all have a function that ends in _lpdf that compute the log probability density function for the distribution (with _lpmf for discrete distributions that computes the log probability mass function for the distribution).

Let’s sample out of this generative model.

[2]:
# Load data and make data dictionary
df = pd.read_csv('../data/singer_transcript_counts.csv', comment='#')
data = {'N': len(df), 'n': df['Rex1'].values.astype(int)}

# Compile the model
sm = cmdstanpy.CmdStanModel(stan_file="neg_binom.stan")

# Perform sampling
samples = sm.sample(data=data)
INFO:cmdstanpy:stan to c++ (/Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_07/neg_binom.hpp)
INFO:cmdstanpy:compiling c++
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_07/neg_binom
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 3
INFO:cmdstanpy:finish chain 4

When we convert the samples to an ArviZ object, we need to specify what is posterior predictive and log likelihood.

[3]:
# Convert to ArviZ object
samples = az.from_cmdstanpy(
    posterior=samples, posterior_predictive="n_ppc", log_likelihood="log_lik"
)

Now, we will check the diagnostics (as we always should) and make a corner plot.

[4]:
# Check diagnostics
bebi103.stan.check_all_diagnostics(samples)

# Make a corner plot
bokeh.io.show(bebi103.viz.corner(samples, pars=['alpha', 'b']))
Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.

Everything looks good. Actually, it doesn’t. The sampling looks good, but we should do posterior predictive checks.

[5]:
n_ppc = samples.posterior_predictive.n_ppc.stack(
    {"sample": ("chain", "draw")}
).transpose("sample", "n_ppc_dim_0")

bokeh.io.show(
    bebi103.viz.predictive_ecdf(
        n_ppc,
        name="n_ppc",
        data=df["Rex1"].values,
        diff=True,
        x_axis_label="mRNA copy number",
    )
)

We have clearly failed the posterior predictive checks. We can stop here, but we will continue to compute the WAIC and LOO for illustrative purposes. As we do that, let’s take a quick look at the output so we can see how the log likelihood samples are organized. The log likelihood is stored in ArviZ InferenceData objects in the sample_stats.

[6]:
samples.sample_stats
[6]:
<xarray.Dataset>
Dimensions:               (chain: 4, draw: 1000, log_likelihood_dim_0: 279)
Coordinates:
  * chain                 (chain) int64 0 1 2 3
  * draw                  (draw) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999
  * log_likelihood_dim_0  (log_likelihood_dim_0) int64 0 1 2 3 ... 276 277 278
Data variables:
    lp                    (chain, draw) float64 -1.639e+03 ... -1.639e+03
    accept_stat           (chain, draw) float64 0.9777 0.9932 ... 0.8441 1.0
    stepsize              (chain, draw) float64 0.314 0.314 ... 0.329 0.329
    treedepth             (chain, draw) int64 2 2 3 1 2 2 2 2 ... 3 4 2 2 1 2 2
    n_leapfrog            (chain, draw) int64 7 3 11 3 7 7 5 ... 15 15 3 7 3 7 3
    diverging             (chain, draw) bool False False False ... False False
    energy                (chain, draw) float64 1.641e+03 ... 1.641e+03
    log_likelihood        (chain, draw, log_likelihood_dim_0) float64 -5.824 ... -6.865
Attributes:
    created_at:                 2020-02-13T03:38:44.922401
    inference_library:          cmdstanpy
    inference_library_version:  0.8.0

The log likelihood is three dimensional, one dimension each for chain and draw, and then the final dimension is for the pointwise log-likelihood estimates.

Given that the log likelihood is stored in the ArviZ object, we can directly use the samples to compute the WAIC and LOO.

Computing the WAIC and LOO

We will start with the WAIC.

[7]:
az.waic(samples)
[7]:
Computed from 4000 by 279 log-likelihood matrix

        Estimate       SE
IC_waic  3281.65    17.26
p_waic      2.15        -

Here IC_waic stand for “information criterion WAIC” (which is kind of like saying “ATM machine”). The estimate for the WAIC is 3281, with a standard error of 17. We can also compute the LOO.

[8]:
single_loo = az.loo(samples)

single_loo
[8]:
Computed from 4000 by 279 log-likelihood matrix

       Estimate       SE
IC_loo  3281.65    17.26
p_loo      2.15        -

We see that the LOO and WAIC give almost identical results (as they should). Remember, though, that LOO has better performance across a wider variety of models.

Calculations with the mixture model

Now, let’s do the same calculation for the mixture model. In this case, we do not have a built-in distribution to use a _logpmf function. Fortunately, the log_mix() function in Stan accomplishes exactly what we need, as it did when we added it to target in the model specification.

data {
  int N;
  int n[N];
}


parameters {
  vector<lower=0>[2] alpha;
  vector<lower=0>[2] b;
  real<lower=0, upper=1> w;
}


transformed parameters {
  vector[2] beta_ = 1.0 ./ b;
}


model {
  // Priors
  alpha ~ lognormal(0.0, 2.0);
  b ~ lognormal(2.0, 3.0);
  w ~ beta(1.0, 1.0);

  // Likelihood
  for (i in 1:N) {
    target += log_mix(
      w,
      neg_binomial_lpmf(n[i] | alpha[1], beta_[1]),
      neg_binomial_lpmf(n[i] | alpha[2], beta_[2]));
  }
}


generated quantities {
  int n_ppc[N];
  real log_lik[N];

  // Posterior predictive checks
  for (i in 1:N) {
    if (uniform_rng(0.0, 1.0) < w) {
      n_ppc[i] = neg_binomial_rng(alpha[1], beta_[1]);
    }
    else {
      n_ppc[i] = neg_binomial_rng(alpha[2], beta_[2]);
    }
  }

  // Pointwise log likelihood
  for (i in 1:N) {
    log_lik[i] = log_mix(
      w,
      neg_binomial_lpmf(n[i] | alpha[1], beta_[1]),
      neg_binomial_lpmf(n[i] | alpha[2], beta_[2]));
  }
}

Let’s compile the model and get some samples.

[9]:
sm_mix = cmdstanpy.CmdStanModel(stan_file="neg_binom_mix.stan")

samples_mix = sm_mix.sample(data=data)
samples_mix = az.from_cmdstanpy(
    posterior=samples_mix, posterior_predictive="n_ppc", log_likelihood="log_lik"
)
INFO:cmdstanpy:stan to c++ (/Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_07/neg_binom_mix.hpp)
INFO:cmdstanpy:compiling c++
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_07/neg_binom_mix
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 4
INFO:cmdstanpy:finish chain 3

We’ll do our usual diagnostic checks and make a corner plot.

[10]:
# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_mix)

# Make corner plot
bokeh.io.show(
    bebi103.viz.corner(
        samples_mix,
        pars=["alpha[0]", "alpha[1]", "b[0]", "b[1]", "w"],
        plot_width=100,
        xtick_label_orientation=np.pi / 4,
    )
)
ESS for parameter alpha[0] is 24.10334569087535.
ESS for parameter alpha[1] is 14.169449045054352.
tail-ESS for parameter alpha[1] is 30.975407533515767.
ESS for parameter b[0] is 7.238196929145973.
tail-ESS for parameter b[0] is 33.481207843647155.
ESS for parameter b[1] is 7.348911591336188.
tail-ESS for parameter b[1] is 32.430899698858376.
ESS for parameter w is 7.215628332310587.
tail-ESS for parameter w is 32.61049728673094.
ESS for parameter beta_[0] is 7.238197128608312.
tail-ESS for parameter beta_[0] is 33.48120784364758.
ESS for parameter beta_[1] is 7.348843427199784.
tail-ESS for parameter beta_[1] is 32.43089969885849.
  ESS or tail-ESS below 100 per chain indicates that expectation values
  computed from samples are unlikely to be good approximations of the
  true expectation values.
Rhat for parameter alpha[0] is 1.1185921715495748.
Rhat for parameter alpha[1] is 1.2161771961165373.
Rhat for parameter b[0] is 1.5316544260281295.
Rhat for parameter b[1] is 1.5194938752573814.
Rhat for parameter w is 1.529347374592077.
Rhat for parameter beta_[0] is 1.5316542059503648.
Rhat for parameter beta_[1] is 1.5212994341634194.
  Rank-normalized Rhat above 1.01 indicates that the chains very likely have not mixed
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.

Oof! WE failed the ESS and Rhat diagnostics. Recall that sampling is tricky because of label switching. To protect against label switching, we will initialize the walker near where chain 1 sampled.

[11]:
# Compute mean of parameters for chain 1
inits = {
    "alpha": [
        float(samples_mix.posterior.sel(chain=1, alpha_dim_0=0).alpha.mean()),
        float(samples_mix.posterior.sel(chain=1, alpha_dim_0=1).alpha.mean()),
    ],
    "b": [
        float(samples_mix.posterior.sel(chain=1, b_dim_0=0).b.mean()),
        float(samples_mix.posterior.sel(chain=1, b_dim_0=0).b.mean()),
    ],
    "w": float(samples_mix.posterior.sel(chain=1).w.mean()),
}

# Take a look
inits
[11]:
{'alpha': [5.12722252, 3.552391071],
 'b': [32.2517472, 32.2517472],
 'w': 0.8337721379999999}

Now we’ll use this as our initial walker positions.

[12]:
# Sample
samples_mix = sm_mix.sample(data, inits=inits)

# Convert to ArviZ
samples_mix = az.from_cmdstanpy(
    posterior=samples_mix, posterior_predictive="n_ppc", log_likelihood="log_lik"
)

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_mix)

# Make corner plot
bokeh.io.show(
    bebi103.viz.corner(
        samples_mix,
        pars=["alpha[0]", "alpha[1]", "b[0]", "b[1]", "w"],
        plot_width=100,
        xtick_label_orientation=np.pi / 4,
    )
)
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 3
INFO:cmdstanpy:finish chain 4
Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.

Everything looks good! We can do a quick posterior predictive check.

[13]:
n_ppc = samples_mix.posterior_predictive.n_ppc.stack(
    {"sample": ("chain", "draw")}
).transpose("sample", "n_ppc_dim_0")

bokeh.io.show(
    bebi103.viz.predictive_ecdf(
        n_ppc,
        name="n_ppc",
        data=df["Rex1"].values,
        diff=False,
        x_axis_label="mRNA copy number",
    )
)

Much nicer! The model allows for plenty of flexibility to allow for the observed bimodal behavior. Let’s proceed to compute the LOO and WAIC, starting with the WAIC.

[14]:
az.waic(samples_mix)
[14]:
Computed from 4000 by 279 log-likelihood matrix

        Estimate       SE
IC_waic  3191.99    21.97
p_waic      5.96        -

And the LOO.

[15]:
mix_loo = az.loo(samples_mix)

mix_loo
[15]:
Computed from 4000 by 279 log-likelihood matrix

       Estimate       SE
IC_loo  3191.99    21.97
p_loo      5.97        -

As expected, we get the same value for the information criterion. Let’s take a quick look at the difference of the LOO’s.

[16]:
mix_loo.loo - single_loo.loo
[16]:
-89.65856552855712

Remember that for historical reasons,

\begin{align} \text{WAIC} \approx -2\,\text{elpd}, \\[1em] \text{LOO} \approx -2\,\text{elpd}. \\[1em] \end{align}

The bigger the elpd is, the smaller the Kullback-Leibler divergence is, so the better the model is. So, a bigger elpd means a smaller WAIC or LOO. So, the smaller the WAIC or LOO is, the closer the model is to the true generative model. This WAIC and LOO are smaller for the mixture model than for the single Negative Binomial model, so it is a better model.

Computing the weights

We can directly compute the Akaike weights from the values of the LOO, using

\begin{align} w_i = \frac{\exp\left[-(\text{LOO}_i-\text{LOO}_j)/2\right]}{1 + \exp\left[-(\text{LOO}_i-\text{LOO}_j)/2\right]}. \end{align}

[17]:
d_loo = mix_loo.loo - single_loo.loo
w_single = np.exp(d_loo/2) / (1 + np.exp(d_loo/2))
w_mix = 1 - w_single

print('           Mixture model weight:', w_mix)
print('Single Neg. Binom. model weight:', w_single)
           Mixture model weight: 1.0
Single Neg. Binom. model weight: 3.3953915835199315e-20

In agreement with our posterior predictive checks, the mixture model is far more predictive than the single negative binomial model.

As I mentioned above, ArviZ offers more a sophisticated means of computing the weights using stacking. The results tend to be less extreme that directly computing the Akaike weights. We can use the az.compare() function to do the calculation. We will do it using the LOO (WAIC is default, so we use the ic kwarg). The first input is a dictionary containing the MCMC samples, where the keys of the dictionary are the names of the models.

[18]:
az.compare({'single': samples, 'mix': samples_mix}, ic='loo')
[18]:
rank loo p_loo d_loo weight se dse warning loo_scale
mix 0 3191.99 5.96877 0 1 17.7151 0 False deviance
single 1 3281.65 2.15077 89.6586 2.76402e-08 22.5856 17.8182 False deviance

The mixture model is still dominant.

[19]:
bebi103.stan.clean_cmdstan()

Computing environment

[20]:
%load_ext watermark
%watermark -v -p numpy,pandas,cmdstanpy,arviz,bokeh,bebi103,jupyterlab
CPython 3.7.6
IPython 7.12.0

numpy 1.18.1
pandas 0.24.2
cmdstanpy 0.8.0
arviz 0.6.1
bokeh 1.4.0
bebi103 0.0.52
jupyterlab 1.2.6