Tutorial 9b: Model comparison

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [1]:
import numpy as np
import pandas as pd

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

As we saw in the previous tutorial, posterior calculations are essential for assess how well a model can capture experimental results. As I alluded to in that tutorial, and spelled out in lecture 8, we can use information criteria to assess relative predictive effectiveness of models. In order to understand this tutorial, you will need have carefully read and understand lecture 8.

The key quantity we compute for more comparison is the expected log pointwise predictive density, or elpd. There were a few key ideas and assumptions in using elpd.

  1. The elpd is an approximation of the difference in the Kullback-Leibler divergence between the posterior predictive distribution and the true generative distribution.
  2. For our set of measurements $y = (y_1, y_2, \ldots y_N)$, where each $y_i$ may be multidimensional (as you would have, for example, of beak length/beak depth measurements for a single finch), we assume that $y_i$'s are independently distributed, both in the model and in the true generative distribution.

With these assumptions, we can approximately compute the elpd using the Watanabe-Akaike information criterion (WAIC) or leave-one-out cross validation (LOO). See lecture 8 for the basic ideas, and this paper by Vehtari, Gelman, and Gabry (arXiv version) for more details about the implementation. As described in that paper, LOO, when computing using Pareto-smoothed importance sampling, is the preferred method for computing an approximate elpd.

Importantly, the (approximate) elpd by itself is not terribly useful in assessing a model. The elpd of one prospective model needs to be compared to another. For this comparison, as shown in lecture 8, we can compute Akaike weights. This is the most straightforward calculation of relative weights of respective models, and perhaps easiest to understand. However, it may not be the best way to assess the predictive capabilities of a model, especially in situations where the true generative model is not known (which is often the case for us as scientists). As we think about generative models, and we are not sure which model best generates observed data, it is useful to think about model averaging if our aim is to be predictive. The idea here is that we do not know which model generates data. Instead, we try to find a combination of models that spans all of the models we are considering, that best describe the data. The respective weights of the models give their contributions to this combination of models. As a scientist, I tend to shy away from model averaging; I am rather seeking to understand how nature generates the observations I see, and nature is not trying to predict, nor average models. However, taking a model averaging approach with an eye for optimizing predictive performance leads to more robust estimates of model weights, as outlined in this paper by Yao and coworkers, which describes a technique known as stacking.

In this tutorial, we will demonstrate how to calculate model weight both by using Akaike weights (and variants thereof), and stacking. Conveniently, this may be done approximately directly from samples out of the posterior distributions. The ArviZ package provides much of the functionality we need.

The ArviZ package

You should install ArviZ using

pip install arviz

on the command line. You should also make sure your bebi103 module is up to date.

pip install --upgrade bebi103

ArviZ is still very much in active development by many of the same developers of PyStan. Its API may change going forward. Importantly, it requires that the inputs, which are MCMC samples, be in its own format. Its format converters work differently for different versions of PyStan. We will therefore interface with ArviZ though the bebi103 module, which will appropriately convert your MCMC samples to the format ArviZ wants, whether you are using Python 2.17 or 2.18.

An example model comparison

To show how we can do an model comparison, we will again consider the data set from Singer and coworkers, where they performed RNA FISH to determing the copy numbers of RNA transcripts of specific genes in individual cells in their samples. The data set can be downloaded here.

We will work with the Rex1 gene. In previous tutorials, we considered two models. First, a model in which transcript counts are generated from a single Negative Binomial distribution. Second, a mixture model in which the transcript counts are generated from two Negative Binomial distributions. Note that this is a purely academic exercise, since the first model will fail posterior predictive checks quite specatacularly. We would never really come to this point where we needed to do a model comparison, but we proceed to demonstrate how it is done in practice.

Computing the pointwise log likelihood

Recalling from lecture, the elpd depends on the logarithm of the posterior predictive distribution, $f(\tilde{y}_i\mid y)$,

\begin{align} \text{elpd} = \sum_{i=1}^N\int\mathrm{d}\tilde{y}_i\,\,f_t(\tilde{y}_i)\,\ln f(\tilde{y}_i\mid y). \end{align}

When generating our samples, we therefore also need to compute samples of the value of the log likelihood. To do this, for each set of parameters $\theta$ that we sample out of the posterior, we compute the pointwise log likelihood of the data set, using the parameters $\theta$. The Stan code below includes these log likelihood samples as well as posterior predictive checks for the single Negative Binomial model. Read the code carefully.

In [2]:
model_code = """
data {
  int N;
  int n[N];
}


parameters {
  real<lower=0> alpha;
  real<lower=0> b;
}


transformed parameters {
  real beta_ = 1.0 / b;
}


model {
  // Priors
  alpha ~ lognormal(0.0, 2.0);
  b ~ lognormal(2.0, 3.0);

  // Likelihood
  n ~ neg_binomial(alpha, beta_);
}


generated quantities {
  int n_ppc[N];
  real log_lik[N];

  // Draw posterior predictive data set
  for (i in 1:N) {
    n_ppc[i] = neg_binomial_rng(alpha, beta_);
  }
  
  // Compute pointwise log likelihood
  for (i in 1:N) {
    log_lik[i] = neg_binomial_lpmf(n[i] | alpha, beta_);
  }
}
"""

sm = bebi103.stan.StanModel(model_code=model_code)
Using cached StanModel.

In the array log_lik, I store the pointwise log likelihood. That is, for each measurement (in this case for each $n_i$), I compute the log likelihood for that data point using the parameters (in this case alpha and beta_) that I sampled out of the posterior. Conveniently, Stan's distributions all have a function that ends in _lpdf that compute the log probability density functionfor the distribution (with _lpmf for distrete distributions that computes the log probability mass function for the distribution).

Let's sample out of this generative model.

In [3]:
# Load data and make data dictionary
df = pd.read_csv('../data/singer_transcript_counts.csv', comment='#')
data = {'N': len(df), 'n': df['Rex1'].values.astype(int)}

# Perform sampling
samples = sm.sampling(data=data)

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples)

# Make a corner plot
bokeh.io.show(bebi103.viz.corner(samples, pars=['alpha', 'b']))
n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0.0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.

Everything looks good. Actually, it doesn't. The sampling looks good, but we should do posterior predictive checks.

In [4]:
bokeh.io.show(bebi103.viz.predictive_ecdf(samples, 
                                          name='n_ppc', 
                                          data=df['Rex1'].values,
                                          diff=True,
                                          data_line=False))

We have clearly failed the posterior predictive checks. We can stop here, but we will continue to compute the WAIC and LOO for illustrative purposes. As we do that, let's take a quick look at the output so we can see how the log likelihood samples are organized.

In [5]:
df_mcmc = bebi103.stan.to_dataframe(samples)
df_mcmc.columns
Out[5]:
Index(['chain', 'chain_idx', 'warmup', 'divergent__', 'energy__',
       'treedepth__', 'accept_stat__', 'stepsize__', 'n_leapfrog__', 'alpha',
       ...
       'log_lik[271]', 'log_lik[272]', 'log_lik[273]', 'log_lik[274]',
       'log_lik[275]', 'log_lik[276]', 'log_lik[277]', 'log_lik[278]',
       'log_lik[279]', 'lp__'],
      dtype='object', length=571)

We have a log likehood for each of the 279 data points for each sample out of the posterior.

Computing the WAIC and LOO

With these samples, we can compute the WAIC and LOO using (bebi103 wrappers around) ArviZ functions. You will likely get annoying warning messages about dtypes and permuted kwargs. This is happening as ArviZ is calling some functions within PyStan. They are superfluous warnings. You can ignore them.

In [6]:
waic_results = bebi103.stan.waic(samples, log_likelihood='log_lik')
loo_results = bebi103.stan.loo(samples, log_likelihood='log_lik')

print(waic_results, end='\n\n')
print(loo_results)
          waic   waic_se    p_waic  warning
0  3281.345729  17.32826  1.991239        0

          loo    loo_se     p_loo  warning
0  3281.34582  17.32826  1.991284        0

The functions waic and loo output data frames with the respective information criterion and estimates of the standard error for them. We see that the LOO and WAIC give almost identical results (as they should). Remember, though, that LOO has better performance across a wider variety of models.

Calculations with the mixture model

Now, let's do the same calculation for the mixture model. In this case, we do not have a built-in distribution to use a _logpmf function. Fortunately, the log_mix() function in Stan accomplishes exactly what we need, as it did when we added it to target in the model specification.

In [7]:
model_code_mix = """
data {
  int N;
  int n[N];
}


parameters {
  vector<lower=0>[2] alpha;
  vector<lower=0>[2] b;
  real<lower=0, upper=1> w;
}


transformed parameters {
  vector[2] beta_ = 1.0 ./ b;
}


model {
  // Priors
  alpha ~ lognormal(0.0, 2.0);
  b ~ lognormal(2.0, 3.0);
  w ~ beta(1.0, 1.0);

  // Likelihood
  for (i in 1:N) {
    target += log_mix(w,
                      neg_binomial_lpmf(n[i] | alpha[1], beta_[1]),
                      neg_binomial_lpmf(n[i] | alpha[2], beta_[2]));
  }
}


generated quantities {
  int n_ppc[N];
  real log_lik[N];
  
  // Posterior predictive checks
  for (i in 1:N) {
    if (uniform_rng(0.0, 1.0) < w) {
      n_ppc[i] = neg_binomial_rng(alpha[1], beta_[1]);
    }
    else {
      n_ppc[i] = neg_binomial_rng(alpha[2], beta_[2]);
    }
  }
  
  // Pointwise log likelihood
  for (i in 1:N) {
    log_lik[i] = log_mix(w,
                      neg_binomial_lpmf(n[i] | alpha[1], beta_[1]),
                      neg_binomial_lpmf(n[i] | alpha[2], beta_[2]));
  }
}
"""

sm_mix = bebi103.stan.StanModel(model_code=model_code_mix)
Using cached StanModel.

Recall that sampling is tricky. We'll use the same function from the last tutorial to sample out of the mixture model.

In [8]:
def sample_mix(data, **kwargs):
    """Sample a mixture model."""
    samples = sm_mix.sampling(data=data, chains=1, **kwargs)
    df_mcmc = bebi103.stan.to_dataframe(samples)
    params = ['alpha[1]', 'alpha[2]', 'b[1]', 'b[2]', 'w']
    param_means = df_mcmc.loc[df_mcmc['chain']==1, params].mean()
    
    def init_fun():
        """Initialization function for sample at mean of one mode."""
        return {'alpha': [param_means['alpha[1]'], param_means['alpha[2]']],
                'b': [param_means['b[1]'], param_means['b[2]']],
                'w': param_means['w']}

    # Get the samples
    return sm_mix.sampling(data=data, init=init_fun, **kwargs)

# Sample
samples_mix = sample_mix(data)

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_mix)

# Make a corner plot
bokeh.io.show(bebi103.viz.corner(samples_mix, 
                                 pars=['alpha[1]', 'b[1]', 'alpha[2]', 'b[2]', 'w']))
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0.0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.

We already performed posterior predictive checks for this model/data set in the previous tutorial, and it looks good. Let's proceed to compute the LOO and WAIC.

In [9]:
waic_results_mix = bebi103.stan.waic(samples_mix, log_likelihood='log_lik')
loo_results_mix = bebi103.stan.loo(samples_mix, log_likelihood='log_lik')

print('Single Negative Binomial:')
print(waic_results, end='\n\n')
print(loo_results, end='\n\n\n\n')
print('Mixture model:')
print(waic_results_mix, end='\n\n')
print(loo_results_mix)
Single Negative Binomial:
          waic   waic_se    p_waic  warning
0  3281.345729  17.32826  1.991239        0

          loo    loo_se     p_loo  warning
0  3281.34582  17.32826  1.991284        0



Mixture model:
          waic   waic_se    p_waic  warning
0  3191.985353  21.88561  5.974564        0

           loo     loo_se     p_loo  warning
0  3192.001042  21.884966  5.982409        0

Remember that for historical reasons,

\begin{align} \text{WAIC} \approx -2\,\text{elpd}, \\[1em] \text{LOO} \approx -2\,\text{elpd}. \\[1em] \end{align}

The bigger the elpd is, the smaller the Kullback-Leibler divergence is, so the better the model is. So, a bigger elpd means a smaller WAIC or LOO. So, the smaller the WAIC or LOO is, the closer the model is to the true generative model. This WAIC and LOO are smaller for the mixture model than for the single Negative Binomial model, so it is a better model.

Computing the weights

We can directly compute the Akaike weights from the values of the LOO, using

\begin{align} w_i = \frac{\exp\left[-(\text{LOO}_i-\text{LOO}_j)/2\right]}{1 + \exp\left[-(\text{LOO}_i-\text{LOO}_j)/2\right]}. \end{align}

In [10]:
d_loo = (loo_results_mix['loo'] - loo_results['loo'])[0]
w_single = np.exp(d_loo/2) / (1 + np.exp(d_loo/2))
w_mix = 1 - w_single

print('           Mixture model weight:', w_mix)
print('Single Neg. Binom. model weight:', w_single)
           Mixture model weight: 1.0
Single Neg. Binom. model weight: 3.972172683456721e-20

In agreement with our posterior predictive checks, the mixture model is far more predictive than the single negative binomial model.

As I mentioned above, ArviZ offers more a sophisticated means of computing the weights using stacking. The results tend to be less extreme that directly computing the Akaike weights. We can use the compare() function to do the calculation. We will do it using the LOO (WAIC is default, so we use the ic kwarg). The first input is a dictionary containing the MCMC samples, where the keys of the dictionary are the names of the models.

In [11]:
bebi103.stan.compare({'single': samples, 'mixture': samples_mix},
                     log_likelihood='log_lik', ic='loo')
Out[11]:
loo ploo dloo weight se dse warning
mixture 3192 5.98241 0 0.975242 21.885 0 0
single 3281.35 1.99128 89.3448 0.0247581 17.3283 17.7765 0

The mixture model is still dominant.

Another example: nonlinear regression

We performed nonlinear regression using optimization methods in a previous tutorial using the data set from Good, et al., Science, 342, 856-860, 2013. We also analyzed this data set using MCMC. We considered two models for how spindle length depends on droplet diameter.

  1. The spindle length is set; there is no dependence on droplet diameter.
  2. The spindle length is set by the total amount of tubulin available.

We can state the two models as follows.

Model 1 \begin{align} &\phi \sim \text{LogNorm}(\ln 20, 0.75),\\[1em] &\sigma_0 \sim \text{Gamma}(2, 10),\\[1em] &\sigma = \sigma_0\,\phi,\\[1em] &l_i \sim \text{Norm}(\phi, \sigma) \;\forall i. \end{align}


Model 2 \begin{align} &\phi \sim \text{LogNorm}(\ln 20, 0.75),\\[1em] &\gamma \sim \text{Beta}(2, 2), \\[1em] &\sigma_0 \sim \text{Gamma}(2, 10),\\[1em] &\sigma = \sigma_0\,\phi,\\[1em] &\mu = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &l_i \sim \text{Norm}(\mu, \sigma) \;\forall i. \end{align}

Our task now is to compare these two models. Note that model 2 reduces to model 1 in the limit of $\gamma d \gg \phi$ so we have a clear connection between these two models. Let's code up and compile the models, including posterior predictive checks and pointwise log likelihood calculations.

In [12]:
model_code_set_length = """
data {
  int N;
  real ell[N];
}


parameters {
  real phi;
  real sigma_0;
}


transformed parameters {
  real sigma = sigma_0 * phi;
}


model {
  phi ~ lognormal(log(20.0), 0.75);
  sigma_0 ~ gamma(2.0, 10.0);
  
  for (i in 1:N) {
    ell[i] ~ normal(phi, sigma);
  }
}


generated quantities {
  real ell_ppc[N];
  real log_lik[N];
  
  // Posterior predictive checks
  for (i in 1:N) {
    ell_ppc[i] = normal_rng(phi, sigma);
  }
  
  // Pointwise log likelihood
  for (i in 1:N) {
    log_lik[i] = normal_lpdf(ell[i] | phi, sigma);
  }
}
"""

sm_set_length = bebi103.stan.StanModel(model_code=model_code_set_length)
Using cached StanModel.
In [13]:
model_code_cons_tubulin = """
functions {
  real ell_theor(real d, real phi, real gamma) {
    real denom_ratio = (gamma * d / phi)^3;
    return gamma * d / (1 + denom_ratio)^(1.0 / 3.0); 
  }
}


data {
  int N;
  real d[N];
  real ell[N];
}


parameters {
  real phi;
  real gamma;
  real sigma_0;
}


transformed parameters {
  real sigma = sigma_0 * phi;
}


model {
  phi ~ lognormal(log(20.0), 0.75);
  gamma ~ beta(2.0, 2.0);
  sigma_0 ~ gamma(2.0, 10.0);
  
  for (i in 1:N) {
    ell[i] ~ normal(ell_theor(d[i], phi, gamma), sigma);
  }
}


generated quantities {
  real ell_ppc[N];
  real log_lik[N];
  
  // Posterior predictive checks
  for (i in 1:N) {
    ell_ppc[i] = normal_rng(ell_theor(d[i], phi, gamma), sigma);
  }
  
  // Pointwise log likelihood
  for (i in 1:N) {
    log_lik[i] = normal_lpdf(ell[i] | ell_theor(d[i], phi, gamma), sigma);
  }
}
"""

sm_cons_tubulin = bebi103.stan.StanModel(model_code=model_code_cons_tubulin)
Using cached StanModel.

All right! Let's do some sampling, first for model 1.

In [14]:
# Load in Data Frame
df = pd.read_csv('../data/good_invitro_droplet_data.csv', comment='#')

# Set up data dict
data = dict(N=len(df),
            d=df['Droplet Diameter (um)'].values,
            ell=df['Spindle Length (um)'].values)

# Draw samples
samples_1 = sm_set_length.sampling(data=data)

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_1)

# Corner plot
bokeh.io.show(
    bebi103.viz.corner(samples_1, pars=['phi', 'sigma']))
/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0.0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.

Since there is no $d$ dependence, we can use an ECDF for our posterior predictive check. I will adjust the percentiles we use in the plot to include the middle 99th percentile, since we have lots of data points.

In [15]:
bokeh.io.show(bebi103.viz.predictive_ecdf(samples_1, 
                                          percentiles=[99, 70, 50, 30],
                                          name='ell_ppc', 
                                          data=df['Spindle Length (um)'].values,
                                          diff=True,
                                          data_line=False))

None of the data points lie outside the 99th percentile. It seems that this model passes the posterior predictive check. Let us now analyze the conserved tubulin model in the same way.

In [20]:
# Draw samples
samples_2 = sm_cons_tubulin.sampling(data=data)

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_2)

# Corner plot
bokeh.io.show(
    bebi103.viz.corner(samples_2, pars=['phi', 'gamma', 'sigma'],
                       labels=['φ (µm)', 'γ', 'σ (µm)'],
                       xtick_label_orientation=np.pi/6))
n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0.0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.

This looks fine. The $d$ dependence makes it so we cannot directly use the predictive_ecdf() function. Rather, we should plot the spindle length versus diameter, along with the percentiles from the posterior predictive checks. For regressions of this sort, the bebi103.viz.predictive_regression() function will help with these plots.

In [19]:
bokeh.io.show(bebi103.viz.predictive_regression(
        samples_2, 
        x_axis_label='droplet diameter (µm)',
        y_axis_label='spindle length (µm)',
        percentiles=[99, 70, 50, 30],
        name='ell_ppc', 
        data_x=df['Droplet Diameter (um)'].values,
        data_y=df['Spindle Length (um)'].values,
        data_alpha=0.5,
        diff=False))

Like the set spindle length model, the conserved tubulin model also captrues the data set, with just a few data points of the 670 just outside the 99th percentile of the samples out of the posterior predictive distribution.

So, both models cover the data set and pass the posterior predictive checks. We can then turn to the model comparisons to see which model is closer to the true generative distribution.

In [21]:
bebi103.stan.compare({'set length': samples_1,
                      'conserved tubulin': samples_2},
                     log_likelihood='log_lik',
                     ic='loo')
Out[21]:
loo ploo dloo weight se dse warning
conserved tubulin 3680.58 3.23764 0 0.999763 38.274 0 0
set length 4003.17 2.0631 322.597 0.000237141 36.4386 30.9535 0

The conserved tubulin model is much better! When we cannot rule out models based on posterior predictive checks, computing weights based on information criteria allows us to select which model(s) best match the true generative model.

Conclusions

You have learned how to use Stan to use MCMC to sample out of a posterior distribution. I hope it is evident how convenient and powerful this is. I also hope you have an understanding of how fragile statistical modeling can be, as you saw with a label switching-based nonidentifiability.

We have looked at some visualizations of MCMC results in this tutorial, and in the next one, we will take a closer look at how to visualize and report MCMC results.

Computing environment

In [22]:
%load_ext watermark
In [23]:
%watermark -v -p numpy,pandas,pystan,arviz,bokeh,bebi103,jupyterlab
CPython 3.7.0
IPython 7.1.1

numpy 1.15.4
pandas 0.23.4
pystan 2.17.1.0
arviz 0.2.1
bokeh 1.0.1
bebi103 0.0.40
jupyterlab 0.35.3