Tutorial 6b: Model selection

(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

In [1]:
import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st
import statsmodels.tools.numdiff as smnd

import pymc3 as pm
import theano.tensor as tt

import bebi103

import bokeh.io
import bokeh.plotting
import bokeh.layouts

# Display graphics in this notebook
bokeh.io.output_notebook()
Loading BokehJS ...

Reminder of Bayesian model comparison problem

As a reminder of the material in lecture 5, our goal is to compute the probability that a given model is true, given the data. By Bayes's theorem, this is

\begin{align} g(M_i\mid D) = \frac{f(D \mid M_i)\,g(M_i)}{f(D)}. \end{align}

Unless we can somehow sum probabilities over all models to compute

\begin{align} f(D) = \sum_{i\in\text{all models}} f(D \mid M_i)\,g(M_i), \end{align}

we cannot compute $g(M_i\mid D)$ exactly. Instead, we compare a pair of models, $M_i$ and $M_j$, by computing the odds ratio,

\begin{align} O_{ij} = \frac{g(M_i)}{g(M_j)}\,\frac{f(D\mid M_i)}{f(D\mid M_j)}. \end{align}

The second fraction in this equation is the evidence from the parameter estimation problem. By Bayes's theorem, we have, for the parameter estimation problem,

\begin{align} g(\theta_i\mid D, M_i) = \frac{f(D\mid \theta_i, M_i, I)\,g(\theta_i\mid M_i, I)}{f(D\mid M_i)}, \end{align}

where the parameters are $\theta_i$. Because the posterior of the parameter estimation problem must be normalized,

\begin{align} f(D\mid M_i) = \int \mathrm{d}\theta_i\,f(D\mid \theta_i, M_i)\,P(\theta_i\mid M_i). \end{align}

We can evaluate this integral by approximating the posterior as sharply peaked and then using the Laplace approximation to compute the integral.

An alternative to this approach is to use information criteria. From MCMC samples, we can compute the Watanabe-Akaike Information criterion (WAIC), which is a measure of the predictive performance of a statistical model. We can compare two models by computing their respective Akaike weights,

\begin{align} w_i = \frac{\exp\left[-\frac{1}{2}\,WAIC_i\right]}{\exp\left[-\frac{1}{2}\,WAIC_i\right] + \exp\left[-\frac{1}{2}\,WAIC_j\right]}. \end{align}

The WAIC (and indeed the Akaike weights) may be computed automatically using PyMC3.

Model selection for the Good, et al. data

Recall from Tutorial 4a that we were considering two different theoretical models for the spindle length vs. droplet size data presented in the Good, et al. paper. You can download the data here.

Model A

In Model A, the spindle size is independent of droplet size. The corresponding equation is

\begin{align} l = l_s. \end{align}

We assume the data are Gaussian distributed about this $l_s$, which an unknown variance $\sigma$.

Model B

We define by Model B the full relation between spindle length and droplet diameter,

\begin{align} l(d;\gamma,\phi) \approx \frac{\gamma d}{\left(1+(d/\phi)^3\right)^{\frac{1}{3}}}. \end{align}

As in Model A, we assume the data are Gaussian distributed about this theoretical length, which an unknown variance $\sigma$.

The approximate odds ratio

As given in lecture 5, the approximate odds ratio is

\begin{align} O_{AB} \approx \left(\frac{g(A)}{g(B)}\right) \left(\frac{f(D\mid l_s^*, \sigma^*, A)}{f(D\mid \phi^*, \gamma^*, \sigma^*, B)}\right) \left(\frac{g(l_s^*, \sigma^* \mid A) \,2\pi\,\sqrt{\det\mathsf{\Sigma}_A}} {P(\phi^*, \gamma^*, \sigma^*\mid B)\,(2\pi)^{3/2} \sqrt{\det\mathsf{\Sigma}_B}}\right), \end{align}

where the asterisks denote the most probable parameter values (the MAP), and $\mathsf{\Sigma}_i$ is the covariance matrix computed for Model $i$. So, to compute the odds ratio, we need to first specify the prior odds of the two models. We will assume that

\begin{align} \frac{g(A)}{P(B)} \approx 1, \end{align}

since we are not sure which model is correct a priori (that was the whole purpose of the experiment). We also need to specify the respective priors. We will assume that, $\phi$, $\gamma$, and $\sigma$ are all independent, as are $l_s$ and $\sigma$. Further, $g(\sigma)$ is the same for both model A and B. Since $l_s = \gamma \phi$ in the large $d$ regime, we take $g(\phi) = \gamma^{-1} g(l_s)$. Finally, we know that $0 \le \gamma \le 1$ by physical considerations. We will take a uniform prior for $\gamma$, so the ratio

\begin{align} \frac{g(l_s^*, \sigma^*\mid A)} {g(\phi^*, \gamma^*, \sigma^*\mid B)} = 1/\gamma^*. \end{align}

All we have left to do is to compute the MAP and the covariance at the MAP for each model, and then to evaluate the likelihood at the MAP. So, we will first define the likelihoods for both models.

In Tutorial 4a, we found the MAP and covariance for both Model A and Model B. The problem, though is that we always worked with the marginalized posterior, where $\sigma$, the error in the data, was eliminated from the problem. Here, we need to find $\sigma^*$, the most probable variance in the data, and use that in computing the (unmarginalized) likelihood. There is an analytical solution for Model A, but we will just compute both MAPs using scipy.optimize.minimize(). We'll start by defining the functions we need, first for Model A. Note that our array of parameters, p, now has sigma in it. As a reminder, if $l(d;\theta)$ is the theoretically predicted spindle length for a droplet of diameter $d$ depending on parameters $\theta$ under model $M$, the likelihood is

\begin{align} f(\{d_i, l_i\}\mid \theta, M) = \prod_i \frac{1}{\sqrt{2\pi \sigma^2}}\, \exp\left\{-\frac{(l_i - l(d_i; \theta)^2}{2\sigma^2}\right\}. \end{align}

In [2]:
def spindle_length(params, d, model):
    """
    Theoretical models for spindle length.
    """
    if model == 'A':
        return params[0] * np.ones_like(d)
    elif model == 'B':
        phi, gamma, _ = params
        return gamma * d / np.cbrt(1 + (d / phi)**3)
    else:
        raise RuntimeError('Model not properly specified.')

        
def resid(params, d, ell, model):
    """
    Residuals for spindle length model.
    """
    return ell - spindle_length(params, d, model)


def log_likelihood(params, d, ell, model):
    """
    Log likelihood for mitotic spindle length vs droplet size.
    """
    sigma = params[-1]
    
    return -np.sum(resid(params, d, ell, model)**2) / 2 / sigma**2 \
                - len(ell) / 2 * np.log(2 * np.pi * sigma**2)
    
def log_prior(params, model):
    """
    Log prior for mitotic spindle length vs droplet size.
    """
    if (params < 0).any():
        return -np.inf

    if model == 'B' and params[1] > 1:
        return -np.inf

    if model == 'A':
        return -np.log(params[-1])
    else:
        return -np.log(params[0]) - np.log(params[-1])


def log_posterior(params, d, ell, model):
    """
    Log posterior for mitotic spindle length vs droplet size.
    Model A:
        params[0] = phi
        params[1] = sigma
    Model B:
        params[0] = phi
        params[1] = gamma
        params[2] = sigma    
    """
    lp = log_prior(params, model)
    
    if lp == -np.inf:
        return -np.inf
    
    return lp + log_likelihood(params, d, ell, model)


def neg_log_posterior(params, d, ell, model):
    return -log_posterior(params, d, ell, model)

We can now proceed with MAP finding using Powell's method.

In [3]:
# Load data into DataFrame
df = pd.read_csv('../data/good_invitro_droplet_data.csv', comment='#')

# Extra arguments as a tuple
args = (df['Droplet Diameter (um)'].values, df['Spindle Length (um)'].values)

# Model A
p0 = np.array([40.0, 5.0])
args_A = args + ('A',)
res = scipy.optimize.minimize(neg_log_posterior,
                              p0,
                              args=args_A, 
                              method='powell')
popt_A = res.x
cov_A = -np.linalg.inv(smnd.approx_hess(popt_A, log_posterior, args=args_A))

# Model B
p0 = np.array([40, 0.5, 5])
args_B = args + ('B',)
res = scipy.optimize.minimize(neg_log_posterior,
                              p0,
                              args=args_B, 
                              method='powell')
popt_B = res.x
cov_B = -np.linalg.inv(smnd.approx_hess(popt_B, log_posterior, args=args_B))

Note that that calculation was very fast. As you will soon see, it is much more code that putting together a model to run MCMC on, but the advantage is speed.

Now, just as a reminder, let's plot our best-fit results.

In [4]:
p = bokeh.plotting.figure(plot_width=600,
                          plot_height=350,
                          x_axis_label='droplet diameter (µm)',
                          y_axis_label='spindle length (µm)')

# Plot data
p.circle(df['Droplet Diameter (um)'], df['Spindle Length (um)'], alpha=0.3)

# Smooth droplet diameter for plotting
d_plot = np.linspace(0, 250, 300)

# Theoretical curves
p.line(d_plot, spindle_length(popt_A, d_plot, 'A'), color='#beaed4', line_width=3)
p.line(d_plot, spindle_length(popt_B, d_plot, 'B'), color='#fdc086', line_width=3)

bokeh.io.show(p)

We can now compute the components of the approximate odds ratio. As is often good practice, we will compute the log odds ratio. We will do it piece by piece. As we already established, the prior ratio is unity. Now, let's look at the goodness of fit ratio.

In [5]:
log_good_fit_ratio = log_likelihood(popt_A, *args_A) - \
                                    log_likelihood(popt_B, *args_B)

print('Goodness of fit ratio:', np.exp(log_good_fit_ratio))
Goodness of fit ratio: 3.29459302752e-71

Wow. Model A really gives a lousy fit to the data compared to model B. Now, let's see if the Occam factor ratio can overwhelm this bad goodness of fit. Remember that a model with fewer parameters, like Model A, has a bigger Occam factor, making it more probable.

In [6]:
log_occam_factor = (-np.log(2 * np.pi) + np.log(np.linalg.det(cov_A)) 
                    - np.log(np.linalg.det(cov_B))) / 2

print('Occam factor ratio:', np.exp(log_occam_factor))
Occam factor ratio: 19.9720013511

Yes, the Occam factor penalizes Model B, but nowhere near enough to compensate for the superior goodness of fit it provides. So, we can compute approximate odds ratio as follows.

In [7]:
log_odds_ratio = log_good_fit_ratio + log_occam_factor

print('Odds ratio:', np.exp(log_odds_ratio))
Odds ratio: 6.5799616397e-70

We, we conclude that Model B is waaaaaay more probable than Model A.

The Akaike weights

We will now take another approach where we assess the models based on their WAIC and Akaike weights. We first need to set up PyMC3 models for sampling.

In [8]:
# Get Numpy arrays for speed
d = df['Droplet Diameter (um)'].values
ell = df['Spindle Length (um)'].values

with pm.Model() as set_length_model:
    # Priors
    l_s = pm.Uniform('l_s', lower=0, upper=1000)
    sigma = bebi103.pm.Jeffreys('sigma', lower=0.1, upper=50)
    
    # Likelihood
    ell_obs = pm.Normal('ell_obs', mu=l_s, sd=sigma, observed=ell)

And now the conserved tubulin model.

In [9]:
def spindle_length(d, gamma, phi):
    """Spindle length as a function of droplet diameter."""
    return gamma * d / (1 + (d/phi)**3)**(1/3)

with pm.Model() as cons_tubulin_model:
        # Priors
        gamma = pm.Uniform('gamma', lower=0, upper=1)
        phi = bebi103.pm.Jeffreys('phi', lower=1, upper=100)
        sigma = bebi103.pm.Jeffreys('sigma', lower=0.1, upper=50)
        
        # Compute spindle length
        ell_theor = spindle_length(d, gamma, phi)
        
        # Likelihood
        ell_obs = pm.Normal('ell_obs', mu=ell_theor, sd=sigma, observed=ell)

With the models in hand, let's get some samples!

In [10]:
with set_length_model:
    trace_set_length = pm.sample(draws=10000,
                                 tune=10000,
                                 init='advi+adapt_diag',
                                 njobs=4)
    
with cons_tubulin_model:
    trace_cons_tubulin = pm.sample(draws=10000,
                                   tune=10000,
                                   init='advi+adapt_diag',
                                   njobs=4)
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
Average Loss = 2,157.8:  10%|â–‰         | 19060/200000 [00:10<01:43, 1747.09it/s]  
Convergence archived at 19100
Interrupted at 19,099 [9%]: Average Loss = 2.6039e+06
100%|██████████| 20000/20000 [00:13<00:00, 1467.99it/s]
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
Average Loss = 1,930.4:   8%|â–Š         | 16447/200000 [00:11<02:03, 1483.03it/s]
Convergence archived at 16600
Interrupted at 16,599 [8%]: Average Loss = 11,885
 98%|█████████▊| 19673/20000 [01:21<00:01, 240.30it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:451: UserWarning: The acceptance probability in chain 1 does not match the target. It is 0.888981196172, but should be close to 0.8. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))
100%|██████████| 20000/20000 [01:22<00:00, 241.05it/s]

PyMC3 has a very convenient function, pm.compare(), that computes the WAIC and Akaike weights,etc. We pass in a tuple of traces, along with a tuple of models under which they were calculated, and the pm.compare() function does the rest.

In [11]:
pm.compare((trace_set_length, trace_cons_tubulin), (set_length_model, cons_tubulin_model))
Out[11]:
WAIC pWAIC dWAIC weight SE dSE warning
1 3680.61 3.25 0 1 38.29 0 0
0 4003.04 1.99 322.43 0 36.46 30.89 0

The result is a data frame with the comparison results. The index is an integer corresponding to the order of the inputs. In this case, index 0 corresponds to the set length model, and index 1 corresponds to the conserved tubulin model. The WAIC column gives the WAIC for the respective models. This is the log predictive posterior density (the lppd) plus the effective number of parameters, $p_\mathrm{WAIC}$. The pWAIC column gives, you guessed it, the effective number of parameters. The 'dWAIC' column gives the difference in the WAICs (always with the lowest subtracted out). This is akin to the log odds ratio, in the sense that it is the logarithm of a ratio of predictive likelihoods. The weight column contains the Akaike weights. The SE is the error in the WAIC measurement as computed by a techinque called Bayesian bootstrap, which we will not discuss in detail in this course. The dSE column is the error in the dWAIC. Finally, the warning column alerts you to any warning flags in the calculation.

Using the WAIC, we again see that the conserved tubulin model is overwhelmingly favored.

Finally, as a sanity check, it is always a good idea to check out the posterior.

In [12]:
g = bebi103.viz.corner(trace_cons_tubulin,
                       vars=['gamma', 'phi', 'sigma'],
                       plot_width=200)
bokeh.io.show(g)

Model selection for the Singer, et al. data

I will update this tutorial with a second example, from the Singer, et al. data, soon.