(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st
import statsmodels.tools.numdiff as smnd
import pymc3 as pm
import theano.tensor as tt
import bebi103
import bokeh.io
import bokeh.plotting
import bokeh.layouts
# Display graphics in this notebook
bokeh.io.output_notebook()
As a reminder of the material in lecture 5, our goal is to compute the probability that a given model is true, given the data. By Bayes's theorem, this is
g(Miβ£D)=f(Dβ£Mi)g(Mi)f(D).
Unless we can somehow sum probabilities over all models to compute
f(D)=βiβall modelsf(Dβ£Mi)g(Mi),
we cannot compute g(Miβ£D) exactly. Instead, we compare a pair of models, Mi and Mj, by computing the odds ratio,
Oij=g(Mi)g(Mj)f(Dβ£Mi)f(Dβ£Mj).
The second fraction in this equation is the evidence from the parameter estimation problem. By Bayes's theorem, we have, for the parameter estimation problem,
g(ΞΈiβ£D,Mi)=f(Dβ£ΞΈi,Mi,I)g(ΞΈiβ£Mi,I)f(Dβ£Mi),
where the parameters are ΞΈi. Because the posterior of the parameter estimation problem must be normalized,
f(Dβ£Mi)=β«dΞΈif(Dβ£ΞΈi,Mi)P(ΞΈiβ£Mi).
We can evaluate this integral by approximating the posterior as sharply peaked and then using the Laplace approximation to compute the integral.
An alternative to this approach is to use information criteria. From MCMC samples, we can compute the Watanabe-Akaike Information criterion (WAIC), which is a measure of the predictive performance of a statistical model. We can compare two models by computing their respective Akaike weights,
wi=exp[β12WAICi]exp[β12WAICi]+exp[β12WAICj].
The WAIC (and indeed the Akaike weights) may be computed automatically using PyMC3.
Recall from Tutorial 4a that we were considering two different theoretical models for the spindle length vs. droplet size data presented in the Good, et al. paper. You can download the data here.
In Model A, the spindle size is independent of droplet size. The corresponding equation is
l=ls.
We assume the data are Gaussian distributed about this ls, which an unknown variance Ο.
We define by Model B the full relation between spindle length and droplet diameter,
l(d;Ξ³,Ο)βΞ³d(1+(d/Ο)3)13.
As in Model A, we assume the data are Gaussian distributed about this theoretical length, which an unknown variance Ο.
As given in lecture 5, the approximate odds ratio is
OABβ(g(A)g(B))(f(Dβ£lβs,Οβ,A)f(Dβ£Οβ,Ξ³β,Οβ,B))(g(lβs,Οββ£A)2ΟβdetΞ£AP(Οβ,Ξ³β,Οββ£B)(2Ο)3/2βdetΞ£B),
where the asterisks denote the most probable parameter values (the MAP), and Ξ£i is the covariance matrix computed for Model i. So, to compute the odds ratio, we need to first specify the prior odds of the two models. We will assume that
g(A)P(B)β1,
since we are not sure which model is correct a priori (that was the whole purpose of the experiment). We also need to specify the respective priors. We will assume that, Ο, Ξ³, and Ο are all independent, as are ls and Ο. Further, g(Ο) is the same for both model A and B. Since ls=Ξ³Ο in the large d regime, we take g(Ο)=Ξ³β1g(ls). Finally, we know that 0β€Ξ³β€1 by physical considerations. We will take a uniform prior for Ξ³, so the ratio
g(lβs,Οββ£A)g(Οβ,Ξ³β,Οββ£B)=1/Ξ³β.
All we have left to do is to compute the MAP and the covariance at the MAP for each model, and then to evaluate the likelihood at the MAP. So, we will first define the likelihoods for both models.
In Tutorial 4a, we found the MAP and covariance for both Model A and Model B. The problem, though is that we always worked with the marginalized posterior, where Ο, the error in the data, was eliminated from the problem. Here, we need to find Οβ, the most probable variance in the data, and use that in computing the (unmarginalized) likelihood. There is an analytical solution for Model A, but we will just compute both MAPs using scipy.optimize.minimize()
. We'll start by defining the functions we need, first for Model A. Note that our array of parameters, p
, now has sigma
in it. As a reminder, if l(d;ΞΈ) is the theoretically predicted spindle length for a droplet of diameter d depending on parameters ΞΈ under model M, the likelihood is
f({di,li}β£ΞΈ,M)=βi1β2ΟΟ2exp{β(liβl(di;ΞΈ)22Ο2}.
def spindle_length(params, d, model):
"""
Theoretical models for spindle length.
"""
if model == 'A':
return params[0] * np.ones_like(d)
elif model == 'B':
phi, gamma, _ = params
return gamma * d / np.cbrt(1 + (d / phi)**3)
else:
raise RuntimeError('Model not properly specified.')
def resid(params, d, ell, model):
"""
Residuals for spindle length model.
"""
return ell - spindle_length(params, d, model)
def log_likelihood(params, d, ell, model):
"""
Log likelihood for mitotic spindle length vs droplet size.
"""
sigma = params[-1]
return -np.sum(resid(params, d, ell, model)**2) / 2 / sigma**2 \
- len(ell) / 2 * np.log(2 * np.pi * sigma**2)
def log_prior(params, model):
"""
Log prior for mitotic spindle length vs droplet size.
"""
if (params < 0).any():
return -np.inf
if model == 'B' and params[1] > 1:
return -np.inf
if model == 'A':
return -np.log(params[-1])
else:
return -np.log(params[0]) - np.log(params[-1])
def log_posterior(params, d, ell, model):
"""
Log posterior for mitotic spindle length vs droplet size.
Model A:
params[0] = phi
params[1] = sigma
Model B:
params[0] = phi
params[1] = gamma
params[2] = sigma
"""
lp = log_prior(params, model)
if lp == -np.inf:
return -np.inf
return lp + log_likelihood(params, d, ell, model)
def neg_log_posterior(params, d, ell, model):
return -log_posterior(params, d, ell, model)
We can now proceed with MAP finding using Powell's method.
# Load data into DataFrame
df = pd.read_csv('../data/good_invitro_droplet_data.csv', comment='#')
# Extra arguments as a tuple
args = (df['Droplet Diameter (um)'].values, df['Spindle Length (um)'].values)
# Model A
p0 = np.array([40.0, 5.0])
args_A = args + ('A',)
res = scipy.optimize.minimize(neg_log_posterior,
p0,
args=args_A,
method='powell')
popt_A = res.x
cov_A = -np.linalg.inv(smnd.approx_hess(popt_A, log_posterior, args=args_A))
# Model B
p0 = np.array([40, 0.5, 5])
args_B = args + ('B',)
res = scipy.optimize.minimize(neg_log_posterior,
p0,
args=args_B,
method='powell')
popt_B = res.x
cov_B = -np.linalg.inv(smnd.approx_hess(popt_B, log_posterior, args=args_B))
Note that that calculation was very fast. As you will soon see, it is much more code that putting together a model to run MCMC on, but the advantage is speed.
Now, just as a reminder, let's plot our best-fit results.
p = bokeh.plotting.figure(plot_width=600,
plot_height=350,
x_axis_label='droplet diameter (Β΅m)',
y_axis_label='spindle length (Β΅m)')
# Plot data
p.circle(df['Droplet Diameter (um)'], df['Spindle Length (um)'], alpha=0.3)
# Smooth droplet diameter for plotting
d_plot = np.linspace(0, 250, 300)
# Theoretical curves
p.line(d_plot, spindle_length(popt_A, d_plot, 'A'), color='#beaed4', line_width=3)
p.line(d_plot, spindle_length(popt_B, d_plot, 'B'), color='#fdc086', line_width=3)
bokeh.io.show(p)
We can now compute the components of the approximate odds ratio. As is often good practice, we will compute the log odds ratio. We will do it piece by piece. As we already established, the prior ratio is unity. Now, let's look at the goodness of fit ratio.
log_good_fit_ratio = log_likelihood(popt_A, *args_A) - \
log_likelihood(popt_B, *args_B)
print('Goodness of fit ratio:', np.exp(log_good_fit_ratio))
Wow. Model A really gives a lousy fit to the data compared to model B. Now, let's see if the Occam factor ratio can overwhelm this bad goodness of fit. Remember that a model with fewer parameters, like Model A, has a bigger Occam factor, making it more probable.
log_occam_factor = (-np.log(2 * np.pi) + np.log(np.linalg.det(cov_A))
- np.log(np.linalg.det(cov_B))) / 2
print('Occam factor ratio:', np.exp(log_occam_factor))
Yes, the Occam factor penalizes Model B, but nowhere near enough to compensate for the superior goodness of fit it provides. So, we can compute approximate odds ratio as follows.
log_odds_ratio = log_good_fit_ratio + log_occam_factor
print('Odds ratio:', np.exp(log_odds_ratio))
We, we conclude that Model B is waaaaaay more probable than Model A.
We will now take another approach where we assess the models based on their WAIC and Akaike weights. We first need to set up PyMC3 models for sampling.
# Get Numpy arrays for speed
d = df['Droplet Diameter (um)'].values
ell = df['Spindle Length (um)'].values
with pm.Model() as set_length_model:
# Priors
l_s = pm.Uniform('l_s', lower=0, upper=1000)
sigma = bebi103.pm.Jeffreys('sigma', lower=0.1, upper=50)
# Likelihood
ell_obs = pm.Normal('ell_obs', mu=l_s, sd=sigma, observed=ell)
And now the conserved tubulin model.
def spindle_length(d, gamma, phi):
"""Spindle length as a function of droplet diameter."""
return gamma * d / (1 + (d/phi)**3)**(1/3)
with pm.Model() as cons_tubulin_model:
# Priors
gamma = pm.Uniform('gamma', lower=0, upper=1)
phi = bebi103.pm.Jeffreys('phi', lower=1, upper=100)
sigma = bebi103.pm.Jeffreys('sigma', lower=0.1, upper=50)
# Compute spindle length
ell_theor = spindle_length(d, gamma, phi)
# Likelihood
ell_obs = pm.Normal('ell_obs', mu=ell_theor, sd=sigma, observed=ell)
With the models in hand, let's get some samples!
with set_length_model:
trace_set_length = pm.sample(draws=10000,
tune=10000,
init='advi+adapt_diag',
njobs=4)
with cons_tubulin_model:
trace_cons_tubulin = pm.sample(draws=10000,
tune=10000,
init='advi+adapt_diag',
njobs=4)
PyMC3 has a very convenient function, pm.compare()
, that computes the WAIC and Akaike weights,etc. We pass in a tuple of traces, along with a tuple of models under which they were calculated, and the pm.compare()
function does the rest.
pm.compare((trace_set_length, trace_cons_tubulin), (set_length_model, cons_tubulin_model))
The result is a data frame with the comparison results. The index is an integer corresponding to the order of the inputs. In this case, index 0 corresponds to the set length model, and index 1 corresponds to the conserved tubulin model. The WAIC
column gives the WAIC for the respective models. This is the log predictive posterior density (the lppd) plus the effective number of parameters, pWAIC. The pWAIC
column gives, you guessed it, the effective number of parameters. The 'dWAIC' column gives the difference in the WAICs (always with the lowest subtracted out). This is akin to the log odds ratio, in the sense that it is the logarithm of a ratio of predictive likelihoods. The weight
column contains the Akaike weights. The SE
is the error in the WAIC measurement as computed by a techinque called Bayesian bootstrap, which we will not discuss in detail in this course. The dSE
column is the error in the dWAIC. Finally, the warning
column alerts you to any warning flags in the calculation.
Using the WAIC, we again see that the conserved tubulin model is overwhelmingly favored.
Finally, as a sanity check, it is always a good idea to check out the posterior.
g = bebi103.viz.corner(trace_cons_tubulin,
vars=['gamma', 'phi', 'sigma'],
plot_width=200)
bokeh.io.show(g)
I will update this tutorial with a second example, from the Singer, et al. data, soon.