Auxiliary lesson 8: Tips for MCMC¶

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

import math
import pickle

import numpy as np
import numba
import pandas as pd
import scipy.stats as st
import theano.tensor as tt
import theano

import pymc3 as pm

import bebi103

import bokeh.io
bokeh.io.output_notebook()

In this tutorial, we will explore some tips and traps of performing Bayesian analyses using MCMC.

Priors for positive definite matrices¶

In the example homework, we considered a bivariate Gaussian model for measured beak depths and beak lengths from the Grant and Grant data set. I repeat that analysis here to show how we can set up priors for positive definite matrices.

We start by loading in the data set and adjusting column headings for convenience.

# Read data
df = pd.read_csv('../data/grant_and_grant_2014.csv')

# Rename columns
df.columns = ('offspring_bd', 'male_bd', 'female_bd')

We will need the mean parental beak depth for our regression, so we create a new column in the DataFrame that has that.

# Make a new column with the mean of the male and female parent columns
df['parent_bd'] = (df['male_bd'] + df['female_bd']) / 2

We'll first plot the data to see what we are dealing with.

p = bokeh.plotting.figure(width=450, 
                          height=300, 
                          x_axis_label='parental beak depth(mm)',
                          y_axis_label='offspring beak depth (mm)')
p.circle(df['parent_bd'], df['offspring_bd'])
bokeh.io.show(p)

By eye, we see correlation between the parents and the offspring. Let's now write down our statistical model. We will take the likelihood as being a bivariate Gaussian with mean $\boldsymbol{\mu} = (\mu_p, \mu_0)^T$ and variance

\begin{align} \mathsf{\Sigma} = \begin{pmatrix} \sigma_p^2 & \sigma_{po} \\ \sigma_{po} & \sigma_o^2 \end{pmatrix}, \end{align}

where the subscripts $p$ and $o$ denote respectively parents and offspring. Thus, the likelihood for a single parent/offspring beak depth pair, $\mathbf{d} = (d_p, d_o)^T$ is

\begin{align} f(\mathbf{d}\mid\boldsymbol{\mu},\mathsf{\Sigma}) = \frac{1}{\sqrt{2\pi\,\mathrm{det}(\mathsf{\Sigma})}}\,\exp\left[\frac{1}{2}(\mathbf{d}-\boldsymbol{\mu})^T\cdot\mathsf{\Sigma}^{-1}\cdot(\mathbf{d}-\boldsymbol{\mu})\right]. \end{align}

Taking all the data sets together, we have \begin{align} f(D\mid\boldsymbol{\mu},\mathsf{\Sigma}) = \left(2\pi\,\mathrm{det}(\mathsf{\Sigma})\right)^{-n/2}\,\exp\left[\frac{1}{2}\sum_{\mathbf{d}_i\in D}(\mathbf{d}_i-\boldsymbol{\mu})^T\cdot\mathsf{\Sigma}^{-1}\cdot(\mathbf{d}_i-\boldsymbol{\mu})\right]. \end{align}

In specifying our priors, we will assume that $\mu_p$ and $\mu_o$ are independent parameters and take them to have uniform priors. The prior for $\mathsf{\Sigma}$ is trickier. The covariance matrix has three independent entries, $\sigma_p^2$, $\sigma_o^2$, and $\sigma_{po}$. We could choose Jeffreys priors for each of these three entries, but then we would have a problem: the matrix $\mathsf{\Sigma}$ must be positive definite. To aid in thinking about this, recall that we can write the covariance matrix as

\begin{align} \mathsf{\Sigma} = \mathrm{diag}(\boldsymbol{\sigma}) \cdot \mathsf{\Omega} \cdot \mathrm{diag}(\boldsymbol{\sigma}), \end{align}

where $\boldsymbol{\sigma}$ is the array of standard deviations of each variable, and $\mathsf{\Omega}$ is the correlation matrix. In general, if matrices $\mathsf{A}$ and $\mathsf{B}$ are positive definite, then $\mathsf{A}\cdot\mathsf{B}\cdot\mathsf{A}$ is positive definite. Because the diagonal variance matrices are positive definite, the covariance matrix $\mathsf{\Sigma}$ is positive definite if the correlation matrix $\mathsf{\Omega}$ is positive definite.

In the present case with two variables (parental and offspring beak depth), the matrices are 2$\times$2, with

\begin{align} \mathsf{\Omega} = \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}, \end{align}

where $\rho = \sigma_{op}/\sigma_p\sigma_o$ is the Pearson correlation. Positive definiteness is thus assured if $|\rho| < 1$. So, in this case, we could specify priors for $\sigma_o$ and $\sigma_p$, ensuring they are positive (with something like a Jeffreys prior) and also specify a prior for $\rho$, ensuring $-1 < \rho < 1$.

Another option is to use an LKJ prior (Lewandowski, Kurowicka, and Joe). This is a convenient method for specifying priors for positive definite correlation matrices. (Wow, Lewandowski can score five goals in nine minutes and derive a clever algorithm for priors on positive definite matrices! Amazing. Or maybe it's a different Lewandowski.) We will employ this, since it is conveniently build in to PyMC3. You can read more about it in the PyMC3 docs. Notoably, we need to specify that the parameter $\eta = 1$ for an uninformative prior on $\mathsf{\Omega}$. SO, our prior is

\begin{align} g(\boldsymbol{\mu}, \mathsf{\Sigma}) = g(\boldsymbol{\mu}, \boldsymbol{\sigma}, \mathsf{\Omega}) = g(\boldsymbol{\mu})\,g_\mathrm{LKJ}(\mathsf{\Omega})\,g(\boldsymbol{\sigma}) \propto g_\mathrm{LKJ}(\mathsf{\Omega})\,\frac{1}{\sigma_p \sigma_o}. \end{align}

There is one other aspect to be aware of. An N×N symmetric positive definite matrix can be completely specified by N(N+1)/2 entries, due to its symmetry. Furthermore, any real symmetric positive definite matrix can be written as its unique Cholesky decomposition, $\mathsf{A} = \mathsf{L}\cdot \mathsf{L}^\mathrm{T}$. Using the matrix as its Cholesky decomposition results in numerical benefits in the calculation, so it is generally good practice to specify a positive definite matrix in terms of its Cholesky decomposition. This is achieved by using the pm.LKJCholeskyCov class.

We proceed by coding up our model using PyMC3 using the Cholesky decomposition of $\mathsf{\Sigma}$ constructed using an LKJ distribution. First, for speed, we convert the data in the DataFrame to a Numpy array.

data = df.loc[:,['parent_bd', 'offspring_bd']].values

Now, we construct the model.

with pm.Model() as bigauss_model:
    # Prior on the mean parental and offspring beak depths
    mu = pm.Uniform('mu', lower=1, upper=20, shape=2)

    # Jeffreys prior on the two standard deviations
    sigma = bebi103.pm.Jeffreys.dist(lower=0.01, upper=10, shape=2)

    # Packed Cholesky decomposition of covariance matrix with LKJ prior
    chol_packed = pm.LKJCholeskyCov('chol_packed', n=2, eta=1, sd_dist=sigma)

    # Expand into 2D Cholesky matrix
    chol = pm.expand_packed_triangular(2, chol_packed)

    # Multivariate Gaussian likelihood
    data_obs = pm.MvNormal('data_obs', mu=mu, chol=chol, observed=data)

Very nice! Now let's do some sampling!

with bigauss_model:
    trace = pm.sample(tune=1000, draws=10000, njobs=4)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%|██████████| 11000/11000 [00:36<00:00, 300.87it/s]

This trace is useful, but we need to convert the Cholesky decomposition to a covariance matrix. The bebi103.pm.chol_to_cov() function conveniently does this.

df = bebi103.pm.trace_to_dataframe(trace)
df = pd.concat(
    (df,
    bebi103.pm.chol_to_cov(df[df.columns[df.columns.str.contains('chol')]], 'cov')),
    axis=1)

Now that we have the covariances, we can compute the heritability, $h$.

df['heritability'] = df['cov__1__0'] / df['cov__0__0']

Now that we have the heritability, let's look at a corner plot.

g = bebi103.viz.corner(df, 
                       vars=['cov__0__0', 'cov__1__1', 'heritability'],
                       plot_ecdf=True)
bokeh.io.show(g)

We get that the heritability is peaked at 0.72, as we would expect. Let's compute the median and 95% HPD.

print("""median heritability: {0:.2f}
HDP: [{1:.2f}, {2:.2f}]""".format(np.median(df['heritability']),
                                  *pm.hpd(df['heritability'], alpha=0.05)))

median heritability: 0.72
HDP: [0.65, 0.79]

So, the result is that the heritability is $0.72^{+0.07}_{-0.07}$.

Hierarchical models and the funnel of hell¶

This section draws on this article by Mike Betancourt and Mark Girolami. The term "funnel of hell" come from this very nice blog post by Thomas Wiecki.

While very powerful for modeling experimental results of repeat experiments, hierarchical models can present some unique challenges for modeling. We already saw some issues with coming up with uninformative priors while keeping proper posteriors. Here, we investigate another pathology that arises with hierarchical models, known as the "funnel of hell." (Actually, Thomas Wiecki gave it that name, and it's not widely used, but it's memorable and I like it.)

To illustrate the funnel, I will first consider a data-less model, first proposed by Radford Neal in section 8 of this paper and modified in the Betancourt and Girolami paper. Consider a variable $x$ that is Normally distributed with mean zero and variance equal to $\mathrm{e}^v$. The parameter $v$ is Normally distributed with mean zero and variance of nine.

\begin{align} v &\sim \text{Norm}(0, 3) \\[1em] x &\sim \text{Norm}(0, \mathrm{e}^{v/2}). \end{align}

We will look at the joint probability density function $f(x, v)$. To visualize it, we will use DataShader.

# Sample out of distribution
v = np.random.normal(0, 3, size=10000000)
x = np.random.normal(0, np.exp(v/2))

# Build visualization with DataShader
df = pd.DataFrame(data=dict(x=x, v=v))
bebi103.viz.ds_point_plot(df, 'x', 'v', x_axis_label='x', y_axis_label='v', cmap='black')

Now, let's try to use PyMC3 to sample out of this distribution.

with pm.Model() as funnel_model:
    v = pm.Normal('v', mu=0, sd=3)
    x = pm.Normal('x', mu=0, sd=tt.exp(v/2))
    
    funnel_trace = pm.sample(tune=100000, draws=100000, njobs=4)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
 92%|█████████▏| 184875/200000 [03:58<00:19, 775.61it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:451: UserWarning: The acceptance probability in chain 1 does not match the target. It is 0.675070264424, but should be close to 0.8. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))
/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 1 contains 11746 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|█████████▉| 199933/200000 [04:15<00:00, 783.13it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 0 contains 10521 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 200000/200000 [04:15<00:00, 783.17it/s]
/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 2 contains 7059 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 3 contains 9526 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))

df = bebi103.pm.trace_to_dataframe(funnel_trace)
bebi103.viz.ds_point_plot(df, 'x', 'v', x_axis_label='x', y_axis_label='v', cmap='black')

It's not immediately clear without zooming into the funnel part, but you can see that we get very few samples below v = -5, whereas the true distribution has significant density below v = -5. NUTS is failing to sample the thin part of the funnel.

Why is this the case? In the thin part of the funnel, x and v are very tightly correlated. They fall together on a line. This does not give the walker much room at all to step. It gets stuck down in the funnel and cannot move. Proposal distributions keep getting rejected.

To further illustrate this, let's look at a trace of one of the trajectories.

df_0 = df.loc[df['chain']==0, :].copy()
df_0['ind'] = np.arange(len(df_0))
bebi103.viz.ds_line_plot(df_0,
                         'ind',
                         'v',
                         x_axis_label='sample number',
                         y_axis_label='v',
                         cmap='black')

We see that when the walker gets down to v ≈ 4 or so, it stalls, rejecting many proposal distributions. This is indicative of a sampler being unable to sample a region effectively.

How do we get around this? We can do a clever trick where the sampler is exploring another auxiliary variable, which we will call $x_\mathrm{var}$ that is uncorrelated with v, and then we compute the value of x from by transforming $x_\mathrm{var}$. Specifically, if $x\sim \text{Norm}(\mu, \sigma)$, we define $x_\mathrm{var}\sim \text{Norm}(0, 1)$ and then compute $x = \mu + \sigma x_\mathrm{var}$. You can verify for yourself that this transformation gives the appropriate distribution for $x$.

A model built in this way is said to be a non-centered model. Let's give it a shot.

with pm.Model() as noncentered_funnel_model:
    v = pm.Normal('v', mu=0, sd=3)
    x_var = pm.Normal('x_var', mu=0, sd=1)
    x = pm.Deterministic('x', x_var*tt.exp(v/2))
    
    noncentered_funnel_trace = pm.sample(tune=100000, draws=100000, njobs=4)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%|██████████| 200000/200000 [01:59<00:00, 1678.88it/s]

Let's look at our samples now.

df = bebi103.pm.trace_to_dataframe(noncentered_funnel_trace)
bebi103.viz.ds_point_plot(df, 'x', 'v', x_axis_label='x', y_axis_label='v', cmap='black')

Much better! We are now effectively sampling the funnel. In general, this is a good idea, to reparametrize your model so that the sample can sample uncorrelated parameters.

Now, let's try to sample this distribution with the MCMC Hammer. We should not see this problem, since the performance of the algorithm it uses is unchanged by affine transformations. (An affine transformation is what we just did!)

@numba.jit(nopython=True)
def log_post(p):
    """Unnormalized log posterior for the toy funnel model."""
    x, v = p
    
    # Avoid divide by zero
    if v < -50:
        return -np.inf

    return -v**2/18 - v/2 - x**2/2/np.exp(v)

# Set up initial positions of walkers (around 0)
n_walkers = 50
n_dim = 2
p0 = np.zeros((n_walkers, n_dim)) + np.random.normal(0, 1, size=(n_walkers, n_dim))

# Perform sampling
df = bebi103.emcee.run_ensemble_emcee(log_post,
                                      n_burn=20000,
                                      n_steps=20000,
                                      n_walkers=50,
                                      p0=p0,
                                      columns=['x', 'v'])

# Look at samples
bebi103.viz.ds_point_plot(df, 'x', 'v', x_axis_label='x', y_axis_label='v', cmap='black')

Indeed, we do not have the problem with the MCMC Hammer. It is important to note that the algorithm of the MCMC Hammer will be incorportated in a future release of PyMC3.

Funnels and hierarchical models¶

What does this have to do with hierarchical models? It turns out that hierarchical models have a similar structure to the toy funnel model. The priors involving hyperparameters can show strong correlations when the variance on which the prior is conditioned gets small. We will see this as an example with some synthetic data. Imagine we repeat an experiment 10 times, with between three to ten replicated in each repeat. We will assume that the replicates for experiment i are Gaussian distributed with mean $\theta_i$, which is in turn Gaussian distributed about $\mu$ with standard deviation $\tau$. We further assume that every measurement has the same error, $\sigma$. We will take weakly informative hyperpriors for $\mu$ and $\tau$. Our model is then

\begin{align} \mu &\sim \text{Norm}(0, 5) \\[1em] \tau &\sim \text{HalfCauchy}(2.5) \\[1em] \theta_i &\sim \text{Norm}(\mu, \tau) \;\;\forall i \\[1em] x_{ij} &\sim \text{Norm}(\theta_i, 10) \;\;\forall i, j \end{align}

Now, we will generate data for the model.

# Specify parameters for random data
mu_val = 8
tau_val = 3
sigma_val = 10
n_trials = 10

# Generate number of replicates for each repeat
n = np.random.randint(low=3, high=10, size=n_trials, dtype=int)

# Useful quantities to have
n_cumsum = np.cumsum(n)
n_total = n.sum()

# Generate data set
x = np.concatenate(
        [np.random.normal(np.random.normal(mu_val, tau_val), sigma_val, size=n_val)
                  for n_val in n])

Now that we have made our fabricated data set, let's take a quick look at it.

bokeh.io.show(bebi103.viz.ecdf(x, x_axis_label='x'))

Now, let's build a centered model and sample out of it. In building this model, I will demonstrate an important piece of PyMC3 syntax that is often essential for hierarchical models. We have multiple experiments, which we could index by, 0, 1, 2, etc. We want each experiment to be multiplied together to form the likelihood. But, as we know, writing for loops within PyMC3 models can be really problematic, and we should use built-in Theano functionality where we can.

exp_ind = np.concatenate([[i]*n_val for i, n_val in enumerate(n)])

with pm.Model() as centered_model:
    # Hyperpriors
    mu = pm.Normal('mu', mu=0, sd=5)
    tau = pm.HalfCauchy('tau', beta=2.5)
    
    # Prior on theta
    theta = pm.Normal('theta', mu=mu, sd=tau, shape=n_trials)
    
    # Likelihood
    x_obs = pm.Normal('x_obs',
                       mu=theta[exp_ind],
                       sd=sigma_val,
                       observed=x)
    
    centered_trace = pm.sample(tune=10000,
                               draws=10000,
                               njobs=4,
                               nuts_kwargs=dict(max_treedepth=20, target_accept=0.9))

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
 78%|███████▊  | 15578/20000 [01:01<00:17, 251.39it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:451: UserWarning: The acceptance probability in chain 3 does not match the target. It is 0.799241897221, but should be close to 0.9. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))
/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 3 contains 100 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
 82%|████████▏ | 16476/20000 [01:06<00:14, 248.80it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:451: UserWarning: The acceptance probability in chain 1 does not match the target. It is 0.801574180793, but should be close to 0.9. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))
/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 1 contains 546 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
 86%|████████▌ | 17111/20000 [01:09<00:11, 247.45it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 2 contains 143 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|█████████▉| 19995/20000 [01:21<00:00, 245.34it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:451: UserWarning: The acceptance probability in chain 0 does not match the target. It is 0.768836292836, but should be close to 0.9. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))
/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 0 contains 599 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 20000/20000 [01:21<00:00, 245.32it/s]

Now that we have our samples, we can check the Gelman-Rubin statistic for convergence.

pm.gelman_rubin(centered_trace)

{'mu': 1.002296536823547,
 'tau': 1.0035573233818333,
 'theta': array([ 1.00133834,  1.00179886,  1.00147957,  1.00058513,  1.00119157,
         1.00077647,  1.00083679,  1.00249094,  1.00136004,  1.0013294 ])}

Everything looks good, so we might not expect that we have any real problems with the sampling. This is where improper sampling of hierarchical models can present a real problem. Let's plot the samples.

bebi103.viz.ds_point_plot(bebi103.pm.trace_to_dataframe(centered_trace),
                          'mu',
                          'tau',
                          x_axis_label='µ',
                          y_axis_label='τ',
                          cmap='black')

We see the funnel, and, upon zooming, that the sampling extends down to τ ≈ 0.5 and not much further. Let's try sampling with a non-centered model.

with pm.Model() as noncentered_model:
    # Hyperpriors
    mu = pm.Normal('mu', mu=0, sd=5)
    tau = pm.HalfCauchy('tau', beta=2.5)
    
    # Prior on theta
    var_theta = pm.Normal('var_theta', mu=0, sd=1, shape=n_trials)
    theta = pm.Deterministic('theta', mu + var_theta * tau)
    
    # Likelihood
    x_obs = pm.Normal('x_obs',
                       mu=theta[exp_ind],
                       sd=sigma_val,
                       observed=x)
    
    noncentered_trace = pm.sample(tune=10000,
                                  draws=10000,
                                  njobs=4,
                                  nuts_kwargs=dict(max_treedepth=20, target_accept=0.9))

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
 98%|█████████▊| 19533/20000 [00:48<00:01, 404.52it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:451: UserWarning: The acceptance probability in chain 3 does not match the target. It is 0.833009325121, but should be close to 0.9. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))
/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 3 contains 3 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
 98%|█████████▊| 19576/20000 [00:48<00:01, 404.50it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 2 contains 1 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 20000/20000 [00:49<00:00, 405.31it/s]

The sampling was a bit faster, nearly twice as fast, likely due to the walkers being less constrained. Now, let's look at the trace.

bebi103.viz.ds_point_plot(bebi103.pm.trace_to_dataframe(noncentered_trace),
                          'mu',
                          'tau',
                          x_axis_label='µ',
                          y_axis_label='τ',
                          cmap='black')

Here, we see that the funnel has effectively been sampled, all the way down to τ ≈ 0.