Tutorial 8a: MCMC diagnostics

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [1]:
import numpy as np
import pandas as pd

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

NOTE: You will need to update the bebi103 module for this tutorial. Do: pip install --upgrade bebi103.

In previous tutorials, you have seen that we can sample out of arbitrary probability distributions, including posterior probability distributions in the context of Bayesian inference, using Markov chain Monte Carlo. However, there are a few questions we need to answer to make sure our MCMC samplers are in fact sampling the target distribution.

  1. Have we achieved stationarity? That is, have the chains sampled enough that we are effectively getting independent samples out of the target distribution?
  2. Can the chains access all areas of parameter space?
  3. Have we taken enough samples to get a good picture of the posterior?

There are diagnostic checks we can do to address these questions, and these checks are the topic of this tutorial.

The data set

As we set out to learn about MCMC diagnostics, we will again use the data set from Singer, et al. consisting of mRNA transcript counts in cells from single molecule FISH experiments. We'll start by loading in the data set. We will work with the Rest data, which I will go ahead and pull out as a Numpy array. I'll make a quick plot of the ECDF as a reminder of the data set.

In [2]:
df = pd.read_csv('../data/singer_transcript_counts.csv', comment='#')
n = df['Rest'].values

bokeh.io.show(bebi103.viz.ecdf(n, x_axis_label='transcript count'))

The model

As a reminder from Tutorial 7a, our model is

\begin{align} &\alpha \sim \text{LogNorm}(0,2), \\[1em] &b \sim \text{LogNorm}(2, 3), \\[1em] &\beta = 1/b,\\[1em] &n_i \sim \text{NegBinom}(\alpha, \beta) \;\forall i. \end{align}

Here, $\alpha$ is the frequency of bursts in gene expression and $b$ is the size of the bursts. We do a change of variables to convert $b$ to $\beta$, as required for parametrization with Stan. The Stan model is below. We'll compile it to have it ready for sampling.

In [3]:
model_code = """
data {
  int N;
  int n[N];
}


parameters {
  real<lower=0> alpha;
  real<lower=0> b;
}


transformed parameters {
  real beta_ = 1.0 / b;
}


model {
  // Priors
  alpha ~ lognormal(0.0, 2.0);
  b ~ lognormal(2.0, 3.0);

  // Likelihood
  n ~ neg_binomial(alpha, beta_);
}
"""

sm = bebi103.stan.StanModel(model_code=model_code)
Using cached StanModel.

Let's get some samples to work with.

In [4]:
samples = sm.sampling(data=dict(N=len(n), n=n), seed=29234)

Diagnostics for any MCMC sampler

We will first investigate diagnostics that apply to any MCMC sampler, not just Hamiltonian Monte Carlo samplers like Stan uses.

The Gelman-Rubin R-hat statistic

The Gelman-Rubin R-hat statistic is a useful metric to determine if we have achieved stationarity with our chains. The idea is that we run multiple chains in parallel (at least four). For a given parameter, we then compute the variance in the samples between the chains, and then the variance of samples within the chains. The ratio of these two is the Gelman-Rubin R-hat statistic, usually denoted as $\hat{R}$, and we compute $\hat{R}$ for each chain.

\begin{align} \hat{R} = \frac{\text{variance between chains}}{\text{variance within chains}}. \end{align}

The value of $\hat{R}$ approaches unity if the chains are properly sampling the target distribution because the chains should be identical in their sampling of the posterior if they have all reached the limiting distribution.

Stan automatically computes $\hat{R}$. You can view the results in Jupyter Lab by just executing samples (or whatever variable name you have used to store the output of sm.sampling()).

In [5]:
samples
Out[5]:
Inference for Stan model: anon_model_19725d24b6aa24995430b147ace33ea9.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
alpha    4.5    0.02   0.41   3.72   4.22   4.49   4.79   5.36    741   1.01
b       16.8    0.06   1.64  13.86  15.63  16.69  17.83  20.33    823   1.01
beta_   0.06  2.1e-4 5.8e-3   0.05   0.06   0.06   0.06   0.07    756   1.01
lp__   -1378    0.03   1.08  -1380  -1378  -1377  -1377  -1377   1007    1.0

Samples were drawn using NUTS at Fri Nov 16 14:12:00 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

We see that Rhat for each of the three parameters is 1.01. As a rule of thumb, if $|1-\hat{R}| < 0.1$, the chains have converged.

To see examples where they have not converged, we will sample again, but only allow the chains eight warm-up steps.

In [6]:
samples_limited_warmup = sm.sampling(data=dict(N=len(n), n=n), 
                                     warmup=7, 
                                     iter=1007, 
                                     seed=29234)

samples_limited_warmup
Out[6]:
Inference for Stan model: anon_model_19725d24b6aa24995430b147ace33ea9.
4 chains, each with iter=1007; warmup=7; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
alpha    9.9    5.16    7.3   1.48   2.85  10.14  17.46  21.31      2   8.89
b      17.18   11.12  15.73   3.77    4.7   8.64  27.79  49.47      2   6.25
beta_   0.12    0.06   0.09   0.02   0.04   0.12   0.22   0.27      2   8.26
lp__   -1466   55.01   77.8  -1593  -1534  -1455  -1412  -1377      2   9.74

Samples were drawn using NUTS at Fri Nov 16 14:12:01 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Now, the $\hat{R}$ values are large; the chains have not converged to the limiting distribution. Note also that the mean values of $\alpha$ and $\beta$ are not the same as for the properly warmed-up sampler. This emphasizes the point that warm-up is crucial for performance of the sampler. If you see $\hat{R}$ values that are too large, you may be able to fix it by having the walker take more warm-up steps.

We can also see the poor mixing of the chains by looking at the trace plot.

In [7]:
bokeh.io.show(bebi103.viz.trace_plot(samples_limited_warmup, pars=['alpha', 'b'], line_width=2))

This is pathological; three of the chains are essentially not moving. One of the chains is moving very poorly. This means that most proposed steps are being rejeced.

As is the case with all diagnostic metrics, there are caveats. You can read about them for $\hat{R}$ in section 30.3 of the Stan manual.

Number of effective steps

Recall that MCMC samplers do not draw independent samples from the target distribution. Rather, the samples are correlated. Ideally, though, we would draw independent samples. We would like to get an estimate for the number of effectively independent samples we draw, $n_\mathrm{eff}$. I will not go into the calculation of $n_\mathrm{eff}$ here, but it is described in section 30.4 of the Stan manual. An approximate $n_\mathrm{eff}$ is computed by Stan, and it is also given in the summer of the sampling.

In [8]:
samples
Out[8]:
Inference for Stan model: anon_model_19725d24b6aa24995430b147ace33ea9.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
alpha    4.5    0.02   0.41   3.72   4.22   4.49   4.79   5.36    741   1.01
b       16.8    0.06   1.64  13.86  15.63  16.69  17.83  20.33    823   1.01
beta_   0.06  2.1e-4 5.8e-3   0.05   0.06   0.06   0.06   0.07    756   1.01
lp__   -1378    0.03   1.08  -1380  -1378  -1377  -1377  -1377   1007    1.0

Samples were drawn using NUTS at Fri Nov 16 14:12:00 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

We took a total of 4000 steps (1000 on each of four chains), and got $n_\mathrm{eff}$ of about 850, giving $n_\mathrm{eff} / n_\mathrm{steps} \approx 0.2$. This is a reasonable number, and as a rule of thumb, you should have $n_\mathrm{eff}/n_\mathrm{steps} > 0.001$. Note, though, that if $n_\mathrm{eff}$ is that small, you will need to take many many Monte Carlo steps to get enough samples.

Thinning

A strategy to boost the independence of your samples is to thin your samples. In this example, the autocorrelation of the chains is about 5 steps, since $n_\mathrm{eff} / n_\mathrm{steps} \approx 0.2$. To get 4000 samples that are more independent, we can take 20,000 samples and only take every fifth sample. This is accomplished using the thin kwarg.

In [9]:
samples = sm.sampling(data=dict(N=len(n), n=n), 
                      seed=29234, 
                      warmup=1000, 
                      iter=6000, 
                      thin=5)

samples
Out[9]:
Inference for Stan model: anon_model_19725d24b6aa24995430b147ace33ea9.
4 chains, each with iter=6000; warmup=1000; thin=5; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
alpha    4.5  7.2e-3    0.4   3.76   4.23    4.5   4.76    5.3   3026    1.0
b      16.79    0.03   1.57  13.98  15.71  16.68  17.78  20.21   3101    1.0
beta_   0.06  1.0e-4 5.6e-3   0.05   0.06   0.06   0.06   0.07   3080    1.0
lp__   -1378    0.02   1.01  -1380  -1378  -1377  -1377  -1377   3117    1.0

Samples were drawn using NUTS at Fri Nov 16 14:12:12 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Now, $n_\mathrm{eff}/n_\mathrm{steps} \approx 3/4$, so the samples we keep are more independent. We could push this more aggressively if we have the patience for the computation.

Bear in mind that the $n_\mathrm{eff}$ calculation is approximate and subject to error. There are, as usual, other caveats, which are discussed in section 30.4 of the Stan manual.

Diagnostics for HMC

Both $\hat{R}$ and $n_\mathrm{eff}$ are useful diagnostics for and MCMC solver, but Hamiltonian Monte Carlo offers other diagnostics to help ensure that the sampling is going as it should. It is important to note that these diagnostics are a feature of HMC, not a bug. By that I mean that the absence of these diagnostics, particularly divergences, from other sampling methods means that it is harder to ensure that they are sampling properly. The ability to check that it is working properly makes HMC all the more powerful.

Divergences

Hamiltonian Monte Carlo enables large step sizes by taking into account the shape of the target distribution can tracing trajectories along it. (This is of course a very loose description. You should read Michael Betancourt's wonderful introduction to HMC to get a more complete picture.) When the tracjectory encounters a region of parameter space where the posterior (target) distribution has high curvature, the trajectory can veer sharply. These events can be detected and are registered as divergences. A given Monte Carlo step ends in a divergence if this happens. This does not necessarily mean that there is a problem with the sample, but there is a good chance that there is.

Stan keep track of divergences and reports them. Let's look first at our good samples where we properly warmed up the sampler. We can convert the samples to a data frame, and the divergent__ column is zero for a sample that did not have a divergence, and one for a sample that did.

In [10]:
df_mcmc = bebi103.stan.to_dataframe(samples)
print('Number of divergences:', df_mcmc['divergent__'].sum())
Number of divergences: 0.0

So, the properly warmed up sampler had no divergences. Let's look at the improperly warmed-up sampler.

In [11]:
df_mcmc_limited_warmup = bebi103.stan.to_dataframe(samples_limited_warmup)
print('Number of divergences:', df_mcmc_limited_warmup['divergent__'].sum())
Number of divergences: 2191.0

Yikes! All kinds of divergences there. This is endemic of a sampler in trouble.

We will talk more about divergences later in this tutorial and in the next one when we deal with distributions that are inherently difficult to sample, regardless of whether or not we warmed up properly.

Tree depth

The explanation of this diagnostic is a little computer-sciencey, so you can skip to the last sentence of this section if the CS terms are unfamiliar to you.

The HMC algorithm used by Stan uses recursion. In practice when doing recursive calculations, you need to put a bound on how deep the recursion can go, i.e., you need to cap the tree depth, left you get stack overflow. Stan therefore has to have a limit on tree depth, the default of which is 10. If this tree depth is hit while trying to take a sample, the sampling is not wrong, but less efficient. Stan therefore reports the tree depth information for each sample. These are also included in the data frame you get from the samples, and you can check to see if the sampler slammed against the maximum tree depth.

In [12]:
(df_mcmc['treedepth__'] == 10).sum()
Out[12]:
0

This looks good. To get a feel for the tree depth for this particular calculation, we can look at some summaries.

In [13]:
df_mcmc['treedepth__'].median()
Out[13]:
3.0

E-BFMI

The energy-Bayes fraction of missing information, or E-BFMI is another metric that is specific to HMC samplers. Loosley speaking (again), it is a measure of how effective the sampler is at taking long steps. Some details are given in the Betancourt paper on HMC, and we will not go into them here, but say that as a rule of thumb, values below 0.2 can be indicative of inefficient sampling.

Stan also automatically computes the E-BFMI. It is stored in outputted data frames in the energy__ column. We can do a quick check here.

In [14]:
print('Number of samples with small E-BFMI:', (df_mcmc['energy__'] < 0.2).sum())
print('Mean E-BFMI of the samples:', df_mcmc['energy__'].mean())
Number of samples with small E-BFMI: 0
Mean E-BFMI of the samples: 1379.0966354447526

Quickly checking the diagnostics

I wrote a function, based on work by Michael Betancourt, to quickly check these diagnostics for a set of samples. It is available in the bebi103.stan submodule.

In [15]:
bebi103.stan.check_all_diagnostics(samples)
n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0.0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
Out[15]:
0

This is a quick check you can do to make sure everything is in order after obtaining samples. But it is very important to note that passing all of these diagnostic checks does not ensure that you achieved effective sampling. And perhaps even more importantly, getting effective sampling certainly does not guarantee that your model is a good one. Nonetheless, good, identifiable models tend to pass the diagnostic checks more often than poor ones.

A pathological example

We will now consider a probability distribution that is difficult to sample. As you will see in the next tutorial, it has many of the same pathologies that are often present in hierarchical models.

\begin{align} & v \sim \text{Norm}(0, 3),\\[1em] & \theta \sim \text{Norm}(0, \mathrm{e}^{v/2}). \end{align}

That is, $v$ is Gaussian distribution with mean zero and variance 9, and $\theta$ is Gaussian distributed with mean zero and variance $\mathrm{e}^v$. The joint distribution is then

\begin{align} P(\theta, v) = P(\theta\mid v) \,P(v) = \frac{\mathrm{e}^{-v/2}}{6\pi}\,\exp\left[-\frac{1}{2}\left(\frac{v^2}{9} + \frac{\theta^2}{\mathrm{e}^v}\right)\right] \end{align}

We can compute this analytically, so let's make a plot of it so we know what we're sampling out of.

In [16]:
theta, v = np.meshgrid(np.linspace(-4, 4, 400), np.linspace(-15, 5, 400))
P = np.exp(-v/2) / 6 / np.pi * np.exp(-(v**2 / 9 + theta**2 / np.exp(v))/2)

bokeh.io.show(bebi103.viz.imshow(im=P,
                                 x_range=[-4, 4], 
                                 y_range=[-15, 5], 
                                 flip=False,
                                 x_axis_label='θ',
                                 y_axis_label='v'))

This probabilty density function is funnel shaped, named "the Funnel of Hell" by Thomas Wiecki because the probability distribution is difficult to sample out of using MCMC. I like to call it the microinjection needle, since it looks like the tip of an injection needle.

Note that much of the probability density lies deep in the needle, which is a region of high curvature. The sampler may have some real troubles down there.

Before proceeding to attempt to sample this, I note that use of this funnel originates from section 8 of this paper by Radford Neal, and this section of this tutorial draws from this paper by Betancourt and Girolami.

Sampling out of the funnel

This simple distribution allows for independent sampling without MCMC. First, let's generate some of these samples so we can see what effective sampling should look like.

In [17]:
# Sample out of distribution
np.random.seed(45234)
v = np.random.normal(0, 3, size=4000)
theta = np.random.normal(0, np.exp(v/2))

p = bokeh.plotting.figure(height=400, 
                          width=450, 
                          x_range=[-100, 100],
                          x_axis_label='θ', 
                          y_axis_label='v')
p.circle(theta, v, 
         alpha=0.3, 
         color='#66c2a5',
         legend='indep. samples')
p.legend.location = 'bottom_left'
bokeh.io.show(p)

Now, we'll code up a Stan model for the funnel and draw some samples using MCMC.

In [18]:
model_code_funnel = """
parameters {
  real theta;
  real v; 
}


model {
v ~ normal(0, 3);
theta ~ normal(0, exp(v/2));
}
"""

sm_funnel = bebi103.stan.StanModel(model_code=model_code_funnel)
samples = sm_funnel.sampling(seed=71283)
Using cached StanModel.

Let's first take a quick look at the diagnostics.

In [19]:
bebi103.stan.check_all_diagnostics(samples)
n_eff / iter looks reasonable for all parameters.
Rhat for parameter v is 1.153555342174175.
  Rhat above 1.1 indicates that the chains very likely have not mixed
388.0 of 4000 (9.7%) iterations ended with a divergence.
  Try running with larger adapt_delta to remove divergences.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
Out[19]:
6

The diagnostics indicated several divergences, which, as I mentioned before, tend to happen in regions where the target distribution has high curvature.

Let's look at a plot of the samples, overlaid with the samples we trust. (You can click on the legend to display or hide respective samples.)

In [20]:
# Pull out samples into data frame
df_mcmc = bebi103.stan.to_dataframe(samples)

p.circle(df_mcmc['theta'],
         df_mcmc['v'],
         color='#fc8d62',
         alpha=0.3, 
         legend='default sampling')
p.legend.click_policy = 'hide'
bokeh.io.show(p)

Stan's sampler is clearly not penetrating to the lower regions of the funnel. If we did not have the correctly generated independent samples to compare to, we might not ever discover that this is an issue. So how can we clue ourselves in to sampling issues like this?

First, off, the divergences clue us in that there is a problem. We can start to investigate what the chains are doing by taking a graphical approach. We can start with the trace plot.

In [21]:
bokeh.io.show(bebi103.viz.trace_plot(samples, pars=['theta', 'v']))

We immediately see a pathology in the trace plot for $v$. When $v$ is small, the chains get stuck and keep rejecting steps. They cannot move. This is because the proposal steps keep ending in divergences and the steps cannot be taken.

We can look at this another way using a parallel coordinate plot. To allow for easy comparison, we will apply a transformation to $\theta$ such that we show its logarithm (of the absolute value). The function bebi103.viz.parcoord_plot() displays divergent samples in orange.

In [22]:
transformation = [lambda x: np.log(np.abs(x)), None, None]
bokeh.io.show(bebi103.viz.parcoord_plot(samples, 
                                        transformation=transformation,
                                        divergence_alpha=0.1, 
                                        divergence_line_width=0.5))

From the parallel coordinate plot, the divergences come when $v$ is small and $\theta$ is close to zero, which is the bottom of the funnel. The log posterior is also high for these divergences. There is substantial probability mass in the funnel, so we do really need to sample it.

As an alternative plot, we can plot the divergent samples in a different color in a scatter plot of our samples. The bebi103.viz.corner() function automatically does this.

In [23]:
bokeh.io.show(bebi103.viz.corner(samples, pars=['theta', 'v'], plot_width=300))

The graphical display of divergences, in particular in the colored scatter plots as above and in the parallel coordinate plot help diagnose the problem.

Conquering the Funnel of Hell

How can we get our MCMC sampler to get deep into the funnel? The funnel is caused by the variance of the distribution of $\theta$ getting very small. This narrows the funnel and any step the sampler takes is too large such that it steps out of the funnel. We need to sample down into the funnel to get true samples out of the target distribution.

Adjusting adapt_delta

We could try to take the advice of Stan's warning messages and decrease the adapt_delta parameter to take smaller steps. The default value if 0.8, so let's crank it up to 0.99 and see if that works.

In [24]:
samples = sm_funnel.sampling(seed=123712, control=dict(adapt_delta=0.99))

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples)

# Pull out samples into data frame
df_mcmc = bebi103.stan.to_dataframe(samples)

# Add plot of samples
p.circle(df_mcmc['theta'],
         df_mcmc['v'],
         color='#8da0cb',
         alpha=0.3, 
         legend='small adapt_delta')
bokeh.io.show(p) 
n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
96.0 of 4000 (2.4%) iterations ended with a divergence.
  Try running with larger adapt_delta to remove divergences.
2 of 4000 (0.05%) iterations saturated the maximum tree depth of 10.
  Try running again with max_treedepth set to a larger value to avoid saturation.
E-BFMI indicated no pathological behavior.

That helped. We have far fewer divergences. However, we are still just a bit shy of the bottom of the funnel.

Uncentering

Instead of making the sampler sample out of a distribution with tiny variance, we can make it sample out of a distribution that has a more reasonable variance, and then apply a transformation to those samples to get samples from the tiny variance distribution. To devise a strategy for doing this, we use the change of variables formula for probability distributions. Imagine we have a probability distribution of $\theta$ with probability density function $\pi(\theta)$. If we wish to instead had a probability density function of another variable $\xi$, which we can express as a function of $\theta$, $\xi = \xi(\theta)$, we need to ensure that $\pi(\xi)$ is normalized,

\begin{align} \int \mathrm{d}\tilde{\theta}\,\pi(\tilde{\theta}) = 1. \end{align}

To relate this integral to the integral of $\pi(\theta)$, we need to properly change variables in the integral. This leads to the change of variables formula,

\begin{align} \pi(\tilde{\theta}) = \left|\frac{\mathrm{d}\theta}{\mathrm{d}\tilde{\theta}}\right|\,\pi(\theta). \end{align}

Now, if we choose

\begin{align} \tilde{\theta} = \frac{\theta - \mu}{\sigma}, \end{align}

then

\begin{align} \left|\frac{\mathrm{d}\theta}{\mathrm{d}\tilde{\theta}}\right| = \sigma \end{align}

and

\begin{align} \pi(\tilde{\theta}) = \sigma \pi(\theta). \end{align}

If $\theta$ is Gaussian distributed with mean $\mu$ and variance $\sigma^2$, we have

\begin{align} \pi(\theta) = \frac{1}{\sqrt{2\pi\sigma^2}}\,\mathrm{e}^{-(\theta-\mu)^2/2\sigma^2}. \end{align}

Then, to satisfy the change of variables formula,

\begin{align} \pi(\tilde{\theta}) = \frac{1}{\sqrt{2\pi}}\,\mathrm{e}^{-\tilde{\theta}^2/2}. \end{align}

This means that $\tilde{\theta} \sim \text{Norm}(0, 1)$. Thus, we can reparametrize using the fact that $\theta \sim \text{Norm}(\mu, \sigma)$ is equivalent to

\begin{align} &\tilde{\theta} \sim \text{Norm}(0, 1),\\[1em] &\theta = \mu + \sigma\,\tilde{\theta}. \end{align}

So, in our case, we can instead sample using $\tilde{\theta}$ with

\begin{align} &\tilde{\theta} \sim \text{Norm}(0, 1),\\[1em] &\theta = \mathrm{e}^{v/2}\,\tilde{\theta}. \end{align}

This process is called uncentering. A non-centered parametrization has the sampler exploring away from the mean of the target disrtibution (hence, it is non-centered), and then a transformation ensures that the samples come from the target.

Let's implement the non-centered parametrization of this pathological distribution in Stan.

In [25]:
model_code_noncentered = """
parameters {
  real theta_tilde;
  real v; 
}


transformed parameters {
  real theta = exp(v/2) * theta_tilde;
}

model {
v ~ normal(0, 3);
theta_tilde ~ normal(0, 1);
}
"""

sm_noncentered = bebi103.stan.StanModel(model_code=model_code_noncentered)
INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_8a2a74c2f62e38a97c72b634dc8f6135 NOW.
/Users/Justin/anaconda3/lib/python3.7/site-packages/Cython/Compiler/Main.py:367: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /var/folders/y7/r162f1fd0zv1vzxkb5nktz300000gn/T/tmpmk2bwlql/stanfit4anon_model_8a2a74c2f62e38a97c72b634dc8f6135_713572304158052157.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)

Now let's sample and update the plot. We will not bother adjusting adapt_delta, and we'll just see what we get.

In [26]:
samples = sm_noncentered.sampling(seed=23784)

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples)

# Pull out samples into data frame
df_mcmc = bebi103.stan.to_dataframe(samples)

# Add plot of samples
p.circle(df_mcmc['theta'],
         df_mcmc['v'],
         color='#e78ac3',
         alpha=0.3, 
         legend='non-centered')
bokeh.io.show(p) 
n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0.0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.

Look at that! No divergences, and we have managed to sample all the way down the funnel! We have conquered the Funnel of Hell.

Hierarchical models feature a Funnel of Hell

It turns out that many hierarchical models feature a Funnel of Hell, so uncentering is ofter crucial. We will explore this in the next tutorial.

Computing environment

In [27]:
%load_ext watermark
In [28]:
%watermark -v -p numpy,pandas,pystan,bokeh,bebi103,jupyterlab
CPython 3.7.0
IPython 7.1.1

numpy 1.15.4
pandas 0.23.4
pystan 2.17.1.0
bokeh 1.0.1
bebi103 0.0.35
jupyterlab 0.35.3