A diagnostics case study: Artifical funnel of hell¶
[1]:
import numpy as np
import cmdstanpy
import arviz as az
import bebi103
import holoviews as hv
hv.extension('bokeh')
import bokeh.io
bokeh.io.output_notebook()
We will now consider a probability distribution that is difficult to sample. As you will see in forthcoming lessons, it has many of the same pathologies that are often present in hierarchical models.
\begin{align} & v \sim \text{Norm}(0, 3),\\[1em] & \theta \sim \text{Norm}(0, \mathrm{e}^{v/2}). \end{align}
That is, \(v\) is Normally distribution with mean zero and variance 9, and \(\theta\) is Normally distributed with mean zero and variance \(\mathrm{e}^v\). The joint distribution is then
\begin{align} P(\theta, v) = P(\theta\mid v) \,P(v) = \frac{\mathrm{e}^{-v/2}}{6\pi}\,\exp\left[-\frac{1}{2}\left(\frac{v^2}{9} + \frac{\theta^2}{\mathrm{e}^v}\right)\right] \end{align}
We can compute this analytically, so let’s make a plot of it so we know what we’re sampling out of.
[2]:
theta = np.linspace(-4, 4, 400)
v = np.linspace(-15, 5, 400)
THETA, V = np.meshgrid(theta, v)
P = np.exp(-V/2) / 6 / np.pi * np.exp(-(V**2 / 9 + THETA**2 / np.exp(V))/2)
hv.Image((theta, v, P), kdims=['θ', 'v'])
[2]:
This probabilty density function is funnel shaped, named “the Funnel of Hell” by Thomas Wiecki because the probability distribution is difficult to sample out of using MCMC. I like to call it the microinjection needle, since it looks like the tip of an injection needle.
Note that much of the probability density lies deep in the needle, which is a region of high curvature. The sampler may have some real troubles down there.
Before proceeding to attempt to sample this, I note that use of this funnel originates from section 8 of this paper by Radford Neal, and this section of this tutorial draws from this paper by Betancourt and Girolami.
Sampling out of the funnel¶
This simple distribution allows for independent sampling without MCMC. First, let’s generate some of these samples so we can see what effective sampling should look like.
[3]:
# Sample out of distribution
np.random.seed(45234)
v = np.random.normal(0, 3, size=4000)
theta = np.random.normal(0, np.exp(v / 2))
p = bokeh.plotting.figure(
    height=400, width=450, x_range=[-100, 100], x_axis_label="θ", y_axis_label="v"
)
p.circle(theta, v, alpha=0.3, color="#66c2a5", legend_label="indep. samples")
p.legend.location = "bottom_left"
bokeh.io.show(p)
Now, we’ll code up a Stan model for the funnel and draw some samples using MCMC. The Stan code is short and simple.
parameters {
  real theta;
  real v;
}
model {
v ~ normal(0, 3);
theta ~ normal(0, exp(v/2));
}
Let’s compile and sample!
[4]:
sm = cmdstanpy.CmdStanModel(stan_file='funnel.stan')
samples = sm.sample(seed=71283)
samples = az.from_cmdstanpy(samples)
INFO:cmdstanpy:stan to c++ (/Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_06/funnel.hpp)
INFO:cmdstanpy:compiling c++
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_06/funnel
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 3
INFO:cmdstanpy:finish chain 4
Let’s first take a quick look at the diagnostics.
[5]:
bebi103.stan.check_all_diagnostics(samples)
ESS for parameter v is 185.91057053537065.
tail-ESS for parameter v is 243.10005846225746.
  ESS or tail-ESS below 100 per chain indicates that expectation values
  computed from samples are unlikely to be good approximations of the
  true expectation values.
Rhat for parameter theta is 1.0312854953913977.
Rhat for parameter v is 1.0506633290792524.
  Rank-normalized Rhat above 1.01 indicates that the chains very likely have not mixed
296 of 4000 (7.4%) iterations ended with a divergence.
  Try running with larger adapt_delta to remove divergences.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
[5]:
7
The diagnostics indicated several divergences, which, as I mentioned before, tend to happen in regions where the target distribution has high curvature. We also have poor effective sample sizes for the parameter \(v\), and the R-hats are a touch large.
Let’s look at a plot of the samples, overlaid with the samples we trust. (You can click on the legend to display or hide respective samples.)
[6]:
p.circle(
    samples.posterior.theta.values.flatten(),
    samples.posterior.v.values.flatten(),
    color="#fc8d62",
    alpha=0.3,
    legend_label="default sampling",
)
p.legend.click_policy = "hide"
bokeh.io.show(p)
Stan’s sampler is clearly not penetrating to the lower regions of the funnel. If we did not have the correctly generated independent samples to compare to, we might not ever discover that this is an issue. So how can we clue ourselves in to sampling issues like this?
First, off, the divergences clue us in that there is a problem. We can start to investigate what the chains are doing by taking a graphical approach. We can start with the trace plot.
[7]:
bokeh.io.show(bebi103.viz.trace_plot(samples, pars=['theta', 'v']))
We immediately see a pathology in the trace plot for \(v\). When \(v\) is small, the chains get stuck and keep rejecting steps. They cannot move. This is because the proposal steps keep ending in divergences and the steps cannot be taken.
We can look at this another way using a parallel coordinate plot. To allow for easy comparison, we will apply a transformation to \(\theta\) such that we show its logarithm (of the absolute value). The function bebi103.viz.parcoord_plot() displays divergent samples in orange.
[8]:
transformation = {'theta': lambda x: np.log10(np.abs(x))}
bokeh.io.show(
    bebi103.viz.parcoord_plot(
        samples,
        transformation=transformation,
        divergence_kwargs={"line_width": 1, "line_alpha": 0.15},
    )
)
From the parallel coordinate plot, the divergences come when \(v\) is small and \(\theta\) is close to zero, which is the bottom of the funnel. The log posterior is also high for these divergences. There is substantial probability mass in the funnel, so we do really need to sample it.
As an alternative plot, we can plot the divergent samples in a different color in a scatter plot of our samples. The bebi103.viz.corner() function automatically does this.
[9]:
bokeh.io.show(bebi103.viz.corner(samples, pars=['theta', 'v'], plot_width=200))
The graphical display of divergences, in particular in the colored scatter plots as above and in the parallel coordinate plot help diagnose the problem.
Conquering the Funnel of Hell¶
How can we get our MCMC sampler to get deep into the funnel? The funnel is caused by the variance of the distribution of \(\theta\) getting very small. This narrows the funnel and any step the sampler takes is too large such that it steps out of the funnel. We need to sample down into the funnel to get true samples out of the target distribution.
Adjusting adapt_delta¶
We could try to take the advice of Stan’s warning messages and decrease the adapt_delta parameter to take smaller steps. The default value if 0.8, so let’s crank it up to 0.99 and see if that works.
[10]:
samples = sm.sample(seed=123712, adapt_delta=0.99)
samples = az.from_cmdstanpy(samples)
# Check diagnostics
bebi103.stan.check_all_diagnostics(samples)
# Add plot of samples
p.circle(
    samples.posterior.theta.values.flatten(),
    samples.posterior.v.values.flatten(),
    color="#8da0cb",
    alpha=0.3,
    legend_label="small adapt_delta",
)
bokeh.io.show(p)
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 3
INFO:cmdstanpy:finish chain 4
ESS for parameter v is 292.4367715960507.
tail-ESS for parameter v is 281.93643603821573.
  ESS or tail-ESS below 100 per chain indicates that expectation values
  computed from samples are unlikely to be good approximations of the
  true expectation values.
Rhat for parameter theta is 1.0181941952665776.
Rhat for parameter v is 1.0206428453011591.
  Rank-normalized Rhat above 1.01 indicates that the chains very likely have not mixed
53 of 4000 (1.325%) iterations ended with a divergence.
  Try running with larger adapt_delta to remove divergences.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
That helped. We have far fewer divergences. However, we are still just a bit shy of the bottom of the funnel.
Noncentering¶
Instead of making the sampler sample out of a distribution with tiny variance, we can make it sample out of a distribution that has a more reasonable variance, and then apply a transformation to those samples to get samples from the tiny variance distribution. To devise a strategy for doing this, we use the change of variables formula for probability distributions. Imagine we have a probability distribution of \(\theta\) with probability density function \(\pi(\theta)\). If we wish to instead had a probability density function of another variable \(\tilde{\theta}\), which we can express as a function of \(\theta\), \(\tilde{\theta} = \tilde{\theta}(\theta)\), we need to ensure that \(\pi(\tilde{\theta})\) is normalized,
\begin{align} \int \mathrm{d}\tilde{\theta}\,\pi(\tilde{\theta}) = 1. \end{align}
To relate this integral to the integral of \(\pi(\theta)\), we need to properly change variables in the integral. This leads to the change of variables formula,
\begin{align} \pi(\tilde{\theta}) = \left|\frac{\mathrm{d}\theta}{\mathrm{d}\tilde{\theta}}\right|\,\pi(\theta). \end{align}
Now, if we choose
\begin{align} \tilde{\theta} = \frac{\theta - \mu}{\sigma}, \end{align}
then
\begin{align} \left|\frac{\mathrm{d}\theta}{\mathrm{d}\tilde{\theta}}\right| = \sigma \end{align}
and
\begin{align} \pi(\tilde{\theta}) = \sigma \pi(\theta). \end{align}
If \(\theta\) is Normally distributed with mean \(\mu\) and variance \(\sigma^2\), we have
\begin{align} \pi(\theta) = \frac{1}{\sqrt{2\pi\sigma^2}}\,\mathrm{e}^{-(\theta-\mu)^2/2\sigma^2}. \end{align}
Then, to satisfy the change of variables formula,
\begin{align} \pi(\tilde{\theta}) = \frac{1}{\sqrt{2\pi}}\,\mathrm{e}^{-\tilde{\theta}^2/2}. \end{align}
This means that \(\tilde{\theta} \sim \text{Norm}(0, 1)\). Thus, we can reparametrize using the fact that \(\theta \sim \text{Norm}(\mu, \sigma)\) is equivalent to
\begin{align} &\tilde{\theta} \sim \text{Norm}(0, 1),\\[1em] &\theta = \mu + \sigma\,\tilde{\theta}. \end{align}
So, in our case, we can instead sample using \(\tilde{\theta}\) with
\begin{align} &\tilde{\theta} \sim \text{Norm}(0, 1),\\[1em] &\theta = \mathrm{e}^{v/2}\,\tilde{\theta}. \end{align}
This process is called uncentering. A non-centered parametrization has the sampler exploring away from the mean of the target distribution (hence, it is non-centered), and then a transformation ensures that the samples come from the target.
Let’s implement the non-centered parametrization of this pathological distribution in Stan. The Stan code is
parameters {
  real theta_tilde;
  real v;
}
transformed parameters {
  real theta = exp(v/2) * theta_tilde;
}
model {
  v ~ normal(0, 3);
  theta_tilde ~ normal(0, 1);
}
Let’s compile and sample. We won’t bother adjusting adapt_delta; we’ll just see what we get.
[11]:
sm = cmdstanpy.CmdStanModel(stan_file='funnel_noncentered.stan')
samples = sm.sample(seed=23784)
samples = az.from_cmdstanpy(samples)
bebi103.stan.check_all_diagnostics(samples)
INFO:cmdstanpy:stan to c++ (/Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_06/funnel_noncentered.hpp)
INFO:cmdstanpy:compiling c++
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_06/funnel_noncentered
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 3
INFO:cmdstanpy:finish chain 4
Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
[11]:
0
Excellent! No divergences and all diagnostics check out. Let’s overlay a plot of the samples to see if we got the whole funnel.
[12]:
p.circle(
    samples.posterior.theta.values.flatten(),
    samples.posterior.v.values.flatten(),
    color="#e78ac3",
    alpha=0.3,
    legend_label="non-centered",
)
bokeh.io.show(p)
Look at that! We have managed to sample all the way down the funnel! We have conquered the Funnel of Hell.
Hierarchical models feature a Funnel of Hell¶
It turns out that many hierarchical models feature a Funnel of Hell, so noncentering is ofter crucial. We will explore this in the future lessons.
[13]:
bebi103.stan.clean_cmdstan()
Computing environment¶
[14]:
%load_ext watermark
%watermark -v -p numpy,pandas,cmdstanpy,bokeh,holoviews,bebi103,jupyterlab
CPython 3.7.6
IPython 7.11.1
numpy 1.18.1
pandas 0.24.2
cmdstanpy 0.8.0
bokeh 1.4.0
holoviews 1.12.7
bebi103 0.0.51
jupyterlab 1.2.5