Tutorial 7b: Display of MCMC results¶

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

import numpy as np
import pandas as pd
import scipy.stats as st

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

NOTE: You will need to update the bebi103 module for this tutorial. Do: pip install --upgrade bebi103.

In the previous tutorial, we learned how to use Stan to sample out of posterior distributions specified by a generative model. In this tutorial, we will learn about some techniques to visualize the results. As our first working example, we will consider again the conserved tubulin spindle size model, discussed at length in tutorial 6b, using the data set from Good, et al., Science, 342, 856-860, 2013.

The model and samples¶

We will briefly repeat the sampling of the model as we did in the previous tutorial. As a reminder, the model is

\begin{align} &\phi \sim \text{LogNorm}(\ln 20, 0.75),\\[1em] &\gamma \sim \text{Beta}(2, 2), \\[1em] &\sigma_0 \sim \text{Gamma}(2, 10),\\[1em] &\sigma = \sigma_0\,\phi,\\[1em] &\mu = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &l_i \sim \text{Norm}(\mu, \sigma) \;\forall i. \end{align}

This is coded up in Stan and compiled as follows. (Reminder: You should have your Stan codes in their own .stan file. I am embedding them as strings here so the notebook is stand-alone.)

model_code = """
functions {
  real ell_theor(real d, real phi, real gamma) {
    real denom_ratio = (gamma * d / phi)^3;
    return gamma * d / (1 + denom_ratio)^(1.0 / 3.0); 
  }
}


data {
  int N;
  real d[N];
  real ell[N];
}


parameters {
  real phi;
  real gamma;
  real sigma_0;
}


transformed parameters {
  real sigma = sigma_0 * phi;
}


model {
  phi ~ lognormal(log(20.0), 0.75);
  gamma ~ beta(2.0, 2.0);
  sigma_0 ~ gamma(2.0, 10.0);
  
  for (i in 1:N) {
    ell[i] ~ normal(ell_theor(d[i], phi, gamma), sigma);
  }
}
"""

sm = bebi103.stan.StanModel(model_code=model_code)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_7bc07d175dc2621737aa512411914de8 NOW.
/Users/Justin/anaconda3/lib/python3.7/site-packages/Cython/Compiler/Main.py:367: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /var/folders/y7/r162f1fd0zv1vzxkb5nktz300000gn/T/tmpu9au2phs/stanfit4anon_model_7bc07d175dc2621737aa512411914de8_2274355879154270300.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)

Now, we can load in the data set and get our samples.

# Load in data frame
df = pd.read_csv('../data/good_invitro_droplet_data.csv', comment='#')

# Set up the data dictionary
data = dict(N=len(df),
            d=df['Droplet Diameter (um)'].values,
            ell=df['Spindle Length (um)'].values)

# Sample!
samples = sm.sampling(data=data)

/Users/Justin/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):

Examining traces¶

The first type of visualization we will explore is useful for diagnosing potential problems with the sampler.

Trace plots¶

Trace plots are a common way of visualizing the trajectory a sampler takes through parameter space. We plot the value of the parameter versus step number. You can make these plots using bebi103.viz.trace_plot(). By default, the traces are colored according to chain number. In our sampling we just did, we used Stan's default of four chains.

bokeh.io.show(bebi103.viz.trace_plot(samples, pars=['phi', 'gamma', 'sigma_0']))

These trace plots look pretty good; the sampler is bounding around a central value. Trace plots are very common, but are not particularly useful. Here is what Dan Simpson has to say about trace plots.

They are pedagogically useful in that you can also visualize how the sampler warms up. You can see the warmup steps as well using the inc_warmup kwarg.

bokeh.io.show(bebi103.viz.trace_plot(samples, 
                                     pars=['phi', 'gamma', 'sigma_0'], 
                                     inc_warmup=True))

We see that the sampler jumps around a bit before it settles into stationarity sampling the target distribution (the posterior). This is why we disregard the warmup steps when using our samples.

Parallel coordinate plots¶

As Dan Simpson pointed out in his talk I linked to above, plots that visualize diagnostics effectively are better. We will talk more about diagnostics of MCMC in a later tutorial, but for now, I will display one of the diagnostic plots Dan showed in his talk. Here, we put the parameter names on the x-axis, and we plot the values of the parameters on the y-axis. To make sure we compare things of the same magnitude, we center the parameter values by their median and then scale them by the width of the middle 95th percentile of the samples. Each line in the plot represents a single sample.

# Apply a center-and-scale transformation for each of comparison
transformation = lambda x: (x - np.mean(x)) / np.std(x)

bokeh.io.show(bebi103.viz.parcoord_plot(samples, 
                                        transformation=transformation, 
                                        pars=['phi', 'gamma', 'sigma_0']))

Normally, samples with problems are plotted in another color to help diagnose the problems. In this particular set of samples, there were no problems, so everything looks fine.

Interestingly, the "neck" between phi and gamma is indicative of anticorrelation. When φ is high, γ is low, and vice-versa. We also include lp__, which is the log posterior value for the sample.

Plots of marginalized distributions¶

We are often interested in displaying a marginal distribution for a single parameter. We may do this by directly plotting the ECDF of the samples for that parameter. To do this, we first need to convert the samples to a data frame.

df_mcmc = bebi103.stan.to_dataframe(samples)

bokeh.io.show(bebi103.viz.ecdf(df_mcmc['phi'],
                               x_axis_label='φ (µm)',
                               y_axis_label='posterior ECDF'))

We could also plot this as a histogram to show what the PDF looks like.

bokeh.io.show(bebi103.viz.histogram(df_mcmc['phi'],
                                    bins=int(np.sqrt(len(df))),
                                    line_width=2,
                                    density=True,
                                    x_axis_label='φ (µm)',
                                    y_axis_label='g(φ|y)'))

We can of course do the same for the parameter γ. We can also plot the two-dimensional distribution of φ and γ, which we showed in the previous tutorial. That involves plotting each sample as a semitransparent dot in a 2D scatter plot.

Corner plots¶

It would be convenient to plot all visualizable marginal distributions (that is one- and two-dimensional distribution). We can conveniently do that with a corner plot.

bokeh.io.show(bebi103.viz.corner(samples, 
                                 pars=['phi', 'gamma', 'sigma_0'],
                                 labels=['φ (µm)', 'γ', 'σ₀ (µm)']))

This is a nice way to summarize the posterior and is useful for visualizing how various parameters covary.

Reporting summaries of MCMC samples¶

Performing MCMC calculations gives you full information about the posterior, which you can summarize in beautiful and informative corner plots. But very often, we wish to summarize the posterior in a few simple numbers. In particular, we wish to report a credible region, the Bayesian analog to a confidence interval, for a single parameter. This is computed from the marginalized posterior. The credible region is a region in parameter space where we might expect a parameter value to lie. This credible region is often reported and plotted as an error bar.

We will consider three commonly used ways of plotting a value plus error bar. We will use a 95% credible region for this demonstration.

mean $\pm$ standard deviation: The most commonly used confidence interval is $\mu \pm k\sigma$, where $k$ is chosen to give the appropriate confidence interval, assuming the posterior is Gaussian. We'll do 95%, which means $k = 1.96$.
median with quantile: The posterior need not be Gaussian. If it is not, we would like a more robust way to summarize it. A simple method is to report the median, and then give lower and upper bounds to the error bar based on quantile. In our case, we would report the 2.5th percentile and the 97.5th percentile.
mode with HPD: This method uses the highest posterior density interval, or HPD. If we're considering a 95% confidence interval, the HPD interval is the shortest interval that contains 95% of the probability of the posterior. So, we report the mode (the most probable parameter value) and then the bounds on the HPD interval.
median with HPD: We report the HPD, but report the median parameter value from our samples instead of the mode.

To illustrate the relative merits of these reporting schemes, we'll draw MCMC samples out of some artificial distributions.

Some distributions to sample¶

We will generate some sample distributions to sample using MCMC. We will consider an exponential distribution, a Gaussian, the sum of two Gaussians, and a distribution with a long tail. We choose these to illustrate how various choices of the credible region will be reported.

Let's define the models and draw samples. MCMC is not necessary here for this illustrative exercise, so I will just draw the samples using Numpy.

# Parametrize models
mu = 1.0
sigma = 0.25
mu_2 = np.array([mu, 3.0])
sigma_2 = np.array([sigma, 0.5])

# Draw out of the distributions, start with exponential
x_expon = np.random.exponential(mu, size=15000)

# Normal
x_norm = np.random.normal(mu, sigma, size=15000)

# Bimodal mixture of Normals
which_norm = np.random.choice([0, 1], 15000)
x_2norm = np.random.normal(mu_2[which_norm], sigma_2[which_norm])

# Heavy tailed Pareto
x_heavytail = np.random.pareto(1.2, size=15000)

# Store samples in a list for easy access
samples = [x_expon, x_norm, x_2norm, x_heavytail]

Let's look at what we got by plotting the histograms.

names = ['Exponential', 'Gaussian', 'Two Gaussians', 'Heavy tail']
plots = [bebi103.viz.histogram(x,
                               density=True,
                               bins=50,
                               plot_width=350,
                               plot_height=250,
                               x_axis_label='x',
                               y_axis_label='pdf',
                               title=t,
                               line_width=2,
                               color='gray')
             for x, t in zip(samples[:-1], names[:-1])]

# Make plot for heavy tail (needs different bins because of heavy tail)
plots.append(bebi103.viz.histogram(samples[-1],
                                   density=True,
                                   bins=np.arange(21),
                                   plot_width=350,
                                   plot_height=250,
                                   x_axis_label='x',
                                   y_axis_label='pdf',
                                   title='Heavy tail',
                                   line_width=2,
                                   color='gray'))

# Show in a grid
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

/Users/Justin/Dropbox/git/bebi103/bebi103/viz.py:289: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if bins == 'exact':
/Users/Justin/Dropbox/git/bebi103/bebi103/viz.py:297: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  elif bins == 'integer':

Summarizing the "MCMC" results with error bars¶

We'll start by computing the mean, standard deviation, median, and quantiles, since these are easy to compute with NumPy.

# DataFrame to store summary stats
df_summary = pd.DataFrame(index=['mean', 'std', '2.5', 'median', '97.5', 
                                 'mode', 'hpd_low', 'hpd_high'],
                          columns=names)

for x, name in zip(samples, names):
    df_summary.loc['mean', name] = np.mean(x)
    df_summary.loc['std', name] = np.std(x)
    df_summary.loc[['2.5', 'median', '97.5'], name] = np.percentile(x, [2.5, 50, 97.5])

Computation of the HPD is a little trickier. The idea is that we rank-order the MCMC trace. We know that the number of samples that are included in the HPD is 0.95 times the total number of MCMC samples. We then consider all intervals that contain that many samples and find the shortest one. The bebi103.stan.hpd() function does this.

for x, name in zip(samples, names):
    df_summary.loc[['hpd_low', 'hpd_high'], name] = bebi103.stan.hpd(x, mass_frac=0.95)

Finally, we want to add the mode of our samples to our summary statistics. We could do that from MCMC samples using the log posterior (lp__) values, and find the parameter values for which lp__ is maximal. But since we quickly generated our samples using Numpy, we do not have those, so I will hand-code the known modes.

df_summary.loc['mode', :] = [0, mu, mu, 1]

Now, we can take a look at our various summary statistics. (Of course, this data frame is not tidy, but remember, organization for data for display and for analysis are two different things. This data frame is organized for display.)

df_summary

It is easier to visualize these summaries as error bars on plots. There's a bit of code below to generate the plots, but bear with me.

# y_values for bars on plots
y_vals = [[0.8, 0.6, 0.4, 0.2], # Exponential
          [1.33, 1, 0.66, 0.33], # Gaussian
          [0.6, 0.45, 0.3, 0.15], # Two Gaussians
          [0.2, 0.15, 0.1, 0.05]] # Heavy tail

# Color scheme
colors = bokeh.palettes.Category10[10]

def plot_errorbar(x, y, barx, color, legend, p):
    p.circle([x], [y], size=10, color=color, legend=legend)
    p.line(barx, [y, y], line_width=4, color=color)

for p, name, y in zip(plots, names, y_vals):
    if name == 'Exponential':
        legends = ['mean ± std', 'quantile', 'mode/HPD', 'median/HPD']
    else:
        legends = [None] * 4

    # Mean ± std
    x = df_summary.loc['mean', name]
    barx = x + np.array([1, -1]) * 1.96*df_summary.loc['std', name]
    plot_errorbar(x, y[0], barx, colors[0], legends[0], p)

    # Median with central 95% interval
    x = df_summary.loc['median', name]
    barx = df_summary.loc[['2.5', '97.5'], name]
    plot_errorbar(x, y[1], barx, colors[1], legends[1], p)

    # Mode with HPD
    x = df_summary.loc['mode', name]
    barx = df_summary.loc[['hpd_low', 'hpd_high'], name]
    plot_errorbar(x, y[2], barx, colors[2], legends[2], p)

    # Median with HPD
    x = df_summary.loc['median', name]
    barx = df_summary.loc[['hpd_low', 'hpd_high'], name]
    plot_errorbar(x, y[3], barx, colors[3], legends[3], p)

bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

Relative merits of each method¶

The mean/std does not respect bounds on the posterior, nor any asymmetry. Unless we are going for speed and using a MAP finder/Gaussian approximation, there is no need for this method or summarizing the posterior.

The primary advantage of the quantile approach is that it is very easy to interpret, especially for the researcher uninitiated to Bayesian statistics. It does not suffer from the problems that the mean/std method does. It does not rely on any approximations.

The mode/HPD method gives just that: where the parameter value is most likely to fall, which is not necessarily the interquantile region with the median at its center. It is also nice to know the most probable parameter value (the MAP). The drawback is the possible difficulty of interpretability for the uninitiated. Furthermore, I think it places too much emphasis on the MAP. This is not always the most relevant thing to know.

The median/HPD method is my personal favorite. It gives the HPD (with the advantages I just stated) and also the median, which to me is a more useful statistic than the mode.

In any case, attempting to describe a multi-modal posterior with an error bar is misleading and futile. A distribution with a long tail can also be deceiving. Even if you report a 95% confidence interval, there is still a 5% chance the parameter value would be reeeeally big.

How to display the summary in text.¶

One issue that may be worrying you is how to report the asymmetric error bars in text. This is best seen by example. For the example of the Exponential, we may report the median with HPD as $0.71^{+2.31}_{-0.70}$.

Computing environment¶

%load_ext watermark

%watermark -v -p numpy,pandas,scipy,pystan,bokeh,bebi103,jupyterlab

CPython 3.7.0
IPython 7.1.1

numpy 1.15.4
pandas 0.23.4
scipy 1.1.0
pystan 2.17.1.0
bokeh 1.0.1
bebi103 0.0.35
jupyterlab 0.35.3

	Exponential	Gaussian	Two Gaussians	Heavy tail
mean	1.01216	0.995102	2.00231	7.55238
std	1.01477	0.248925	1.07234	350.634
2.5	0.0266582	0.514143	0.590876	0.0208824
median	0.698712	0.995621	1.7314	0.777857
97.5	3.7479	1.47958	3.80221	23.5115
mode	0	1	1	1
hpd_low	7.11723e-05	0.514147	0.524746	6.08679e-05
hpd_high	3.02962	1.4796	3.71739	11.8243