(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
*This tutorial was generated from an Jupyter notebook. You can download the notebook here.
import collections
import numpy as np
import pandas as pd
import scipy.stats as st
import pymc3 as pm
import theano.tensor as tt
import bebi103
import bokeh.io
import bokeh.plotting
colors = bokeh.palettes.d3['Category10'][10]
bokeh.io.output_notebook()
Performing MCMC calculations gives you full information about the posterior, which you can summarize in beautiful and informative corner plots. But, very often, we wish to summarize the posterior in a few simple numbers. In particular, we wish to report a credible region, the Bayesian analog to a confidence interval. The credible region is a region in parameter space where we might expect a parameter value to lie. This credible region is often reported and plotted as an error bar.
We will consider three commonly used ways of plotting a value plus error bar. We will use a 95% credible region for this demonstration.
To illustrate the relative merits of these reporting schemes, we'll draw MCMC samples out of some artificial distributions.
We will generate some sample distributions to sample using MCMC. We will consider an exponential distribution, a Gaussian, the sum of two Gaussians, and a distribution with a long tail. We choose these to illustrate how various choices of the credible region will be reported.
Let's define the models and draw samples. (Note that MCMC is not necessary here, since these distributions are simple and we can draw out of them using the np.random
module, but we use MCMC and the outputs we normally get so we can illustrate how to display MCMC results.)
# Parametrize models
mu = 1.0
sigma = 0.25
mu_2 = 3.0
sigma_2 = 0.5
with pm.Model() as expon:
x = pm.Exponential('x', mu)
trace_expon = pm.sample(draws=20000, tune=5000)
df_expon = bebi103.pm.trace_to_dataframe(trace_expon, log_post=True)
with pm.Model() as norm:
x = pm.Normal('x', mu=mu, sd=sigma)
trace_norm = pm.sample(draws=20000, tune=5000)
df_norm = bebi103.pm.trace_to_dataframe(trace_norm, log_post=True)
with pm.Model() as two_norms:
x1_dist = pm.Normal.dist(mu=mu, sd=sigma)
x2_dist = pm.Normal.dist(mu=mu_2, sd=sigma_2)
x = pm.Mixture('x', [0.5, 0.5], [x1_dist, x2_dist])
trace_two_norms = pm.sample(draws=20000, tune=5000)
df_two_norms = bebi103.pm.trace_to_dataframe(trace_two_norms, log_post=True)
with pm.Model() as heavy_tail:
def logp(y):
return pm.distributions.dist_math.bound(tt.log(y / (y + 4.0)**5),
tt.gt(y, 0))
x = pm.DensityDist('x', logp, testval=1.0)
trace_heavy_tail = pm.sample(draws=20000, tune=5000)
df_heavy_tail = bebi103.pm.trace_to_dataframe(trace_heavy_tail, log_post=True)
Let's look at what we got from MCMC.
traces = [trace_expon, trace_norm, trace_two_norms, trace_heavy_tail]
dfs = [df_expon, df_norm, df_two_norms, df_heavy_tail]
names = ['Exponential', 'Gaussian', 'Two Gaussians', 'Heavy tail']
plots = [bebi103.viz.histogram(df['x'],
bins=50,
plot_width=350,
plot_height=250,
x_axis_label='x',
y_axis_label='pdf',
title=t,
line_width=2,
color='gray')
for df, t in zip(dfs[:-1], names[:-1])]
# Make plot for heavy tail (needs different bins because of heavy tail)
plots.append(bebi103.viz.histogram(df_heavy_tail['x'],
bins=np.arange(51),
plot_width=350,
plot_height=250,
x_axis_label='x',
y_axis_label='pdf',
title='Long tail',
line_width=2,
color='gray'))
# Show in a grid
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))
We'll start by computing the mean, standard deviation, median, and quantiles, since these are easy to compute with NumPy.
# DataFrame to store summary stats
df_summary = pd.DataFrame(index=['mean', 'std', '2.5', 'median', '97.5',
'mode', 'hpd_low', 'hpd_high'],
columns=names)
for df, name in zip(dfs, names):
df_summary.loc['mean', name] = np.mean(df['x'])
df_summary.loc['std', name] = np.std(df['x'])
df_summary.loc[['2.5', 'median', '97.5'], name] = np.percentile(
df['x'], [2.5, 50, 97.5])
Computation of the HPD is a little trickier. The function below will compute the HPD interval. The idea is that we rank-order the MCMC trace. We know that the number of samples that are included in the HPD is 0.95 times the total number of MCMC samples. We then consider all intervals that contain that many samples and find the shortest one.
def hpd(x, mass_frac) :
"""
Returns highest probability density region given by
a set of samples.
Parameters
----------
x : array
1D array of MCMC samples for a single variable
mass_frac : float with 0 < mass_frac <= 1
The fraction of the probability to be included in
the HPD. For example, `massfrac` = 0.95 gives a
95% HPD.
Returns
-------
output : array, shape (2,)
The bounds of the HPD
"""
# Get sorted list
d = np.sort(np.copy(x))
# Number of total samples taken
n = len(x)
# Get number of samples that should be included in HPD
n_samples = np.floor(mass_frac * n).astype(int)
# Get width (in units of data) of all intervals with n_samples samples
int_width = d[n_samples:] - d[:n-n_samples]
# Pick out minimal interval
min_int = np.argmin(int_width)
# Return interval
return np.array([d[min_int], d[min_int+n_samples]])
Fortunately, we do not need to use this function, since PyMC3 will compute it for you. Instead of the argument mass_frac
above, it has a kwarg, alpha
, which is 1 - mass_frac
. So, if you wanted a 95% HPD, you would use alpha=0.05
. I included the above function to show you how it is done to help you get a better understanding for what the HPD is. (And this function is available as bebi103.pm.hpd()
.)
for trace, name in zip(traces, names):
df_summary.loc[['hpd_low', 'hpd_high'], name] = pm.hpd(trace, alpha=0.05)['x']
Finally, we want to add the mode of our samples to our summary statistics. To do that, we would do this:
for df, name in zip(dfs, names):
df_summary.loc['mode', name] = df.loc[df['log_posterior'].idxmax(), 'x']
In this case, because there is no posterior (we are sampling out of distributions where there are no observations, so PyMC3 knows there is no likelihood, and therefore no posterior), there is no log posterior to calculate and use for finding the mode. Instead, I will hand-enter the modes, which we know to be one for all of the distributions.
df_summary.loc['mode', :] = [0, mu, mu, 1]
Now, we can take a look at our various summary statistics. (Of course, this data frame is not tidy, but remember, organization for data for display and for analysis are two different things. This data frame is organized for display.)
df_summary
It is easier to visualize these summaries as error bars on plots. There's a bit of code below to generate the plots, but bear with me.
# y_values for bars on plots
y_vals = [[0.8, 0.6, 0.4, 0.2], # Exponential
[1.33, 1, 0.66, 0.33], # Gaussian
[0.6, 0.45, 0.3, 0.15], # Two Gaussians
[0.2, 0.15, 0.1, 0.05]] # Heavy tail
def plot_errorbar(x, y, barx, color, legend, p):
p.circle([x], [y], size=10, color=color, legend=legend)
p.line(barx, [y, y], line_width=4, color=color)
for p, name, y in zip(plots, names, y_vals):
if name == 'Exponential':
legends = ['mean ± std', 'quantile', 'mode/HPD', 'median/HPD']
else:
legends = [None] * 4
# Mean ± std
x = df_summary.loc['mean', name]
barx = x + np.array([1, -1]) * 1.96*df_summary.loc['std', name]
plot_errorbar(x, y[0], barx, colors[0], legends[0], p)
# Median with central 95% interval
x = df_summary.loc['median', name]
barx = df_summary.loc[['2.5', '97.5'], name]
plot_errorbar(x, y[1], barx, colors[1], legends[1], p)
# Mode with HPD
x = df_summary.loc['mode', name]
barx = df_summary.loc[['hpd_low', 'hpd_high'], name]
plot_errorbar(x, y[2], barx, colors[2], legends[2], p)
# Median with HPD
x = df_summary.loc['median', name]
barx = df_summary.loc[['hpd_low', 'hpd_high'], name]
plot_errorbar(x, y[3], barx, colors[3], legends[3], p)
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))
In any case, attempting to describe a multi-modal posterior with an error bar is misleading and futile. A distribution with a long tail can also be deceiving. Even if you report a 95% confidence interval, there is still a 5% chance the parameter value would be reeeeally big.
One issue that may be worrying you is how to report the asymmetric error bars in text. This is best seen by example. For the example of the long tailed, we may report the median with HPD as $2.5^{+13.4}_{-2.1}$.
In Tutorial 4a, we displayed the theoretical curve given by the MAP parameter values for fitting the spindle size data. In Tutorial 4b, we plotted the theoretical CDF parametrized by the MAP along with the ECDF of the measured mRNA counts. With MCMC samples, we can display many curves on the same plot, which enables a better feel for how tightly constrained the parameter values are.
Let's start with the Singer, et al., single molecule FISH data. First, we'll get some samples for the Rex gene and make a corner plot to see how we are doing.
df = pd.read_csv('../data/singer_transcript_counts.csv', comment='#')
n = df['Rest'].values
with pm.Model() as model:
# Priors
r = pm.HalfFlat('r')
mu = pm.HalfFlat('mu')
# Likelihood
n_obs = pm.NegativeBinomial('n_obs', mu=mu, alpha=r, observed=n)
# Draw the samples
trace = pm.sample(draws=10000, tune=10000, njobs=2, init='advi+adapt_diag')
# Convert to data frame
df_mcmc = bebi103.pm.trace_to_dataframe(trace, log_post=True)
# Compute p
df_mcmc['p'] = df_mcmc['r'] / (df_mcmc['r'] + df_mcmc['mu'])
# Plot result
g = bebi103.viz.corner(df_mcmc, vars=['r', 'p'], plot_ecdf=True)
bokeh.io.show(g)
Now, we can visualize the samples as CDFs laid over the ECDF of the original counts. That is, for each sample of $r$ and $p$, we make a CDF. We plot this CDF as a thin, partially transparent line. Actually we need not do this for 40,000 samples; 400 should suffice.
# Plot the CDF
p = bebi103.viz.ecdf(df['Rest'], x_axis_label='mRNA counts')
# x-values for theoretical CDF
n_plot = np.arange(0, df['Rest'].max()+1)
# Populate plot with theoretical CDFs
for _, row in df_mcmc.loc[::100, ['r', 'p']].iterrows():
cdf_theor = st.nbinom.cdf(n_plot, row['r'], row['p'])
p.line(n_plot, cdf_theor, line_width=0.5, alpha=0.05, color='tomato')
bokeh.io.show(p)
This is a convenient method of showing what you might get, given knowledge of the parameter values and, of course, the assumption that the model is correct.
We will now to the same thing with the data set pertaining to spindle size. Let's get our samples first.
# Load in Data Frame
df = pd.read_csv('../data/good_invitro_droplet_data.csv', comment='#')
# Get Numpy arrays for speed
d = df['Droplet Diameter (um)'].values
ell = df['Spindle Length (um)'].values
def spindle_length(d, gamma, phi):
"""Spindle length as a function of droplet diameter."""
return gamma * d / (1 + (d/phi)**3)**(1/3)
with pm.Model() as model:
# Priors
gamma = pm.Uniform('gamma', lower=0, upper=1)
phi = bebi103.pm.Jeffreys('phi', lower=1, upper=100)
# Compute spindle length
ell_theor = spindle_length(d, gamma, phi)
# Likelihood
ell_obs = bebi103.pm.MarginalizedHomoscedasticNormal('ell_obs', mu=ell_theor,
observed=ell)
trace = pm.sample(draws=10000, tune=10000, init='advi+adapt_diag', njobs=2)
df_mcmc = bebi103.pm.trace_to_dataframe(trace)
As we did with the theoretical CDFs, we can plot the regression lines with the data.
# Plot the data
p = bokeh.plotting.figure(plot_width=450,
plot_height=350,
x_axis_label='droplet diameter (µm)',
y_axis_label='spindle length (µm)')
p.circle(df['Droplet Diameter (um)'], df['Spindle Length (um)'], alpha=0.3)
# x-values for theoretical curve
d = np.linspace(0, 250, 200)
# Populate plot with theoretical curves
for _, row in df_mcmc.loc[::100, ['gamma', 'phi']].iterrows():
y = spindle_length(d, row['gamma'], row['phi'])
p.line(d, y, line_width=0.5, alpha=0.05, color='tomato')
bokeh.io.show(p)
This demonstrates that although the data have a wide spread, the parameter values that describe them under the conserved tubulin model are tightly constrained. This nicely complements the HPD.
print('γ 95% HPD: [{0:.2f}, {1:.2f}]'.format(*pm.hpd(df_mcmc['gamma'])))
print('φ 95% HPD: [{0:.2f}, {1:.2f}] µm'.format(*pm.hpd(df_mcmc['phi'])))
The plot shows how much play in the theoretical curve is allowed for the ranges specified by the HPD.
Parameter estimates are often summaries as some sort of central value with an error bar. There are several ways you could construct these summaries. I prefer reporting the median with the HPD (the median is guaranteed to be within the HPD provided you ask for more than 50% of the total probability). I think it is quite clear that unless you expect your parameter to be Gaussian distributed, reporting a mean ± standard deviation is a bad idea.