Tutorial 5b: Credible regions

(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

*This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [1]:
import collections

import numpy as np
import pandas as pd
import scipy.stats as st

import pymc3 as pm
import theano.tensor as tt

import bebi103

import bokeh.io
import bokeh.plotting
colors = bokeh.palettes.d3['Category10'][10]
bokeh.io.output_notebook()
Loading BokehJS ...

Performing MCMC calculations gives you full information about the posterior, which you can summarize in beautiful and informative corner plots. But, very often, we wish to summarize the posterior in a few simple numbers. In particular, we wish to report a credible region, the Bayesian analog to a confidence interval. The credible region is a region in parameter space where we might expect a parameter value to lie. This credible region is often reported and plotted as an error bar.

We will consider three commonly used ways of plotting a value plus error bar. We will use a 95% credible region for this demonstration.

  1. mean $\pm$ standard deviation: The most commonly used confidence interval is $\mu \pm k\sigma$, where $k$ is chosen to give the appropriate confidence interval, assuming the posterior is Gaussian. We'll do 95%, which means $k = 1.96$.
  2. median with quantile: The posterior need not be Gaussian. If it is not, we would like a more robust way to summarize it. A simple method is to report the median, and then give lower and upper bounds to the error bar based on quantile. In our case, we would report the 2.5th percentile and the 97.5th percentile.
  3. mode with HPD: This method uses the highest posterior density interval, or HPD. If we're considering a 95% confidence interval, the HPD interval is the shortest interval that contains 95% of the probability of the posterior. So, we report the mode (the most probable parameter value) and then the bounds on the HPD interval.
  4. median with HPD: This is a mix of (2) and (3). We report the HPD, but report the median parameter value from our samples instead of the mode.

To illustrate the relative merits of these reporting schemes, we'll draw MCMC samples out of some artificial distributions.

Some distributions to sample

We will generate some sample distributions to sample using MCMC. We will consider an exponential distribution, a Gaussian, the sum of two Gaussians, and a distribution with a long tail. We choose these to illustrate how various choices of the credible region will be reported.

Let's define the models and draw samples. (Note that MCMC is not necessary here, since these distributions are simple and we can draw out of them using the np.random module, but we use MCMC and the outputs we normally get so we can illustrate how to display MCMC results.)

In [2]:
# Parametrize models
mu = 1.0
sigma = 0.25
mu_2 = 3.0
sigma_2 = 0.5

with pm.Model() as expon:
    x = pm.Exponential('x', mu)
    trace_expon = pm.sample(draws=20000, tune=5000)
    df_expon = bebi103.pm.trace_to_dataframe(trace_expon, log_post=True)
    
with pm.Model() as norm:
    x = pm.Normal('x', mu=mu, sd=sigma)
    trace_norm = pm.sample(draws=20000, tune=5000)
    df_norm = bebi103.pm.trace_to_dataframe(trace_norm, log_post=True)
    
with pm.Model() as two_norms:
    x1_dist = pm.Normal.dist(mu=mu, sd=sigma)
    x2_dist = pm.Normal.dist(mu=mu_2, sd=sigma_2)
    x = pm.Mixture('x', [0.5, 0.5], [x1_dist, x2_dist])
    trace_two_norms = pm.sample(draws=20000, tune=5000)
    df_two_norms = bebi103.pm.trace_to_dataframe(trace_two_norms, log_post=True)

with pm.Model() as heavy_tail:
    def logp(y):
        return pm.distributions.dist_math.bound(tt.log(y / (y + 4.0)**5),
                                                tt.gt(y, 0))
    x = pm.DensityDist('x', logp, testval=1.0)
    trace_heavy_tail = pm.sample(draws=20000, tune=5000)
    df_heavy_tail = bebi103.pm.trace_to_dataframe(trace_heavy_tail, log_post=True)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
 99%|█████████▉| 24813/25000 [00:10<00:00, 2474.66it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 0 contains 31 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 25000/25000 [00:10<00:00, 2476.17it/s]
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%|██████████| 25000/25000 [00:09<00:00, 2641.72it/s]
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%|█████████▉| 24882/25000 [00:10<00:00, 2365.05it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:451: UserWarning: The acceptance probability in chain 0 does not match the target. It is 0.576607297316, but should be close to 0.8. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))
100%|██████████| 25000/25000 [00:10<00:00, 2365.65it/s]
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
 99%|█████████▉| 24873/25000 [00:10<00:00, 2295.05it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:451: UserWarning: The acceptance probability in chain 0 does not match the target. It is 0.678082867202, but should be close to 0.8. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))
/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 0 contains 7336 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 25000/25000 [00:10<00:00, 2296.11it/s]

Let's look at what we got from MCMC.

In [3]:
traces = [trace_expon, trace_norm, trace_two_norms, trace_heavy_tail]
dfs = [df_expon, df_norm, df_two_norms, df_heavy_tail]
names = ['Exponential', 'Gaussian', 'Two Gaussians', 'Heavy tail']
plots = [bebi103.viz.histogram(df['x'], 
                               bins=50,
                               plot_width=350,
                               plot_height=250,
                               x_axis_label='x',
                               y_axis_label='pdf',
                               title=t,
                               line_width=2,
                               color='gray')
             for df, t in zip(dfs[:-1], names[:-1])]

# Make plot for heavy tail (needs different bins because of heavy tail)
plots.append(bebi103.viz.histogram(df_heavy_tail['x'], 
                                   bins=np.arange(51),
                                   plot_width=350,
                                   plot_height=250,
                                   x_axis_label='x',
                                   y_axis_label='pdf',
                                   title='Long tail',
                                   line_width=2,
                                   color='gray'))

# Show in a grid
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

Summarizing the MCMC results with error bars

We'll start by computing the mean, standard deviation, median, and quantiles, since these are easy to compute with NumPy.

In [4]:
# DataFrame to store summary stats
df_summary = pd.DataFrame(index=['mean', 'std', '2.5', 'median', '97.5', 
                                 'mode', 'hpd_low', 'hpd_high'],
                          columns=names)

for df, name in zip(dfs, names):
    df_summary.loc['mean', name] = np.mean(df['x'])
    df_summary.loc['std', name] = np.std(df['x'])
    df_summary.loc[['2.5', 'median', '97.5'], name] = np.percentile(
        df['x'], [2.5, 50, 97.5])

Computation of the HPD is a little trickier. The function below will compute the HPD interval. The idea is that we rank-order the MCMC trace. We know that the number of samples that are included in the HPD is 0.95 times the total number of MCMC samples. We then consider all intervals that contain that many samples and find the shortest one.

In [5]:
def hpd(x, mass_frac) :
    """
    Returns highest probability density region given by
    a set of samples.

    Parameters
    ----------
    x : array
        1D array of MCMC samples for a single variable
    mass_frac : float with 0 < mass_frac <= 1
        The fraction of the probability to be included in
        the HPD.  For example, `massfrac` = 0.95 gives a
        95% HPD.
        
    Returns
    -------
    output : array, shape (2,)
        The bounds of the HPD
    """
    # Get sorted list
    d = np.sort(np.copy(x))

    # Number of total samples taken
    n = len(x)
    
    # Get number of samples that should be included in HPD
    n_samples = np.floor(mass_frac * n).astype(int)
    
    # Get width (in units of data) of all intervals with n_samples samples
    int_width = d[n_samples:] - d[:n-n_samples]
    
    # Pick out minimal interval
    min_int = np.argmin(int_width)
    
    # Return interval
    return np.array([d[min_int], d[min_int+n_samples]])

Fortunately, we do not need to use this function, since PyMC3 will compute it for you. Instead of the argument mass_frac above, it has a kwarg, alpha, which is 1 - mass_frac. So, if you wanted a 95% HPD, you would use alpha=0.05. I included the above function to show you how it is done to help you get a better understanding for what the HPD is. (And this function is available as bebi103.pm.hpd().)

In [6]:
for trace, name in zip(traces, names):
    df_summary.loc[['hpd_low', 'hpd_high'], name] = pm.hpd(trace, alpha=0.05)['x']

Finally, we want to add the mode of our samples to our summary statistics. To do that, we would do this:

for df, name in zip(dfs, names):
    df_summary.loc['mode', name] = df.loc[df['log_posterior'].idxmax(), 'x']

In this case, because there is no posterior (we are sampling out of distributions where there are no observations, so PyMC3 knows there is no likelihood, and therefore no posterior), there is no log posterior to calculate and use for finding the mode. Instead, I will hand-enter the modes, which we know to be one for all of the distributions.

In [7]:
df_summary.loc['mode', :] = [0, mu, mu, 1] 

Now, we can take a look at our various summary statistics. (Of course, this data frame is not tidy, but remember, organization for data for display and for analysis are two different things. This data frame is organized for display.)

In [8]:
df_summary
Out[8]:
Exponential Gaussian Two Gaussians Heavy tail
mean 1.00402 0.99681 2.07568 4.29049
std 0.986694 0.249456 1.08514 5.85388
2.5 0.0282695 0.512217 0.593893 0.345304
median 0.702995 0.997995 2.25357 2.57751
97.5 3.6001 1.48035 3.8728 19.0982
mode 0 1 1 1
hpd_low 0.000118046 0.510614 0.507884 0.164978
hpd_high 2.99096 1.47638 3.75082 13.4792

It is easier to visualize these summaries as error bars on plots. There's a bit of code below to generate the plots, but bear with me.

In [9]:
# y_values for bars on plots
y_vals = [[0.8, 0.6, 0.4, 0.2], # Exponential
          [1.33, 1, 0.66, 0.33], # Gaussian
          [0.6, 0.45, 0.3, 0.15], # Two Gaussians
          [0.2, 0.15, 0.1, 0.05]] # Heavy tail

def plot_errorbar(x, y, barx, color, legend, p):
    p.circle([x], [y], size=10, color=color, legend=legend)
    p.line(barx, [y, y], line_width=4, color=color)

for p, name, y in zip(plots, names, y_vals):
    if name == 'Exponential':
        legends = ['mean ± std', 'quantile', 'mode/HPD', 'median/HPD']
    else:
        legends = [None] * 4

    # Mean ± std
    x = df_summary.loc['mean', name]
    barx = x + np.array([1, -1]) * 1.96*df_summary.loc['std', name]
    plot_errorbar(x, y[0], barx, colors[0], legends[0], p)

    # Median with central 95% interval
    x = df_summary.loc['median', name]
    barx = df_summary.loc[['2.5', '97.5'], name]
    plot_errorbar(x, y[1], barx, colors[1], legends[1], p)

    # Mode with HPD
    x = df_summary.loc['mode', name]
    barx = df_summary.loc[['hpd_low', 'hpd_high'], name]
    plot_errorbar(x, y[2], barx, colors[2], legends[2], p)

    # Median with HPD
    x = df_summary.loc['median', name]
    barx = df_summary.loc[['hpd_low', 'hpd_high'], name]
    plot_errorbar(x, y[3], barx, colors[3], legends[3], p)

bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

Relative merits of each method

  • The mean/std does not respect bounds on the posterior, nor any asymmetry. Unless we are going for speed and using a MAP finder/Gaussian approximation, there is no need for this method or summarizing the posterior.


  • The primary advantage of the quantile approach is that it is very easy to interpret, especially for the researcher uninitiated to Bayesian statistics. It does not suffer from the problems that the mean/std method does. It does not rely on any approximations.


  • The mode/HPD method gives just that: where the parameter value is most likely to fall, which is not necessarily the interquantile region with the median at its center. It is also nice to know the most probable parameter value (the MAP). The drawback is the possible difficulty of interpretability for the uninitiated. Furthermore, I think it places too much emphasis on the MAP. This is not always the most relevant thing to know.


  • The median/HPD method is my personal favorite. It gives the HPD (with the advantages I just stated) and also the median, which to me is a more useful statistic than the mode.


In any case, attempting to describe a multi-modal posterior with an error bar is misleading and futile. A distribution with a long tail can also be deceiving. Even if you report a 95% confidence interval, there is still a 5% chance the parameter value would be reeeeally big.

How to display the summary in text.

One issue that may be worrying you is how to report the asymmetric error bars in text. This is best seen by example. For the example of the long tailed, we may report the median with HPD as $2.5^{+13.4}_{-2.1}$.

Display of fits

In Tutorial 4a, we displayed the theoretical curve given by the MAP parameter values for fitting the spindle size data. In Tutorial 4b, we plotted the theoretical CDF parametrized by the MAP along with the ECDF of the measured mRNA counts. With MCMC samples, we can display many curves on the same plot, which enables a better feel for how tightly constrained the parameter values are.

ECDFs and CDFs

Let's start with the Singer, et al., single molecule FISH data. First, we'll get some samples for the Rex gene and make a corner plot to see how we are doing.

In [10]:
df = pd.read_csv('../data/singer_transcript_counts.csv', comment='#')
n = df['Rest'].values

with pm.Model() as model:
    # Priors
    r = pm.HalfFlat('r')
    mu = pm.HalfFlat('mu')

    # Likelihood
    n_obs = pm.NegativeBinomial('n_obs', mu=mu, alpha=r, observed=n)

    # Draw the samples
    trace = pm.sample(draws=10000, tune=10000, njobs=2, init='advi+adapt_diag')

    # Convert to data frame
    df_mcmc = bebi103.pm.trace_to_dataframe(trace, log_post=True)
        
    # Compute p
    df_mcmc['p'] = df_mcmc['r'] / (df_mcmc['r'] + df_mcmc['mu'])

# Plot result
g = bebi103.viz.corner(df_mcmc, vars=['r', 'p'], plot_ecdf=True)
bokeh.io.show(g)
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
Average Loss = 1,383.2:  10%|█         | 20648/200000 [00:13<01:54, 1567.69it/s]
Convergence archived at 20800
Interrupted at 20,799 [10%]: Average Loss = 3,211.2
100%|██████████| 20000/20000 [00:15<00:00, 1298.17it/s]

Now, we can visualize the samples as CDFs laid over the ECDF of the original counts. That is, for each sample of $r$ and $p$, we make a CDF. We plot this CDF as a thin, partially transparent line. Actually we need not do this for 40,000 samples; 400 should suffice.

In [11]:
# Plot the CDF
p = bebi103.viz.ecdf(df['Rest'], x_axis_label='mRNA counts')

# x-values for theoretical CDF
n_plot = np.arange(0, df['Rest'].max()+1)

# Populate plot with theoretical CDFs
for _, row in df_mcmc.loc[::100, ['r', 'p']].iterrows():
    cdf_theor = st.nbinom.cdf(n_plot, row['r'], row['p'])    
    p.line(n_plot, cdf_theor, line_width=0.5, alpha=0.05, color='tomato')
    
bokeh.io.show(p)

This is a convenient method of showing what you might get, given knowledge of the parameter values and, of course, the assumption that the model is correct.

Regressions

We will now to the same thing with the data set pertaining to spindle size. Let's get our samples first.

In [12]:
# Load in Data Frame
df = pd.read_csv('../data/good_invitro_droplet_data.csv', comment='#')

# Get Numpy arrays for speed
d = df['Droplet Diameter (um)'].values
ell = df['Spindle Length (um)'].values

def spindle_length(d, gamma, phi):
    """Spindle length as a function of droplet diameter."""
    return gamma * d / (1 + (d/phi)**3)**(1/3)

with pm.Model() as model:
        # Priors
        gamma = pm.Uniform('gamma', lower=0, upper=1)
        phi = bebi103.pm.Jeffreys('phi', lower=1, upper=100)
        
        # Compute spindle length
        ell_theor = spindle_length(d, gamma, phi)
        
        # Likelihood
        ell_obs = bebi103.pm.MarginalizedHomoscedasticNormal('ell_obs', mu=ell_theor,
                                                             observed=ell)
        
        trace = pm.sample(draws=10000, tune=10000, init='advi+adapt_diag', njobs=2)
        
        df_mcmc = bebi103.pm.trace_to_dataframe(trace)
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
Average Loss = 1,899.5:   9%|▉         | 17995/200000 [00:11<02:00, 1505.09it/s]
Convergence archived at 18100
Interrupted at 18,099 [9%]: Average Loss = 2,382.6
100%|██████████| 20000/20000 [00:43<00:00, 460.04it/s]

As we did with the theoretical CDFs, we can plot the regression lines with the data.

In [13]:
# Plot the data
p = bokeh.plotting.figure(plot_width=450,
                          plot_height=350,
                          x_axis_label='droplet diameter (µm)',
                          y_axis_label='spindle length (µm)')
p.circle(df['Droplet Diameter (um)'], df['Spindle Length (um)'], alpha=0.3)

# x-values for theoretical curve
d = np.linspace(0, 250, 200)

# Populate plot with theoretical curves
for _, row in df_mcmc.loc[::100, ['gamma', 'phi']].iterrows():
    y = spindle_length(d, row['gamma'], row['phi'])
    p.line(d, y, line_width=0.5, alpha=0.05, color='tomato')
    
bokeh.io.show(p)

This demonstrates that although the data have a wide spread, the parameter values that describe them under the conserved tubulin model are tightly constrained. This nicely complements the HPD.

In [22]:
print('γ 95% HPD: [{0:.2f}, {1:.2f}]'.format(*pm.hpd(df_mcmc['gamma'])))
print('φ 95% HPD: [{0:.2f}, {1:.2f}] µm'.format(*pm.hpd(df_mcmc['phi'])))
γ 95% HPD: [0.83, 0.90]
φ 95% HPD: [41.99, 47.05] µm

The plot shows how much play in the theoretical curve is allowed for the ranges specified by the HPD.

Conclusions

Parameter estimates are often summaries as some sort of central value with an error bar. There are several ways you could construct these summaries. I prefer reporting the median with the HPD (the median is guaranteed to be within the HPD provided you ask for more than 50% of the total probability). I think it is quite clear that unless you expect your parameter to be Gaussian distributed, reporting a mean ± standard deviation is a bad idea.