Tutorial 5a: Parameter estimation with Markov chain Monte Carlo

(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as st

# Our main MCMC package
import pymc3 as pm
import theano.tensor as tt
import theano

# BE/Bi 103 utils
import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

# Some of PyMC3's output uses Matplotlib
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}
Loading BokehJS ...

In this tutorial, we will learn how to use Markov chain Monte Carlo to do parameter estimation. To get the basic idea behind MCMC, imagine for a moment that we can draw samples out of the posterior distribution. This means that the probability of choosing given values of a set of parameters is proportional to the posterior probability of that set of values. If we drew many many such samples, we could reconstruct the posterior from the samples, e.g., by making histograms. That's a big thing to imagine: that we can draw properly weighted samples. But, it turns out that we can! That is what MCMC allows us to do.

We will discuss the theory behind this seemingly miraculous capability in lecture. For this tutorial, we will just use the fact that we can do the sampling to learn about posterior distributions in the context of parameter estimation.

Our MCMC engines

We will use PyMC3 as our main engine for performing MCMC. This package, along with PyStan, Edward, and emcee (formerly known as "MCMC Hammer," clearly the best name of a software package ever), represent the state of the art for MCMC. Each has its relative advantages and disadvantages, and I think PyMC3 has the most going for it.

  1. It is actively developed.
  2. It has nice, concise syntax for specifying your statistical models.
  3. It employs state of the art sampling, namely using the No U-turn sampler for continuous parameters.
  4. It uses Theano, which results in fast, compiled sampling.
  5. It can automatically find starting positions for the sampler.
  6. It has built-in analyses of performance.
  7. It has built-in information-based assessment for model selection.

We will talk about items in more detail as we go along. We will cover item 7 when we talk about model selection.

That said, the other packages are also very good and very powerful. I especially like emcee, which we will cover in auxiliary lessons.

To install PyMC3, we should do

conda install -c conda-forge pymc3
conda upgrade pymc3

if you have not already. You should also make sure you have the most up-to-date version of the bebi103 module.

pip install bebi103 --upgrade

Without further ado, let's start doing some statistical inference with PyMC3!

The data set

We will use the data set from the previous tutorial from Singer, et al., while it is still fresh in our minds. Recall that this data set has mRNA counts for four genes in a population of cells. First, we'll load in the data.

In [2]:
# Load DataFrame
df = pd.read_csv('../data/singer_transcript_counts.csv', comment='#')

As a reminder, here are the ECDFs.

In [3]:
# Make a list of plots
plots = [bebi103.viz.ecdf(df[gene],
                          x_axis_label='mRNA count',
                          plot_width=300, 
                          plot_height=200,
                          title=gene)
         for gene in df]

# Show them as a grid
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

Analysis of a Negative Binomially distributed mRNA count

Recall from the previous tutorial that the mRNA count if Negative Binomially distributed for someof the genes, and not Rex1. We will start our foray into MCMC by studying the Rest gene, whose RNA copy number we assume to be Negative Binomially distributed.

The statistical model

As we did last time, we use a Negative Binomial likelihood. Let $n$ be the copy number of a given gene in a given cell. Then the likelihood is

\begin{align} f(n\mid r,p) = \frac{(n + r - 1)!}{n!(r-1)!}\,p^r(1-p)^n, \end{align}

where $r$ is the frequency of bursts in gene expression relative to the decay rate of mRNA and $p$ is related to the number of transcripts in each burst, the burst size $b = (1-p)/p$. These two parameters, $r$ and $p$ are what we are trying to estimate.

Since each measurement is independent, the likelihood of all measurements for a given gene is

\begin{align} f(\mathbf{n}\mid r,p) = \prod_i f(n_i\mid r,p) = \prod_i \frac{(n_i + r - 1)!}{n_i!(r-1)!}\,p^r(1-p)^{n_i}. \end{align}

We assume that $r$ and $p$ are independent, so that $g(r,p) = g(r)g(p)$. For $p$, we can choose a Uniform prior on the domain, $0 \le p \le 1$,

\begin{align} g(p) = \left\{\begin{array}{lll} 1 && 0 \le p \le 1 \\[0.5em] 0 && \text{otherwise}. \end{array} \right. \end{align}

Note that if we instead wanted to parametrize using $b$ instead of $p$, this implies that the prior on $b$ is

\begin{align} g(b) = \frac{1}{(1+b)^2}, \end{align}

defined for all nonnegative $b$. This was computed using the change of variables formula.

For the prior of $r$, we might also choose a Uniform prior, this time on the domain $r_\mathrm{min} < r \le r_\mathrm{max}$, where $r_\mathrm{min} \ge 0$.

\begin{align} g(r) = \left\{\begin{array}{lll} 1/(r_\mathrm{max} - r_\mathrm{min}) && r_\mathrm{min} < r \le r_\mathrm{max} \\[0.5em] 0 && \text{otherwise}. \end{array} \right. \end{align}

With these functions defined, the posterior is

\begin{align} g(r,p \mid \mathbf{n}) \propto f(\mathbf{n}\mid r,p)\,g(r)\,g(p). \end{align}

We plotted this posterior in the last tutorial, as we do it again here for the rest gene.

In [4]:
def log_likelihood(params, n):
    """
    Log likelihood for MLE of Singer data.
    """
    r, p = params

    return np.sum(st.nbinom.logpmf(n, r, p))


def log_prior(params, r_max):
    """
    Log prior for MLE of Singer data.
    """
    r, p = params
    
    # Zero probability of having p < 0 or p > 1
    if p < 0 or p > 1:
        return -np.inf
    
    # Zero probability of r < 1
    if r <= 0 or r > r_max:
        return -np.inf

    return -np.log(r_max - 1)


def log_posterior(params, n, r_max):
    """
    Log posterior for the Singer data.
    """
    # Compute log prior
    lp = log_prior(params, r_max)

    # If log prior is -inf, return that
    if lp == -np.inf:
        return -np.inf

    # Compute and return posterior
    return lp + log_likelihood(params, n)

# Set up ranges for plotting
r = np.linspace(3, 6, 200)
p = np.linspace(0.04, 0.07, 200)

# Fetch data
n = df['Rest'].values

# Specify r_max
r_max = 20

# Compute log posterior
log_post = np.empty((len(p), len(r)))
for j, r_val in enumerate(r):
    for i, p_val in enumerate(p):
        log_post[i,j] = log_posterior((r_val, p_val), n, r_max)
                    
# Compute values for plot
R, P = np.meshgrid(r, p)
post = np.exp(log_post - log_post.max())

# Show posterior plot
p = bebi103.viz.contour(R,
                        P,
                        post,
                        overlaid=True,
                        x_axis_label='r',
                        y_axis_label='p')
bokeh.io.show(p)

Looking at what we did above, it was a lot of code, and also a fair amount of mathematical expressions we had to write out. We will now do the the same thing using MCMC.

Model specification

Instead of writing out the whole posterior, we can write the statistical model as follows.

\begin{align} r &\sim \text{Uniform}(r_\mathrm{min}, r_\mathrm{max}) \\[0.5em] p &\sim \text{Uniform}(0, 1) \\[0.5em] n_i &\mid r,p \sim \text{NBinom}(n_i;r,p) \;\forall i, \\[0.5em] \end{align}

where the last "$\forall i $" implies that each $n_i$ is independent and identically distributed (i.i.d.). As we can see above, we have specified a prior and a likelihood, so all necessary information has been supplied for a posterior. This is more concise than writing out the whole posterior (though that is often a good idea). This type of notation is how we build our model in PyMC3.

When we build our model, we will specify the distributions using PyMC3's built-in probability distributions. If we look at the definition of the Negative Binomial from PyMC3, it has a different definition than we have been using. PyMC3 uses

\begin{align} f(n;\mu, \alpha) = \frac{\Gamma(n + \alpha)}{\Gamma(\alpha) n!}\,\left(\frac{\alpha}{\mu + \alpha}\right)^\alpha \left(\frac{\mu}{\mu + \alpha}\right)^n. \end{align}

The change of variables is $\alpha = r$ and $\mu = r(1-p)/p$, which is proportional to the expression we derived for the burst size, $b$, before. If we are parametrizing by $\mu$ and $\alpha$ and we want to maintain uniform priors on $p$ and $r$, we should use a prior of

\begin{align} g(\mu, \alpha) \propto \frac{\alpha}{(\alpha + \mu)^2}, \end{align}

which is improper. We will use that later on in this tutorial when we specify custom priors in PyMC3, but for now, we will just take $b$ to have a Uniform prior for the purposes of demonstrating how to use PyMC3's built-in distributions.

So, here is the model we will use for this demonstration.

\begin{align} r &\sim \text{Uniform}(r_\mathrm{min}, r_\mathrm{max}) \\[0.5em] b &\sim \text{Uniform}(0, \infty) \\[0.5em] n_i &\mid \mu, \alpha \sim \text{NBinom}(n_i;\mu, \alpha) \;\forall i, \\[0.5em] \end{align}

where in the last line we are using PyMC3's parametrization scheme for the Negative Binomial. Note, though, that a Uniform prior is not really defined on an infinite domain. This is not really a problem, though, since we can just treat the prior as a positive constant, which is accomplished via PyMC3's HalfFlat distribution.

Ok, let's build the model, knowing that we have imported PyMC3 as pm.

In [5]:
# Specify parameters
r_min = 0
r_max = 20

# Pull out data we want to use as a NumPy array
n = df['Rest'].values

with pm.Model() as nbinom_model:
    # Priors
    alpha = pm.Uniform('alpha', lower=r_min, upper=r_max)
    mu = pm.HalfFlat('mu')
    
    # Likelihood
    n_obs = pm.NegativeBinomial('n_obs', mu=mu, alpha=alpha, observed=n)

You may notice that that cell takes a while to run. That is because PyMC3 is building a compiled model, so that when we sample out of it, only compiled code runs. This results is big speed boosts.

Now let's parse the model set-up piece by piece. First, before we specify the model, we need to have the data and the bounds on the prior for $r$ already defined. Note that I have extracted the data out of the DataFrame as a Numpy array. This is not necessary, since PyMC3 can handle Pandas data frames, but it does result in a speed boost.

Next, we have a with statement. This is the way you do context management in Python. We use pm.Model() to instantiate a model, which we will call nbinom_model. All of what follows is done in the context of this model. So, all of the compiled code that is generated when PyMC3 builds the model exists in the context of nbinom_model.

We then go on to define the priors for $r$ and $b$. To do this, we instantiate PyMC3's built-in distributions. The first argument is always the name of the variable as a string. In almost all cases, this is trivial, here just r and b. The next arguments are args and kwargs for the distribution we are instantiating. Because the distributions for r and b do not have an observed kwarg, PyMC3 knows they are priors and not a likelihood.

Finally, we define the likelihood as Negative Binomial. The parameters of the Negative Binomial comes from the prior. We have the observed kwarg to specify that this quantity is observed (and the distribution is therefore a likelihood) and to specify that those observations are in the variable n.

The model is now built. Note that the data are now part of the compiled model. This is what they had to be specified before we set up the model.

Drawing samples

Ok, let's sample! We have to do the sampling within the context of nbinom_model so that PyMC3 knows what model is being used in the sampling. Note that the kwarg njobs below says how many samplers to run in parallel. I generally use the number of cores on my machine minus 2 as the maximum number of jobs I run. So, your machine does not have many cores, you might want to lower njobs.

In [6]:
with nbinom_model:
    trace = pm.sample(draws=2000, tune=5000, njobs=4)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%|██████████| 7000/7000 [00:06<00:00, 1082.24it/s]

PyMC3 used a no U-turn sampler (NUTS) to draw samples out of the posterior. We specified that we wanted the sampler to take 5000 tuning steps (which are then discarded) and then take 2000 samples, or draws, out of the posterior. We used the njobs=4 kwarg to tell the sampler to do this whole procedure four times in parallel, using four cores. I have an eight core machine, so this is fine.

Note that PyMC3 determined that we can use the NUTS sampler (although that is kind of like saying "ATM machine") and employed it. That is a very efficient sampler that uses Hamitonian Monte Carlo, which you can read about here.

Note also that PyMC3 initialized the sampler using jitter+adapt_diag. This is how PyMC3 chooses the starting point for MCMC sampling by default. Roughly, it starts the sampler at the mean of the priors, with some random jitter to the starting point. You may use a more sophisticated method, automatic differentiation variational inference (ADVI) to start your walkers. This method is more computationally intensive, but tends to give good starting positions for walkers.

In [7]:
with nbinom_model:
    trace = pm.sample(init='advi+adapt_diag', draws=10000, tune=10000, njobs=4)
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
Average Loss = 1,383.2:  10%|â–ˆ         | 20851/200000 [00:12<01:45, 1694.67it/s]
Convergence archived at 21000
Interrupted at 20,999 [10%]: Average Loss = 8,133.8
100%|██████████| 20000/20000 [00:18<00:00, 1061.04it/s]

Checking stationarity

To make sure that the sampler is properly sampling the posterior, we can plot a trace of the sampler as it moves through space. To extract the values of the samples, we can use the get_values() method of a PyMC3 Trace or MultiTrace object.

In [8]:
# Get trace of values of b for first chain
trace_vals = trace.get_values('mu', chains=[0])

# Make a plot
p = bokeh.plotting.figure(plot_width=700,
                          plot_height=250,
                          x_axis_label='sample number',
                          y_axis_label='µ')
p.line(np.arange(len(trace_vals)), trace_vals, line_join='bevel')
bokeh.io.show(p)

This kind of trace is a sign of good sampling. The sampler bounces around an average value.

PyMC3 has a nice utility to show all the traces, and also the marginalized PDFs of the traces, implemented as pm.traceplot(). It displays the plots using Matplotlib, so be sure you use the %matplotlib inline magic function in your notebook so that the graphics are properly displayed.

In [9]:
pm.traceplot(trace);

The plots on the right show the samples in $\alpha$ and $\mu$, with each chain (a total of four, from the four cores we used to do the calculation) shown in a different color. To the left is an approximation using kernel density estimation of the marginalized posterior, in this case $g(\alpha\mid\mathbf{n})$ and $g(\mu\mid\mathbf{n})$. I will discuss how these are calculated from the samples in a moment.

We can also compute the Gelman-Rubin statistic if we have multiple chains, which we defined in lecture to be

\begin{align} \hat{R} = \frac{\text{variance between the chains}}{\text{mean variance within the chains}}. \end{align}

This is another argument for doing at least two chains in parallel; you are able to test for stationarity. The test is implemented in the pm.gelman_rubin() function.

In [10]:
pm.gelman_rubin(trace)
Out[10]:
{'alpha': 0.99997839457848559, 'mu': 0.99999443263277421}

In this case, we have $\hat{R}$ values very close to one, which is a good sign for stationarity.

Marginalizing the posterior

If we can build a model, we are able to write down the posterior analytically. But if we want to learn anything from it, we must compute integrals over the posterior. The easiest to compute from MCMC samples are marginalizations. To compute the marginalized distribution, say $g(r\mid\mathbf{n})$, we simple ignore the values of $b$ that we sampled! So, we can directly plot the approximate CDF of $g(r\mid \mathbf{n})$, which we will call $G(r\mid \mathbf{n})$, by plotting the ECDF of the samples.

In [11]:
plots = [bebi103.viz.ecdf(trace.get_values(param),
                          formal=True,
                          plot_height=250,
                          plot_width=400,
                          x_axis_label=param,
                          y_axis_label=f'G({param} | n)')
             for param in ('mu', 'alpha')]
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

We can also plot them as a histogram.

In [12]:
plots = [bebi103.viz.histogram(trace.get_values(param),
                               bins=50,
                               plot_height=250,
                               plot_width=400,
                               line_width=2,
                               x_axis_label=param,
                               y_axis_label=f'g({param} | n)')
             for param in ('alpha', 'mu')]
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

Histograms are a common way of looking at MCMC results, but I prefer ECDFs.

Corner plots

To get a full graphical summary of the posterior we can imagine making the following set of plots.

  1. The posterior marginalized to include each single parameter in $\theta$ (in our case, $\theta = \{\alpha, \mu\}$.
  2. The posterior marginalized to include each unique pair of parameters (in our case, there is only one pair, $(\alpha, \mu)$).

Such a collection of plots is called a corner plot. The bebi103.viz.corner() function generates this from a trace. It is heavily inspired by Dan Forman-Mackey's corner package.

In [13]:
g = bebi103.viz.corner(trace, vars=['alpha', 'mu'])
bokeh.io.show(g)

In this case, since there are only two parameters, the plot in the $r$-$b$ plane is the complete posterior. Alternatively, we can plot ECDFs instead of histograms.

In [14]:
g = bebi103.viz.corner(trace, vars=['alpha', 'mu'], plot_ecdf=True)
bokeh.io.show(g)

Computing from the samples

Recall that we originally consider parameters $r$ and $p$ instead of $\alpha$ and $\mu$. We chose the latter because that was what was demanded from the built-in pm.NegativeBinomial class. Conveniently, if we want to compute any quantity from the samples, we just directly compute it, and then we can study the probability distribution of the new quantity. So, we will just take our $\alpha$ and $\mu$ values and compute $p = \alpha/(\alpha+\mu)$. This is easier accomplished if we first convert the samples from a PyMC3 MultiTrace object into a Pandas DataFrame. PyMC3 has a function to do this, pm.trace_to_dataframe(), but the bebi103.pm.trace_to_dataframe() does the same thing with a few more bells and whistles.

In [15]:
# Convert trace to DataFrame
df_mcmc = bebi103.pm.trace_to_dataframe(trace, model=nbinom_model, log_post=True)

# Take a look
df_mcmc.head()
Out[15]:
alpha mu energy_error step_size_bar energy diverging tune tree_size max_energy_error mean_tree_accept depth step_size chain log_posterior
0 4.443611 76.812753 -0.475233 1.263103 1375.522226 False False 3.0 -0.533517 1.000000 2 1.263103 0 -1377.128804
1 4.150964 74.191971 0.083921 1.263103 1375.704341 False False 3.0 0.316227 0.798209 2 1.263103 0 -1377.281641
2 4.150964 74.191971 0.000000 1.263103 1375.988383 False False 1.0 0.676804 0.508239 1 1.263103 0 -1377.281641
3 4.371177 78.717741 0.268521 1.263103 1378.233213 False False 3.0 1.610404 0.454870 2 1.263103 0 -1378.208374
4 4.156399 76.129494 -0.165157 1.263103 1375.685067 False False 3.0 -0.230219 1.000000 2 1.263103 0 -1377.350783

You will notice lost of statistics from the sampler, such as energy, diverging, etc. We will not go into these now, but instead mentin that the log_posterior column has the value of the log posterior for each sample. To calculate this when building the DataFrame, use the model and log_post kwargs. We will not use this just yet, but it is useful to have for computing the MAP.

Now, we can add a column to this DataFrame, p, which has the values of that parameter. We can also compute the burst size, $b$.

In [16]:
# Compute p from samples
df_mcmc['p'] = df_mcmc['alpha'] / (df_mcmc['alpha'] + df_mcmc['mu'])
df_mcmc['b'] = (1 - df_mcmc['p']) / df_mcmc['p']

# Make corner plot
g = bebi103.viz.corner(df_mcmc,
                       vars=['alpha', 'mu', 'b', 'p'],
                       plot_ecdf=True,
                       show_contours=False)
bokeh.io.show(g)

Note that the plot of the posterior in the $\alpha$-$p$ plane looks quite similar to the analytical posterior we plotted at the beginning of this tutorial (though they will be slightly different because they have different priors). Also note that for this particular corner plot, I chose not to display the contours, since the $b$-$p$ dependence would give contours that are too wide due to the smoothing that is employed to generate the contours.

Finding the MAP

Having a data frame with a log_posterior column enables us to compute the MAP. We just find the index of the row that gives the largest log posterior. We can use the idxmax() method that is built in to Pandas DataFrames.

In [17]:
# Get the index of the most probable parameter set
max_ind = df_mcmc['log_posterior'].idxmax()

# Pull out values.
r_MAP, p_MAP = df_mcmc.loc[max_ind, ['alpha', 'p']]

# Print the results
print("""
Most probable parameter values:
r: {0:.3f}
p: {1:.3f}""".format(r_MAP, p_MAP))
Most probable parameter values:
r: 4.530
p: 0.057

This is pretty much what we got with optimization in the last tutorial, though we were using a different prior in the last tutorial.

Writing a function to define the model and sample

It is a little bit annoying that we need to have the data defined outside of the model, especially if we need to do the analysis multiple times. To ease this issue, we could write a function to define a model and sample.

In [18]:
def sample_neg_binom(n, r_min=0, r_max=20, draws=500, tune=500, njobs=1):
    """Sample the mRNA count data with Neg. Binom. likelihood."""
    with pm.Model() as model:
        # Priors
        r = pm.Uniform('r', lower=r_min, upper=r_max)
        mu = pm.HalfFlat('mu')

        # Likelihood
        n_obs = pm.NegativeBinomial('n_obs', mu=mu, alpha=r, observed=n)
        
        # Draw the samples
        trace = pm.sample(draws=draws, tune=tune, njobs=njobs, init='advi+adapt_diag')
        
        # Convert to data frame
        df_mcmc = bebi103.pm.trace_to_dataframe(trace, log_post=True)
        
    return df_mcmc, trace, model

As an examle, we'll run it on Prdm14.

In [19]:
# Perform sampling
df_mcmc, _, _ = sample_neg_binom(df['Prdm14'].values, tune=10000, draws=10000, njobs=4)

# Plot result
g = bebi103.viz.corner(df_mcmc, vars=['r', 'mu'], plot_ecdf=True)
bokeh.io.show(g)
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
Average Loss = 719.06:   8%|â–Š         | 16831/200000 [00:10<01:49, 1672.50it/s]
Convergence archived at 17000
Interrupted at 16,999 [8%]: Average Loss = 924.62
100%|██████████| 20000/20000 [00:18<00:00, 1079.54it/s]

You may get some warnings about diverging samples (I did on a few of the times I ran this). During sampling PyMC3 uses some heuristic tests to make sure the sampler is working properly. If it encounters very high curvature in a trajectory of a sample, it deems the sample to be diverging. This may mean there could be some error in sampling, but it necessarily mean that is the came. When the number of diverging samples is small, as is the case here, we are usually ok.

Setting a custom prior

As I mentioned before, we should be using

\begin{align} g(\mu, \alpha) \propto \frac{\alpha}{(\alpha + \mu)^2}. \end{align}

with $g(\mu\le 0, \alpha\le 0) = 0$, as our prior. This is not a distribution that is built in to PyMC3, so we need to make our own prior. There are a couple of ways of doing this, first by defining a deterministic variable, and second by building a custom prior.

For our first approach, we can specify the prior we want on $p$, Uniform(0, 1), and then explicitly do the change of variables to $\mu$, as the PyMC3 NegativeBinomial distribution requires.

In [20]:
# Specify parameters
r_min = 0
r_max = 20

# Pull out data we want to use as a NumPy array
n = df['Rest'].values

with pm.Model() as nbinom_model:
    # Priors
    r = pm.Uniform('r', lower=r_min, upper=r_max)
    p = pm.Uniform('p', lower=0, upper=1)
                
    # Convert p to b
    mu = r * (1 - p) / p

    # Likelihood
    n_obs = pm.NegativeBinomial('n_obs', alpha=r, mu=mu, observed=n)

Let's sample, just to take a look.

In [21]:
with nbinom_model:
    trace = pm.sample(init='advi+adapt_diag', draws=10000, tune=5000, njobs=4)
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
Average Loss = 1,545.6:   6%|▌         | 11364/200000 [00:06<01:50, 1706.15it/s]
Convergence archived at 11500
Interrupted at 11,499 [5%]: Average Loss = 3,151.1
100%|██████████| 15000/15000 [00:25<00:00, 583.27it/s]

If we convert the trace to a data frame and look, we see that mu was not computed during the sampling.

In [22]:
df_mcmc = bebi103.pm.trace_to_dataframe(trace)

df_mcmc.head()
Out[22]:
r p energy_error step_size_bar energy diverging tune tree_size max_energy_error mean_tree_accept depth step_size chain
0 4.157824 0.053017 0.018166 0.32536 1382.312339 False False 7.0 -0.046211 0.992396 3 0.32536 0
1 4.073795 0.053359 0.458063 0.32536 1383.058044 False False 1.0 0.458063 0.632507 1 0.32536 0
2 4.177351 0.051320 -0.347642 0.32536 1384.078803 False False 3.0 1.460288 0.502289 2 0.32536 0
3 4.129454 0.050235 0.177673 0.32536 1383.432002 False False 3.0 0.177673 0.878013 2 0.32536 0
4 5.719044 0.071628 -0.892058 0.32536 1387.543680 False False 15.0 2.590398 0.512998 4 0.32536 0

We only got $r$ and $p$, but that's fine. It's what we wanted in the first place. We can always compute $\mu$ from the samples anyway. So, let's do that and make a corner plot to check it out.

In [23]:
# Add b to data frame
df_mcmc['mu'] = df_mcmc['r'] * (1 - df_mcmc['p']) / df_mcmc['p']

# Make plot
p = bebi103.viz.corner(df_mcmc, vars=['r', 'mu'], plot_ecdf=True)
bokeh.io.show(p)

Deterministic variables

In the above example, $\mu$ was a deterministic variable, computed directly from a parameter that can very ($p$). We saw that PyMC3 did not keep track of its value during sampling; we had to recompute it after sampling. If we want the sampler to keep track of a variable, we need to say that it is "deterministically" distributed using pm.Deterministic. We can redo the calculation using this.

In [24]:
with pm.Model() as nbinom_model:
    # Priors
    r = pm.Uniform('r', lower=r_min, upper=r_max)
    p = pm.Uniform('p', lower=0, upper=1)
                
    # Convert p to mu
    mu = pm.Deterministic('mu', r * (1 - p) / p)

    # Likelihood
    n_obs = pm.NegativeBinomial('n_obs', alpha=r, mu=mu, observed=n)
    
    # Get the samples
    trace = pm.sample(init='advi+adapt_diag', draws=10000, tune=5000, njobs=4)
    
p = bebi103.viz.corner(trace, vars=['r', 'mu'], plot_ecdf=True)
bokeh.io.show(p)
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
Average Loss = 1,548.1:   6%|▌         | 11379/200000 [00:06<01:52, 1677.16it/s]
Convergence archived at 11500
Interrupted at 11,499 [5%]: Average Loss = 3,166.3
100%|██████████| 15000/15000 [00:26<00:00, 561.41it/s]

Custom priors

Now let's try this by defining a custom prior. One way to specify an arbitrary continuous distribution is to use the pm.DensityDist class. We instantiate it like other distributions with the dirst argument being the name of the distribution. The second argument is a function that computes the logarithm of the PDF. So, we need to write a function that returns

\begin{align} \ln g(\mu, \alpha) = \ln\alpha - 2 \ln(\alpha + \mu)^2, \end{align}

given an input array containing $\alpha$ and $\mu$. Here is how I would write the function using NumPy.

In [25]:
def custom_prior_logp_numpy(value):
    """Log of prior for alpha and mu."""
    alpha, mu = value
    if alpha <= 0 or mu <= 0:
        return -np.inf
    
    return np.log(alpha) - 2*np.log(alpha + mu)

This unfortunately will not work. The reason for this is that under PyMC3's hood, the arrays are not Numpy arrays, but Theano tensors. The reason for this is that the Theano packages makes building compiled models with symbolic differentiation possible (read: SPEED). So, when writing functions that will be used in PyMC3 models, we need to use Theano. I have already imported the theano.tensor submodule as tt. To enforce bounds on the prior (namely that $\alpha, \mu > 0$), I will use the pm.distributions.dist_math.bound() function to conveniently dictate that the prior is only nonzero within certain bounds. The function for the custom prior is below.

In [26]:
def custom_prior_logp(value):
    """Log of prior for alpha and mu."""
    return pm.distributions.dist_math.bound(
                tt.log(value[0]) - 2.0 * tt.log(value[0] + value[1]),
                tt.all(value > 0.0))

Note that the prior is on both $\alpha$ and $\mu$, since they are no longer independent in the prior. This means that we need to specify that the shape of the distribution, which is PyMC3 speak for the dimensionality of the prior, in this case 2. Finally, we also specify a testval kwarg when setting up the distribution to tell PyMC3 where to start walkers. We have to do this because PyMC3 is unaware of this custom distribution and its properties, so it cannot smartly pick a starting point. Now, let's specify the model.

In [27]:
# Specify parameters
r_min = 0
r_max = 20

# Pull out data we want to use as a NumPy array
n = df['Rest'].values

with pm.Model() as nbinom_model:
    # Priors
    alpha_mu = pm.DensityDist('alpha_mu',
                              custom_prior_logp,
                              shape=2,
                              testval=[1.0, 1.0])
                
    # Likelihood
    n_obs = pm.NegativeBinomial('n_obs', alpha=alpha_mu[0], mu=alpha_mu[1], observed=n)

Once we got the custom prior defined, it's just a matter of writing the likelihood. Let's sample and plot!

In [28]:
with nbinom_model:
    trace = pm.sample(draws=10000, tune=5000, njobs=4)

g = bebi103.viz.corner(trace,
                       vars=['alpha_mu__0', 'alpha_mu__1'],
                       labels=['α', 'µ'],
                       plot_ecdf=True)
bokeh.io.show(g)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%|██████████| 15000/15000 [00:12<00:00, 1193.49it/s]

Notice that I specified the vars kwarg to be ['alpha_mu__0', 'alpha_mu__1']. This is because PyMC3 stores vector-valued parameters with double underscores and their index when the trace is converted to a data frame. In our case, alpha_mu__0 is $\alpha$ and alpha_mu__1 is $\mu$. The labels kwarg of bebi103.viz.corner() enables us to label the axes with something a little cleaner.

Note that we did not really need to write a custom prior because we could do a change of variables; We specified priors for $r$ and $p$ that are built in to PyMC3, and then computed $\alpha$ and $\mu$ for use in the likelihood. Whatever you can do with built-in functions, you should.

But what if we really need some custom functions? And what if those custom functions cannot be coded up in Theano (for example, things like ODE solving)? Can we just write Numpy functions? The answer if yes and no. Yes, you can, but no you can't use NUTS. You will have to use a sampler that does not rely on gradient information, like Metropolis sampling. The reason for this is that PyMC3 uses Theano to do automatic differentiation of all of the functions. Without this differentiation, you cannot use NUTS. If you define your custom functions using only basic operations (like +, *, -, /, **, etc.) and Theano and PyMC3 functions, the gradient is automatically available. Otherwise, you will have to specify gradient information, which involves getting more into the guts of Theano, which we will not cover in this course.

To make a Numpy function available to PyMC3 as a Theano op, you decorate the function of interest using theano.compile.ops.as_op(). It is important to note that even though the function, in this case a prior, returns a scalar, it still my be returned as a Numpy array (e.g., np.array(-np.inf) and not -np.inf).

In [29]:
@theano.compile.ops.as_op(itypes=[tt.dvector], otypes=[tt.dscalar])
def custom_prior_logp_numpy(value):
    """Log of prior for alpha and mu."""
    alpha, mu = value
    if alpha <= 0 or mu <= 0:
        return np.array(-np.inf)
    
    return np.array(np.log(alpha) - 2*np.log(alpha + mu))

We can now use this in our model, just as we did last time with the custom prior built from Theano functions.

In [30]:
with pm.Model() as nbinom_model:
    # Priors
    alpha_mu = pm.DensityDist('alpha_mu',
                              custom_prior_logp_numpy,
                              shape=2,
                              testval=[1.0, 1.0])
                
    # Likelihood
    n_obs = pm.NegativeBinomial('n_obs', alpha=alpha_mu[0], mu=alpha_mu[1], observed=n)

Unfortunately, we cannot sample this posterior with NUTS because we have no gradient information. Nonetheless, PyMC3 will try NUTS by default. Instead, we have to specify our sampler using the step kwarg. We will use Metropolis.

In [31]:
with nbinom_model:
    trace = pm.sample(draws=10000, tune=100000, njobs=4, step=pm.Metropolis())

g = bebi103.viz.corner(trace,
                       vars=['alpha_mu__0', 'alpha_mu__1'],
                       labels=['α', 'µ'],
                       plot_ecdf=True)
bokeh.io.show(g)
100%|██████████| 110000/110000 [00:25<00:00, 4383.34it/s]

A more complicated case: The bimodal Rex1 distribution

We will now look at a more complicated case, the bimodal Rex1 distribution. Referring again to the previous tutorial, the likelihood and prior for the model of two Negative Binomial distributions are respectively

\begin{align} f(\mathbf{n} \mid r_1, r_2, p_1, p_2, w_1) &\propto \prod_i \left(w_1\,\frac{(n_i + r_1 - 1)!}{n_i!\,(r_1-1)!}\,p_1^{r_1}(1-p_1)^{n_i}\right. \nonumber \\[1em] &\;\;\;\;\;\;\;\;\;\;\;\; \left.+\;(1-w_1) \frac{(n_i + r_2 - 1)!}{n_i!\,(r_2-1)!}\,p_2^{r_2}(1-p_2)^{n_i}\right),\\[1em] g(r_1, r_2, p_1, p_2, w_1) &= \text{constant}, \end{align}

where in the Uniform prior it is understood that the parameters lie within bounds.

For reasons that will become clear in one moment, let us define $w_2 = 1-w_1$ and rewrite this as

\begin{align} f(\mathbf{n} \mid r_1, r_2, p_1, p_2, w_1, w_2) &\propto \prod_i \left(q\,\frac{(n_i + r_1 - 1)!}{n_i!\,(r_1-1)!}\,p_1^{r_1}(1-p_1)^{n_i}\right. \nonumber \\[1em] &\;\;\;\;\;\;\;\;\;\;\;\; \left.+\;w_2 \frac{(n_i + r_2 - 1)!}{n_i!\,(r_2-1)!}\,p_2^{r_2}(1-p_2)^{n_i}\right),\\[1em] g(r_1, r_2, p_1, p_2, w_1, w_2) &= \text{constant, s.t. } w_1 + w_2 = 1, \end{align}

Importantly, each of $p_1$, $p_2$, $w_1$, and $w_2$ lie between zero and one, with $w_1 + w_2 = 1$. As we attempt to write this in a form for use in PyMC3, we note that the likelihood is a mixture model. A mixture model describes a situation where there are two or more subpopulations, each of which having its own probability distribution, but where we do not know which measurement belongs to which subpopulation. Each of these distributions has a weight, which gives the probability that a given measurement came out it. So, we can write out model like this.

\begin{align} \mathbf{r} &\sim \text{Uniform}(\mathbf{r}_\mathrm{min}, \mathbf{r}_\mathrm{max}) \\[1em] \mathbf{p} &\sim \text{Uniform}(0, 1) \\[1em] \mathbf{n} \mid \mathbf{r}, \mathbf{p}, \mathbf{w} &\sim \sum_j w_j\,\text{NBinom}(n_i;r_j,p_j) \;\forall i\\[1em] \mathbf{w} &\sim \text{Dirichlet}(\mathbf{1}). \end{align}

I have written this for a general Mixture model, but in this there are only two distributions in the model, so $\mathbf{r} = (r_1, r_2)$, $\mathbf{p} = (p_1, p_2)$, and $\mathbf{w} = (w_1, w_2)$.

Let's code up this model and let it run.

In [32]:
n = df['Rex1'].values
r_min = 0
r_max = 50

with pm.Model() as bi_nbinom_model:
    # Priors
    r = pm.Uniform('r', lower=r_min, upper=r_max, shape=2)
    p = pm.Uniform('p', lower=0, upper=1, shape=2)
    w = pm.Dirichlet('w', a=np.ones(2))
        
    # Convert p's to mu's
    mu = pm.Deterministic('mu', r*(1-p)/p)

    # Pieces of likelihood
    likelihood = pm.NegativeBinomial.dist(alpha=r, mu=mu)

    # Likelihood
    n_obs = pm.Mixture('n', w, likelihood, observed=n)

    # Perform MCMC
    trace = pm.sample(draws=10000, tune=10000, init='advi+adapt_diag', njobs=4)

# Make corner plot
g = bebi103.viz.corner(trace,
                       vars=['r__0', 'p__0', 'r__1', 'p__1', 'w__0'],
                       labels=['r1', 'µ1', 'r2', 'µ2', 'w1'])
bokeh.io.show(g)
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
Average Loss = 4,048.6:   3%|â–Ž         | 5581/200000 [00:05<03:11, 1013.66it/s]
Convergence archived at 5600
Interrupted at 5,599 [2%]: Average Loss = 6,459.3
100%|█████████▉| 19996/20000 [03:20<00:00, 99.80it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:451: UserWarning: The acceptance probability in chain 0 does not match the target. It is 0.389509183723, but should be close to 0.8. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))
/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 0 contains 48 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 20000/20000 [03:20<00:00, 99.78it/s]
/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:451: UserWarning: The acceptance probability in chain 3 does not match the target. It is 0.890400282453, but should be close to 0.8. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))

Yikes! We see some weird bimodality (though if we were to run it again, we many not see the bimodality for reasons that will become clear in a moment). We can check the Gelman-Rubin statistic.

In [33]:
pm.gelman_rubin(trace)
Out[33]:
{'mu': array([ 15.06671885,  21.019613  ]),
 'p': array([ 2.08092439,  1.43836978]),
 'r': array([ 1.04035536,  1.01510737]),
 'w': array([ 12.85642219,  12.85642219])}

These look really out of whack, as well. All of this is due to the model having a degenerate mixture. We could exchange subscripts 1 and 2 and get exactly the same model. The sampler may start at $w_1$ being high and $w_2$ being low, or vice versa, and will sample two pieces of parameter space that are equally probable. So, the Gelman-Rubin statistic is also off because the samplers can switch $p_1$ being larger or smaller than $p_2$. We can break this degeneracy by stipulating that the parameters $p_1$ and $p_2$ are ordered. That is, $p_1 < p_2$. Ideally, we would like to change the model specification we just used as follows.

# Old line
p = pm.Uniform('p', lower=0, upper=1, shape=2)

# New line
p = pm.Beta('p', lower=0, upper=1, shape=2, transform=pm.Ordered())

This feature is not yet implemented (see also here), but likely will be in future releases. Instead, we will hack this together by adding a pm.Deterministic variable that is the sorted probabilities.

In [34]:
n = df['Rex1'].values
r_min = 0
r_max = 50

with pm.Model() as bi_nbinom_model:
    # Priors
    r = pm.Uniform('r', lower=r_min, upper=r_max, shape=2)
    p = pm.Uniform('p', lower=0, upper=1, shape=2)
    w = pm.Dirichlet('w', a=np.ones(2))
    
    # Break symmetry by sorting
    p_sorted = pm.Deterministic('p_sorted', tt.sort(p))
        
    # Convert p's to mu's
    mu = pm.Deterministic('mu', r*(1-p_sorted)/p_sorted)

    # Pieces of likelihood
    likelihood = pm.NegativeBinomial.dist(alpha=r, mu=mu)

    # Likelihood
    n_obs = pm.Mixture('n', w, likelihood, observed=n)

    # Perform MCMC
    trace = pm.sample(draws=10000, tune=10000, init='advi+adapt_diag', njobs=4)

# Make corner plot
g = bebi103.viz.corner(trace,
                       vars=['r__0', 'mu__0', 'r__1', 'mu__1', 'w__0'],
                       labels=['r1', 'µ1', 'r2', 'µ2', 'w1'])
bokeh.io.show(g)
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
Average Loss = 1,982.2:   8%|â–Š         | 15858/200000 [00:17<03:18, 925.73it/s]
Convergence archived at 15900
Interrupted at 15,899 [7%]: Average Loss = 3,681.3
 92%|█████████▎| 18500/20000 [03:51<00:18, 79.83it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 3 contains 11 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
 95%|█████████▌| 19088/20000 [03:58<00:11, 80.09it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 1 contains 1 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|█████████▉| 19997/20000 [04:08<00:00, 80.43it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:451: UserWarning: The acceptance probability in chain 0 does not match the target. It is 0.886508908344, but should be close to 0.8. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))
100%|██████████| 20000/20000 [04:08<00:00, 80.43it/s]

We can now check the Gelman-Rubin statistic, knowing the it is meaningless for p, but has meaning for p_sorted and the other parameters.

In [35]:
pm.gelman_rubin(trace)
Out[35]:
{'mu': array([ 1.00009426,  1.0004231 ]),
 'p': array([ 1.79321328,  1.80604863]),
 'p_sorted': array([ 1.00028757,  1.00034039]),
 'r': array([ 1.00036812,  1.00034803]),
 'w': array([ 1.00006822,  1.00006822])}

This has remedied the problem. For a more detailed discussion on degeneracy in mixture models, see this blog post by Michael Betancourt. It is written using Stan with R, but the general principles are the same.

We can do a quick calculation of the MAP to see how it matches our results from Tutorial 4b.

In [36]:
df_mcmc = bebi103.pm.trace_to_dataframe(trace, model=bi_nbinom_model, log_post=True)

# Get the index of the most probable parameter set
max_ind = df_mcmc['log_posterior'].idxmax()

# Look at max
df_mcmc.loc[max_ind, ['r__0', 'r__1', 'p_sorted__0', 'p_sorted__1', 'w__0']]
Out[36]:
r__0             5.02063
r__1             3.65597
p_sorted__0    0.0300456
p_sorted__1     0.204321
w__0            0.839718
Name: 77, dtype: object

The results are very close to what we got from optimization. In the next tutorial, we will discuss ways other than corner plots for displaying summaries of MCMC results.

Another example: nonlinear regression

We also performed nonlinear regression using optimization methods in the last tutorial using the data set from Good, et al., Science, 342, 856-860, 2013 from lecture 3. Here, we will perform the same analysis using MCMC. We start by loading in the data set.

In [37]:
# Load in Data Frame
df = pd.read_csv('../data/good_invitro_droplet_data.csv', comment='#')

# Get Numpy arrays for speed
d = df['Droplet Diameter (um)'].values
ell = df['Spindle Length (um)'].values

Next, we define a function to compute the spindle length. As a reminder,

\begin{align} l(d;\gamma, \phi) = \frac{\gamma d}{(1 + (d/\phi)^3)^\frac{1}{3}}. \end{align}

Note that we cannot use np.cbrt() as we did in the last tutorial, since whatever expressions we use must be able to be evaluated using Theano. So, we instead raise the expression to the 1/3 power.

In [38]:
def spindle_length(d, gamma, phi):
    """Spindle length as a function of droplet diameter."""
    return gamma * d / (1 + (d/phi)**3)**(1/3)

Now, we define our model. We want to specify

\begin{align} \gamma &\sim \text{Uniform}(0, 1) \\[1em] \phi &\sim \text{Jeffreys}(\phi_\mathrm{min}, \phi_\mathrm{max}) \\[1em] \sigma &\sim \text{Jeffreys}(\sigma_\mathrm{min}, \sigma_\mathrm{max}) \\[1em] l_i\mid \gamma, \phi, \sigma &\sim \text{Norm}(l(d_i;\gamma, \phi), \sigma) \; \forall i.\\[0.1em]\nonumber \phantom{blah} \nonumber \end{align}

A problem is that PyMC3 does not have a built-in Jeffreys prior. However, we can use the fact that a Jeffreys prior is the same as a Uniform prior on the logarithm of the parameter. So, we can do a transformation to get a Jeffreys prior.

In [39]:
with pm.Model() as model:
    # Priors
    gamma = pm.Uniform('gamma', lower=0, upper=1)
    log_phi = pm.Flat('log_phi')
    log_sigma = pm.Flat('log_sigma')

    # Convert logs (this gives Jeffreys priors)
    phi = pm.Deterministic('phi', tt.exp(log_phi))
    sigma = pm.Deterministic('sigma', tt.exp(log_sigma))

    # Compute spindle length
    ell_theor = spindle_length(d, gamma, phi)

    # Likelihood
    ell_obs = pm.Normal('ell_obs', mu=ell_theor, sd=sigma, observed=ell)

Now, let's draw some samples and make a corner plot.

In [40]:
with model:
    trace = pm.sample(draws=10000, tune=10000, init='advi+adapt_diag', njobs=4)
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
Average Loss = 4,681.7:   5%|▌         | 10071/200000 [00:06<02:04, 1522.74it/s]  
Convergence archived at 10200
Interrupted at 10,199 [5%]: Average Loss = 1.479e+05
100%|██████████| 20000/20000 [01:07<00:00, 298.24it/s]

And, as usual, we'll make a corner plot.

In [41]:
g = bebi103.viz.corner(trace,
                       vars=['gamma', 'phi', 'sigma'],
                       labels=['γ', 'φ (µm)', 'σ (µm)'],
                       plot_width=200,
                       smooth=1)
bokeh.io.show(g)

This is a similar result as we got when we plotted the whole posterior in Tutoral 4a. In that case, we marginalized over $\sigma$ and used the marginalized posterior. Here, we have not done that, and have instead let the sampler include $\sigma$. We still get a marginalized posterior because all we need to do to marginalized is ignore samples in the variables are are not interested in.

Now, suppose for a moment that we did want to sample out of the marginalized posterior, where

\begin{align} g(\gamma, \phi, \mid \mathbf{l},\mathbf{d}) = \frac{1}{\phi}\,\frac{\Gamma(n/2)}{2\pi^{n/2}}\left(\sum_i(l_i - l(d_i;\gamma, \phi))^2\right)^{-\frac{n}{2}}. \end{align}

Coding up this likelihood is a bit trickier. You need to build a class that inherits a PyMC3 class to do it. You can see some discussion on it here. This particular distribution is included in the bebi103module, as is a Jeffreys prior. We can use these to code up the model.

In [42]:
with pm.Model() as model:
        # Priors
        gamma = pm.Uniform('gamma', lower=0, upper=1)
        phi = bebi103.pm.Jeffreys('phi', lower=1, upper=100)
        
        # Compute spindle length
        ell_theor = spindle_length(d, gamma, phi)
        
        # Likelihood
        ell_obs = bebi103.pm.MarginalizedHomoscedasticNormal('ell_obs', mu=ell_theor,
                                                             observed=ell)
        
        trace = pm.sample(draws=10000, tune=10000, init='advi+adapt_diag', njobs=4)
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
Average Loss = 1,899.5:   9%|â–‰         | 17971/200000 [00:11<01:51, 1626.29it/s]
Convergence archived at 18100
Interrupted at 18,099 [9%]: Average Loss = 2,382.6
100%|█████████▉| 19966/20000 [00:50<00:00, 398.57it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:451: UserWarning: The acceptance probability in chain 0 does not match the target. It is 0.880793218046, but should be close to 0.8. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))
100%|██████████| 20000/20000 [00:50<00:00, 398.61it/s]

And now, the corner plot, this time without $\sigma$, since we did not calculate it.

In [43]:
g = bebi103.viz.corner(trace,
                       vars=['gamma', 'phi'],
                       labels=['γ', 'φ (µm)'],
                       plot_width=200,
                       smooth=1)
bokeh.io.show(g)

And, as expected, we get the same result.

Conclusions

You now have an introduction to PyMC3, probably the leading Python-based package for MCMC. At the moment, it uses NUTS for sampling, which is more or less the state of the art for continuous parameters. But research in MCMC samplers develops rapidly, and there are already improvements to NUTS, such as this one. You should try not to get too caught up in the syntax of PyMC3, but think about how models are specified. This type of specification is similar to packages like Stan and Edward (see for example this blog post from Bob Carpenter). Given that the landscape is so rapidly changing (e.g., PyMC4 will likely use TensorFlow or MXNet instead of Theano), what is most important is your understanding of what MCMC is and how it works and how to think about building your models.

Now that you can build statistical models and generate samples, you can make corner plots to visualize the posteriors. In the next tutorial, we discuss other ways to report results from MCMC calculations.