(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This tutorial was generated from an Jupyter notebook. You can download the notebook here.
import numpy as np
import pandas as pd
import scipy.stats as st
# Our main MCMC package
import pymc3 as pm
import theano.tensor as tt
import theano
# BE/Bi 103 utils
import bebi103
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
# Some of PyMC3's output uses Matplotlib
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}
In this tutorial, we will learn how to use Markov chain Monte Carlo to do parameter estimation. To get the basic idea behind MCMC, imagine for a moment that we can draw samples out of the posterior distribution. This means that the probability of choosing given values of a set of parameters is proportional to the posterior probability of that set of values. If we drew many many such samples, we could reconstruct the posterior from the samples, e.g., by making histograms. That's a big thing to imagine: that we can draw properly weighted samples. But, it turns out that we can! That is what MCMC allows us to do.
We will discuss the theory behind this seemingly miraculous capability in lecture. For this tutorial, we will just use the fact that we can do the sampling to learn about posterior distributions in the context of parameter estimation.
We will use PyMC3 as our main engine for performing MCMC. This package, along with PyStan, Edward, and emcee (formerly known as "MCMC Hammer," clearly the best name of a software package ever), represent the state of the art for MCMC. Each has its relative advantages and disadvantages, and I think PyMC3 has the most going for it.
We will talk about items in more detail as we go along. We will cover item 7 when we talk about model selection.
That said, the other packages are also very good and very powerful. I especially like emcee, which we will cover in auxiliary lessons.
To install PyMC3, we should do
conda install -c conda-forge pymc3
conda upgrade pymc3
if you have not already. You should also make sure you have the most up-to-date version of the bebi103
module.
pip install bebi103 --upgrade
Without further ado, let's start doing some statistical inference with PyMC3!
We will use the data set from the previous tutorial from Singer, et al., while it is still fresh in our minds. Recall that this data set has mRNA counts for four genes in a population of cells. First, we'll load in the data.
# Load DataFrame
df = pd.read_csv('../data/singer_transcript_counts.csv', comment='#')
As a reminder, here are the ECDFs.
# Make a list of plots
plots = [bebi103.viz.ecdf(df[gene],
x_axis_label='mRNA count',
plot_width=300,
plot_height=200,
title=gene)
for gene in df]
# Show them as a grid
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))
Recall from the previous tutorial that the mRNA count if Negative Binomially distributed for someof the genes, and not Rex1. We will start our foray into MCMC by studying the Rest gene, whose RNA copy number we assume to be Negative Binomially distributed.
As we did last time, we use a Negative Binomial likelihood. Let $n$ be the copy number of a given gene in a given cell. Then the likelihood is
\begin{align} f(n\mid r,p) = \frac{(n + r - 1)!}{n!(r-1)!}\,p^r(1-p)^n, \end{align}
where $r$ is the frequency of bursts in gene expression relative to the decay rate of mRNA and $p$ is related to the number of transcripts in each burst, the burst size $b = (1-p)/p$. These two parameters, $r$ and $p$ are what we are trying to estimate.
Since each measurement is independent, the likelihood of all measurements for a given gene is
\begin{align} f(\mathbf{n}\mid r,p) = \prod_i f(n_i\mid r,p) = \prod_i \frac{(n_i + r - 1)!}{n_i!(r-1)!}\,p^r(1-p)^{n_i}. \end{align}
We assume that $r$ and $p$ are independent, so that $g(r,p) = g(r)g(p)$. For $p$, we can choose a Uniform prior on the domain, $0 \le p \le 1$,
\begin{align} g(p) = \left\{\begin{array}{lll} 1 && 0 \le p \le 1 \\[0.5em] 0 && \text{otherwise}. \end{array} \right. \end{align}
Note that if we instead wanted to parametrize using $b$ instead of $p$, this implies that the prior on $b$ is
\begin{align} g(b) = \frac{1}{(1+b)^2}, \end{align}
defined for all nonnegative $b$. This was computed using the change of variables formula.
For the prior of $r$, we might also choose a Uniform prior, this time on the domain $r_\mathrm{min} < r \le r_\mathrm{max}$, where $r_\mathrm{min} \ge 0$.
\begin{align} g(r) = \left\{\begin{array}{lll} 1/(r_\mathrm{max} - r_\mathrm{min}) && r_\mathrm{min} < r \le r_\mathrm{max} \\[0.5em] 0 && \text{otherwise}. \end{array} \right. \end{align}
With these functions defined, the posterior is
\begin{align} g(r,p \mid \mathbf{n}) \propto f(\mathbf{n}\mid r,p)\,g(r)\,g(p). \end{align}
We plotted this posterior in the last tutorial, as we do it again here for the rest gene.
def log_likelihood(params, n):
"""
Log likelihood for MLE of Singer data.
"""
r, p = params
return np.sum(st.nbinom.logpmf(n, r, p))
def log_prior(params, r_max):
"""
Log prior for MLE of Singer data.
"""
r, p = params
# Zero probability of having p < 0 or p > 1
if p < 0 or p > 1:
return -np.inf
# Zero probability of r < 1
if r <= 0 or r > r_max:
return -np.inf
return -np.log(r_max - 1)
def log_posterior(params, n, r_max):
"""
Log posterior for the Singer data.
"""
# Compute log prior
lp = log_prior(params, r_max)
# If log prior is -inf, return that
if lp == -np.inf:
return -np.inf
# Compute and return posterior
return lp + log_likelihood(params, n)
# Set up ranges for plotting
r = np.linspace(3, 6, 200)
p = np.linspace(0.04, 0.07, 200)
# Fetch data
n = df['Rest'].values
# Specify r_max
r_max = 20
# Compute log posterior
log_post = np.empty((len(p), len(r)))
for j, r_val in enumerate(r):
for i, p_val in enumerate(p):
log_post[i,j] = log_posterior((r_val, p_val), n, r_max)
# Compute values for plot
R, P = np.meshgrid(r, p)
post = np.exp(log_post - log_post.max())
# Show posterior plot
p = bebi103.viz.contour(R,
P,
post,
overlaid=True,
x_axis_label='r',
y_axis_label='p')
bokeh.io.show(p)
Looking at what we did above, it was a lot of code, and also a fair amount of mathematical expressions we had to write out. We will now do the the same thing using MCMC.
Instead of writing out the whole posterior, we can write the statistical model as follows.
\begin{align} r &\sim \text{Uniform}(r_\mathrm{min}, r_\mathrm{max}) \\[0.5em] p &\sim \text{Uniform}(0, 1) \\[0.5em] n_i &\mid r,p \sim \text{NBinom}(n_i;r,p) \;\forall i, \\[0.5em] \end{align}
where the last "$\forall i $" implies that each $n_i$ is independent and identically distributed (i.i.d.). As we can see above, we have specified a prior and a likelihood, so all necessary information has been supplied for a posterior. This is more concise than writing out the whole posterior (though that is often a good idea). This type of notation is how we build our model in PyMC3.
When we build our model, we will specify the distributions using PyMC3's built-in probability distributions. If we look at the definition of the Negative Binomial from PyMC3, it has a different definition than we have been using. PyMC3 uses
\begin{align} f(n;\mu, \alpha) = \frac{\Gamma(n + \alpha)}{\Gamma(\alpha) n!}\,\left(\frac{\alpha}{\mu + \alpha}\right)^\alpha \left(\frac{\mu}{\mu + \alpha}\right)^n. \end{align}
The change of variables is $\alpha = r$ and $\mu = r(1-p)/p$, which is proportional to the expression we derived for the burst size, $b$, before. If we are parametrizing by $\mu$ and $\alpha$ and we want to maintain uniform priors on $p$ and $r$, we should use a prior of
\begin{align} g(\mu, \alpha) \propto \frac{\alpha}{(\alpha + \mu)^2}, \end{align}
which is improper. We will use that later on in this tutorial when we specify custom priors in PyMC3, but for now, we will just take $b$ to have a Uniform prior for the purposes of demonstrating how to use PyMC3's built-in distributions.
So, here is the model we will use for this demonstration.
\begin{align} r &\sim \text{Uniform}(r_\mathrm{min}, r_\mathrm{max}) \\[0.5em] b &\sim \text{Uniform}(0, \infty) \\[0.5em] n_i &\mid \mu, \alpha \sim \text{NBinom}(n_i;\mu, \alpha) \;\forall i, \\[0.5em] \end{align}
where in the last line we are using PyMC3's parametrization scheme for the Negative Binomial. Note, though, that a Uniform prior is not really defined on an infinite domain. This is not really a problem, though, since we can just treat the prior as a positive constant, which is accomplished via PyMC3's HalfFlat
distribution.
Ok, let's build the model, knowing that we have imported PyMC3 as pm
.
# Specify parameters
r_min = 0
r_max = 20
# Pull out data we want to use as a NumPy array
n = df['Rest'].values
with pm.Model() as nbinom_model:
# Priors
alpha = pm.Uniform('alpha', lower=r_min, upper=r_max)
mu = pm.HalfFlat('mu')
# Likelihood
n_obs = pm.NegativeBinomial('n_obs', mu=mu, alpha=alpha, observed=n)
You may notice that that cell takes a while to run. That is because PyMC3 is building a compiled model, so that when we sample out of it, only compiled code runs. This results is big speed boosts.
Now let's parse the model set-up piece by piece. First, before we specify the model, we need to have the data and the bounds on the prior for $r$ already defined. Note that I have extracted the data out of the DataFrame
as a Numpy array. This is not necessary, since PyMC3 can handle Pandas data frames, but it does result in a speed boost.
Next, we have a with
statement. This is the way you do context management in Python. We use pm.Model()
to instantiate a model, which we will call nbinom_model
. All of what follows is done in the context of this model. So, all of the compiled code that is generated when PyMC3 builds the model exists in the context of nbinom_model
.
We then go on to define the priors for $r$ and $b$. To do this, we instantiate PyMC3's built-in distributions. The first argument is always the name of the variable as a string. In almost all cases, this is trivial, here just r
and b
. The next arguments are args and kwargs for the distribution we are instantiating. Because the distributions for r
and b
do not have an observed
kwarg, PyMC3 knows they are priors and not a likelihood.
Finally, we define the likelihood as Negative Binomial. The parameters of the Negative Binomial comes from the prior. We have the observed
kwarg to specify that this quantity is observed (and the distribution is therefore a likelihood) and to specify that those observations are in the variable n
.
The model is now built. Note that the data are now part of the compiled model. This is what they had to be specified before we set up the model.
Ok, let's sample! We have to do the sampling within the context of nbinom_model
so that PyMC3 knows what model is being used in the sampling. Note that the kwarg njobs
below says how many samplers to run in parallel. I generally use the number of cores on my machine minus 2 as the maximum number of jobs I run. So, your machine does not have many cores, you might want to lower njobs
.
with nbinom_model:
trace = pm.sample(draws=2000, tune=5000, njobs=4)
PyMC3 used a no U-turn sampler (NUTS) to draw samples out of the posterior. We specified that we wanted the sampler to take 5000 tuning steps (which are then discarded) and then take 2000 samples, or draws, out of the posterior. We used the njobs=4
kwarg to tell the sampler to do this whole procedure four times in parallel, using four cores. I have an eight core machine, so this is fine.
Note that PyMC3 determined that we can use the NUTS sampler (although that is kind of like saying "ATM machine") and employed it. That is a very efficient sampler that uses Hamitonian Monte Carlo, which you can read about here.
Note also that PyMC3 initialized the sampler using jitter+adapt_diag
. This is how PyMC3 chooses the starting point for MCMC sampling by default. Roughly, it starts the sampler at the mean of the priors, with some random jitter to the starting point. You may use a more sophisticated method, automatic differentiation variational inference (ADVI) to start your walkers. This method is more computationally intensive, but tends to give good starting positions for walkers.
with nbinom_model:
trace = pm.sample(init='advi+adapt_diag', draws=10000, tune=10000, njobs=4)
To make sure that the sampler is properly sampling the posterior, we can plot a trace of the sampler as it moves through space. To extract the values of the samples, we can use the get_values()
method of a PyMC3 Trace
or MultiTrace
object.
# Get trace of values of b for first chain
trace_vals = trace.get_values('mu', chains=[0])
# Make a plot
p = bokeh.plotting.figure(plot_width=700,
plot_height=250,
x_axis_label='sample number',
y_axis_label='µ')
p.line(np.arange(len(trace_vals)), trace_vals, line_join='bevel')
bokeh.io.show(p)
This kind of trace is a sign of good sampling. The sampler bounces around an average value.
PyMC3 has a nice utility to show all the traces, and also the marginalized PDFs of the traces, implemented as pm.traceplot()
. It displays the plots using Matplotlib, so be sure you use the %matplotlib inline
magic function in your notebook so that the graphics are properly displayed.
pm.traceplot(trace);
The plots on the right show the samples in $\alpha$ and $\mu$, with each chain (a total of four, from the four cores we used to do the calculation) shown in a different color. To the left is an approximation using kernel density estimation of the marginalized posterior, in this case $g(\alpha\mid\mathbf{n})$ and $g(\mu\mid\mathbf{n})$. I will discuss how these are calculated from the samples in a moment.
We can also compute the Gelman-Rubin statistic if we have multiple chains, which we defined in lecture to be
\begin{align} \hat{R} = \frac{\text{variance between the chains}}{\text{mean variance within the chains}}. \end{align}
This is another argument for doing at least two chains in parallel; you are able to test for stationarity. The test is implemented in the pm.gelman_rubin()
function.
pm.gelman_rubin(trace)
In this case, we have $\hat{R}$ values very close to one, which is a good sign for stationarity.
If we can build a model, we are able to write down the posterior analytically. But if we want to learn anything from it, we must compute integrals over the posterior. The easiest to compute from MCMC samples are marginalizations. To compute the marginalized distribution, say $g(r\mid\mathbf{n})$, we simple ignore the values of $b$ that we sampled! So, we can directly plot the approximate CDF of $g(r\mid \mathbf{n})$, which we will call $G(r\mid \mathbf{n})$, by plotting the ECDF of the samples.
plots = [bebi103.viz.ecdf(trace.get_values(param),
formal=True,
plot_height=250,
plot_width=400,
x_axis_label=param,
y_axis_label=f'G({param} | n)')
for param in ('mu', 'alpha')]
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))
We can also plot them as a histogram.
plots = [bebi103.viz.histogram(trace.get_values(param),
bins=50,
plot_height=250,
plot_width=400,
line_width=2,
x_axis_label=param,
y_axis_label=f'g({param} | n)')
for param in ('alpha', 'mu')]
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))
Histograms are a common way of looking at MCMC results, but I prefer ECDFs.
To get a full graphical summary of the posterior we can imagine making the following set of plots.
Such a collection of plots is called a corner plot. The bebi103.viz.corner()
function generates this from a trace. It is heavily inspired by Dan Forman-Mackey's corner package.
g = bebi103.viz.corner(trace, vars=['alpha', 'mu'])
bokeh.io.show(g)
In this case, since there are only two parameters, the plot in the $r$-$b$ plane is the complete posterior. Alternatively, we can plot ECDFs instead of histograms.
g = bebi103.viz.corner(trace, vars=['alpha', 'mu'], plot_ecdf=True)
bokeh.io.show(g)
Recall that we originally consider parameters $r$ and $p$ instead of $\alpha$ and $\mu$. We chose the latter because that was what was demanded from the built-in pm.NegativeBinomial
class. Conveniently, if we want to compute any quantity from the samples, we just directly compute it, and then we can study the probability distribution of the new quantity. So, we will just take our $\alpha$ and $\mu$ values and compute $p = \alpha/(\alpha+\mu)$. This is easier accomplished if we first convert the samples from a PyMC3 MultiTrace
object into a Pandas DataFrame
. PyMC3 has a function to do this, pm.trace_to_dataframe()
, but the bebi103.pm.trace_to_dataframe()
does the same thing with a few more bells and whistles.
# Convert trace to DataFrame
df_mcmc = bebi103.pm.trace_to_dataframe(trace, model=nbinom_model, log_post=True)
# Take a look
df_mcmc.head()
You will notice lost of statistics from the sampler, such as energy
, diverging
, etc. We will not go into these now, but instead mentin that the log_posterior
column has the value of the log posterior for each sample. To calculate this when building the DataFrame
, use the model
and log_post
kwargs. We will not use this just yet, but it is useful to have for computing the MAP.
Now, we can add a column to this DataFrame
, p
, which has the values of that parameter. We can also compute the burst size, $b$.
# Compute p from samples
df_mcmc['p'] = df_mcmc['alpha'] / (df_mcmc['alpha'] + df_mcmc['mu'])
df_mcmc['b'] = (1 - df_mcmc['p']) / df_mcmc['p']
# Make corner plot
g = bebi103.viz.corner(df_mcmc,
vars=['alpha', 'mu', 'b', 'p'],
plot_ecdf=True,
show_contours=False)
bokeh.io.show(g)
Note that the plot of the posterior in the $\alpha$-$p$ plane looks quite similar to the analytical posterior we plotted at the beginning of this tutorial (though they will be slightly different because they have different priors). Also note that for this particular corner plot, I chose not to display the contours, since the $b$-$p$ dependence would give contours that are too wide due to the smoothing that is employed to generate the contours.
Having a data frame with a log_posterior
column enables us to compute the MAP. We just find the index of the row that gives the largest log posterior. We can use the idxmax()
method that is built in to Pandas DataFrame
s.
# Get the index of the most probable parameter set
max_ind = df_mcmc['log_posterior'].idxmax()
# Pull out values.
r_MAP, p_MAP = df_mcmc.loc[max_ind, ['alpha', 'p']]
# Print the results
print("""
Most probable parameter values:
r: {0:.3f}
p: {1:.3f}""".format(r_MAP, p_MAP))
This is pretty much what we got with optimization in the last tutorial, though we were using a different prior in the last tutorial.
It is a little bit annoying that we need to have the data defined outside of the model, especially if we need to do the analysis multiple times. To ease this issue, we could write a function to define a model and sample.
def sample_neg_binom(n, r_min=0, r_max=20, draws=500, tune=500, njobs=1):
"""Sample the mRNA count data with Neg. Binom. likelihood."""
with pm.Model() as model:
# Priors
r = pm.Uniform('r', lower=r_min, upper=r_max)
mu = pm.HalfFlat('mu')
# Likelihood
n_obs = pm.NegativeBinomial('n_obs', mu=mu, alpha=r, observed=n)
# Draw the samples
trace = pm.sample(draws=draws, tune=tune, njobs=njobs, init='advi+adapt_diag')
# Convert to data frame
df_mcmc = bebi103.pm.trace_to_dataframe(trace, log_post=True)
return df_mcmc, trace, model
As an examle, we'll run it on Prdm14.
# Perform sampling
df_mcmc, _, _ = sample_neg_binom(df['Prdm14'].values, tune=10000, draws=10000, njobs=4)
# Plot result
g = bebi103.viz.corner(df_mcmc, vars=['r', 'mu'], plot_ecdf=True)
bokeh.io.show(g)
You may get some warnings about diverging samples (I did on a few of the times I ran this). During sampling PyMC3 uses some heuristic tests to make sure the sampler is working properly. If it encounters very high curvature in a trajectory of a sample, it deems the sample to be diverging. This may mean there could be some error in sampling, but it necessarily mean that is the came. When the number of diverging samples is small, as is the case here, we are usually ok.
As I mentioned before, we should be using
\begin{align} g(\mu, \alpha) \propto \frac{\alpha}{(\alpha + \mu)^2}. \end{align}
with $g(\mu\le 0, \alpha\le 0) = 0$, as our prior. This is not a distribution that is built in to PyMC3, so we need to make our own prior. There are a couple of ways of doing this, first by defining a deterministic variable, and second by building a custom prior.
For our first approach, we can specify the prior we want on $p$, Uniform(0, 1), and then explicitly do the change of variables to $\mu$, as the PyMC3 NegativeBinomial
distribution requires.
# Specify parameters
r_min = 0
r_max = 20
# Pull out data we want to use as a NumPy array
n = df['Rest'].values
with pm.Model() as nbinom_model:
# Priors
r = pm.Uniform('r', lower=r_min, upper=r_max)
p = pm.Uniform('p', lower=0, upper=1)
# Convert p to b
mu = r * (1 - p) / p
# Likelihood
n_obs = pm.NegativeBinomial('n_obs', alpha=r, mu=mu, observed=n)
Let's sample, just to take a look.
with nbinom_model:
trace = pm.sample(init='advi+adapt_diag', draws=10000, tune=5000, njobs=4)
If we convert the trace to a data frame and look, we see that mu
was not computed during the sampling.
df_mcmc = bebi103.pm.trace_to_dataframe(trace)
df_mcmc.head()
We only got $r$ and $p$, but that's fine. It's what we wanted in the first place. We can always compute $\mu$ from the samples anyway. So, let's do that and make a corner plot to check it out.
# Add b to data frame
df_mcmc['mu'] = df_mcmc['r'] * (1 - df_mcmc['p']) / df_mcmc['p']
# Make plot
p = bebi103.viz.corner(df_mcmc, vars=['r', 'mu'], plot_ecdf=True)
bokeh.io.show(p)
In the above example, $\mu$ was a deterministic variable, computed directly from a parameter that can very ($p$). We saw that PyMC3 did not keep track of its value during sampling; we had to recompute it after sampling. If we want the sampler to keep track of a variable, we need to say that it is "deterministically" distributed using pm.Deterministic
. We can redo the calculation using this.
with pm.Model() as nbinom_model:
# Priors
r = pm.Uniform('r', lower=r_min, upper=r_max)
p = pm.Uniform('p', lower=0, upper=1)
# Convert p to mu
mu = pm.Deterministic('mu', r * (1 - p) / p)
# Likelihood
n_obs = pm.NegativeBinomial('n_obs', alpha=r, mu=mu, observed=n)
# Get the samples
trace = pm.sample(init='advi+adapt_diag', draws=10000, tune=5000, njobs=4)
p = bebi103.viz.corner(trace, vars=['r', 'mu'], plot_ecdf=True)
bokeh.io.show(p)
Now let's try this by defining a custom prior. One way to specify an arbitrary continuous distribution is to use the pm.DensityDist
class. We instantiate it like other distributions with the dirst argument being the name of the distribution. The second argument is a function that computes the logarithm of the PDF. So, we need to write a function that returns
\begin{align} \ln g(\mu, \alpha) = \ln\alpha - 2 \ln(\alpha + \mu)^2, \end{align}
given an input array containing $\alpha$ and $\mu$. Here is how I would write the function using NumPy.
def custom_prior_logp_numpy(value):
"""Log of prior for alpha and mu."""
alpha, mu = value
if alpha <= 0 or mu <= 0:
return -np.inf
return np.log(alpha) - 2*np.log(alpha + mu)
This unfortunately will not work. The reason for this is that under PyMC3's hood, the arrays are not Numpy arrays, but Theano tensors. The reason for this is that the Theano packages makes building compiled models with symbolic differentiation possible (read: SPEED). So, when writing functions that will be used in PyMC3 models, we need to use Theano. I have already imported the theano.tensor
submodule as tt
. To enforce bounds on the prior (namely that $\alpha, \mu > 0$), I will use the pm.distributions.dist_math.bound()
function to conveniently dictate that the prior is only nonzero within certain bounds. The function for the custom prior is below.
def custom_prior_logp(value):
"""Log of prior for alpha and mu."""
return pm.distributions.dist_math.bound(
tt.log(value[0]) - 2.0 * tt.log(value[0] + value[1]),
tt.all(value > 0.0))
Note that the prior is on both $\alpha$ and $\mu$, since they are no longer independent in the prior. This means that we need to specify that the shape of the distribution, which is PyMC3 speak for the dimensionality of the prior, in this case 2. Finally, we also specify a testval
kwarg when setting up the distribution to tell PyMC3 where to start walkers. We have to do this because PyMC3 is unaware of this custom distribution and its properties, so it cannot smartly pick a starting point. Now, let's specify the model.
# Specify parameters
r_min = 0
r_max = 20
# Pull out data we want to use as a NumPy array
n = df['Rest'].values
with pm.Model() as nbinom_model:
# Priors
alpha_mu = pm.DensityDist('alpha_mu',
custom_prior_logp,
shape=2,
testval=[1.0, 1.0])
# Likelihood
n_obs = pm.NegativeBinomial('n_obs', alpha=alpha_mu[0], mu=alpha_mu[1], observed=n)
Once we got the custom prior defined, it's just a matter of writing the likelihood. Let's sample and plot!
with nbinom_model:
trace = pm.sample(draws=10000, tune=5000, njobs=4)
g = bebi103.viz.corner(trace,
vars=['alpha_mu__0', 'alpha_mu__1'],
labels=['α', 'µ'],
plot_ecdf=True)
bokeh.io.show(g)
Notice that I specified the vars
kwarg to be ['alpha_mu__0', 'alpha_mu__1']
. This is because PyMC3 stores vector-valued parameters with double underscores and their index when the trace is converted to a data frame. In our case, alpha_mu__0
is $\alpha$ and alpha_mu__1
is $\mu$. The labels
kwarg of bebi103.viz.corner()
enables us to label the axes with something a little cleaner.
Note that we did not really need to write a custom prior because we could do a change of variables; We specified priors for $r$ and $p$ that are built in to PyMC3, and then computed $\alpha$ and $\mu$ for use in the likelihood. Whatever you can do with built-in functions, you should.
But what if we really need some custom functions? And what if those custom functions cannot be coded up in Theano (for example, things like ODE solving)? Can we just write Numpy functions? The answer if yes and no. Yes, you can, but no you can't use NUTS. You will have to use a sampler that does not rely on gradient information, like Metropolis sampling. The reason for this is that PyMC3 uses Theano to do automatic differentiation of all of the functions. Without this differentiation, you cannot use NUTS. If you define your custom functions using only basic operations (like +
, *
, -
, /
, **
, etc.) and Theano and PyMC3 functions, the gradient is automatically available. Otherwise, you will have to specify gradient information, which involves getting more into the guts of Theano, which we will not cover in this course.
To make a Numpy function available to PyMC3 as a Theano op, you decorate the function of interest using theano.compile.ops.as_op()
. It is important to note that even though the function, in this case a prior, returns a scalar, it still my be returned as a Numpy array (e.g., np.array(-np.inf)
and not -np.inf
).
@theano.compile.ops.as_op(itypes=[tt.dvector], otypes=[tt.dscalar])
def custom_prior_logp_numpy(value):
"""Log of prior for alpha and mu."""
alpha, mu = value
if alpha <= 0 or mu <= 0:
return np.array(-np.inf)
return np.array(np.log(alpha) - 2*np.log(alpha + mu))
We can now use this in our model, just as we did last time with the custom prior built from Theano functions.
with pm.Model() as nbinom_model:
# Priors
alpha_mu = pm.DensityDist('alpha_mu',
custom_prior_logp_numpy,
shape=2,
testval=[1.0, 1.0])
# Likelihood
n_obs = pm.NegativeBinomial('n_obs', alpha=alpha_mu[0], mu=alpha_mu[1], observed=n)
Unfortunately, we cannot sample this posterior with NUTS because we have no gradient information. Nonetheless, PyMC3 will try NUTS by default. Instead, we have to specify our sampler using the step
kwarg. We will use Metropolis.
with nbinom_model:
trace = pm.sample(draws=10000, tune=100000, njobs=4, step=pm.Metropolis())
g = bebi103.viz.corner(trace,
vars=['alpha_mu__0', 'alpha_mu__1'],
labels=['α', 'µ'],
plot_ecdf=True)
bokeh.io.show(g)
We will now look at a more complicated case, the bimodal Rex1 distribution. Referring again to the previous tutorial, the likelihood and prior for the model of two Negative Binomial distributions are respectively
\begin{align} f(\mathbf{n} \mid r_1, r_2, p_1, p_2, w_1) &\propto \prod_i \left(w_1\,\frac{(n_i + r_1 - 1)!}{n_i!\,(r_1-1)!}\,p_1^{r_1}(1-p_1)^{n_i}\right. \nonumber \\[1em] &\;\;\;\;\;\;\;\;\;\;\;\; \left.+\;(1-w_1) \frac{(n_i + r_2 - 1)!}{n_i!\,(r_2-1)!}\,p_2^{r_2}(1-p_2)^{n_i}\right),\\[1em] g(r_1, r_2, p_1, p_2, w_1) &= \text{constant}, \end{align}
where in the Uniform prior it is understood that the parameters lie within bounds.
For reasons that will become clear in one moment, let us define $w_2 = 1-w_1$ and rewrite this as
\begin{align} f(\mathbf{n} \mid r_1, r_2, p_1, p_2, w_1, w_2) &\propto \prod_i \left(q\,\frac{(n_i + r_1 - 1)!}{n_i!\,(r_1-1)!}\,p_1^{r_1}(1-p_1)^{n_i}\right. \nonumber \\[1em] &\;\;\;\;\;\;\;\;\;\;\;\; \left.+\;w_2 \frac{(n_i + r_2 - 1)!}{n_i!\,(r_2-1)!}\,p_2^{r_2}(1-p_2)^{n_i}\right),\\[1em] g(r_1, r_2, p_1, p_2, w_1, w_2) &= \text{constant, s.t. } w_1 + w_2 = 1, \end{align}
Importantly, each of $p_1$, $p_2$, $w_1$, and $w_2$ lie between zero and one, with $w_1 + w_2 = 1$. As we attempt to write this in a form for use in PyMC3, we note that the likelihood is a mixture model. A mixture model describes a situation where there are two or more subpopulations, each of which having its own probability distribution, but where we do not know which measurement belongs to which subpopulation. Each of these distributions has a weight, which gives the probability that a given measurement came out it. So, we can write out model like this.
\begin{align} \mathbf{r} &\sim \text{Uniform}(\mathbf{r}_\mathrm{min}, \mathbf{r}_\mathrm{max}) \\[1em] \mathbf{p} &\sim \text{Uniform}(0, 1) \\[1em] \mathbf{n} \mid \mathbf{r}, \mathbf{p}, \mathbf{w} &\sim \sum_j w_j\,\text{NBinom}(n_i;r_j,p_j) \;\forall i\\[1em] \mathbf{w} &\sim \text{Dirichlet}(\mathbf{1}). \end{align}
I have written this for a general Mixture model, but in this there are only two distributions in the model, so $\mathbf{r} = (r_1, r_2)$, $\mathbf{p} = (p_1, p_2)$, and $\mathbf{w} = (w_1, w_2)$.
Let's code up this model and let it run.
n = df['Rex1'].values
r_min = 0
r_max = 50
with pm.Model() as bi_nbinom_model:
# Priors
r = pm.Uniform('r', lower=r_min, upper=r_max, shape=2)
p = pm.Uniform('p', lower=0, upper=1, shape=2)
w = pm.Dirichlet('w', a=np.ones(2))
# Convert p's to mu's
mu = pm.Deterministic('mu', r*(1-p)/p)
# Pieces of likelihood
likelihood = pm.NegativeBinomial.dist(alpha=r, mu=mu)
# Likelihood
n_obs = pm.Mixture('n', w, likelihood, observed=n)
# Perform MCMC
trace = pm.sample(draws=10000, tune=10000, init='advi+adapt_diag', njobs=4)
# Make corner plot
g = bebi103.viz.corner(trace,
vars=['r__0', 'p__0', 'r__1', 'p__1', 'w__0'],
labels=['r1', 'µ1', 'r2', 'µ2', 'w1'])
bokeh.io.show(g)
Yikes! We see some weird bimodality (though if we were to run it again, we many not see the bimodality for reasons that will become clear in a moment). We can check the Gelman-Rubin statistic.
pm.gelman_rubin(trace)
These look really out of whack, as well. All of this is due to the model having a degenerate mixture. We could exchange subscripts 1 and 2 and get exactly the same model. The sampler may start at $w_1$ being high and $w_2$ being low, or vice versa, and will sample two pieces of parameter space that are equally probable. So, the Gelman-Rubin statistic is also off because the samplers can switch $p_1$ being larger or smaller than $p_2$. We can break this degeneracy by stipulating that the parameters $p_1$ and $p_2$ are ordered. That is, $p_1 < p_2$. Ideally, we would like to change the model specification we just used as follows.
# Old line
p = pm.Uniform('p', lower=0, upper=1, shape=2)
# New line
p = pm.Beta('p', lower=0, upper=1, shape=2, transform=pm.Ordered())
This feature is not yet implemented (see also here), but likely will be in future releases. Instead, we will hack this together by adding a pm.Deterministic
variable that is the sorted probabilities.
n = df['Rex1'].values
r_min = 0
r_max = 50
with pm.Model() as bi_nbinom_model:
# Priors
r = pm.Uniform('r', lower=r_min, upper=r_max, shape=2)
p = pm.Uniform('p', lower=0, upper=1, shape=2)
w = pm.Dirichlet('w', a=np.ones(2))
# Break symmetry by sorting
p_sorted = pm.Deterministic('p_sorted', tt.sort(p))
# Convert p's to mu's
mu = pm.Deterministic('mu', r*(1-p_sorted)/p_sorted)
# Pieces of likelihood
likelihood = pm.NegativeBinomial.dist(alpha=r, mu=mu)
# Likelihood
n_obs = pm.Mixture('n', w, likelihood, observed=n)
# Perform MCMC
trace = pm.sample(draws=10000, tune=10000, init='advi+adapt_diag', njobs=4)
# Make corner plot
g = bebi103.viz.corner(trace,
vars=['r__0', 'mu__0', 'r__1', 'mu__1', 'w__0'],
labels=['r1', 'µ1', 'r2', 'µ2', 'w1'])
bokeh.io.show(g)
We can now check the Gelman-Rubin statistic, knowing the it is meaningless for p
, but has meaning for p_sorted
and the other parameters.
pm.gelman_rubin(trace)
This has remedied the problem. For a more detailed discussion on degeneracy in mixture models, see this blog post by Michael Betancourt. It is written using Stan with R, but the general principles are the same.
We can do a quick calculation of the MAP to see how it matches our results from Tutorial 4b.
df_mcmc = bebi103.pm.trace_to_dataframe(trace, model=bi_nbinom_model, log_post=True)
# Get the index of the most probable parameter set
max_ind = df_mcmc['log_posterior'].idxmax()
# Look at max
df_mcmc.loc[max_ind, ['r__0', 'r__1', 'p_sorted__0', 'p_sorted__1', 'w__0']]
The results are very close to what we got from optimization. In the next tutorial, we will discuss ways other than corner plots for displaying summaries of MCMC results.
We also performed nonlinear regression using optimization methods in the last tutorial using the data set from Good, et al., Science, 342, 856-860, 2013 from lecture 3. Here, we will perform the same analysis using MCMC. We start by loading in the data set.
# Load in Data Frame
df = pd.read_csv('../data/good_invitro_droplet_data.csv', comment='#')
# Get Numpy arrays for speed
d = df['Droplet Diameter (um)'].values
ell = df['Spindle Length (um)'].values
Next, we define a function to compute the spindle length. As a reminder,
\begin{align} l(d;\gamma, \phi) = \frac{\gamma d}{(1 + (d/\phi)^3)^\frac{1}{3}}. \end{align}
Note that we cannot use np.cbrt()
as we did in the last tutorial, since whatever expressions we use must be able to be evaluated using Theano. So, we instead raise the expression to the 1/3 power.
def spindle_length(d, gamma, phi):
"""Spindle length as a function of droplet diameter."""
return gamma * d / (1 + (d/phi)**3)**(1/3)
Now, we define our model. We want to specify
\begin{align} \gamma &\sim \text{Uniform}(0, 1) \\[1em] \phi &\sim \text{Jeffreys}(\phi_\mathrm{min}, \phi_\mathrm{max}) \\[1em] \sigma &\sim \text{Jeffreys}(\sigma_\mathrm{min}, \sigma_\mathrm{max}) \\[1em] l_i\mid \gamma, \phi, \sigma &\sim \text{Norm}(l(d_i;\gamma, \phi), \sigma) \; \forall i.\\[0.1em]\nonumber \phantom{blah} \nonumber \end{align}
A problem is that PyMC3 does not have a built-in Jeffreys prior. However, we can use the fact that a Jeffreys prior is the same as a Uniform prior on the logarithm of the parameter. So, we can do a transformation to get a Jeffreys prior.
with pm.Model() as model:
# Priors
gamma = pm.Uniform('gamma', lower=0, upper=1)
log_phi = pm.Flat('log_phi')
log_sigma = pm.Flat('log_sigma')
# Convert logs (this gives Jeffreys priors)
phi = pm.Deterministic('phi', tt.exp(log_phi))
sigma = pm.Deterministic('sigma', tt.exp(log_sigma))
# Compute spindle length
ell_theor = spindle_length(d, gamma, phi)
# Likelihood
ell_obs = pm.Normal('ell_obs', mu=ell_theor, sd=sigma, observed=ell)
Now, let's draw some samples and make a corner plot.
with model:
trace = pm.sample(draws=10000, tune=10000, init='advi+adapt_diag', njobs=4)
And, as usual, we'll make a corner plot.
g = bebi103.viz.corner(trace,
vars=['gamma', 'phi', 'sigma'],
labels=['γ', 'φ (µm)', 'σ (µm)'],
plot_width=200,
smooth=1)
bokeh.io.show(g)
This is a similar result as we got when we plotted the whole posterior in Tutoral 4a. In that case, we marginalized over $\sigma$ and used the marginalized posterior. Here, we have not done that, and have instead let the sampler include $\sigma$. We still get a marginalized posterior because all we need to do to marginalized is ignore samples in the variables are are not interested in.
Now, suppose for a moment that we did want to sample out of the marginalized posterior, where
\begin{align} g(\gamma, \phi, \mid \mathbf{l},\mathbf{d}) = \frac{1}{\phi}\,\frac{\Gamma(n/2)}{2\pi^{n/2}}\left(\sum_i(l_i - l(d_i;\gamma, \phi))^2\right)^{-\frac{n}{2}}. \end{align}
Coding up this likelihood is a bit trickier. You need to build a class that inherits a PyMC3 class to do it. You can see some discussion on it here. This particular distribution is included in the bebi103
module, as is a Jeffreys prior. We can use these to code up the model.
with pm.Model() as model:
# Priors
gamma = pm.Uniform('gamma', lower=0, upper=1)
phi = bebi103.pm.Jeffreys('phi', lower=1, upper=100)
# Compute spindle length
ell_theor = spindle_length(d, gamma, phi)
# Likelihood
ell_obs = bebi103.pm.MarginalizedHomoscedasticNormal('ell_obs', mu=ell_theor,
observed=ell)
trace = pm.sample(draws=10000, tune=10000, init='advi+adapt_diag', njobs=4)
And now, the corner plot, this time without $\sigma$, since we did not calculate it.
g = bebi103.viz.corner(trace,
vars=['gamma', 'phi'],
labels=['γ', 'φ (µm)'],
plot_width=200,
smooth=1)
bokeh.io.show(g)
And, as expected, we get the same result.
You now have an introduction to PyMC3, probably the leading Python-based package for MCMC. At the moment, it uses NUTS for sampling, which is more or less the state of the art for continuous parameters. But research in MCMC samplers develops rapidly, and there are already improvements to NUTS, such as this one. You should try not to get too caught up in the syntax of PyMC3, but think about how models are specified. This type of specification is similar to packages like Stan and Edward (see for example this blog post from Bob Carpenter). Given that the landscape is so rapidly changing (e.g., PyMC4 will likely use TensorFlow or MXNet instead of Theano), what is most important is your understanding of what MCMC is and how it works and how to think about building your models.
Now that you can build statistical models and generate samples, you can make corner plots to visualize the posteriors. In the next tutorial, we discuss other ways to report results from MCMC calculations.