4. Conjugacy


[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade bebi103 watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np
import scipy.stats as st

import bebi103

import holoviews as hv
hv.extension('bokeh')
bebi103.hv.set_defaults()

We have talked about Bayes theorem as a model for learning. The idea there was that we know something before (a priori) acquiring data, and then we update our knowledge after (a posteriori). So, we come in with the prior and out with the posterior after acquiring data. It might make sense, then, that the prior and the posterior distributions have the same functional form. That is, the prior and the posterior have the same distribution, and the parameters of the distribution are updated from the prior to the posterior by the likelihood. When the prior and posterior have the same functional form, the prior is said to be conjugate to the likelihood. This seems pleasing: the likelihood serves to update the prior into the posterior, so it should determine the functional form of the prior/posterior such that they are the same.

Conjugate priors are especially useful because they enable analytical calculation of the posterior, typically without have to do any mathematics yourself. You can look up the result (though I will likely ask you to derive one in the homework to make sure you understand it). This is perhaps best seen through example.

The Beta-Binomial conjugate pair

As a motivating example of use of a conjugate prior, we will use data from one of the experiments from a paper by Mossman, et al., 2019, in which the authors investigated the age of Drosophila parents on the viability of the eggs of their offspring. In one vial, they mated young (≤ 3 days old) males with old females (≥ 45 days old). Of the 176 eggs laid by their offspring over a 10-day period, 94 of them hatched, while the remainder failed to hatch. In another vial, they mated young males with young females. Of the 190 eggs laid by their offspring over a 10-day period, 154 hatched. (They took many more data than these, but we’re using this experiment as a demonstration of conjugacy.)

The story behind the data matches that of the Binomial distribution. Each egg can by thought of as a Bernoulli trial, with success being hatching and failure being failure to hatch. The number \(n\) of hatches out of \(N\) eggs is Binomially distributed.

\begin{align} n \mid N, \theta \sim \text{Binom}(N, \theta). \end{align}

Writing out the PMF, this is

\begin{align} f(n\mid N, \theta) = \begin{pmatrix}N\\n\end{pmatrix} \theta^n(1-\theta)^{N-n}. \end{align}

This identifies a single parameter to estimate, \(\theta\), which is the probability of hatching. Our goal, then, is to compute \(g(\theta\mid n,N)\).

Finding the conjugate

We seek a prior that is conjugate to the likelihood. That is, we want to choose \(g(\theta)\) such that \(g(\theta \mid n, N)\) and \(g(\theta)\) have the same functional form. Intuitively, we would like a prior that has a similar functional form as the likelihood. The Beta distribution is one such distribution. Its PDF is

\begin{align} g(\theta \mid \alpha, \beta) = \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)}, \end{align}

where \(B(\alpha,\beta)\) is a Beta function,

\begin{align} B(\alpha,\beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}. \end{align}

Note that there are two parameters, \(\alpha\) and \(\beta\) that parametrize the prior. If in fact the Beta distribution is conjugate to the Binomial, it is these two parameters that get updated by the data.

You should explore the Beta distribution in the Distribution Explorer to get a feel for it shape and how the parameters \(\alpha\) and \(\beta\) affect its shape. Importantly, if \(\alpha = \beta = 1\), we get a Uniform distribution on the interval [0, 1].

We can now check to see if the Beta distribution is conjugate to the Binomial. If we insert a Beta distribution for the prior, we have

\begin{align} g(\theta \mid n, N, \alpha, \beta) &= \frac{f(n\mid N, \theta)\, g(\theta\mid \alpha, \beta)}{f(n \mid N)} \\[1em] &= \frac{1}{f(n\mid N)}\,\begin{pmatrix}N\\n\end{pmatrix}\,\theta^n(1-\theta)^{N-n}\,\frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)} \\[1em] &= \frac{1}{f(n\mid N)\,B(\alpha,\beta)}\,\begin{pmatrix}N\\n\end{pmatrix}\,\theta^{n+\alpha-1}(1-\theta)^{N-n+\beta-1}. \end{align}

In looking at this expression, the only bit that depends on \(\theta\) is \(\theta^{n+\alpha-1}(1-\theta)^{N-n+\beta-1}\), which is exactly the \(\theta\)-dependence of a Beta distribution with parameters \(n+\alpha\) and \(N-n+\beta\). Thus, the posterior must also be a Beta distribution. Because the posterior must be normalized,

\begin{align} \frac{1}{f(n\mid N)\,B(\alpha,\beta)}\,\begin{pmatrix}N\\n\end{pmatrix} = \frac{1}{B(n+\alpha, N-n+\beta)}. \end{align}

We have just normalized the posterior without doing any nasty integrals! So, the posterior is

\begin{align} g(\theta \mid n, N, \alpha, \beta) = \frac{\theta^{n+\alpha-1}(1-\theta)^{N-n+\beta-1}}{B(n+\alpha, N-n+\beta)}. \end{align}

Thus, we have prior and posterior distributions

\begin{align} &\theta \sim \text{Beta}(\alpha, \beta),\\[1em] &\theta \mid n, N \sim \text{Beta}(n+\alpha, N-n+\beta). \end{align}

So, we can see that conjugacy is useful. For a given likelihood, if we know its conjugate prior, we can just immediately write down the posterior in a clear form. The Wikipedia page on conjugate priors has a useful table of likelihood-conjugate pairs.

Plots of the posteriors

We can now rapidly exactly plot the posteriors for the egg hatch probability for old mothers and young mothers.

[2]:
# Observed data
n_old = 94
N_old = 176
n_young = 154
N_young = 190

# Assume uniform prior
alpha = 1
beta = 1

# Theoretical theta values
theta = np.linspace(0, 1, 400)

# Old posterior
post_old = hv.Curve(
    (theta, st.beta.pdf(theta, n_old + alpha, N_old - n_old + beta)),
    kdims='θ',
    vdims='g(θ | n, N)',
    label='old mothers',
)

# Young posterior
post_young = hv.Curve(
    (theta, st.beta.pdf(theta, n_young + alpha, N_young - n_young + beta)),
    kdims='θ',
    vdims='g(θ | n, N)',
    label='young mothers',
)

post_young * post_old

Data type cannot be displayed:

[2]:

In plotting the posteriors, we see that the young mothers clearly are more likely to have their offsprings’ eggs hatch than the old mothers.

Comments on conjugates

Use of conjugate priors have advantages and disavantages. First, the advantages.

  1. They are conceptually pleasing in that the prior and posterior have the same form, so the update of the prior by the data via the likelihood simply updates the parameters of the distribution describing the parameter values. (See, however, the criticisms of this approach below.)

  2. They make evaluation of the posterior analytical tractable, and in fact easy. For example, you now know how to immediately write down the posterior after an experiment consisting of a series of Bernoulli trials (a Binomial likelihood) using a Beta distribution for the prior and posterior.

There are disadvantages and criticisms, however.

  1. Finding conjugates for a given likelihood is typically very challenging for all but simple distributions. The table on Wikipedia is pretty complete for known useful conjugates and it contains a paltry few distributions. For any sort of complex model, it is hopeless to find conjugates, thereby restricting their utility.

  2. Our prior information about a parameter might not actually match the form of a conjugate prior. Of example, if someone tells you a coin is slightly biased, but not in which direction, you might choose a prior that is bimodal. This cannot be described with a Beta distribution. Sivia comments on this in his book: “While we might expect our initial understanding of the object of interest to have a bearing on the experiment we conduct, it seems strange that the choice of the prior pdf should have to wait for, and depend in detail upon, the likelihood function.”

Computing environment

[3]:
%load_ext watermark
%watermark -v -p numpy,scipy,bokeh,holoviews,bebi103,jupyterlab
CPython 3.8.5
IPython 7.19.0

numpy 1.19.2
scipy 1.5.2
bokeh 2.2.3
holoviews 1.14.0
bebi103 0.1.2
jupyterlab 2.2.6