Aux Lesson 7: Revisting Our Introduction to Baysian Modeling

(c) 2018 Julian Wagner and John Ciemniecki. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [1]:
import numpy as np
import scipy.stats as st
import bokeh.io
import bokeh.plotting
import pandas as pd
import bebi103
import tqdm

bokeh.io.output_notebook()
Loading BokehJS ...

In this lesson we'll walk through generating, coding, and calculating parameter values for a baysian model that describes C. elegans egg cross-sectional area.

The data we'll be using comes from here.

We want to find the mean and standard deviation for a dataset of C. elegans egg sizes measured in square microns. We could easily do this by using plug-in estimates (i.e. frequentist stats), but for the sake of exercise let's make a baysian model.

The model

Our model is specified as below:

\begin{align} d_i &\sim \mathrm{Norm}(\mu, \sigma)\\ \mu &\sim \mathrm{Norm}(4000,1200)\\ \sigma &\sim \mathrm{HalfNorm}(0,300) \end{align}

We chose a Gaussian likelihood because we think it is reasonable to assume that many reactions compound into determining egg size outcome. The central limit theorum states that such outcomes are gaussian distributed.

For $\mu$, we chose a Normal prior distribution with mean 4000 $\mu m^2$ and standard deviation 1200 $\mu m^2$. We reasoned that the egg's diameter should fall somewhere between 2 and 100 $\mu$m (between the size of a bacterium and 1/10th the size of an adult worm). Transforming this into circular cross sectional area, we guess our lower and upper bounds to be approximately 3.14 $\mu m^2$ and 7.85x$10^{3}$ $\mu m^2$. The average of these values is about 4000 $\mu m^2$, so we choose that for our mean, with a standard deviation at around 1200 to achieve the shape coded below:

In [2]:
mu = np.linspace(0, 8000, 8000)
prob = st.norm.pdf(mu, 4000, 1200)

p = bokeh.plotting.figure(
    height=200,
    width=350,
    x_axis_label='mu (um^2)',
    y_axis_label='PDF')
p.line(mu, prob, line_width=2)

bokeh.io.show(p)

For $\sigma$, we chose a HalfNormal distribution with mean 0 and standard deviation 300, which is weakly informative but ensures extremely large deviations relative to our chosen prior average (anything greater than 1000, ie 1/4th of 4000 $\mu m^2$) cannot occur. See shape below:

In [3]:
mu = np.linspace(0, 4000, 4000)
prob = st.halfnorm.pdf(mu, 0, 300)

p = bokeh.plotting.figure(
    height=200,
    width=350,
    x_axis_label='sigma (um^2)',
    y_axis_label='PDF')
p.line(mu, prob, line_width=2)

bokeh.io.show(p)

Executing PPC

1) Draw single, random parameter set from prior distribution(s)
2) Plug parameter set into likelihood
3) Draw a random dataset from this likelihood
4) Repeat 1-3 for the number of PPC samples you want to draw
5) Plot the collection of PPC samples
6) Check that values you see in these datasets are physically reasonable

In [4]:
def draw_parameters_from_priors():
    """Produces a random set of mu and sigma parameters 
       from the specified priors, step 1 of PPC."""
    
    mu = np.random.normal(4000, 1200)
    sigma = np.abs(np.random.normal(0, 300))
    return mu, sigma

Next, we will use a function to perform one 'trial' of the experiment with the parameters drawn from the prior. That is, we draw a value for egg cross sectional area. This is analagous to performing one measurment of the egg.

In [5]:
def draw_datum_from_likelihood(mu, sigma):
    """Produces a random datum from the specified 
       likelihood with parameters passed in arguments, 
       step 2-3 of PPC."""
    
    datum = np.random.normal(mu, sigma)
    return(datum)

To constuct a full data set, we need to make as many egg cross sectional area draws with a given parameter set as the experimenters made measures of eggs. Note from the data set that this is 57 measures for the low feeding worms.

In [6]:
def draw_ppc_samples(number_datum_per_data_set):
    """Produces a random dataset from the specified 
       likelihood and priors, step 3 of PPC."""
    
    mu, sigma = draw_parameters_from_priors()
    data_set = [draw_datum_from_likelihood(mu, sigma) for i in range(number_datum_per_data_set)]
    return data_set

We can now construct many data sets (we will draw 50), each composed of 57 draws of egg cross sectional area from a distribution with the same random parameter selection.

In [7]:
p = bokeh.plotting.figure(x_axis_label='Cross sectional area (um^2)', 
                          y_axis_label='ECDF', 
                          plot_height=300, 
                          plot_width=400)

# Step 4 and 5 of PPC
[bebi103.viz.ecdf(draw_ppc_samples(57), p=p, alpha=0.2) for i in range(50)]
bokeh.io.show(p)

Step 6: These ECDF look reasonable, ie, there are no values smaller than a bacterium or larger than 1/10th an adult worm which are the reasonable bounds we specified in thinking about our priors. While it's hard to have an expectation of the variance of egg sizes, the spread in these ECDFs seems reasonable; most cover a range no larger than 1000 square microns. We'll move forward with calculating and plotting the posterior.

Plotting Posterior

We first load up to data to make sure our PPC lines up with the data.

In [10]:
# Load data
df = pd.read_csv('../data/c_elegans_egg_xa.csv', comment='#')
df.head()
Out[10]:
food area (sq um)
0 high 1683
1 high 2061
2 high 1792
3 high 1852
4 high 2091

We plot the data.

In [11]:
# Take a look
bokeh.io.show(bebi103.viz.ecdf(df.loc[(df['food']=='low'), 'area (sq um)'], 
                               x_axis_label='Cross sectional area (um^2)'))

0) For this dataset D, assume independence of individual measurements; ie, total likelihood is a product of independent likelihoods per datum: \begin{align} f(D|\mu, \sigma) = \prod_{i=1}^{N} \mathrm{Norm}(d_i|\mu, \sigma) \end{align} 1) Plug dataset D (ie $d_i \, \forall i$) into above equation.
2) Construct a function that multiplies this likelihood by the priors.
2) Feed in a numpy linspace of potential $\mu$ and $\sigma$ values into this posterior function. Output is posterior probability value for each $\mu$ and $\sigma$.
3) Plot posterior, ie g($\mu,\sigma$|D) vs $\mu,\sigma$; can use Justin's bebi103.viz.contour.

\begin{align} \mathrm{normal\, pdf} = \frac{1}{\sqrt{2 \pi \sigma^{2}}} \mathrm{exp}\left(\frac{-(x-\mu)^{2}}{2\sigma^2}\right) \end{align}

In [12]:
def normal_pdf(x, params):
    """Function for a gaussian likelihood.
       -----
       x (float): value to calculate the probability of.
       Params (tuple or array): Single set of mu, sigma parameter set."""
    
    mu, sigma = params
    
    norm_pdf = (np.sqrt(1/(2*np.pi*sigma**2)) *
                np.exp(-(x-mu)**2/(2*sigma**2)))
    
    return norm_pdf

\begin{align} \mathrm{prior \, \mu} = \frac{1}{\sqrt{2 \pi (1200)^{2}}} \mathrm{exp}\left(\frac{-(\mu-4000)^{2}}{2 \cdot \mathrm{(1200)}^2}\right) \end{align}

\begin{align} \mathrm{prior \, \sigma} = \, \left| \frac{1}{\sqrt{2 \pi (300)^{2}}} \mathrm{exp}\left(\frac{-(\mu-0)^{2}}{2 \cdot (300)^2}\right) \right| \end{align}

In [13]:
def log_prior_explicit(params):
    """Function for log of specified priors.
       -----
       Params (tuple or array): Single set of mu, sigma parameter set."""
    
    mu, sigma = params
    
    prior_mu = np.log(normal_pdf(mu, (4000, 1200)))
    prior_sigma = np.log(np.abs(normal_pdf(sigma, (0, 300))))
    log_prior = prior_mu + prior_sigma
    
    return log_prior

Our likelihood takes the same form, though it has many terms that we will address momentarily.

\begin{align} \mathrm{likelihood} = \frac{1}{\sqrt{2 \pi \sigma^{2}}} \mathrm{exp}\left(\frac{-(d_i-\mu)^{2}}{2\sigma^2}\right) \forall i. \end{align}

Since we are assuming that these are i.i.d., we can write this as the product of all these terms. That is,

\begin{align} f(d \, | \, \mu, \sigma) = \prod_{i=1}^{N} \frac{1}{\sqrt{2 \pi \sigma^{2}}} \mathrm{exp}\left(\frac{-(d_i-\mu)^{2}}{2\sigma^2}\right). \end{align}

We will consider the log likelihood here for numerical concerns. Based on algebra of logarithms, this product turns into a sum of terms to give the log likelihood. That is,

\begin{align} \mathrm{log(likelihood)} &= \log{\left(\prod_{i=1}^{N} \frac{1}{\sqrt{2 \pi \sigma^{2}}} \mathrm{exp}\left(\frac{-(d_i-\mu)^{2}}{2\sigma^2}\right)\right)}. \\ & = \log{\left(\frac{1}{\sqrt{2 \pi \sigma^{2}}} \mathrm{exp}\left(\frac{-(d_0-\mu)^{2}}{2\sigma^2}\right)\right)}\\ & + \log{\left(\frac{1}{\sqrt{2 \pi \sigma^{2}}} \mathrm{exp}\left(\frac{-(d_1-\mu)^{2}}{2\sigma^2}\right)\right)}\\ & \, + \\ & + \log{\left(\frac{1}{\sqrt{2 \pi \sigma^{2}}} \mathrm{exp}\left(\frac{-(d_N-\mu)^{2}}{2\sigma^2}\right)\right)} \end{align}

In [14]:
def log_like_explicit(params, d):
    """Function for log of gaussian likelihood.
       -----
       Params (tuple or array): Single set of mu, sigma parameter set.
       d (array): 1D dataset of egg sizes in square microns."""
    
    mu, sigma = params
    total_log_like = 0
    
    for datum in d:
        datum_log_like = np.log(normal_pdf(datum, (mu, sigma)))
        total_log_like += datum_log_like
        
    return total_log_like

With these equations in hand, we can easily get the posterior from the other terms. Note that the evidence is the const term. Recall that this is essencially a normalization constant, which we will ignore for this analyis.

\begin{align} \mathrm{posterior} &= \mathrm{const} \cdot \mathrm{prior \, \mu} \cdot \mathrm{prior \, \sigma} \cdot \mathrm{likelihood}\\ \mathrm{log(posterior)} &= \mathrm{log(\mathrm{const} \cdot prior \, \mu} \cdot \mathrm{prior \, \sigma} \cdot \mathrm{likelihood)}\\ & = \mathrm{const} + \mathrm{log(prior \, \mu)} + \mathrm{log(prior \, \sigma)} + \mathrm{log(likelihood)} \end{align}

In [15]:
def log_posterior_explicit(params, d):
    """Function for posterior of c. elegans egg size model.
       -----
       Params (tuple or array): Single set of mu, sigma parameter set.
       d (array): 1D dataset of egg sizes in square microns."""
    
    return log_prior_explicit(params) + log_like_explicit(params, d)

We will now use a brute force approach to evaluate the posterior. Since we have two parameters, we can plot the whole thing using a contour plot.

In [16]:
# Set up plotting range
mu = np.linspace(2000, 2200, 200)
sigma = np.linspace(100, 200, 200)
MU, SIGMA = np.meshgrid(mu, sigma)

We run the calculations for a countour plot of the posterior.

In [17]:
# Only consider low food condition in dataset for now.
areas = df.loc[(df['food']=='low'), 'area (sq um)']

# Compute log posterior
LOG_POST = np.empty_like(MU)
for i in tqdm.tqdm_notebook(range(MU.shape[0])):
    for j in range(MU.shape[1]):
        LOG_POST[i, j] = log_posterior_explicit(
                            np.array([MU[i, j], SIGMA[i,j]]), areas)
        
# Exponentiate. Ignore normalization, so set max LOG_POST to zero
POST = np.exp(LOG_POST - LOG_POST.max())

Plot the results!

In [18]:
# Make plot
p = bebi103.viz.contour(MU,
                        SIGMA,
                        POST,
                        x_axis_label='µ (µm^2)',
                        y_axis_label='σ (µm^2)',
                        title='Posterior for egg cross sectional area',
                        overlaid=True)
bokeh.io.show(p)

Looks good! The shape looks like an egg, so it's obviously the correct posterior for this dataset. Just kidding!
From here we can eye that µ is about 2100 and sigma is about 150. But would we get the same answer if we had different priors or a smaller dataset?

An exercise on shrinkage

Let's now switch over to scipy.stats functions to demonstrate shrinkage! Again, our model is:

\begin{align} \mathrm{prior\, on \,} \mu &\sim \mathrm{Norm}(4000,1200)\\ \mathrm{prior\, on \,} \sigma &\sim \mathrm{HalfNorm}(0,300)\\ \mathrm{liklihood \, for \,} d_i &\sim \mathrm{Norm}(\mu, \sigma) \end{align} for the log of these, we have \begin{align} \log{(\mathrm{prior\, on \,} \mu)} &\sim \log{(\mathrm{Norm}(4000,1200))}\\ \log{(\mathrm{prior\, on \,} \sigma)} &\sim \log{(\mathrm{HalfNorm}(0,300))}\\ \log{(\mathrm{liklihood \, for \,} d_i)} &\sim \log{(\mathrm{Norm}(\mu, \sigma))} \end{align}

Now that we demonstrated how to implement the pdf manually, we return to using the scipy.stats functionality.

In [19]:
def log_prior(params):
    mu, sigma = params
    prior_mu = st.norm.logpdf(mu, 4000, 1200)
    prior_sigma = st.halfnorm.logpdf(sigma, 0, 300)
    log_prior = prior_mu + prior_sigma
    return log_prior

def log_like(params, d):
    mu, sigma = params
    return np.sum(st.norm.logpdf(d, mu, sigma))

def log_post(params, d):
    return log_prior(params) + log_like(params, d)

We check our previous work and make sure the posterior looks good.

In [20]:
# Set up plotting range
mu = np.linspace(2000, 2200, 200)
sigma = np.linspace(100, 200, 200)
MU, SIGMA = np.meshgrid(mu, sigma)

With these parameters, we make the plot of the posterior.

In [21]:
# Compute log posterior
LOG_POST = np.empty_like(MU)
for i in tqdm.tqdm_notebook(range(MU.shape[0])):
    for j in range(MU.shape[1]):
        LOG_POST[i, j] = log_post(
                            np.array([MU[i, j], SIGMA[i,j]]), areas)
        
# Exponentiate. Ignore normalization, so set max LOG_POST to zero
POST = np.exp(LOG_POST - LOG_POST.max())

Time to make the plot again.

In [22]:
# Make plot
p = bebi103.viz.contour(MU,
                        SIGMA,
                        POST,
                        x_axis_label='µ (µm)',
                        y_axis_label='σ',
                        title='Posterior for egg cross sectional area',
                        overlaid=True)
bokeh.io.show(p)

As expected, we get the same looking posterior when we used the scipy.stats functions as when we coded up the equations manually.

We are interested in looking at the how the prior and data interact. That is, how does changing how 'informative' our prior is effect the resulting posterior? We first note how sheer number of terms might play into this.

\begin{align} \mathrm{log(posterior)} = \, &\mathrm{const}\, + \\ &\log\mathrm{(prior \, \mu)}\, +\, \,\,\,\,\,\,\,\,\,&(1\, \mathrm{term}) \\ &\mathrm{log(prior \, \sigma)}\, + \,\,\,\,\,\,\,\,\,&(1\, \mathrm{term})\\ &\mathrm{log(likelihood)} \,\,\,\,\,\,\,\,\,&(N\, \mathrm{terms}) \end{align}

If we have a lot of data points (or even marginally many), the likelihood term will begin to dominate the effects of the priors on the posterior. Even if the prior gives a low probablity for a given choice of parameter, if the likelihood gives a high probability, it will do so ~N times over, outweighing the prior choice. In order to see how this works, we will start with our same priors as above, but consider different sized subsets of the data points and see how the posterior shape changes as the number of data points increases.

In [23]:
# Set up plotting range
mu = np.linspace(1600, 6000, 200)
sigma = np.linspace(1,500, 200)
MU, SIGMA = np.meshgrid(mu, sigma)

For this data, we will take a random subsample of 1, 10, 20, 30, 40, 50, or the full 57 data points and observe the posterior shape.

In [26]:
plots = []

# Compute log posterior
LOG_POST = np.empty_like(MU)
for i in range(MU.shape[0]):
    for j in range(MU.shape[1]):
        LOG_POST[i, j] = log_prior(
                            np.array([MU[i, j], SIGMA[i,j]]))
POST = np.exp(LOG_POST - LOG_POST.max())

plots.append(bebi103.viz.contour(MU,
                        SIGMA,
                        POST,
                        x_axis_label='µ (µm)',
                        y_axis_label='σ',
                        title='Prior for egg cross sectional area',
                        overlaid=True))

for num_dat in tqdm.tqdm_notebook([1, 10, 20, 30, 40, 50, 57]):
    areas = np.random.choice(df.loc[(df['food'] == 'low'), 'area (sq um)'], num_dat)

    # Compute log posterior
    LOG_POST = np.empty_like(MU)
    for i in range(MU.shape[0]):
        for j in range(MU.shape[1]):
            LOG_POST[i, j] = log_post(
                                np.array([MU[i, j], SIGMA[i,j]]), areas)

    # Exponentiate. Ignore normalization, so set max LOG_POST to zero
    POST = np.exp(LOG_POST - LOG_POST.max())
    plots.append(bebi103.viz.contour(MU,
                        SIGMA,
                        POST,
                        x_axis_label='µ (µm)',
                        y_axis_label='σ',
                        title=str(num_dat)+' data point(s), posterior for egg area',
                        overlaid=True))

With our set of plots made, let's take a look!

In [27]:
bokeh.io.show(bokeh.layouts.gridplot(plots,ncols=2))

Even one data point has a potent effect in drawing the high probability region of the mean towards the value we saw above! In general, we can see that the more data points we consider, the tighter the high probability region of the posterior becomes. This is exactly what we want in a model: the more data we have, the more certain we are of a parameter value.

In order to see what happens if we choose a very 'informative' prior, we will make the peaks on our previous prior choices much sharper. If we don't have much prior information on the system, this seems like a foolish thing to do: it encodes an inflated sense if certainty in where we expect the parameter values to be.

In [28]:
plots = []
mu = np.linspace(0, 8000, 8000)
prob = st.norm.pdf(mu, 4000, 100)

p = bokeh.plotting.figure(height=200, width=350, x_axis_label='mu (um^2)', y_axis_label='PDF', title='Sharper peaked priors')
p.line(mu, prob, line_width=2)
plots.append(p)

prob = st.norm.pdf(mu, 4000, 1200)
p = bokeh.plotting.figure(height=200, width=350, x_axis_label='mu (um^2)', y_axis_label='PDF', title='Previously used priors')
p.line(mu, prob, line_width=2)
plots.append(p)

mu = np.linspace(0, 4000, 4000)
prob = st.halfnorm.pdf(mu, 0, 10)

p = bokeh.plotting.figure(height=200, width=350, x_axis_label='sigma (um^2)', y_axis_label='PDF')
p.line(mu, prob, line_width=2)
plots.append(p)

prob = st.halfnorm.pdf(mu, 0, 300)
p = bokeh.plotting.figure(height=200, width=350, x_axis_label='sigma (um^2)', y_axis_label='PDF')
p.line(mu, prob, line_width=2)
plots.append(p)

bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

We will put these new priors into our model and see how the posterior looks for varying numbers of data points.

In [29]:
def log_prior(params):
    mu, sigma = params
    prior_mu = st.norm.logpdf(mu, 4000, 100)
    prior_sigma = st.halfnorm.logpdf(sigma, 0, 10)
    log_prior = prior_mu + prior_sigma
    return log_prior

def log_like(params, d):
    mu, sigma = params
    return np.sum(st.norm.logpdf(d, mu, sigma))

def log_post(params, d):
    return log_prior(params) + log_like(params, d)

We set a range of parameter values for plotting.

In [30]:
# Set up plotting range
mu = np.linspace(1600, 6000, 200)
sigma = np.linspace(1,500, 200)
MU, SIGMA = np.meshgrid(mu, sigma)

We will now iterate through differing numbers of data points and look at the posterior. In addition to the full data set size, we will also inflate the number of data points by repeating them. For example, we will repeat the data set ~100 times to get an idea of how having 5000 data points instead of 57 would influence the prior.

In [31]:
plots = []

# Compute log posterior
LOG_POST = np.empty_like(MU)
for i in range(MU.shape[0]):
    for j in range(MU.shape[1]):
        LOG_POST[i, j] = log_prior(
                            np.array([MU[i, j], SIGMA[i,j]]))
POST = np.exp(LOG_POST - LOG_POST.max())

plots.append(bebi103.viz.contour(MU,
                        SIGMA,
                        POST,
                        x_axis_label='µ (µm)',
                        y_axis_label='σ',
                        title='Prior for egg cross sectional area',
                        overlaid=True))

for num_dat in tqdm.tqdm_notebook([1, 10, 20, 30, 40, 50, 57, 500, 5000]):
    areas = np.random.choice(df.loc[(df['food'] == 'low'), 'area (sq um)'], num_dat)

    # Compute log posterior
    LOG_POST = np.empty_like(MU)
    for i in range(MU.shape[0]):
        for j in range(MU.shape[1]):
            LOG_POST[i, j] = log_post(
                                np.array([MU[i, j], SIGMA[i,j]]), areas)

    # Exponentiate. Ignore normalization, so set max LOG_POST to zero
    POST = np.exp(LOG_POST - LOG_POST.max())
    plots.append(bebi103.viz.contour(MU,
                        SIGMA,
                        POST,
                        x_axis_label='µ (µm)',
                        y_axis_label='σ',
                        title=str(num_dat)+' data point(s), posterior for egg area',
                        overlaid=True))

Time to plot!

In [32]:
bokeh.io.show(bokeh.layouts.gridplot(plots,ncols=2))

This looks terribly wrong even when we use the full 57 data points!

Our prior on sigma was so peaked near zero that the data was not able to pull it to the value we obtained previously. Notice also that every plot for the posterior has a much smaller area of high probability than it did before. The sharpness of our priors makes it look like we have high certainty about the wrong parameter value! Don't let this scare you about priors: data easily overcomes a broadly peaked distribution on the priors, as seen in the first case. However, you should never use a strongly informative prior unless you have a compeling reason to do so (if you have heaps of information before the experiment).