Aux Lesson 7: Revisting Our Introduction to Baysian Modeling

(c) 2018 Julian Wagner and John Ciemniecki. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [1]:
import tqdm

import numpy as np
import pandas as pd
import scipy.stats as st

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

In this lesson we'll walk through generating, coding, and calculating parameter values for a baysian model that describes C. elegans egg cross-sectional area. The data we'll be using comes from Harvey and Orbidans, PLoS ONE, 6(10): e25840, and can be downloaded here.

We want to find a model that accurately describes C. elegans egg cross sectional area measured in square microns.

First: The model

Please specify a generative bayesian model for this dataset including a likelihood and associated priors. Provide physical reasoning for all your choices.

Second: PPC

1) Draw single, random parameter set from prior distribution(s)
2) Plug parameter set into likelihood
3) Draw a random dataset from this likelihood
4) Repeat 1-3 for the number of PPC samples you want to draw
5) Plot the collection of PPC samples
6) Check that values you see in these datasets are physically reasonable

Use the skeleton code below to guide your execution of drawing prior predictive check samples from your model.

In [1]:
def draw_parameters_from_priors():
    """Produces a random set of parameters 
       from the specified priors."""
    
    return params

Next, we will use a function to perform one 'trial' of the experiment with the parameters drawn from the prior. That is, we draw a value for egg cross sectional area. This is analagous to performing one measurment of the egg.

In [2]:
def draw_datum_from_likelihood(params):
    """Produces a random datum from the specified 
       likelihood with parameters passed in arguments."""
    
    return datum

To constuct a full data set, we need to make as many egg cross sectional area draws with a given parameter set as the experimenters made measures of eggs. Note from the data set that this is 57 measures for the low feeding worms.

In [3]:
def draw_ppc_sample(n_data_points):
    """Produces a random dataset with n_data_points 
       from the specified likelihood."""
    
    return ppc_sample

We can now construct many data sets (we will draw 50), each composed of 57 draws of egg cross sectional area from a distribution with the same random parameter selection.

In [ ]:
n_ppc_samples = 50

p = bokeh.plotting.figure(x_axis_label='Cross sectional area (um^2)', 
                          y_axis_label='ECDF', 
                          plot_height=300, 
                          plot_width=400)

[bebi103.viz.ecdf(draw_ppc_sample(57), p=p, alpha=0.2) for i in range(n_ppc_samples)]
bokeh.io.show(p)

Third: The posterior

Use the skeleton code below to guide your execution of plotting the posterior of your model. Do not use scipy.stats functions. Above the code cells are the discussed mathematical equations for the model.

Since our priors and likelihood come from a normal distribution, our first taks is to define this mathematically.

\begin{align} \mathrm{normal\, pdf} = \frac{1}{\sqrt{2 \pi \sigma^{2}}} \mathrm{exp}\left(\frac{-(x-\mu)^{2}}{2\sigma^2}\right). \end{align}

In [4]:
def normal_pdf(x, params):
    """Explicitly coded function for a gaussian likelihood.
       -----
       x (float): value to calculate the probability of.
       Params (tuple or array): Single set of parameters."""

    return p

The mathematical expression for our priors is the same as the equation above, with the relevant parameter selection inserted.

\begin{align} \mathrm{prior \, \mu} = \frac{1}{\sqrt{2 \pi (\mathrm{std_\mu})^{2}}} \mathrm{exp}\left(\frac{-(\mu-\mathrm{mean_\mu})^{2}}{2 \cdot \mathrm{(\mathrm{std_\mu})}^2}\right). \end{align}

\begin{align} \mathrm{prior \, \sigma} = \, \left| \frac{1}{\sqrt{2 \pi (\mathrm{std_\sigma})^{2}}} \mathrm{exp}\left(\frac{-(\mu-0)^{2}}{2 \cdot (\mathrm{std_\sigma})^2}\right) \right|. \end{align}

In [4]:
def log_prior_explicit(params):
    """Function for log of specified priors.
       -----
       Params (tuple or array): Single set of parameters."""
    
    return log_prior

Our likelihood takes the same form, though it has many terms that we will address momentarily.

\begin{align} \mathrm{likelihood} = \frac{1}{\sqrt{2 \pi \sigma^{2}}} \mathrm{exp}\left(\frac{-(y_i-\mu)^{2}}{2\sigma^2}\right) \forall i. \end{align}

Since we are assuming that these are i.i.d., we can write this as the product of all these terms. That is,

\begin{align} f(y \, | \, \mu, \sigma) = \prod_{i=1}^{N} \frac{1}{\sqrt{2 \pi \sigma^{2}}} \mathrm{exp}\left(\frac{-(y_i-\mu)^{2}}{2\sigma^2}\right). \end{align}

We will consider the log likelihood here for numerical concerns. Based on algebra of logarithms, this product turns into a sum of terms to give the log likelihood. That is,

\begin{align} \mathrm{log(likelihood)} &= \log{\left(\prod_{i=1}^{N} \frac{1}{\sqrt{2 \pi \sigma^{2}}} \mathrm{exp}\left(\frac{-(y_i-\mu)^{2}}{2\sigma^2}\right)\right)}. \\ & = \log{\left(\frac{1}{\sqrt{2 \pi \sigma^{2}}} \mathrm{exp}\left(\frac{-(y_1-\mu)^{2}}{2\sigma^2}\right)\right)}\\ & + \log{\left(\frac{1}{\sqrt{2 \pi \sigma^{2}}} \mathrm{exp}\left(\frac{-(y_2-\mu)^{2}}{2\sigma^2}\right)\right)}\\ & \, + \\ & + \log{\left(\frac{1}{\sqrt{2 \pi \sigma^{2}}} \mathrm{exp}\left(\frac{-(y_N-\mu)^{2}}{2\sigma^2}\right)\right)} \end{align}

In [6]:
def log_like_explicit(params, d):
    """Function for log of likelihood.
       -----
       Params (tuple or array): Single set of parameters.
       d (array): 1D dataset of egg sizes in square microns."""
        
    return total_log_like

With these equations in hand, we can easily get the posterior from the other terms. Note that the evidence is the const term. Recall that this is essencially a normalization constant, which we will ignore for this analyis.

\begin{align} \mathrm{posterior} &= \mathrm{const} \cdot \mathrm{prior \, \mu} \cdot \mathrm{prior \, \sigma} \cdot \mathrm{likelihood}\\ \mathrm{log(posterior)} &= \mathrm{log(\mathrm{const} \cdot prior \, \mu} \cdot \mathrm{prior \, \sigma} \cdot \mathrm{likelihood)}\\ & = \mathrm{const} + \mathrm{log(prior \, \mu)} + \mathrm{log(prior \, \sigma)} + \mathrm{log(likelihood)} \end{align}

In [7]:
def log_posterior_explicit(params, d):
    """Function for posterior of c. elegans egg size model.
       -----
       Params (tuple or array): Single set of parameters.
       d (array): 1D dataset of egg sizes in square microns."""
    
    return log_prior_explicit(params) + log_like_explicit(params, d)

We will now use a brute force approach to evaluate the posterior. Since we have two parameters, we can plot the whole thing using a contour plot.

In [ ]:
# Set up plotting range
mu = np.linspace(2000, 2200, 200)
sigma = np.linspace(100, 200, 200)
MU, SIGMA = np.meshgrid(mu, sigma)

# Load data
df = pd.read_csv('../data/c_elegans_egg_xa.csv', comment='#')

# Only consider low food condition in dataset for now.
areas = df.loc[(df['food']=='low'), 'area (sq um)']

# Compute log posterior
LOG_POST = np.empty_like(MU)

for i in tqdm.tqdm_notebook(range(MU.shape[0])):
    for j in range(MU.shape[1]):
        LOG_POST[i, j] = log_posterior_explicit(
                            np.array([MU[i, j], SIGMA[i,j]]), areas)
        
# Exponentiate. Ignore normalization, so set max LOG_POST to zero
POST = np.exp(LOG_POST - LOG_POST.max())

# Make plot
p = bebi103.viz.contour(MU,
                        SIGMA,
                        POST,
                        x_axis_label='µ (µm^2)',
                        y_axis_label='σ (µm^2)',
                        title='Posterior for egg cross sectional area',
                        overlaid=True)

bokeh.io.show(p)