R3. Choosing priors
This recitation was conducted by Cece Andrews and Ariana Tribby.
[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
cmd = "pip install --upgrade iqplot watermark"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
data_path = "../data/"
# ------------------------------
import numpy as np
import pandas as pd
import iqplot
import bokeh.io
bokeh.io.output_notebook()
import scipy.special
import scipy.optimize
import scipy.stats as st
Part 1: Review of Lecture
In Part 1 we reviewed the Lecture 2 section on choosing priors. A quick recap of what we learned:
Uniform prior: Not the best. Improper prior – it gives us no information and is not normalizable!
Jeffreys prior: Also not great. Fixes the problem of the uniform prior where the priors vary with change of variables, but it is very hard to compute for some distributions.
Weakly informative prior: Good! We can choose a prior probability distribution to encode what we know about the parameter before we measured data. We will discuss how to come up with weakly informative priors below.
Conjugate prior: Can be helpful! Makes the posterior possible to analyze analytically, but only a few distributions have conjugates, so not always possible.
Part 2: Tips and Practice for Choosing Weakly Informative Priors
Recommended steps for choosing priors:
Write down Bayes’ Theorem first!
In words:
Choose your likelihood (please see Recitation 2/Lecture 2 for help).
Assuming the parameters are independently distributed in the prior, you will need to choose a prior for each parameter in your likelihood. For example, if you choose
, you will need to choose priors and .Consider what you know about the parameter. Do you know what it means in the system? Are there any bounds on its value? Is it related to other parameters? Will its value be very close to zero/very far from zero?
Come up with a sketch of what you think the probability distribution for your prior will look like. Your sketch should cover all possible values of the parameter.
Use the Distribution Explorer to help you choose a probability distribution that matches your sketch. Consider the story and shape of the distribution. Play around with the distribution explorer, how does changing the parameters change the shape of the distribution?
Go for more broad instead of more narrow!
Don’t stress too much! Most often there will be several reasonable choices for priors. And they will be overwhelmed by the likelihood. As long as your prior is reasonable and not overly narrow, you will be fine!
We will now do a couple examples of choosing priors with data used in previous terms/years of the course. In these examples we will not worry too much about the likelihood and focus more on practicing choosing priors.
Example 1: Spindle Size
From our favorite data set involving mitotic spindle sizes….
“The data set we will use for this analysis comes from Good, et al., Cytoplasmic volume modulates spindle size during embryogenesis Science, 342, 856-860, 2013. You can download the data set here. In this work, Matt Good and coworkers developed a microfluidic device where they could create droplets cytoplasm extracted from Xenopus eggs and embryos (see figure below; scale bar 20 µm). A remarkable property about Xenopus extract is that mitotic spindles spontaneously form; the extracted cytoplasm has all the ingredients to form them. This makes it an excellent model system for studying spindles. With their device, Good and his colleagues were able to study how the size of the cell affects the dimensions of the mitotic spindle; a simple, yet beautiful, question. The experiment is conceptually simple; they made the droplets and then measured their dimensions and the dimensions of the spindles using microscope images.”
Load in the data:
[2]:
df_spindle = pd.read_csv(
os.path.join(data_path, "good_invitro_droplet_data.csv"), comment="#"
)
df_spindle.head()
[2]:
Droplet Diameter (um) | Droplet Volume (uL) | Spindle Length (um) | Spindle Width (um) | Spindle Area (um2) | |
---|---|---|---|---|---|
0 | 27.1 | 0.000010 | 28.9 | 10.8 | 155.8 |
1 | 28.2 | 0.000012 | 22.7 | 7.2 | 81.5 |
2 | 29.4 | 0.000013 | 26.2 | 10.5 | 138.3 |
3 | 31.0 | 0.000016 | 19.2 | 9.4 | 90.5 |
4 | 31.0 | 0.000016 | 28.4 | 12.1 | 172.4 |
Plot of Spindle Length vs. Droplet Diameter
[3]:
p = bokeh.plotting.figure(
frame_width=300,
frame_height=200,
x_axis_label="droplet diamater (µm)",
y_axis_label="spindle length (µm)",
)
p.circle(
df_spindle["Droplet Diameter (um)"],
df_spindle["Spindle Length (um)"],
size=3,
alpha=0.3,
)
bokeh.io.show(p)
ECDF of Spindle Length
[4]:
bokeh.io.show(iqplot.ecdf(df_spindle, q='Spindle Length (um)'))
Say the spindle length is independent of droplet size. We would then expect the spindle length to be given by
where
Let us use the following likelihood (for explanation on how this likelihood was chosen, please see the discussion of this data set from last term).
So, we will need to choose the priors
Possibility #1:
Let’s think about what we know about mitotic spindles. I don’t know their exact range in size, but I would guess a reasonable choice for

Great! I can use the Distribution Explorer to choose a reasonable prior based on my sketch. I will choose:
We can plot the result.
[5]:
phi = np.linspace(-50, 100, 200)
p = bokeh.plotting.figure(
frame_width=300, frame_height=200, x_axis_label="φ (µm)", y_axis_label="g(φ)"
)
p.line(phi, st.norm.pdf(phi, 20, 20), line_width=2)
bokeh.io.show(p)
For
[6]:
sigma = np.linspace(-10, 20, 100)
p = bokeh.plotting.figure(
frame_width=300,
frame_height=200,
x_axis_label="sigma (µm)",
y_axis_label="g(sigma)",
)
p.line(sigma, st.norm.pdf(sigma, 5, 5), line_width=2)
bokeh.io.show(p)
In this case, I chose the shape for the prior above because 1) I am speculating that the average length might vary by 25-70%, and 2) this standard deviation is likely to have a distribution that has the shape of a normal distribution.
Visualizing the Prior
The prior can be a bit confusing because it’s hard to think about what it means. When choosing priors, we think about our knowledge of the parameter before collecting data. From what we know, we can estimate the probability of the parameter taking on certain values, and therefore can choose a reasonable distribution for our prior probability. One question that came up in recitation a lot is: what does each part of the prior mean? For example, what does the µ in
Consider our prior
What can be even more confusing is the
Possibility #2:
You may be thinking: hold on,
If the location parameter for our Normal prior is far away from 0, negative values have a very low probability anyway due to the light tails, so we won’t have to worry about them too much. We can just include a conditional statement in our model to avoid negative values for our prior.
We could choose a Log-Normal or Half-Normal prior to avoid this issue. Be sure to review those distributions on the distribution explorer to make sure they are reasonable choices.
Rethinking my prior
Which looks like:
[7]:
phi = np.linspace(0, 100, 200)
p = bokeh.plotting.figure(
frame_width=300, frame_height=200, x_axis_label="phi (µm)", y_axis_label="g(phi)"
)
p.line(phi, st.lognorm.pdf(phi, 0.75, loc=0, scale=20), line_width=2)
bokeh.io.show(p)
Looks reasonable!
Revisiting
This would look like:
[8]:
sigma = np.linspace(0, 40, 200)
p = bokeh.plotting.figure(
frame_width=300,
frame_height=200,
x_axis_label="sigma (µm)",
y_axis_label="g(sigma)",
)
p.line(phi, st.halfnorm.pdf(sigma, 0, 10), line_width=2)
bokeh.io.show(p)
So, our final model would be:
Possibility #3: A New Model
Let’s use a new model for spindle length where droplet diameter is related to spindle length. To understand how we came up with this model, and its caveats, please refer to the discussion of the model from last term. Be sure to read through this first, otherwise the following will not make sense.
We have a theoretical model relating the droplet diameter to the spindle length. Let us now build a generative model. For spindle, droplet pair i, we assume
We will assume that
where
which is equivalently stated as
Yikes, we have a lot more parameters in this model! Not to worry, we will choose priors for one parameter at a time. Starting with
Now let’s consider
Conversely, for $ d \ll `:nbsphinx-math:phi`/:nbsphinx-math:`gamma`$, the spindle length varies approximately linearly with diameter.
Of course, these cases are physically possible. Please see the notes I referenced from BE/Bi 103a to review the limitations of this model. So, we know our model will only work in cases of droplets with intermediate diameter. Thus, we will choose:
Let’s take a look.
[9]:
gamma = np.linspace(0, 1, 100)
p = bokeh.plotting.figure(
frame_width=300,
frame_height=200,
x_axis_label="γ (µm)",
y_axis_label="γ(gamma)",
x_range=[0, 1],
)
p.line(gamma, st.beta.pdf(gamma, 2, 2), line_width=2)
bokeh.io.show(p)
Notice how useful the beta distribution can be!
Next, we will consider
We have now introduced a new prior parameter,
Gamma makes sense in this case because, as stated in the Distribution Explorer: “it imparts a heavier tail than the Half-Normal distribution (but not too heavy; it keeps parameters from growing too large), and allows the parameter value to come close to zero.” We want to leave room for the case where
Let’s plot it up to take a look.
[10]:
sigma_0 = np.linspace(0, 2, 100)
p = bokeh.plotting.figure(
frame_width=300, frame_height=200, x_axis_label="σ₀", y_axis_label="g(σ₀)"
)
p.line(sigma_0, st.gamma.pdf(sigma_0, 2, loc=0, scale=0.1), line_width=2)
bokeh.io.show(p)
We have now finished building the model:
Example 2: Microtubule Catastrophe - Try it yourself!
[11]:
df_mt = pd.read_csv(
os.path.join(data_path, "gardner_time_to_catastrophe_dic_tidy.csv"), index_col=0
)
labeled = df_mt.loc[df_mt["labeled"], "time to catastrophe (s)"].values
df_mt.head()
[11]:
time to catastrophe (s) | labeled | |
---|---|---|
0 | 470.0 | True |
1 | 1415.0 | True |
2 | 130.0 | True |
3 | 280.0 | True |
4 | 550.0 | True |
ECDF of Catastrophe Times for Labeled Tubulin
[12]:
p = iqplot.ecdf(labeled, q='t (s)', conf_int=True)
bokeh.io.show(p)
Last term, we explored a Gamma model for this dataset, which gives the amount of time we have to wait for
In the distribution explorer, it is easier to see the impacts of changing the parameters. When
We have two parameters -
For
I am not a biologist, but I would bet to say it is unlikely that microtubule catastrophe will take longer than one hours = 3600 seconds. 1/3600 ~ 2e-4
The range 0 – 5000 is a pretty wide, but we could use a lognormal, since I expect more probability towards faster rates than really slow ones. Let’s use a mean of 500.
[13]:
theta = np.linspace(0, 5000, 200)
p = bokeh.plotting.figure(
frame_width=400, frame_height=300, x_axis_label="θ (s)", y_axis_label="g(θ)"
)
p.line(theta, st.lognorm.pdf(theta, 0.7, loc=0, scale=500), line_width=2)
bokeh.io.show(p)
For
[14]:
alpha = np.linspace(0, 20, 200)
p = bokeh.plotting.figure(
frame_width=300, frame_height=200, x_axis_label="α", y_axis_label="g(α)"
)
p.line(alpha, st.halfnorm.pdf(alpha, 0, 5), line_width=2)
bokeh.io.show(p)
Computing environment
[15]:
%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,iqplot,jupyterlab
CPython 3.8.5
IPython 7.19.0
numpy 1.19.2
pandas 1.2.1
bokeh 2.2.3
iqplot 0.2.0
jupyterlab 2.2.6