BE/Bi 103, Fall 2018: Homework 9¶

Due 1pm, December 11¶

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This homework was generated from an Jupyter notebook. You can download the notebook here.

Problem 9.1: Caulobacter growth: exponential or linear, 80 pts¶

Remind yourself about Homework 4.2, in which you computed the growth and division events of Caulobacter crescentus over time using date from this paper. In this problem, you will use your results from the image processing of those data sets to perform parameter estimation of the growth rates of the individual mother cells and also determine if bacterial growth on a single cell level is linear or exponential. You should use your own results form homework 4 for this problem, but in the event that you had trouble getting those results, you can use these results from the HW4 solutions.

We know that under ideal conditions, bacterial cells experience exponential growth in bulk. That is, the number of cells grows exponentially. This is possible regardless of how individual cells growth; the repeated divisions lead to exponential growth. In their paper, the authors argue that the growth rate of each cell is also exponential. I.e.,

\begin{align} a(t) = a_0 \mathrm{e}^{k t}, \end{align}

where $a(t)$ is the area of the cell in the image as a function of time and $a_0$ is the area of the cell right after a division has been completed, which we mark as $t = 0$.

As an alternative model, the authors consider a linear growth model, in which

\begin{align} a(t) = a_0 + b t. \end{align}

An exponential curve is approximately linear (with $b = a_0k$) for short time scales. So, it is often difficult to distinguish between a linear and an exponential growth. Your goal is to perform parameter estimates and do an effective comparison between these two models for growth. You should use hierarchical models, and be sure to take a principled approach in your model construction and evaluation.

Since you are using a hierarchical model, here are a few tips for building and implementing the models. You do not need to take this advice if you do not want to, but I have found that these strategies help.

Think carefully about your hyperpriors. If you choose an uninformative hyperprior for a level of the model that is data poor, you end up underpooling. For example, in this problem, there are only two mother cells. So, there are only two Caulobacter samples in your data set. If you put a broad prior on the growth rate of Caulobacter cells, these two cells can be effectively decoupled.
The hierarchical structure can make things difficult to code up and make it harder to hunt down bugs. As I'm building my hierarchical model, often approach it with "baby steps." I like to start off with a non-hierarchical model, often with a subset of the data. I perform sampling on this simpler model, taking a small number of samples so I do not have to wait for too long. After making sure everything is ok with this simpler structure, I then add a level to the hierarchy. I again do the sampling with a subset of the data, make sure everything works ok, and then add the next level of hierarchy, and so on. I find this helps me find bugs and little details along the way.
You will probably encounter the funnel of hell, so you should strongly consider using noncentered parametrizations.
When you sample out of the full hierarchical model, the sampler may be slower than you are used to seeing. It will likely also be much slower than sampling out of the non-hierarchical model, even though there are only a few more parameters. Stan may also be particularly slow during the warmup phase. You may see it progress taking a few seconds per iteration. This is natural.
Finally, just to give you a sense of what kind of computation time you might expect, for my hierarchical model, it took many hours to do the sampling on a c5.xlarge instance on AWS.

Problem 9.2: Outliers in FRET binding curve, 20 pts¶

We often want to ascertain how tightly two proteins are bound by measuring their dissociation constant, $K_d$. This is usually done by doing a titration experiment and then performing a regression. For example, imagine two proteins, a and b may bind to each other in the reaction

\begin{align} \text{ab} \rightleftharpoons \text{a} + \text{b} \end{align}

with dissociation constant $K_d$. At equilibrium

\begin{align} K_d = \frac{c_a\,c_b}{c_{ab}}, \end{align}

were $c_i$ is the concentration of species $i$. If we add known amounts of a and b to a solution such that the total concentration of a is $c_a^0$ and the total concentration of b is $c_b^0$, we can compute the equilibrium concentrations of all species. Specifically, in addition to the equation above, we have conservation of mass equations,

\begin{align} c_a^0 &= c_a + c_{ab}\\[1em] c_b^0 &= c_b + c_{ab}, \end{align}

fully specifying the problem. We can solve the three equations for $c_{ab}$ in terms of the known quantities $c_a^0$ and $c_b^0$, along with the parameter we are trying to measure, $K_d$. We get

\begin{align} c_{ab} = \frac{2c_a^0\,c_b^0}{K_d+c_a^0+c_b^0 + \sqrt{\left(K_d+c_a^0+c_b^0\right)^2 - 4c_a^0\,c_b^0}}. \end{align}

The technique, then, is to hold $c_a^0$ fixed in the experiment and measure $c_{ab}$ for various $c_b^0$. We can then perform a regression to get $K_d$.

In order to do this, though, we need some readout of $c_{ab}$. For this problem, we will use FRET (fluorescence resonance energy transfer) to monitor how much of a is bound to b. Specifically, we consider a to have a fluorophore and b to be its receptor. When the two are unbound, we get a fluorescence signal per molecule of $f_0$. When they are bound, the receptor absorbs the light coming out of the fluorophore, so we get less fluorescence per molecule, which we will call $f_q$ (for "quenched"). Let $f$ be the total per-fluorophore fluorescence signal. Then, the measured fluorescence signal, $F$, is

\begin{align} F = c_a^0\,V f = \left(c_a \,f_0 + c_{ab}\, f_q\right)V, \end{align}

where $V$ is the reaction volume. We can absorb $V$ into the other parameters such that $\hat{f}_0 = f_0 V$ and $\hat{f}_q = f_q V$, giving

\begin{align} F = \hat{f}_0(c_a^0 - c_{ab}) + \hat{f}_q\, c_{ab} = \hat{f}_0\,c_a^0 - \frac{2(\hat{f}_0 - \hat{f}_q)c_a^0\,c_b^0}{K_d+c_a^0+c_b^0 + \sqrt{\left(K_d+c_a^0+c_b^0\right)^2 - 4c_a^0\,c_b^0}}. \end{align}

Compute parameter estimates for $K_d$ with and without an outlier detection scheme for this data set. How do the results differ depending on whether or not you were trying to detect outliers?

Note: These are real data, but they are from an unpublished experiment here on campus. I therefore have not exposed the identities of the proteins a and b.