BE/Bi 103, Fall 2014: Homework 5

Due 1pm, Monday, November 17

This tutorial was generated from an IPython notebook. You can download the notebook here.

Problem 5.1 (Microtubule catastrophe and the Gamma distribution, 50 pts)¶

In Homework 1, we plotted data of microtubule catastrophe times. In this problem, we return to the data from the Gardner, Zanic, et al. paper We will carefully analyze the data and make some conclusions about the processes underlying microtubule catastrophe. You can download the data set here.

In the file gardner_mt_catastrophe_only_tubulin.csv, we have observed catastrophe times of microtubules with different concentrations of tubulin. So, our data set $D$ consists of a set of measurements of the amount of time to catastrophe; $D = \{t_i\}$. We will consider two models for microtubule catastrophe. In the first model ($M_1$), our likelihood is

\begin{align} P(D~|~\lambda, M_1, I) = \prod_{i\in D} \lambda \mathrm{e}^{-\lambda t_i}. \end{align}

In the second model ($M_2$), the likelihood is

\begin{align} P(D~|~r, a, M_2, I) = \prod_{i\in D} \frac{(rt_i)^a}{t_i\,\Gamma(a)}\,\mathrm{e}^{-rt_i}, \end{align}

where $\Gamma(a)$ is the gamma function.

a) I have given you the models mathematically as expressions of the likelihoods. Describe the two models in words. Give physical descriptions of the meanings of the parameters $\lambda$, $r$, and $a$.

b) For the trials where the tubulin concentration is 12 µM (from the file gardner_mt_catastrophe_only_tubulin.csv), estimate the values of the parameter $\lambda$ for model 1. Estimate the value of the parameters $r$ and $a$ for model 2. Hint: Do not compute the gamma function and then take its logarithm when computing the log likelihood. Instead, use the scipy.special.gammaln function to directly compute $\ln \Gamma(a)$.

c) Plot the probability density functions (pdf) for each model over a histogram of the data using your most probable parameter values. Also plot the cumulative density functions (cdf) over the data. You will need to bin the data for the pdf histogram, but not for the cdf. Hint: If you wish to compute

\begin{align} P(x;a,r) = \frac{(rx)^a}{x\,\Gamma(a)}\,\mathrm{e}^{-rx}, \end{align}

the gamma distribution, use scipy.stats.gamma.pdf(x, a, scale=1.0/r). Similarly, use scipy.stats.gamma.cdf(x, a, scale=1.0/r) for the gamma cdf.

d) Calculate the approximate odds ratio between the two models. Be careful with your priors! You should use a proper prior. For example, if $a$ has a Jeffreys prior, you can assume a minimum and maximum value for $a$. In this case,

\begin{align} P(a~|~I) = \frac{\ln\left(a_\mathrm{max}/a_\mathrm{min}\right)}{a}. \end{align}

e) Using whichever model you found more probable when you computed the odds ratio, compute either $\lambda$ or $a$ and $r$ for the other concentrations of tubulin. Given that microtubules polymerize faster with higher tubulin concentrations, is there anything you can say about the occurance of catastrophe by looking at the values of $\lambda$ or $a$ and $r$ versus tubulin concentration?

f) In the files gardner_mt_catastrophe_kip3.csv and gardner_mt_catastrophe_mcak.csv, there are measurements of catastrophe times in the presence of the kinesins Kip3 and MCAK with 12 µM tubulin. Analyze these data and discuss conclusions about their respective roles in microtubule catastrophe. Note: This part of the problem is intentionally open-ended. You should think carefully, and perform a complete analysis to draw your conclusions.

Problem 5.2 (Oscillating reactions in droplets, 50 pts)¶

In this problem, we will be analyzing data from Weitz, et al., Diversity in the dynamical behaviour of a compartmentalized programmable biochemical oscillator, Nature Chemistry, 6, 295-302, 2014. As discussed in Tutorial 5b, the authors engineered a synthetic two-switch negative-feedback oscillator circuit. When the components of this circuit are mixed together, the concentrations of the respective species oscillate in time. The concentration is reported by a fluorescently labeled DNA strand whose base-pairing state affects the fluorescent signal.

The authors did this experiment in bulk solution and observed oscillations. Interestingly, the also encapsulated the reaction mixture in small droplets ranging from 33 to 16000 cubic microns. They observe differences in the oscillation period and amplitude in the small droplets when compared to bulk. This highlights the effect that statistical variation in the concentration of components in the droplets can have on observed dynamics. Understanding these effects are important as we seek to engineer cell-scale systems.

In doing this problem, the Data Analysis section of the supplement of the paper (pages 8-14) will be useful to you.

The data sets for this problem contain the area and integrated fluorescent intensity of droplets over time. There are a total of seven sets taken from drops showing sustained oscillations. The file names are organized as such:

area_1.csv: Each column has the area of a given droplet. This is $x_{area,i}$ as defined in the paper.

intensity_1.csv: Each column has the Alex488 fluorescent signal from the reaction network. This is $x_{fl,i}$ as defined in the paper.

intensity_norm_1.csv: Each column has the TexasRed signal. This is used to nroamlize the fluorescent signal. This is $x_{norm,i}$ in the paper.

In all of these CSV files, the first column is labeled SliceNumber. This simply give the number of the image from which the data were taken. The frame rate in these experiments was five minutes per frame (or 12 frames per hour). So, SliceNumber 3 means $t = 10$ minutes.

In this problem, you will work on your scripting skills to deal with large data sets that have many trials, in addition to using smoothing techniques.

a) Load the data (from this ZIP file) and construct a data set that has all of the "good" traces of droplet fluorescence you want to consider. This means you will remove traces that have problems. For example, you will want to remove traces that are too short, as this is indicative that droplet tracking failed. You should find the discussion in the supplement very useful in how to define good traces, though you may come up with your own definitions of good traces. Hint: You may find some of the functions in pandas useful. pd.concat and pd.rolling_median may help. If df is a DataFrame, the methods df.diff and df.quantile may also be useful. Furthermore, df.ix is useful because it allow indexing like NumPy arrays.

b) Using your good set of traces, smooth the traces and find the period of oscillation for each trace. This should be calculated as $T_{n,m}$ as defined on page 10 of the supplement.

c) Plot $T_{3,1}$ versus radius of for your good data and provide any comments you may have.

Problem 5.3 (Outlier detection, 20 pts extra credit)¶

During the evening session of our tutorial on outlier detection, a student suggested that instead of choosing the Cauchy distribution for our likelihood, we could instead define a distribution that has a parameter that describes the "tailedness" of the likelihood distribution and we could marginalize over that parameter. This is suggested by Sivia in section 8.3.3. In this problem you will perform a regression on our fake linear data with outliers. The data are contained in the file linear_data_with_outliers.csv, which you can get from the Tutorial 5 data file.

Recalling Tutorial 5a, we defined our Cauchy likelihood as

\begin{align} P(D~|~\mathbf{a},\beta,I) = \prod_{i\in D} \left[\pi\beta\left(1 + \left(\frac{y_i - f(x_i;\mathbf{a})}{\beta}\right)^2\right)\right]^{-1}. \end{align}

The Cauchy distribution is a special case of the Student-t distribution. We can write our likelihood in terms of the Student-t distribution as

\begin{align} P(D~|~\mathbf{a},\beta,\nu,I) = \prod_{i\in D} \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\Gamma\left(\frac{\nu}{2}\right)\sqrt{\pi \nu}\,\beta}\left(1 + \frac{1}{\nu}\left(\frac{y_i - f(x_i;\mathbf{a})}{\beta}\right)^2\right)^{-\frac{\nu+1}{2}}, \end{align}

where we have introduced an additional parameter $\nu$. When $\nu = 1$, the Cauchy description of the likelihood is obtained. For large $\nu$, we have

\begin{align} \lim_{\nu \to \infty} P(D~|~\mathbf{a},\beta,\nu,I) = \prod_{i\in D}\frac{1}{\sqrt{2\pi \beta^2}}\,\exp\left\{-\frac{y_i - f(x_i;\mathbf{a})}{2\beta^2}\right\}, \end{align}

which is the Gaussian likelihood with (unknown) variance $\beta^2$. So, we could take $\nu$ as a parameter of tailedness of the likelihood and marginalize over it as well as $\beta$ when doing our regression. By doing our regression/outlier detection this way, we have made no assumptions about the tailedness of the likelihood. Our only assumption is that we describe it in a unimodal manner described by the Student-t distribution.

Perform a linear regression on our fake data using the techinque I've just described. In this case, $\mathbf{a} = \{a,b\}$, the slope and intercept of the line modeling the data. Also plot the marginalized posterior distribution,

\begin{align} P(a,b~|~D,I) = \int_0^\infty \mathrm{d}\beta\,\int_1^\infty \mathrm{d}\nu\,P(a,b,\beta,\nu~|~D,I). \end{align}

Hint: You are probably going to want to use MCMC for this.