BE/Bi 103, Fall 2017: Homework 2¶

Due 1pm, Sunday, October 8¶

This homework was generated from an Jupyter notebook. You can download the notebook here.

Problem 2.1 (A temperature controlled Gal4-UAS system, 20 pts)¶

One of the students in last year's class, Han Wang from the Sternberg lab, recently published (Wang, H., ..., Sternberg, P.W. (2017). cGAL, a temperature-robust GAL4-UAS system for Caenorhabditis elegans, Nat. Methods, 14(2), 145-148) an improved Gal4/UAS system in C. elegans. Briefly, the Gal4-UAS system was hijacked from budding yeast and incorporated into the genomes of other organisms, Drosophila being the first. The idea is to insert the Gal4 gene into the genome such that it is under control of a driver that is native to the organism. When Gal4 is expressed, it binds to UAS (upstream activation sequence) and it is activating, leading to expression of the UAS target gene.

Han is using the system with UAS activating production of green fluorescent protein (GFP). The Gal4 production is driven by Pmyo-2, which is only expressed in the pharynx of the worm.

The Gal4/UAS system typically works only at high temperatures. This does not work as well in worms that are stored at lower temperatures. Han therefore has been engineering "cool" Gal4, which will work at lower temperatures. To test how they are working, he measured the GFP flourescent signal in the pharynx of worms.

He generously donated his data set for us to work with. He sent me a MS Excel file with the data, along with some comments via email. Here is what he said about the data set:

SC (orignal Gal4)

SK (cool Gal4)

m3 Pmyo-3::GFP fusion (control; measure of driver expression)

15 20 and 25 at the end for the name of each column shows the experimental temperature.

You can download the MS Excel file here.

a) Load the data into a Pandas DataFrame using pd.read_excel(). That's right, Pandas can read Excel files! You might want to read the Pandas documentation to see how it works.

b) Tidy the DataFrame. Be sure to remove any NaNs.

c) Do some exploratory data analysis of the data set. That is, make some instructive plots. What can you say about Han's cool Gal4 just by looking at the plots?

Problem 2.2 (Exponential likelihoods, 15 points)¶

In next week's tutorial, we will discuss lots of probability distributions. An important distribution is the Exponential distribution, which is often used to describe how waiting times are distributed. For example, we might think that the waiting times for microtubule catastrophe are Exponentially distributed. We will talk much more about its meaning next week, but for now, we will work out some of the mechanics of dealing with this distribution.

The PDF of the Exponential distribution is

\begin{align} P(x\mid \lambda) = \frac{1}{\lambda}\,\mathrm{e}^{-x/\lambda}. \end{align}

There is one parameter to this distribution, $\lambda$. Now, say we made repeated measurements of of a quantity (such as MT catastrophe times), $\{x_i\}$, that we think is Exponentially distributed.

a) Write down the likelihood, $P(\{x_i\}\mid \lambda)$, assuming each measurement is independent of all others. Write it in terms of the mean of all of the $n$ measured quantities,

\begin{align} \bar{x} = \frac{1}{n}\sum_i x_i, \end{align}

and $n$.

b) Assuming $P(\lambda) \propto \lambda^{-1}$, show that

\begin{align} P(\lambda \mid \{x_i\}) = P(\lambda \mid \bar{x}, n) = \frac{(n\bar{x})^n}{\lambda^{n+1} (n-1)!}\,\mathrm{e}^{-n\bar{x}/\lambda}, \end{align}

assuming that you can take the bounds of $\lambda$ to go from zero to infinity. You may use the result that

\begin{align} \int_0^\infty \mathrm{d}\lambda\,\frac{\mathrm{e}^{-t/\lambda}}{\lambda^m} = \frac{(m-2)!}{t^{m-1}} \end{align}

for integer $m \ge 2$.

c) Plot this posterior for $\bar{x} = 10$ and $n = 100$. Hint: First compute the logarithm of the posterior, and then exponentiate it to get the posterior itself. For computing the log of a factorial, note that $m! = \Gamma(m+1)$, and you can compute the log of a gamma function using scipy.special.gammaln(). Watchout: You cannot name a Python variable lambda because it is a Python keyword.

d) Show that the most probable value for the parameter $\lambda$ is $n\bar{x}/(n+1)$.

e) Show that if we approximate the posterior as Gaussian near its peak, the variance of that Gaussian is $n^2\bar{x}^2/(n+1)^3$.

Problem 2.3 (Exploring fish sleep data, 65 pts)¶

In Tutorial 2, we used a data set dealing with zerbafish sleep to learn about tidy data and split-apply-combine. It was fun to work with the data and to make some plots of fish activity over time. In this problem, you will work with your group to come up with some good ways to parametrize sleep behavior and estimate the values of these parameters.

Choose two different ways to parametrize sleep behavior. You can use sleep metrics from the Prober, et al. paper or (for more fun) invent your own. For each of the ways of parametrizing sleep, provide instructive plots and estimate the values of the parameters. Be sure to discuss the rationale behind choosing your parametrizations.

As you work through this problem, much of what you will do is exploratory data analysis. You will work with data frames to compute the behavioral metrics of interest and make plots. Once you compute the metric for the fish of a given genotype, you will need to do a parameter estimation, which requires a statistical model. Because we have not yet delved too far into statistical models and parameter estimation, you should do the following when it becomes time to estimate parameters.

Plot an ECDF of the measured values of the metric.
If the ECDF looks sigmoidal (shaped like an "S"), then assume a Gaussian model. Use the results derived in lecture to get an estimate for $\mu$ and its error bar.
If the ECDF does not have a strong inflection point (which is what the CDF of the Exponential distribution looks like), use the results you derived in problem 2.2 to get an estimate for $\lambda$ and its error bar.

You do not need to do any data validation (we'll get to that next week). You can download and use the resampled data set you generated in Tutorial 2 here. If you feel that you need to use the original data set, you can get the activity file here and the genotypes file here.