BE/Bi 103, Fall 2018: Homework 2

Due 1pm, Sunday, October 14

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This homework was generated from an Jupyter notebook. You can download the notebook here.

Problem 2.1 (A temperature controlled Gal4-UAS system, 25 pts)

One of the students in a previous version of the class, Han Wang from the Sternberg lab, published (Wang, H., ..., Sternberg, P.W. (2017). cGAL, a temperature-robust GAL4-UAS system for Caenorhabditis elegans, Nat. Methods, 14(2), 145-148) an improved Gal4/UAS system in C. elegans. Briefly, the Gal4-UAS system was hijacked from budding yeast and incorporated into the genomes of other organisms, Drosophila being the first. The idea is to insert the Gal4 gene into the genome such that it is under control of a driver that is native to the organism. When Gal4 is expressed, it binds to UAS (upstream activation sequence) and it is activating, leading to expression of the UAS target gene.

Han is using the system with UAS activating production of green fluorescent protein (GFP). The Gal4 production is driven by Pmyo-2, which is only expressed in the pharynx of the worm.

The Gal4/UAS system typically works only at high temperatures. This does not work as well in worms that are stored at lower temperatures. Han therefore has been engineering "cool" Gal4, which will work at lower temperatures. To test how they are working, he measured the GFP fluorescence signal in the pharynx of worms.

He generously donated his data set for us to work with. He sent me a MS Excel file with the data, along with some comments via email. Here is what he said about the data set:

SC (orignal Gal4)

SK (cool Gal4)

m3 Pmyo-3::GFP fusion (control; measure of driver expression)

15 20 and 25 at the end for the name of each column shows the experimental temperature.

You can download the MS Excel file here.

a) Load the data into a Pandas DataFrame using pd.read_excel(). That's right, Pandas can read Excel files! You might want to read the Pandas documentation to see how it works.

b) Tidy the DataFrame. Be sure to remove any NaNs.

c) Do some exploratory data analysis of the data set. That is, make some instructive plots. Discuss why you chose to visualize the data set the way(s) you did. What can you say about Han's cool Gal4 just by looking at the plots?


Problem 2.2 (Exploring fish sleep data, 65 pts)

In Tutorial 2, we used a data set dealing with zerbafish sleep to learn about tidy data and split-apply-combine. It was fun to work with the data and to make some plots of fish activity over time. In this problem, you will work with your group to come up with some good ways to parametrize sleep behavior and estimate the values of these parameters.

Choose two different ways to parametrize sleep behavior. You can use sleep metrics from the Prober, et al. paper or (for more fun) invent your own. For each of the ways of parametrizing sleep, provide instructive plots and estimate the values of the parameters. Be sure to discuss the rationale behind choosing your parametrizations.

Note that there is a lot of debate among the community of scientists studying sleep how to best quantify the behavior. This is generally true in studies of behavior, and much of the process of understanding the measurements is deciding on what to use as metrics. This problem obviously has no right answer. What is important is that you can provide a clear rational for your choices.

As you work through this problem, much of what you will do is exploratory data analysis. You will work with data frames to compute the behavioral metrics of interest and make instructive plots. Again, this problem is intentionally open-ended. You are taking a data set and making plots that you might put in a presentation or in a paper to describe the behavior. As you do the analysis, provide text that discusses your choice and what conclusions you can draw from your analyses.

You do not need to do any data validation (we'll get to that next week). You can download and use the resampled data set you generated in Tutorial 2 here. If you feel that you need to use the original data set, you can get the activity file here and the genotypes file here.


Problem 2.3 (Bayes's theorem as a model for learning: 10 pts)

Say you did an experiment to investigate a model with parameters $\theta$ and acquired a data set $y_1$. Recall that in building a model for the experiment, you will have specified the likelihood, $f(y_1\mid \theta)$ and prior, $g(\theta)$. The result of your experiment and modeling is the posterior, $g(\theta\mid y_1)$. Now, say you did a second experiment under the same model and acquired a data set $y_2$. Show that the posterior distribution you obtain by using the result of the first experiment as prior information for the second is the same as what you would obtain by pooling the results of the two experiments together into a single data set.

In this way, we see how we learn more by doing more and more experiments.