BE/Bi 103, Fall 2016: Homework 7¶

Due 1pm, Sunday, November 13¶

This homework was generated from an Jupyter notebook. You can download the notebook here. You can also view it here.

Problem 7.1: Practice writing posteriors (25 points + 10 points extra credit)¶

This problem is a worth a total of 25 points. You can do any subset of these problems to get full (or full plus extra) credit.

a) (10 pts) Write and/or draw a flow chart the details the steps from "Write Bayes' Theorem" to arriving at the final form of the posterior for a parameter estimation problem. If you hand-sketch the flow chart, you can include it in your Jupyter notebook as a scan. To include an image with Markdown, so this:

![description of image](file_of_image.png)

Be sure to include the image itself in your repository.

b) The Elowitz lab is interested in the design principles of cellular signaling pathway architectures, or how the interactions between signaling pathway components (things like extracellular ligands and receptors) give rise to different signal processing capabilities. Below are some experiments we might run to get a better quantitative understanding of cell signaling.

For each of the following scenarios:

(3 pts) Write the full form of the likelihood and prior you would use to estimate the parameter(s). You must define all symbols (e.g. parameters and variables).
(2 pts) Explain why you chose the form you did, including what you chose to neglect or exclude. More than one version may be appropriate, so give convincing reasons to select the form you wrote.

Exercise 1: You have images of many fields of cells, where fluorescently-labeled receptors at the cell surface appear as dots. (Assume these are maximum projections of confocal images, so that the image includes the entire cell membrane). You would like to estimate the mean number of receptors expressed by this cell line, and are using an automated image analysis tool that can count the number of dots on each cell. A previous paper reported that there are are $10^6 \pm 10^3$ of this receptor type expressed by this cell line.

Exercise 2: You decide to get higher throughput counts by flowing your fluorescently-labeled cells. However, first you need to know how the fluorescence depends on the number of fluorophores, which we presume to be equal to the number of receptors. To start approximating this, you measure the fluorescence of beads ($F$) with a known number of fluorophores attached to each ($N$). You assume the fluorescence depends linearly on the number of fluorophores, and that there is some background fluorescence:

\begin{align} F(N \mid a, b) = aN + b. \end{align}

We are interested in estimating the values of $a$ and $b$ (though we recognize the background fluorescence will probably be different in our cells).

Exercise 3: A fully-formed signaling complex requires two receptor subunits, and sometimes these receptors come together spontaneously. You want to estimate the average rate at which this happens. You put one half of a fluorescent protein on one receptor subunit and the other half on the other receptor subunit. If the subunits spontaneously come together, you will observe a fluorescent dot. You use time-lapse microscopy to image these cells, and generate a list of the times between dots on given single cells (e.g. you waited $t_1$, $t_2$, etc. seconds between seeing receptors spontaneously come together on cell $j$).

Exercise 4: You weren't able to do the experiment described in Exercise 3, but you read about it in a paper and request the data. They send you a table that reports the number of times receptor complexes spontaneously formed in ten minute intervals for each cell they analyzed. As above, you still want to estimate the average rate at which this happens. You also have an estimate and error bar for this rate from a different paper.

Exercise 5: You are studying a signaling pathway that, when the ligand binds the receptor, forms a multimeric protein complex at the intracellular side of the membrane. You assume that the rate of each protein joining this complex is roughly the same. You want to estimate the number of proteins in this complex, and the average rate at which they join. You employ a similar approach as in Exercise 3, using protein fusions that produce a fluorescent signal only when the complex is fully-formed. You use time-lapse microscopy to time how long it takes the complex to fully form after you add the ligand.

Problem 7.2: Hacker stats and Darwin's finches (75 pts + 25 pts extra credit)¶

Peter and Rosemary Grant of Princeton University have visited the island of Daphne Major on the Galápagos every year for over forty years and have been taking a careful inventory of the finches there. The Grants recently published a wonderful book, 40 years of evolution: Darwin's finches on Daphne Major Island. They were generous and made their data publicly available on the Dryad data repository. (In general, it is a very good idea to put your published data in public data repositories, both to preserve the data and also to make your findings public.) We will be using this data set to learn about evolution of Darwin's finches and use your hacker statistics skills. Up until part (f), all of your analyses will use nonparametric frequentist hacker stats.

We will focus on the primary two species of ground finch on Daphne Major, Geospiza fortis and Geospiza scandens. In this data set, you will find measurements of the beak length (tip to base) and beak depth (top to bottom) of these finches in the years 1973, 1975, 1987, 1991, and 2012. Also included in that data set is the band number for the bird, which gives a unique identifier.

a) We start with a little tidying of the data. Think about how you will deal with duplicate measurements of the same bird and make a decision on how those data are to be treated.

b) Plot ECDFs of the beak depths of Geospiza scandens in 1975 and in 2012. Then, estimate the mean beak depth in for each of these years with confidence intervals.

c) Perform a hypothesis test comparing the G. scandens beak depths in 1975 and 2012. Carefully state your null hypothesis, your test statistic, and you definition of what it means to be at least as extreme as the test statistic. Comment on the results. It might be interesting to know that a severe drought in 1976 and 1977 resulted in the death of the plants that produce small seeds on the island.

d) Devise a measure for the shape of a beak. That is, some scalar measure that combines both the length and depth of the beak. Compare this measure between species and through time. (This is very open-ended. It is up to you to define the measure, make relevant plots, compute confidence intervals, and possibly do hypothesis tests to see how shape changes over time and between the two species.)

e) Introgressive hybridization, occurs when a G. scandens bird mates with a G. fortis bird, and then the offspring mates again with pure G. scandens. This brings traits from G. fortis into the F. scandens genome. As this may be a mode by which beak geometries of G. scandens change over time, it is useful to know how heritable a trait is. Heritability is defined as the ratio of the covariance between parents and offsprings to the variance of the parents alone. To be clear, the heritability is defined as follows.

Compute the average value of a trait in a pair of parents.
Compute the average value of that trait among the offspring of those parents.
Do this for each set of parents/offspring. Using this data set, compute the covariance among all average offspring and the variance among all average parents.

This is a more apt definition than, say, the Pearson correlation, because it is a direct comparison between parents and offspring.

Heritability data for beak depth for G. fortis and G. scandens can be found here and here, respectively. (Be sure to look at the files before reading them in; they do have different formats.) From these data, compute the heritability of beak depth in the two species, with confidence intervals. How do they differ, and what consequences might this have for introgressive hybridization?

f) (25 pts extra credit) Repeat all of the above analysis using parametric Bayesian modeling.