BE/Bi 103, Fall 2017: Homework 1

Due 1pm, Sunday, October 1

(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This homework was generated from an Jupyter notebook. You can download the notebook here.

Problem 1.1 (Using Git/GitHub to submit this homework, 20 pts)

Do parts (a)-(d) before doing the rest of the homework. Part (e) is the final thing to do for this homework.

This may be your first time using Git/GitHub. Git is a version control system that allows for collaborative work. GitHub is a website that can host Git repositories. We will be using Git/GitHub exclusively for submitting homework and tutorial exercises. We do this for two main reasons. First, this allows for much easier tracking, submission, and grading of homework. Second, and most importantly, you should be using version control when working with real data. It is good practice both for the security of your own work and for reproducibility of your research.

Prior to submitting this homework, you formed a team of three (and possibly four) people. This team has a repository containing your work for this class that is hosted at GitHub. For example, if you are team number 6, your repository is hosted at https://github.com/bebi103/06-bebi103. The rest of this problem, as if you are team 6, but you can make obvious substitutions for your team.

a) Clone your repository to your machine. I like to keep my repositories in the directory ~/git, but the location on your machine is your choice. To clone into that directory, I would do the following on my machine.

cd ~/git
git clone https://github.com/bebi103/06-bebi103.git

b) Upon cloning the repository, you will have a new directory, ~/git/06-bebi103 containing your team's repository. Within that directory, you will see three subdirectories, homework/, tutorial_exercises/, and sandbox/. The homework/ and tutorial_exercises/ directories are for submitting homework and tutorial exercises. The sandbox/ directory is for messing around with various ideas that are not for serious submission.

You will need to create a directory on your local machine within the repository called data/. For my machine, I would do

mkdir ~/git/06-bebi103/data

This directory is not under version control, and anything you put in there will not be uploaded to GitHub. This is because we do not want to have large data files under version control. Git will automatically ignore the contents of this directory because it is included in the .gitignore file of the repository.

For Problem 1.3, you will need to download a data set. You should put this data set in your data/ directory exactly as downloaded. Go ahead and download the data set here and put is in your data/ directory. This is where all of your class data will go.

c) You will place the solution to this homework problem in the file homework/hw1.1.ipynb. Have one of your team members should create this file. After it is created and saved, he or she can add it, commit, and push.

cd ~/git/06-bebi103/homework
git add hw1.1.ipynb
git commit -m "Initial commit of homework 1.1."
git push origin master

d) Now, the other members of your team should pull the updates.

git pull

Each team member should, in turn on their own machine, write a haiku about data analysis in a Markdown cell, save the notebook, add it, commit, and push. (Don't put too much time or thought into the haiku; it's just for fun to add some text.) Something like this:

cd ~/git/06-bebi103/homework
git add hw1.1.ipynb
git commit -m "Justin added his haiku. It's undoubtedly bad."
git push origin master

As you are working through your notebooks, it is wise to commit and push often. Before you start working, you should also pull to get your teammates' changes. To avoid conflicts, you may want to edit notebooks separately and then copy and paste them into the notebooks you will submit at the end. These other notebooks should be in the sandbox/ directory to keep your homework/ directory clean. More experienced Git users can use branches and merges instead.

e) After you have completed the other two problems in files homework/hw1.2.ipynb and homework/hw1.3.ipynb, along with your haikus in homework/hw1.1.ipynb, it is time to submit the homework. To do this, commit and push your final submission. Something like this.

cd ~/git/06-bebi103/homework
git add hw1.*.ipynb
git commit -m "Final commit for homework submission."
git push origin master

Then, follow the instructions here to tag this commit as the homework submission. When you type the version number of your release, use hw1_submission. You do not need to include any binaries or select anything about pre-releases.



Problem 1.2 (Your goals, 20 pts)

  • Write down your goals for the class.
  • In what context do you expect to use the skills you learn in this class in the future?
  • Are there specific techniques you would like to learn? Is there something that has been confusing for you that you would like cleared up?

Each member of your group should write his or her own response in a Markdown cell (and identify who each response belongs to), but the responses should be turned in together.



Problem 1.3 (Microtubule catastrophes I, 40 pts + 5 pts extra credit)

Throughout the class, we will analyze data from several sources. We will look at some data sets repeatedly because there is plenty of interesting data analysis to be done. One of these data sets comes from this paper by Gardner, Zanic, and coworkers. The full reference is: Gardner, Zanic, et al., Depolymerizing kinesins Kip3 and MCAK shape cellular microtubule architecture by differential control of catastrophe, Cell, 147, 1092-1103, 2011, 10.1016/j.cell.2011.10.037.

We will discuss the paper more throughout the class, and I encourage you to read it. Briefly, the authors investigated the dynamics of microtubule catastrophe, the switching of a microtubule from a growing to a shrinking state. In particular, they were interested in the time between the start of growth of a microtubule and the catastrophe event. They monitored microtubules by using tubulin (the monomer that comprises a microtubule) that was labeled with a fluorescent marker. As a control to make sure that fluorescent labels and exposure to laser light did not affect the microtubule dynamics, they performed a similar experiment using differential interference contrast (DIC) microscopy. They measured the time until catastrophe with labeled and unlabeled tubulin.

In this problem, we will look at the data used to generate Fig. 2a of their paper. In the end, we will generate a plot similar to Fig. 2a.

a) If you haven't already, download the data file here. Read the data from the data file into a Pandas DataFrame.

b) These data are not tidy. Why? It is possible to tidy these data without the fancy techniques we will learn next week. Tidy the data. Hint: The dropna() method of DataFrames may come in handy.

c) You goal in this part of the problem is to plot the empirical cumulative distribution function (ECDF) as in Fig. 2a of the Gardner, Zanic, et al. paper. You do not need to plot the inset of that figure. To construct the plot, first write a function with the call signature ecdf_vals(data), which takes a one-dimensional Numpy array (or Pandas Series; the same construction of your function will work for both) of data and returns the x and y values for plotting the ECDF. As a reminder,

ECDF(x) = fraction of data points ≤ x.

Use the ecdf_vals() function that you wrote to plot the ECDFs shown in Fig. 2a of the Gardner, Zanic, et al. paper.

d) [5 points extra credit] While many researchers plot ECDFs as in the Garnder, Zanic, et al. paper, this is not the typical convention. Given the definition of the ECDF above, it is defined for all values of x along the real x-axis. So, formally, the ECDF should be plotted as a line.

Write a function with call signature plot_ecdf_formal(data) that takes a one-dimensional Numpy array of data and returns a Bokeh figure with the ECDF plotted as a line. Use this function to generate a plot analogous to the one you did in part (d).


Problem 1.4: Probability distribution of a marginalized parameter (20 pts)

This problem should be completed after the first Wednesday lecture of the course.

a) Explain in words what marginalization is.

b) Say I have a statistical model that has two continuous parameters, that is, $\theta = (\theta_1, \theta_2)$. A statement of Bayes's theorem in this case is

\begin{align} P(\theta \mid D, I) = \frac{P(D\mid \theta, I)\,P(\theta \mid I)}{P(D\mid I)}. \end{align}

We can write this explicitly in terms of the two parameters as

\begin{align} P(\theta_1, \theta_2 \mid D, I) = \frac{P(D\mid \theta_1, \theta_2, I)\,P(\theta_1, \theta_2 \mid I)}{P(D\mid I)}. \end{align}

Derive an expression for $P(\theta_1 \mid D, I)$ that features only terms appearing on the right hand side of the above equation.