BE/Bi 103, Fall 2016: Homework 1

Due 1pm, Sunday, October 2

(c) 2016 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This homework was generated from an Jupyter notebook. You can download the notebook here.

Problem 1.1 (Using Git/GitHub to submit this homework, 20 pts)

Do parts (a)-(d) before doing the rest of the homework. Part (e) is the final thing to do for this homework.

This may be your first time using Git/GitHub. Git is a version control system that allows for collaborative work. GitHub is a website that can host Git repositories. We will be using Git/GitHub exclusively for submitting homework and tutorial exercises. We do this for two main reasons. First, this allows for much easier tracking, submission, and grading of homework. Second, and most importantly, you should be using version control when working with real data. It is good practice both for the security of your own work and for reproducibility of your research.

Prior to submitting this homework, you formed a team of three (and possibly four) people. This team has a repository containing your work for this class that is hosted at GitHub. For example, if you are team number 6, your repository is hosted at https://github.com/bebi103/06-bebi103. For the rest of this problem, we will be working as if you are team 6, but you can make obvious substitutions for your team

a) Clone your repository to your machine. I like to keep my repositories in the directory ~/git, but the location on your machine is your choice. To clone into that directory, I would do the following on my machine.

cd ~/git
git clone https://github.com/bebi103/06-bebi103.git

b) Upon cloning the repository, you will have a new directory, ~/git/06-bebi103 containing your team's repository. Within that directory, you will see three subdirectories, homework/, tutorial_exercises/, and sandbox/. The homework/ and tutorial_exercises/ directories are for submitting homework and tutorial exercises. The sandbox/ directory is for messing around with various ideas that are not for serious submission.

You will need to create a directory on your local machine within the repository called data/. For my machine, I would do

mkdir ~/git/06-bebi103/data

This directory is not under version control, and anything you put in there will not be uploaded to GitHub. This is because we do not want to have large data files under version control. Git will automatically ignore the contents of this directory because it is included in the .gitignore file of the repository.

For Problem 1.3), you will need to download a data set. You should put this data set in your data/ directory exactly as downloaded. Go ahead and download the data set here and put is in your data/ directory. This is where all of your class data will go.

c) You will place the solution to this homework problem in the file homework/hw1.1.ipynb. Have one of your team members should create this file. After it is created and saved, he or she can add it, commit, and push.

cd ~/git/06-bebi103/homework
git add hw1.1.ipynb
git commit -m "Initial commit of homework 1.1."
git push origin master

d) Now, the other members of your team should pull the updates.

git pull

Now, each team member should, in turn on their own machine, write a haiku about data analysis in a Markdown cell, save the notebook, add it, commit, and push. (Don't put too much time or thought into the haiku; it's just for fun to add some text.) Something like this:

cd ~/git/06-bebi103/homework
git add hw1.1.ipynb
git commit -m "Justin added his haiku. It's undoubtedly bad."
git push origin master

As you are working through your notebooks, it is wise to commit and push often. Before you start working, you should also pull to get your teammates' changes. To avoid conflicts, you may want to edit notebooks separately and then copy and paste them into the notebooks you will submit at the end. These other notebooks should be in the sandbox/ directory to keep your homework/ directory clean. More experienced Git users can use branches and merges instead.

e) After you have completed the other two problems in files homework/hw1.2.ipynb and homework/hw1.3.ipynb, along with your haikus in homework/hw1.1.ipynb, it is time to submit the homework. To do this, commit and push your final submission. Something like this.

cd ~/git/06-bebi103/homework
git add hw1.*.ipynb
git commit -m "Final commit for homework submission."
git push origin master

Then, follow the instructions here to tag this commit as the homework submission. When you type the version number of your release, use hw1_submission. You do not need to include any binaries or select anything about pre-releases.



Problem 1.2 (Your goals, 30 pts)

  • Write down your goals for the class.
  • In what context to you expect to use the skills you learn in this class in the future?
  • Are there specific techniques you would like to learn? Is there something that has been confusing for you that you would like cleared up?

Each member of your group should write his or her own response in a Markdown cell (and identify who each response belongs to), but the responses should be turned in together.



Problem 1.3 (Microtubule catastrophes I, 50 pts + 10 pt extra credit)

Throughout the class, we will analyze data from several sources. We will look at some data sets repeatedly because there is plenty of interesting data analysis to be done. One of these data sets comes from this paper by Gardner, Zanic, and coworkers. The full reference is: Gardner, Zanic, et al., Depolymerizing kinesins Kip3 and MCAK shape cellular microtubule architecture by differential control of catastrophe, Cell, 147, 1092-1103, 2011, 10.1016/j.cell.2011.10.037.

We will discuss the paper more throughout the class, and I encourage you to read it. Briefly, the authors investigated the dynamics of microtubule catastrophe, the switching of a microtubule from a growing to a shrinking state. In particular, they were interested in the time between the start of growth of a microtubule and the catastrophe event. They monitored microtubules by using tubulin (the monomer that comprises a microtubule) that was labeled with a fluorescent marker. As a control to make sure that fluorescent labels and exposure to laser light did not affect the microtubule dynamics, they performed a similar experiment using differential interference contrast (DIC) microscopy. They measured the time until catastrophe with labeled and unlabeled tubulin.

In this problem, we will look at the data used to generate Fig. 2a of their paper. In the end, we will generate a plot similar to Fig. 2a.

a) If you haven't already, download the data file here. Read the data from the data file into a DataFrame.

b) I would argue that these data are not tidy. Why? It is possible to tidy these data without the fancy techniques we will learn next week. Tidy the data. Hint: The dropna() method of DataFrames may come in handy.

c) Plot histograms of the catastrophe times for the experiments with labeled and unlabeled tubulin. Try different settings of the plotting parameters to see what works best. In particular, you might want to experiment with the bins, normed, and histtype keyword arguments. You can show a few candidates for how you would display the data. For your "official" histogram(s), discuss the design decisions you made to plot it the way you did.

d) Plot empirical cumulative histograms as in Fig. 2a of the Gardner, Zanic, et al. paper. You do not need to plot the inset of that figure. Since you will do this over and over again in exploratory data analysis, write a function with the call signature ecdf(data), which takes a one-dimensional Numpy array of data and returns the x and y values for plotting the ECDF. As a reminder, if the data set is sorted such that $x_i \le x_{i+1}$, with $i = 1, 2, \ldots, n$, then the y-values for the ECDF, $\hat{F}(x_i)$. are

\begin{align} \hat{F}(x_i) = \frac{i}{n}. \end{align}

Use this ecdf() function that you wrote to plot the ECDFs shown in Fig. 2a of the Gardner, Zanic, et al. paper.

e) Discuss the relative merits of the ways of showing the data in part (d) versus part (e). (If you do part (f), include that in the discussion as well.)

f) [10 points extra credit] While many researchers plot ECDFs as in the Garnder, Zanic, et al. paper, this is not the typical convention. More formally, an ECDF, $\hat{F}(x)$, of a data set $X$ consisting of $n$ points indexed from $1$ to $n$ is defined as (see Wasserman, All of Nonparametric Statistics, eq. 2.2)

\begin{align} \hat{F}(x) &= \frac{1}{n}\sum_{i=1}^n I(X_i \le x), \\[1em] \text{where } I(X_i \le x) &= \left\{\begin{array}{ccl} 1 && \text{if } X_i \le x, \\ 0 && \text{if } X_i > x. \end{array} \right. \end{align}

The ECDF is then plotted as a line. Write a function with call signature ecdf_conventional(data) that takes a one-dimensional Numpy array of data and returns the x and y values for plotting the ECDF as a line. I.e., if you call plt.plot(x, y) with the output of the ecdf_conventional(data), you will get an appropriately looking ECDF.