BE/Bi 103, Fall 2018: Homework 1

Due 1pm, Sunday, October 7

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This homework was generated from an Jupyter notebook. You can download the notebook here.



Problem 1.1 (Your goals, 20 pts)

  • Write down your goals for the class.
  • In what context do you expect to use the skills you learn in this class in the future?
  • Are there specific techniques you would like to learn? Is there something that has been confusing for you that you would like cleared up?

Each member of your group should write his or her own response in a Markdown cell (and identify who each response belongs to), but the responses should be turned in together.



Problem 1.2 Beetle hypnotists, 40 pts

The Parker lab at Caltech studies rove beetles that can infiltrate ant colonies. In one of their experiments, they place a rove beetle and an ant in a circular area and track the movements of the ants. They do this by using a deep learning algorithm to identify the head, thorax, abdomen, and right and left antennae. While deep learning applied to biological images is a beautiful and useful topic, we will not cover it in this course (be on the lookout for future courses that do!). We will instead work with a data set that is the output of the deep learning algorithm.

For the experiment you are considering in this problem, an ant and a beetle were placed in a circular arena and recorded with video at a frame rate of 28 frames per second. The positions of the body parts of the ant were tracked throughout the video recording. You can download the data set here. Pro tip: Pandas's read_csv() function will automatically load in a zip file, so you do not need to unzip it. Be sure to use the comment='#' kwarg, though, since there are header comments on the top of the data file.

To save you from having to unzip and read the comments for the data file, here they are:

# This data set was kindly donated by Julian Wagner from 
# Joe Parker's lab at Caltech. In the experiment, an ant
# and a beetle were placed in a circular arena and 
# recorded with video at a frame rate of 28 frames per 
# second. The positions of the body parts the ant are 
# tracked throughout the video recording.
#
# The experiment aims to distinguish the ant behavior in 
# the presence of a beetle from the genus Sceptobius, which
# secretes a chemical that modifies the behavior of the ant,
# versus in the presence of a beetle from the species 
# Dalotia, which does not.
#
# The data set has the following columns.
#  frame : frame number from the video acquisition
#  beetle_treatment : Either dalotia or sceptobius
#  ID : The unique integer identifier of the ant in the 
#       experiment
#  bodypart : The body part being tracked in the experiment. 
#             Possible values are head, thorax, abdomen, 
#             antenna_left, antenna_right.
#  x_coord : x-coordinate of the body part in units of pixels
#  y_coord : y-coordinate of the body part in units of pixels
#  likelihood : A rating, ranging from zero to one, given by 
#               the deep learning algorithm that approximately 
#               quantifies confidence that the body part was 
#               correctly identified.
#
# The interpixel distance for this experiment was 0.08 
# millimeters.

You task in this problem is to extract records of interest out of the tidy data frame containing the data from the experiment, perform calculations on the data, and make informative plots.

a) The columns x_coord and y_coord give the coordinates of the ant's body parts in units of pixels. Create a column 'x (mm)' and a column 'y (mm)' in the data frame that has the coordinates in units of millimeters.

b) Make a plot displaying the position over time of the thorax of an ant or ants placed in an area with a Dalotia beetle and position over time of an ant or ants with a Sceptobius beetle. I am intentionally not giving more specification for your plot. You need to make decisions about how to effectively extract and display the data. Think carefully about your viz. This is in many ways how you let your data speak. You could make a plot for a single ant from each genus, or for many. You will also probably need to refer to the Altair or Bokeh documentation to specify the plot as you wish.

c) From this quick, exploratory analysis, what would you say about the relative activities of ants with Dalotia versus Sceptobius rove beetles?


Problem 1.3: Probability distribution of a marginalized parameter (20 pts)

This problem should be completed after the first Wednesday lecture of the course.

a) Explain in words what marginalization is.

b) Say I have a statistical model that has two continuous parameters, that is, $\theta = (\theta_1, \theta_2)$, and measured data $y$. A statement of Bayes's theorem in this case is

\begin{align} P(\theta \mid y) = \frac{P(y\mid \theta)\,P(\theta)}{P(y)}. \end{align}

We can write this explicitly in terms of the two parameters as

\begin{align} P(\theta_1, \theta_2 \mid y) = \frac{P(y\mid \theta_1, \theta_2)\,P(\theta_1, \theta_2)}{P(y)}. \end{align}

Derive an expression for $P(\theta_1 \mid y)$ that features only terms appearing on the right hand side of the above equation.



Problem 1.4 (Using Git/GitHub to submit this homework, 20 pts)

This may be your first time using Git/GitHub. Git is a version control system that allows for collaborative work. GitHub is a website that can host Git repositories. We will be using Git/GitHub exclusively for submitting homework and tutorial exercises. We do this for two main reasons. First, this allows for much easier tracking, submission, and grading of homework. Second, and most importantly, you should be using version control when working with real data. It is good practice both for the security of your own work and for reproducibility of your research.

Prior to submitting this homework, you formed a team of three (and possibly four) people. This team has a repository containing your work for this class that is hosted at GitHub. For example, if you are team number 6, your repository is hosted at https://github.com/bebi103/06-bebi103-2018. The rest of this problem, as if you are team 6, but you can make obvious substitutions for your team.

a) Clone your repository to your machine. I like to keep my repositories in the directory ~/git, but the location on your machine is your choice. To clone into that directory, I would do the following on my machine.

cd ~/git
git clone https://github.com/bebi103/06-bebi103-2018.git

b) Upon cloning the repository, you will have a new directory, ~/git/06-bebi103-2018 containing your team's repository. Within that directory, you will see three subdirectories, homework/, tutorial_exercises/, and sandbox/. The homework/ and tutorial_exercises/ directories are for submitting homework and tutorial exercises. The sandbox/ directory is for messing around with various ideas that are not for serious submission.

You will need to create a directory on your local machine within the repository called data/. For my machine, I would do

mkdir ~/git/06-bebi103-2018/data

This directory is not under version control, and anything you put in there will not be uploaded to GitHub. This is because we do not want to have large data files under version control. Git will automatically ignore the contents of this directory because it is included in the .gitignore file of the repository.

For Problem 1.2, you will need to download a data set. You should put this data set in your data/ directory exactly as downloaded. Go ahead and download the data set here and put is in your data/ directory. This is where all of your class data will go.

c) You will place the solution to this homework problem in the file homework/hw1.1.ipynb. Have one of your team members should create this file. After it is created and saved, he or she can add it, commit, and push.

cd ~/git/06-bebi103-2018/homework
git add hw1.1.ipynb
git commit -m "Initial commit of homework 1.1."
git push origin master

d) Now, the other members of your team should pull the updates.

git pull

Each team member should, in turn on their own machine, write a haiku about data analysis in a Markdown cell, save the notebook, add it, commit, and push. (Don't put too much time or thought into the haiku; it's just for fun to add some text.) Something like this:

cd ~/git/06-bebi103-2018/homework
git add hw1.1.ipynb
git commit -m "Justin added his haiku. It's undoubtedly bad."
git push origin master

As you are working through your notebooks, it is wise to commit and push often. Before you start working, you should also pull to get your teammates' changes. To avoid conflicts, you may want to edit notebooks separately and then copy and paste them into the notebooks you will submit at the end. These other notebooks should be in the sandbox/ directory to keep your homework/ directory clean. More experienced Git users can use branches and merges instead.

e) After you have completed the other two problems in files homework/hw1.2.ipynb and homework/hw1.3.ipynb, along with your haikus in homework/hw1.1.ipynb, it is time to submit the homework. To do this, commit and push your final submission. Something like this.

cd ~/git/06-bebi103-2018/homework
git add hw1.*.ipynb
git commit -m "Final commit for homework submission."
git push origin master

Then, follow the instructions here to tag this commit as the homework submission. When you type the version number of your release, use hw1_submission. You do not need to include any binaries or select anything about pre-releases.