This homework was generated from an Jupyter notebook. You can download the notebook here.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Magic function to make matplotlib inline; other style specs must come AFTER
%matplotlib inline
# This enables SVG graphics inline (only use with static plots (non-Bokeh))
%config InlineBackend.figure_format = 'svg'
# JB's favorite Seaborn settings for notebooks
rc={'lines.linewidth': 2, 'axes.labelsize': 18, 'axes.titlesize': 18,
'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)
Write down your goals for the class. Is there something that has been confusing for you that you would like cleared up? Are there specific techniques you would like to learn?
Each member of your group should write his or her own response (and identify who each response belongs to), but the responses should be turned in together.
We will soon be doing regression analysis. We will have a set of $(x,y)$ data and a model that we think describes the observed trends in the data. For example, we may think that $y$ depends linearly on $x$, so we would propose
\begin{align} y(x) = a x + b, \end{align}where $a$ and $b$ are parameters.
In order to do the regression, we will need to write a Python function of the form f(p, x)
, where p
is a NumPy array containing the fit parameters. For example, if we wanted to make a linear function, we might define the following.
def lin_func(p, x):
"""
Returns `p`[0] * `x` + `p`[1].
"""
a, b = p
return a * x + b
One of the tricks is that your function should work if x
is a single number or a NumPy array. In the above example, it does, as we can see by plotting.
# Make a set of evenly spaced points in x
x = np.linspace(-1.0, 2.0, 50)
# Compute y
y = lin_func(np.array([7.0, -3.0]), x)
# Plot as dots to verify it was calculated for each value of x
plt.plot(x, y, 'o')
plt.margins(x=0.02, y=0.02)
plt.xlabel(r'$x$')
plt.ylabel(r'$y$');
Write Python functions of this form (f(p, x)
) for the following functions and make smooth plots of them for a few sets of parameter values over appropriate ranges of $x$ values. If you think it is appropriate, plot the functions on a logarithmic or semilogarithmic scale. (Check out functions like plt.loglog
and plt.semilogy
for this sort of thing.) Whatever you choose, give an explanation as to why you chose to plot the function the way you did.
a) Exponential decay + background signal:
\begin{align} y = a + b\,\mathrm{e}^{-x/\lambda} \end{align}b) The Cauchy distribution:
\begin{align} y = \frac{\beta}{\pi\left(\beta^2 + (x - \alpha)^2\right)} \end{align}c) The Hill function:
\begin{align} y = \frac{x^\alpha}{k^\alpha + x^\alpha}. \end{align}
Throughout the class, we will analyze data from several sources. We will look at some data sets repeatedly because there is plenty of interesting data analysis to be done. One of these data sets comes from this paper by Gardner, Zanic, and coworkers. The full reference is: Gardner, Zanic, et al., Depolymerizing kinesins Kip3 and MCAK shape cellular microtubule architecture by differential control of catastrophe, Cell, 147, 1092-1103, 2011, 10.1016/j.cell.2011.10.037.
We will discuss the paper more throughout the class, and I encourage you to read it. Briefly, the authors investigated the dynamics of microtubule catastrophe, the switching of a microtubule from a growing to a shrinking state. In particular, they were interested in the time between the start of growth of a microtubule and the catastrophe event. They monitored microtubules by using tubulin (the monomer that comprises a microtubule) that was labeled with a fluorescent marker. As a control to make sure that fluorescent labels and exposure to laser light did not affect the microtubule dynamics, they performed a similar experiment using differential interference contrast (DIC) microscopy. They measured the time until catastrophe with labeled and unlabeled tubulin.
In this problem, we will look at the data used to generate Fig. 2a of their paper. In the end, we will generate a plot similar to Fig. 2a.
a) If you haven't already, download the data file here. Read the data from the data file into a DataFrame
.
b) I would argue that these data are not tidy. Why? It is possible to tidy these data without the fancy techniques we will learn next week. Tidy the data. Hint: The dropna()
method of DataFrame
s may come in handy.
c) Plot histograms of the catastrophe times for the experiments with labeled and unlabeled tubulin. Try different settings of the plotting parameters to see what works best. In particular, you might want to experiment with the bins
, normed
, and histtype
keyword arguments. You can show a few candidates for how you would display the data. For your "official" histogram(s), discuss the design decisions you made to plot it the way you did.
d) Plot cumulative histograms for the labeled and unlabeled experiments using the same binning you used in part (c). Hint: You might find the cumulative
keyword argument of plt.hist
useful.
e) Plot cumulative histograms as in Fig. 2a of the Gardner, Zanic, et al. paper. You do not need to plot the inset of that figure. Hint: Think about how to compute a cumulative histogram with no binning. The np.arange
function might be useful.
f) Which do you think is a better way to do your plots, as in part (d) or (e)? Is there ever a reason to use the one you deem inferior?
This problem is very open-ended. Dryad is one of many open data repositories that contain raw or processed data from scientific experiments. Among the many little nuggets of goodness in the Dryad repository are the measurements that David Lack made of finches in the Galápagos and published in his famous book, Darwin's Finches. You can get the data here. You should download the .TAB
file, which is a tab-delimited table of his measurements.
Explore the data set as you like. Make plots, describe what interests you, and give your impressions about what you see in your exploratory data analysis. Use Pandas and the tools we learned in the first tutorial.
Hint: Use the delimiter='\t'
keyword argument when using pd.read_csv()
to read in the data.