Tutorial 3a: Model generation and checking

This tutorial was generated from an IPython notebook. You can download the notebook here.

In part (b) of this tutorial, we will perform regression on data that we model as a linear function. Before we can do a regression, we need to think hard about what the model means and what the appropriate parameters are. It sounds trivial that we should do this, but it is a very important and often overlooked step in data analysis.

The data set

The data set we will use for this analysis comes from Good, et al., Cytoplasmic volume modulates spindle size during embryogenesis Science, 342, 856-860, 2013. You can download the data set here. In this work, Matt Good and coworkers developed a microfluidic device where they could create droplets cytoplasm extracted from Xenopus eggs and embryos. A remarkable property about Xenopus extract is that mitotic spindles spontaneously form; the extracted cytoplasm has all the ingredients to form them. This makes it an excellent model system for studying spindles. With their device, Good and his colleagues were able to study how the size of the cell affects the dimensions of the mitotic spindle; a simple, yet beautiful, question. The experiment is conceptually simple; they made the droplets and then measured their dimensions and the dimensions of the spindles using microscope images. In this tutorial, we will analyze their data relating droplet size and spindle size.

Before we begin, as usual, we will import out modules. One module we will use is brewer2mpl, which provides convenient access to the colormaps of ColorBrewer2. ColorBrewer2 is used for generating colormaps for use in cartography, but they are very useful for plotting as well. They are based on the research of Cynthia Brewer, who studies human perception of color in cartography. brewer2mpl is not part of Canopy, so you can install it from an IPython window by doing !pip install brewer2mpl.

Even if you cannot install brewer2mpl in this tutorial, make sure you can use pip to install software on your machine. You will need to do it for the next tutorial, when we use emcee.

In [1]:
from __future__ import division, absolute_import, \
                                    print_function, unicode_literals

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.special

# We'll use ColorBrewer for color maps
from brewer2mpl import sequential

# Necessary to display plots in this IPython notebook
%matplotlib inline    

Theoretical models for spindle size

Since we will be fitting the data with a theoretical model, I briefly summarize the two competing models for spindle size proposed by Good, et al. in their paper.

Model 1: Set spindle size

As a first model, we propose that the size of a mitotic spindle is inherent to the spindle itself. This means that the size of the spindle is independent of the size of the droplet or cell in which it resides. This would be the case, for example, if construction of the spindle involves length-sensing molecules, such as depolymerizing motor proteins (see, e.g., this paper by Varga, et al.). We define that set length as $\theta$.

As we saw in the last tutorial, we will assume that if we measure the lengths, $L_\mathrm{s}$ of many spindle, each datum is

\begin{align} l_i = \theta + e_i, \end{align}

where $e_i$ is the noise component of the $i$the datum. The noise is assumed to come from a Gaussian distribution with variance $\sigma^2$. The value of $\sigma$ is the same for all data. In this case, the likelihood of a set of measurements $D = \{l_1, l_2,\ldots\}$ is

\begin{align} P(D~|~\theta, \sigma, M_1, I) = \prod_{i\in D} \frac{1}{\sqrt{2\pi\sigma^2}}\, \exp\left[-\frac{(l_i - \theta)^2}{2\sigma^2}\right], \end{align}

where $M_1$ denotes that model $M_1$ is true.

Model 2: Spindle size dependent on total tubulin concentration

This model hinges on the following principles:

  1. The total amount of tubulin in the droplet or cell is conserved.
  2. The total length of polymerized microtubules is a function of the total tubulin concentration after assembly of the spindle. This results from the balances of microtubule polymerization rate with catastrophe frequencies.
  3. The density of tubulin in the spindle is independent of droplet or cell volume.

Assumption 1 (conservation of tubulin) implies

\begin{align} T_0 V_0 = T_1(V_0 - V_\mathrm{s}) + T_\mathrm{s}V_\mathrm{s}, \end{align}

where $V_0$ is the volume of the droplet or cell, $V_\mathrm{s}$ is the volume of the spindle, $T_0$ is the total tubulin concentration (polymerized or not), $T_1$ is the tubulin concentration in the cytoplasm after the the spindle has formed, and $T_\mathrm{s}$ is the concentration of tubulin in the spindle. If we assume the spindle does not take up much of the total volume of the droplet or cell ($V_0 \gg V_\mathrm{s}$, which is the case as we will see when we look at the data), we have

\begin{align} T_1 \approx T_0 - \frac{V_\mathrm{s}}{V_0}\,T_\mathrm{s}. \end{align}

The amount of tubulin in the spindle can we written in terms of the total length of polymerized microtubules, $L_\mathrm{MT}$ as

\begin{align} T_s V_\mathrm{s} = \alpha L_\mathrm{MT}, \end{align}

where $\alpha$ is the tubulin concentration per unit microtubule length. (We will see that it is unimportant, but from the known geometry of microtubules, $\alpha \approx 2.7$ nmol/µm.)

We formalize assumption 2 into a mathematical expression. Microtubule length should grow with increasing $T_1$. There should also be a minimal threshold $T_\mathrm{min}$ where polymerization stops. We therefore approximate the total microtubule length as a linear function,

\begin{align} L_\mathrm{MT} \approx \left\{\begin{array}{ccl} 0 & &T_1 \le T_\mathrm{min} \\ \beta(T_1 - T_\mathrm{min}) & & T_1 > T_\mathrm{min}. \end{array}\right. \end{align}

Because spindles form in Xenopus extract, $T_0 > T_\mathrm{min}$, so there exists a $T_1$ with $T_\mathrm{min} < T_1 < T_0$, so going forward, we are assured that $T_1 > T_\mathrm{min}$. Thus, we have

\begin{align} V_\mathrm{s} \approx \alpha\beta\,\frac{T_1 - T_\mathrm{min}}{T_\mathrm{s}}. \end{align}

With insertion of our expression for $T_1$, this becomes

\begin{align} V_{\mathrm{s}} \approx \alpha \beta\left(\frac{T_0 - T_\mathrm{min}}{T_\mathrm{s}} - \frac{V_\mathrm{s}}{V_0}\right). \end{align}

Solving for $V_\mathrm{s}$, we have

\begin{align} V_\mathrm{s} \approx \frac{\alpha\beta}{1 + \alpha\beta/V_0}\,\frac{T_0 - T_\mathrm{min}}{T_\mathrm{s}} =\frac{V_0}{1 + V_0/\alpha\beta}\,\frac{T_0 - T_\mathrm{min}}{T_\mathrm{s}}. \end{align}

We approximate the shape of the spindle as a prolate spheroid with major axis length $l$ and minor axis length $w$, giving

\begin{align} V_\mathrm{s} = \frac{\pi}{6}\,l w^2 = \frac{\pi}{6}\,k^2 l^3, \end{align}

where $k \equiv w/l$ is the aspect ratio of the spindle. We can now write an expression for the spindle length as

\begin{align} l \approx \left(\frac{6}{\pi k^2}\, \frac{T_0 - T_\mathrm{min}}{T_\mathrm{s}}\, \frac{V_0}{1+V_0/\alpha\beta}\right)^{\frac{1}{3}}. \end{align}

For small droplets, with $V_0\ll \alpha \beta$, this becomes

\begin{align} l(d) \approx \left(\frac{6}{\pi k^2}\, \frac{T_0 - T_\mathrm{min}}{T_\mathrm{s}}\, V_0\right)^{\frac{1}{3}} = \left(\frac{T_0 - T_\mathrm{min}}{k^2T_\mathrm{s}}\right)^{\frac{1}{3}}\,d, \end{align}

where $d$ is the diameter of the spherical droplet or cell. For large $V_0$, the spindle size becomes independent of droplet size;

\begin{align} l \approx \left(\frac{6 \alpha \beta}{\pi k^2}\, \frac{T_0 - T_\mathrm{min}}{T_\mathrm{s}}\right)^{\frac{1}{3}}. \end{align}

We can define two parameters to describe the data,

\begin{align} \gamma &= \left(\frac{T_0-T_\mathrm{min}}{k^2T_\mathrm{s}}\right)^\frac{1}{3} \\ \theta &= \gamma\left(\frac{6\alpha\beta}{\pi}\right)^{\frac{1}{3}}. \end{align}

We assume that $\gamma$ and $\theta$ are the same for all data. We can rewrite the general model expression in terms of these parameters as

\begin{align} l(d) \approx \frac{\gamma d}{\left(1+(\gamma d/\theta)^3\right)^{\frac{1}{3}}}. \end{align}

The linear and plateau regimes are then, respectively,

\begin{align} l(d) &\approx \gamma d ~\text{ for } \gamma d/\theta \ll 1, \\ l &\approx \theta ~~~\text{ for } \gamma d/\theta \gg 1. \end{align}

Note that the expression for the linear regime gives a bounds for $\gamma$. Obviously, $\gamma > 0$. Because $l \le d$, lest the spindle not fit in the droplet, we also have $\gamma \le 1$. The parameter $\theta$ is independent of the system geometry, so it only has the physical lower bound of $\theta > 0$.

To get an idea of what we might expect from the shapes of curves from these two models, let's make plots similar to what we did in homework 1. This is easiest if we choose $\theta$ as the units for both the $x$- and $y$-axes.

In [12]:
# Generate dimensionless d range to plot
d = np.linspace(0.0, 10.0, 200)

# Values of gamma to use in the plot
gamma = np.array([0.1, 0.3, 0.7, 1.0])

# Let's use ColorBrewer to get colormap 
# (I don't like to use lightest color, hence the +1)
bmap = sequential.Purples[len(gamma) + 1]

# Make the plots
legend_labels = []
for i in range(len(gamma)-1, -1, -1):
    plt.plot(d, gamma[i] * d / (1 + (gamma[i] * d)**3)**(1.0 / 3.0),
             '-', color=bmap.mpl_colors[i+1])
    legend_labels.append(r'$\gamma = {0:g}$'.format(gamma[i]))

# Make plot pretty
plt.margins(y=0.02)
plt.xlabel(r'$d/\theta$')
plt.ylabel(r'$l/\theta$')
plt.legend(legend_labels, loc='lower right')
Out[12]:
<matplotlib.legend.Legend at 0x1120bde10>

We can also look at how the different limits compare to the theoretical curves. We will do this for $\gamma = 0.3$.

In [11]:
# Plot the full theoretical curve for gamma = 0.3
gamma = 0.3
plt.plot(d, gamma * (d**3 / (1 + (gamma * d)**3))**(1.0 / 3.0), 'k-')

# Plot plateau region (horizontal line at y = 1)
plt.plot([d[0], d[-1]], [1.0, 1.0], 'k--')

# Plot linear region
plt.plot([d[0], d[-1]], gamma * np.array([d[0], d[-1]]), 'k--')

# Make plot pretty
plt.ylim((-0.02, 1.02))
plt.xlabel(r'$d/\theta$')
plt.ylabel(r'$l/\theta$')
Out[11]:
<matplotlib.text.Text at 0x1120810d0>

A comment on the model parameters

We went through some algebraic manipulations to get model 2 in a form with two parameters. It is important to identify independent parameters in your model before doing regression analysis. In a trivial example, imagine someone proposed the following model to use in a regression on $(x, y)$ data:

\begin{align} y = ax + bx + c. \end{align}

Obviously, it would be silly to have both $a$ and $b$ as regression parameters, and we should instead define a new parameter $d = a + b$ and use that as a regression parameter. In the case of spindle length, we had parameters $T_0$, $T_\mathrm{min}$, $T_\mathrm{s}$, $k$, $\alpha$, and $\beta$, but, as we saw, we can only resolve two parameters, $\gamma$ and $\theta$. Furthermore, if we happen to be in the linear regime, $\theta$ does not enter the expressions, so we obviously cannot resolve it. Similary, we can only determine $\theta$ if we are in the plateau regime.

Finally, we note that if the parameters are such that we are in the plateau regime, we cannot discern model 1 from model 2.

Loading the data

Matt Good sent me beautifully organized and formatted data files (you should have these available for all of your publications!). We will load in the data used to generate Fig. 1C of the paper.

In [4]:
# Load data into DataFrame
df = pd.read_csv('../data/good_et_al/invitro_droplet_data.csv', 
                 comment='#')

# Check it out (only works in iPython)
df.head()
Out[4]:
Droplet Diameter (um) Droplet Volume (uL) Spindle Length (um) Spindle Width (um) Spindle Area (um2)
0 27.1 0.000010 28.9 10.8 155.8
1 28.2 0.000012 22.7 7.2 81.5
2 29.4 0.000013 26.2 10.5 138.3
3 31.0 0.000016 19.2 9.4 90.5
4 31.0 0.000016 28.4 12.1 172.4

Let's rename the columns for more convenient access later. We should also put the droplet volume in units of µm$^3$ so that we have consistent units throughout the DataFrame.

In [5]:
# Rename them all at once
df.columns = ['droplet_diameter', 'droplet_volume', 'spindle_length', 
              'spindle_width', 'spindle_area']

# Put droplet volume in units of µm^3 for consistency of units
df.droplet_volume *= 1e9

Checking the model assumptions

Is $V_\mathrm{s} / V_0 \ll 1$?

Let's do a quick verification that the droplet volume is indeed much larger than the spindle volume. Remember, the spindle volume for a prolate spheroid of length $l$ and width $w$ is $V_\mathrm{s} = \pi l w^2 / 6$.

In [6]:
# Compute spindle volume
spindle_volume = np.pi * df.spindle_length * df.spindle_width**2 / 6.0

# Compute the ratio V_s / V_0
vol_ratio = spindle_volume / df.droplet_volume

# Plot a histogram of the results
n, b, p = plt.hist(vol_ratio, bins=np.sqrt(len(vol_ratio)))
plt.xlabel(r'$V_\mathrm{s} / V_0$')
plt.ylabel('count')
Out[6]:
<matplotlib.text.Text at 0x11118b690>

We see that for the vast majority of spindles that were measured, $V_\mathrm{s} / V_0$ is small.

Do all spindles have the same aspect ratio $k$?

In setting up our model, we assumed that all spindles had the same aspect ratio. We can check this assumption because we have the data to do so available to us.

In [7]:
# Compute the aspect ratio
k = df.spindle_width / df.spindle_length

# Plot a histogram of the results
n, b, p = plt.hist(k, bins=np.sqrt(len(k)))
plt.xlabel(r'$k$')
plt.ylabel('count')
Out[7]:
<matplotlib.text.Text at 0x111814ad0>

The mean aspect ratio is about 0.4, and we see spindle lengths about $\pm 25\%$ of that. This could be significant variation, but we will leave it to the homework to test that. For the purposes of this tutorial, we will assume that $k$ is constant.

Exploratory data analysis

As we have mentioned before in class, it is often useful to explore the data by plotting it ahead of more formal analysis. So, to start, we will plot the data generating a plot similar to Fig. 1C in the paper. We plot spindle length vs. droplet size.

In [13]:
# Use alpha when you have lots of points; helps see overlap
plt.plot(df.droplet_diameter, df.spindle_length, 'k.', alpha=0.25)

# Use unicode to do plot label with microns with nicely formatted mu
plt.ylim((0.0, 50.0)) # Be sure to include origin
plt.xlabel(u'droplet diameter (µm)')
plt.ylabel(u'spindle length (µm)');

At first glance, it looks as though the plateau region is very sparcely sampled. Since we are focusing on the linear regime in this tutorial, we will focus on drops with diameter less than 90 µm.

In [14]:
# Get indices for droplet diameters <= 90 for convenience going forward
inds_90 = df.droplet_diameter <= 90.0

# Redo plot only with smaller droplets
# Use alpha when you have lots of points; helps see overlap
plt.plot(df.droplet_diameter[inds_90], df.spindle_length[inds_90], 'k.', 
         alpha=0.25)

# Use unicode to do plot label with microns with nicely formatted mu
plt.xlim((0.0, 80.0))
plt.ylim((0.0, 50.0))
plt.xlabel(u'droplet diameter (µm)')
plt.ylabel(u'spindle length (µm)');

Just by looking at this plot, we see that our interpretation of Model 2 may not be right. If these data are well-described by a line, the intercept is clearly not zero, as Model 2 requires. It is possible we are not in the regime where spindle diameters depend linearly on the droplet diameter, or Model 2 is not correct.

Summary of models

So, going forward, here is how we will define our models.

Model 1

In Model 1, the spindle size is independent of droplet size. The corresponding equation is

\begin{align} l = \theta. \end{align}

Model 2

We define by Model 2 the full relation between spindle length and droplet diameter,

\begin{align} l(d) \approx \frac{\gamma d}{\left(1+(\gamma d/\theta)^3\right)^{\frac{1}{3}}}. \end{align}

Model 2a

We will call "Model 2a" the model in which we assume we are in the linear regime. Here, the equation relating spindle length to droplet diameter is

\begin{align} l(d) = \gamma d. \end{align}

Model 2b

Finally, we consider "Model 2b", which assumes are are in the nonlinear regime, but we approximate it as a linear function. We could get this result by applying a Taylor series expansion about point $d_0$ in the nonlinear regime, giving

\begin{align} l(d) = \nu + \zeta d, \end{align}

where

\begin{align} \nu &= \frac{(\gamma d_0/\theta)^4}{\left(1+(\gamma d_0/\theta)^3\right)^{\frac{4}{3}}}, \\[1mm] \zeta &= \frac{\gamma/\theta}{\left(1+(\gamma d_0/\theta)^3\right)^{\frac{4}{3}}}. \end{align}

This model is kind of silly because it involves unnecessarily linearizing the function about a point that is, by construction, nonlinear. However, it serves to help us learn about linear regression, so it has pedagogical use.