Building a model¶
[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
cmd = "pip install --upgrade iqplot bebi103 watermark"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
data_path = "../data/"
# ------------------------------
import numpy as np
import pandas as pd
import iqplot
import bebi103
import bokeh.io
bokeh.io.output_notebook()
import holoviews as hv
hv.extension('bokeh')
bebi103.hv.set_defaults()
As machine learning methods grow in power and prominence, and as data acquisition becomes more and more facile, we see more and more methods where a machine “learns” directly from data. Over a decade ago, Chris Anderson wrote an article entitled The End of Theory: The Data Deluge Makes the Scientific Method Obsolete in Wired Magazine. Anderson claimed that because we have access to large data sets, we no longer need the scientific method of testable hypotheses. Specifically, he says we do not need models, we can just use lots and lots of data to make predictions. This is absurd because if we just try to learn from data, we do not really learn anything fundamental about how nature works. If you are working for Netflix and trying to figure out what movies people want to watch, learning from data is fine. But if you’re a scientist and want to increase knowledge, you need models.
In this lesson, we introduce two competing models for how the size of mitotic spindles are set. I take the time to set up these models because it’s important; we should have a firm grasp on the theory behind our models.
What sets the size of mitotic spindles?¶
Matt Good and coworkers (Science, 2013) developed a microfluidic device where they could create droplets of cytoplasm extracted from Xenopus eggs and embryos, as shown the figure below (scale bar 20 µm; image taken from the paper).

A remarkable property about Xenopus extract is that mitotic spindles spontaneously form; the extracted cytoplasm has all the ingredients to form them. This makes it an excellent model system for studying spindles. With their device, Good and his colleagues were able to study how the size of the cell affects the dimensions of the mitotic spindle; a simple, yet beautiful, question. The experiment is conceptually simple; they made the droplets and then measured their dimensions and the dimensions of the spindles using microscope images.
Let’s take a quick look at the result.
[2]:
hv.extension("bokeh")
df = pd.read_csv(os.path.join(data_path, 'good_invitro_droplet_data.csv'), comment='#')
hv.Scatter(
data=df,
kdims=['Droplet Diameter (um)'],
vdims=['Spindle Length (um)']
)
[2]:
We now propose two models for how the droplet diameter affects the spindle length.
The spindles have an inherent length, independent of droplet diameter.
The length of spindles is determined by the total amount of tubulin available to make them.
Model 1: Spindle size is independent of droplet size¶
As a first model, we propose that the size of a mitotic spindle is inherent to the spindle itself. This means that the size of the spindle is independent of the size of the droplet or cell in which it resides. This would be the case, for example, if construction of the spindle involves length-sensing molecules, such as depolymerizing motor proteins. We define that set length as
The statistical model¶
Not all spindles will be measured to be exactly
where
So, we have a theoretical model for spindle length,
Each measured spindle’s length is independent of all others.
The variability in measured spindle length is Normally distributed.
With these assumptions, we can write the probability density function for
Since each measurement is independent, we can write the joint probability density function of the entire data set, which we will define as
We can write this more succinctly, and perhaps more intuitively, as
We will generally write our models in this format, which is easier to parse and understand. Note that in writing this generative model, we have necessarily introduced another parameter,
Model 2: Spindle length is set by total amount of tubulin¶
The cartoon model¶
The three key principles of this “cartoon” model are:
The total amount of tubulin in the droplet or cell is conserved.
The total length of polymerized microtubules is a function of the total tubulin concentration after assembly of the spindle. This results from the balances of microtubule polymerization rate with catastrophe frequencies.
The density of tubulin in the spindle is independent of droplet or cell volume.
The mathematical model¶
From these principles, we need to derive a mathematical model that will provide us with testable predictions. The derivation follows below (following the derivation presented in the paper), and you may read it if you are interested. Since our main focus here is building a statistical model, you can skip ahead to to the final equation, where we define a mathematical expression relating the spindle length,
Principle 1 above (conservation of tubulin) implies
where
The amount of tubulin in the spindle can we written in terms of the total length of polymerized microtubules,
where
We now formalize assumption 2 into a mathematical expression. Microtubule length should grow with increasing
Because spindles form in Xenopus extract,
With insertion of our expression for
Solving for
We approximate the shape of the spindle as a prolate spheroid with major axis length
where
For small droplets, with
where
For large
Indentifiability of parameters¶
We measure the microtubule length
parameter |
meaning |
---|---|
rate constant for MT growth |
|
total tubulin concentration |
|
critical tubulin concentration for polymerization |
|
tubulin concentration in the spindle |
We would like to determine all of these parameters. We could measure them all either in this experiment or in other experiments. We could measure the total tubulin concentration
Importantly, though, the parameters only appear in combinations with each other in our theoretical model. Specifically, we can define two parameters,
We can then rewrite the general model expression in terms of these parameters as
If we tried to determine all four parameters from this experiment only, we would be in trouble. This experiment alone cannot distinguish all of the parameters. Rather, we can only distinguish two combinations of them, which we have defined as
Visualizing the mathematical model¶
Let’s take a quick look at the mathematical model so we can see how the curve looks. It’s best to nondimensionalize the diameter by
So, we will plot
[3]:
hv.extension("bokeh")
def theor_spindle_length(gamma, d):
"""Compute spindle length using mathematical model"""
return gamma * d / np.cbrt(1 + (gamma * d)**3)
d = np.linspace(0, 20, 200)
def plot_theor(gamma):
return hv.Curve(
data=(d, theor_spindle_length(gamma, d)),
kdims=['d/φ'],
vdims=['l/φ'],
label=f'γ = {gamma}',
).opts(
color=hv.Cycle(list(bokeh.palettes.Blues7[1:-1][::-1]))
)
plots = [plot_theor(gamma) for gamma in [0.03, 0.1, 0.3, 0.7, 1.0]]
hv.Overlay(plots)
[3]:
The curve grows from zero to a plateau at
Limiting behavior¶
For large droplets, with
Conversely, for
Note that the expression for the linear regime gives bounds for
Importantly, if the experiment is done in the regime where
This sounds kind of dire, but this is actually a convenient fact. The second model is more complex, but it has the simpler model, model 1, as a limit. Thus, the two models are in fact commensurate with each other. Knowledge of how these limits work also enhances the experimental design. We should strive for small droplets. And perhaps most importantly, if we didn’t consider the second model, we might automatically assume that droplet size has nothing to do with spindle length if we simply did the experiment in larger droplets.
Generative model¶
We have a theoretical model relating the droplet diameter to the spindle length. Let us now build a generative model. For spindle, droplet pair i, we assume
We will assume that
which is equivalently stated as
Importantly, note that this model builds upon our first model. Generally, when doing modeling, it is a good idea to build more complex models on your initial baseline model such that the models are related to each other by limiting behavior. This gives you a continuum of model and a sound basis for making comparisons among models.
Note that we are assuming the droplet diameters are known. When we generate data sets for prior predictive checks, we will randomly generate them from about 20 µm to 200 µm, since this is the range achievable with the microfluidic device.
Checking model assumptions¶
In deriving the mathematical model, we made a series of assumptions. It is generally a good idea to check to see if assumptions in the mathematical modeling are realized in the experiment. If they are not, you may need to relax the assumptions and have a potentially more complicated model (which may suffer from identifiability issues). This underscores the interconnection between modeling and experimental design. You can allow for modeling assumptions and identifiability if you design your experimental parameters to meet the assumptions (e.g., choosing the appropriate range of droplet sizes).
Let’s do a quick verification that the droplet volume is indeed much larger than the spindle volume. Remember, the spindle volume for a prolate spheroid of length
[4]:
# Compute spindle volume
spindle_volume = np.pi * df['Spindle Length (um)'] * df['Spindle Width (um)']**2 / 6
# Compute the ratio V_s / V_0 (taking care of units)
vol_ratio = spindle_volume / df['Droplet Volume (uL)'] * 1e-9
# Plot an ECDF of the results
bokeh.io.show(iqplot.ecdf(vol_ratio.values, x_axis_label='Vs/V0'))
We see that for pretty much all spindles that were measured,
In setting up our model, we assumed that all spindles had the same aspect ratio. We can check this assumption because we have the data to do so available to us.
[5]:
# Compute the aspect ratio
k = df['Spindle Width (um)'] / df['Spindle Length (um)']
# Plot ECDF
bokeh.io.show(iqplot.ecdf(k.values, x_axis_label='k'))
The median aspect ratio is about 0.4, and we see spindle lengths about
Importantly, these checks of the model highlight the importance of checking your assumptions against your data. Always a good idea!
Computing environment¶
[6]:
%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,holoviews,iqplot,bebi103,jupyterlab
CPython 3.8.5
IPython 7.19.0
numpy 1.19.2
pandas 1.1.3
bokeh 2.2.3
holoviews 1.13.5
iqplot 0.1.6
bebi103 0.1.1
jupyterlab 2.2.6