Model building with prior predictive checks¶
[1]:
import numpy as np
import pandas as pd
import scipy.stats as st
import cmdstanpy
import arviz as az
import bokeh_catplot
import bebi103
import holoviews as hv
hv.extension('bokeh')
bebi103.hv.set_defaults()
import bokeh.io
bokeh.io.output_notebook()
import panel as pn
pn.extension()
When we do generative modeling, it is important to understand what kind of data might be generated from our model before looking at the data. You may have been struggling to come up with good priors for your models. I find that this struggle is a symptom of the fact that we often think about what kind of data we might see in an experiment as opposed to how we might think the parameters of the generative process that produces the data may be distributed.
In this lesson, we discuss how prior predictive checks can help build appropriate priors for a model. The procedure is used check to make sure the generative model can actually generate reasonable data sets.
Building a generative model¶
When considering how to build the model, we should think about the data generation process. We generally specify the likelihood first, and then use it to identify what the parameters are, which tells us what we need to specify for the prior. Indeed, the prior can often only be understood in the context of the likelihood.
Note, though, that the split between the likelihood and prior need even be explicit. When we study hierarchical models, we will see that it is not obvious how to unambiguously define the likelihood and prior. Rather, in order to do Bayesian inference, we really only need to specify the joint distribution, \(\pi(y,\theta)\), which is the product of the likelihood and prior.
\begin{align} g(\theta \mid y) = \frac{f(y\mid \theta)\,g(\theta)}{f(\theta)} = \frac{\pi(y, \theta)}{f(\theta)}. \end{align}
After defining the joint distribution, either by defining a likelihood and a prior separately (which is what we usually do) or by directly defining the joint, we should check to make sure that making draws of data sets \(y\) out of it give reasonable results. In other words, we need to ask if the data generation process we have prescribed actually produces data that obeys physical constraints and matches our intuition. The procedure of prior predictive checks enables this. We use the joint distribution to generate parameter sets and data sets and then check if the results make sense. We will learn about doing this using both Numpy and Stan in this lesson.
The data set¶
The data set we will use for this analysis is familiar and comes from Good, et al., Cytoplasmic volume modulates spindle size during embryogenesis Science, 342, 856-860, 2013. You can download the data set here. Good and colleagues measured the length of mitotic spindles enclosed in droplets and investigated how the spindle length varied with droplet diameter.
Here is a quick peak at the data.
[2]:
# Load in Data Frame
df = pd.read_csv('../data/good_invitro_droplet_data.csv', comment='#')
hv.Scatter(
data=df,
kdims='Droplet Diameter (um)',
vdims='Spindle Length (um)',
)
[2]:
We have already built generative models for this in our lesson on brute force plotting of posteriors and our lesson on regression using MCMC. We have been working with two models, the independent size model in which the length of the spindle is independent of the droplet diameter and the conserved tubulin model in which the spindle length \(l\) is dependent on the total amount of tubulin present in the droplet.
We will build generative models for each, starting with the independent size model. We will use prior predictive checks to hone in on the priors.
In order to do the prior predictive checks, we will assume we will measure 1000 spindle lengths in drops with diameters ranging from about 25 µm to 250 µm in diameter. Though we do have a data set available (which as 670 measurements concentrated between droplet diameters of 25 and 75 µm), we will assume we can make droplets uniformly throughout that range. We will therefore take our droplet diameters to range from 25 to 250 µm uniformly.
[3]:
# Number of measurements
N = 1000
# Droplet diameters
d = np.linspace(25, 250, N)
Model 1: Spindle size is independent of droplet size¶
As a first model, we propose that the size of a mitotic spindle is inherent to the spindle itself. This means that the size of the spindle is independent of the size of the droplet or cell in which it resides. This would be the case, for example, if construction of the spindle involves length-sensing molecules, such as depolymerizing motor proteins. We define that set length as \(\phi\).
The likelihood¶
Not all spindles will be measured to be exactly \(\phi\) µm in length. Rather, there may be some variation about \(\phi\) due to natural variation and measurement error. So, we would expect measured length of spindle \(i\) to be
\begin{align} l_i = \phi + e_i, \end{align}
where \(e_i\) is the noise component of the \(i\)th datum.
So, we have a theoretical model for spindle length, \(l = \phi\), and to get a fully generative model, we need to model the errors \(e_i\). A reasonable model assumes
Each measured spindle’s length is independent of all others.
The variability in measured spindle length is Gaussian.
With these assumptions, we can write the probability density function for \(l_i\) as
\begin{align} f(l_i \mid \phi, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}}\,\exp\left[-\frac{(l_i - \phi)^2}{2\sigma^2}\right]. \end{align}
Since each measurement is independent, we can write the likelihood of the entire data set, which we will define as \(l = \{l_1, l_2,\ldots\}\), consisting of \(n\) total measurements.
\begin{align} f(l \mid \phi, \sigma) = \prod_{i} f(l_i \mid \phi, \sigma) = \frac{1}{(2\pi \sigma^2)^{n/2}}\,\exp\left[-\frac{1}{2\sigma^2}\sum_{i}(l_i - \phi)^2\right]. \end{align}
We can write this more succinctly, and perhaps more intuitively, as
\begin{align} l_i \sim \text{Norm}(\phi, \sigma) \;\;\forall i. \end{align}
We will generally write our models in this format, which is easier to parse and understand. It also looks similar to a Stan specification, as we will soon see.
Note that in writing this generative model, we have necessarily introduced another parameter, \(\sigma\), the standard deviation parametrizing the Normal distribution. So, we have two parameters in our model, \(\phi\) and \(\sigma\).
The prior¶
With our likelihood specified, we need to give a prior for \(\phi\) and \(\sigma\), \(g(\phi, \sigma)\). We will assume that \(\phi\) and \(\sigma\) are independent; that is, \(g(\phi, \sigma) = g(\phi)\,g(\sigma)\). So, we need to specify separate priors for \(\phi\) and \(\sigma\).
Let’s start with \(\phi\). We ask, “How big are spindles?” and perform an order-of-magnitude estimate to establish the prior. I think a mitotic spindle should be somewhere around 20 µm. Certainly no smaller than 1 µm, and probably not bigger than 100 µm, which is the on the large end of a typical size of a eukaryotic cell. (I know it sounds strange to say typical size of a eukaryotic cell, since they come in some many varieties, but 30 µm seems like a good upper limit to me.) I’m pretty uncertain, though, so I should not assign a sharp distribution. So, a broad distribution centered at \(\phi = 20\) µm will suffice.
\begin{align} \phi \sim \text{Norm}(20, 20). \end{align}
Next, let’s try \(\sigma\). How much do we expect the spindle length to vary? I would think that 20% variation seems reasonable, but it could be more or less. So, I will again choose a broad distribution centered at \(\sigma = 4\) µm.
\begin{align} \sigma \sim \text{Norm}(4, 5). \end{align}
So, we now have a complete model; the likelihood and prior, and therefore the joint distribution, are specified. Here it is (all units are µm):
\begin{align} &\phi \sim \text{Norm}(20, 20), \\[1em] &\sigma \sim \text{Norm}(4, 5), \\[1em] &l_i \sim \text{Norm}(\phi, \sigma) \;\;\forall i. \end{align}
The measurements of the droplet diameters \(d_i\) are irrelevant under this model.
If we look at the model with a critical eye, we can immediately see that it has a problem. It is possible that parameters \(\phi\) and \(\sigma\) could be nonpositive, as could the length, \(l_i\). This is impossible in the case of \(\sigma\) and unphysical in the cases of \(\phi\) and \(l_i\). So, these really should be truncated distributions.
That said, it is generally a bad idea to truncate distributions (except and points where their derivatives vanish, like for the Half Normal distribution) when building models. It’s both unrealistic and results in more difficult posterior sampling, which ultimately is what we need to do when doing statistical inference.
We will therefore jettison this model and develop a model that is closer to being physically reasonable. We will keep the same likelihood, since its tails should decay away fast enough to prevent negative values of \(l_i\). We instead focus on the prior.
The prior, take 2¶
Let’s start with \(\sigma\). It is possible that the spindle size is very carefully controlled. It is also possible that it could be highly variable. So, we can choose a Half Normal prior for \(\sigma\) with a large standard deviation. It looks something like this.
[4]:
sigma = np.linspace(0, 40, 200)
hv.Curve(
(sigma, st.halfnorm.pdf(sigma, 0, 10)),
kdims='σ [µm]',
vdims='g(σ) [1/µm]'
).opts(
xlim=(0, 40)
)
[4]:
For \(\phi\), we can instead take a Log-Normal distribution, with a median of 20 µm. The Log-Normal distribution is strictly positive and is useful for modeling parameters where we only have order-of-magnitude estimates. It looks like this.
[5]:
phi = np.linspace(0, 80, 200)
hv.Curve(
(phi, st.lognorm.pdf(phi, 0.75, loc=0, scale=20)),
kdims='ϕ [µm]',
vdims='g(ϕ) [1/µm]'
).opts(
xlim=(0, 80)
)
[5]:
So, we have an updated model.
\begin{align} &\phi \sim \text{LogNorm}(\ln 20, 0.75),\\[1em] &\sigma \sim \text{HalfNorm}(0, 10),\\[1em] &l_i \sim \text{Norm}(\phi, \sigma) \;\forall i. \end{align}
Prior predictive checks¶
Let us now generate samples out of this generative model. This procedure is known as prior predictive checking. We first generate parameter values drawing out of the prior distributions for \(\phi\) and \(\sigma\). We then use those parameter values to generate a data set using the likelihood. We repeat this over and over again to see what kind of data sets we might expect out of our generative model. Each of these generated data sets is called a prior predictive sample. We can do this efficiently using Numpy’s random number generators.
[6]:
n_ppc_samples = 1000
# Draw parameters out of the prior
phi = np.random.lognormal(np.log(20), 0.75, size=n_ppc_samples)
sigma = np.abs(np.random.normal(0, 10, size=n_ppc_samples))
# Draw data sets out of the likelihood for each set of prior params
ell = np.array([np.random.normal(ph, sig, size=N) for ph, sig in zip(phi, sigma)])
There are many ways we could visualize the results. One informative plot is to make an ECDF of each of the data sets. We will thin them out a bit, taking only every 20th data set, to keep the file size of the plot down.
[7]:
p = bokeh_catplot.ecdf(
ell[0],
x_axis_label="spindle length (µm)",
marker_kwargs=dict(alpha=0.01, line_alpha=0),
)
for ell_vals in ell[19::20]:
p = bokeh_catplot.ecdf(ell_vals, p=p, marker_kwargs=dict(alpha=0.01, line_alpha=0))
bokeh.io.show(p)
This looks reasonable, except for the occasional negative spindle lengths. We will investigate treating these in a moment.
Another option for display is to plot percentiles of the ECDFs. We can do this using the bebi103.viz.predictive_ecdf()
function. It expects input as a Numpy array of shape \(n_s \times N\), where \(n_s\) is the number of samples and \(N\) is the number of data points. This is exactly what the variable ell
is now.
[8]:
bokeh.io.show(bebi103.viz.predictive_ecdf(ell, x_axis_label="spindle length (µm)"))
In this plot, the median ECDF is shown in the middle, and the colors going out from the median contain the middle 20th, 40th, 60th, and 80th percentiles. The extent of the ECDF gives an indication of the extreme values in the prior predictive data set. The bulk of the spindle lengths lie in a reasonable region, somewhere between zero and 30 µm.
We may be willing to tolerate the negative spindle length values in our model, since any data set, containing no negative values, could easily tug the posterior into positive spindle lengths. Nonetheless, let’s check how many negative spindle lengths we get from our generative model. We will compute the fraction of spindle lengths that are negative from our generative model for each of the 1000 data sets we generated and then plot the results as an ECDF.
[9]:
p = bokeh_catplot.ecdf(
(ell < 0).sum(axis=1) / len(df), x_axis_label="fraction of negative spindle lengths"
)
bokeh.io.show(p)
Nearly half of the data sets we generated had negative spindle lengths, and many of them had a substantial fraction that were negative. The total fraction of negative spindle lengths of all generated data sets can be calculated.
[10]:
(ell < 0).sum() / (N * n_ppc_samples)
[10]:
0.055756
So, more than 5% of generated data points are unphysical. Again, we could decide to tolerate this, but I will instead propose another model.
The prior, take 3¶
It makes sense that the standard deviation of the spindle length may scale with the spindle length itself. We can express this as
\begin{align} &\sigma_0 \sim \text{Gamma}(2, 10), \\[1em] &\sigma = \sigma_0\, \phi. \end{align}
That is, \(\sigma\) scales linearly with \(\phi\) with constant of proportionality \(\sigma_0\). The prior we have chosen for \(\sigma_0\) looks like this.
[11]:
sigma_0 = np.linspace(0, 1, 200)
hv.Curve(
(sigma_0, st.gamma.pdf(sigma_0, 2, loc=0, scale=0.1)),
kdims='σ₀ [µm]',
vdims='g(σ₀) [1/µm]'
).opts(
xlim=(0, 1)
)
[11]:
So, our new complete generative model is
\begin{align} &\phi \sim \text{LogNorm}(\ln 20, 0.75),\\[1em] &\sigma_0 \sim \text{Gamma}(2, 10),\\[1em] &\sigma = \sigma_0\,\phi,\\[1em] &l_i \sim \text{Norm}(\phi, \sigma)\; \forall i. \end{align}
Prior predictive checks, take 2¶
We can again generate our prior predictive data sets.
[12]:
# Draw parameters out of the prior
phi = np.random.lognormal(np.log(20), 0.75, size=n_ppc_samples)
sigma_0 = np.random.gamma(2, 1 / 10, size=n_ppc_samples)
sigma = sigma_0 * phi
# Draw data sets out of the likelihood for each set of prior params
ell = np.array([np.random.normal(ph, sig, size=len(df)) for ph, sig in zip(phi, sigma)])
# Show the prior predictive ECDF
bokeh.io.show(bebi103.viz.predictive_ecdf(ell, x_axis_label="spindle length (µm)"))
We still get some negative spindle lengths, but far fewer. The distribution of data in the data sets also seems reasonable. Let’s compute what fraction are negative.
[13]:
p = bokeh_catplot.ecdf(
(ell < 0).sum(axis=1) / len(df), x_axis_label="fraction of negative spindle lengths"
)
bokeh.io.show(p)
Now most data sets do not contain any negative spindle lengths and those that do contain a much smaller fraction. This seems like a more reasonable prior, and we settle on this as our generative model for the case where spindle length is independent of droplet size.
Prior predictive checks with Stan¶
While generating the data sets to be used in prior predictive checks is intuitive and easy using Numpy, I find it is more convenient to do it in Stan. This may not be obvious now, but will become clearer later when we construct an entire workflow including prior and posterior predictive checks and use simulation-based calibration.
Generating prior predictive samples typically does not require Markov chain Monte Carlo, but the random number generation we are more familiar with. Stan does allow for this kind of random number generation. If you want to draw a sample out of one of Stan’s distributions, append the name of the distribution with _rng
. Unlike Numpy’s and Scipy’s random number generators, Stan’s RNGs can only draw one sample at a time. Therefore, we have to put the random number generation in a for
loop.
The code below demonstrates this.
data {
int N;
}
generated quantities {
// Parameters
real<lower=0> phi;
real<lower=0> sigma_0;
// Data
real ell[N];
phi = lognormal_rng(3.0, 0.75);
sigma_0 = gamma_rng(2.0, 10.0);
for (i in 1:N) {
ell[i] = normal_rng(phi, sigma_0 * phi);
}
}
The data
block contains N
, the number of ell
values we want to generate. The generated quantities
block defines variables we want to store, in this case phi
, sigma_0
and ell
. We first generate values for phi
and sigma_0
and then use those values to generate data ell
. Note that the parameter sigma
, which is given by the product of \(\sigma_0\) and \(\phi\), is also generated and is used to generate ell
. However, we do not need to store this,
since it is determined from sigma_0
and phi
samples. Parameters in the generated quantities
block that are enclosed in braces (outside of if statements and for loops, of course) are not saved in the output.
The above code will be useful for our prior predictive checks, but if we want to tweak some of the parametrizations of the priors (e.g., we might want \(\phi\sim \text{LogNorm}(4, 1)\) instead of \(\phi \sim \text{LogNorm}(3, 0.75)\)), we will have to adjust the Stan code and recompile. For the purposes of prior predictive checks, we may also want to include these values in the data
block and pass them in as the data
kwarg when sampling.
data {
int N;
real phi_mu;
real phi_sigma;
real sigma_0_alpha;
real sigma_0_beta;
}
generated quantities {
// Parameters
real<lower=0> phi;
real<lower=0> sigma_0;
// Data
real ell[N];
phi = lognormal_rng(phi_mu, phi_sigma);
sigma_0 = gamma_rng(sigma_0_alpha, sigma_0_beta);
for (i in 1:N) {
ell[i] = normal_rng(phi, sigma_0 * phi);
}
}
Let’s build this Stan model.
[14]:
sm_prior_pred = cmdstanpy.CmdStanModel(
stan_file="indep_size_model_prior_predictive.stan"
)
INFO:cmdstanpy:stan to c++ (/Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_05/indep_size_model_prior_predictive.hpp)
INFO:cmdstanpy:compiling c++
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_05/indep_size_model_prior_predictive
To draw the samples, we specify the contents of the inputted data
block, and then sample. We need to use the kwarg fixed_param=True
to alert Stan that it will not need to do any MCMC sampling, but rather just run the generated quantities
block.
[15]:
data = {
"N": N,
"phi_mu": 3.0,
"phi_sigma": 0.75,
"sigma_0_alpha": 2.0,
"sigma_0_beta": 10.0,
}
samples = sm_prior_pred.sample(data=data, sampling_iters=1000, fixed_param=True)
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:finish chain 1
As usual, we would like to convert the output to an ArviZ InferenceData
instance. Note that due to a bug in ArviZ, which will be fixed in the next release, we need to specify the posterior and prior as the prior to extract samples. Additionally, we specify a list of parameter values we want to designate as prior predictive using the prior_predictive
kwarg. Instead of the call to az.from_cmdstanpy()
in the code cell below, in the
next release, we would use
samples = az.from_cmdstanpy(prior=samples, prior_predictive=['ell'])
Use this call if your ArviZ version number of 0.6.2 or above, For 0.6.1 (current as of January 30, 2020), use the call below.
[16]:
samples = az.from_cmdstanpy(posterior=samples, prior=samples, prior_predictive=['ell'])
We can pass the array stored in samples.prior_predictive['ell']
directly into the bebi103.viz.predictive_ecdf()
function to get the predictive ECDF.
[17]:
bokeh.io.show(
bebi103.viz.predictive_ecdf(samples.prior_predictive['ell'])
)
The result is of course the same as when generating data sets with Numpy.
As you can see, it is a bit more verbose to use Stan to generate the prior predictive samples. The calculation time is also substantially longer because of the time required for compilation. Nonetheless, it is often advantageous to use Stan for this stage of an analysis pipeline because when we do simulation based calibration (SBC) of our models, it is useful to have everything written in Stan.
Model 2: Spindle size dependent on total tubulin concentration¶
While we have already worked with the conserved tubulin model, it is instructive to consider again how it is derived. This enables us to think about model building in terms of identifiability of parameters.
The conserved tubulin model hinges on the following principles.
The total amount of tubulin in the droplet or cell is conserved.
The total length of polymerized microtubules is a function of the total tubulin concentration after assembly of the spindle. This results from the balances of microtubule polymerization rate with catastrophe frequencies.
The density of tubulin in the spindle is independent of droplet or cell volume.
Assumption 1 (conservation of tubulin) implies
\begin{align} T_0 V_0 = T_1(V_0 - V_\mathrm{s}) + T_\mathrm{s}V_\mathrm{s}, \end{align}
where \(V_0\) is the volume of the droplet or cell, \(V_\mathrm{s}\) is the volume of the spindle, \(T_0\) is the total tubulin concentration (polymerized or not), \(T_1\) is the tubulin concentration in the cytoplasm after the the spindle has formed, and \(T_\mathrm{s}\) is the concentration of tubulin in the spindle. If we assume the spindle does not take up much of the total volume of the droplet or cell (\(V_0 \gg V_\mathrm{s}\), which is the case as we will see when we look at the data), we have
\begin{align} T_1 \approx T_0 - \frac{V_\mathrm{s}}{V_0}\,T_\mathrm{s}. \end{align}
The amount of tubulin in the spindle can we written in terms of the total length of polymerized microtubules, \(L_\mathrm{MT}\) as
\begin{align} T_s V_\mathrm{s} = \alpha L_\mathrm{MT}, \end{align}
where \(\alpha\) is the tubulin concentration per unit microtubule length. (We will see that it is unimportant, but from the known geometry of microtubules, \(\alpha \approx 2.7\) nmol/µm.)
We formalize assumption 2 into a mathematical expression. Microtubule length should grow with increasing \(T_1\). There should also be a minimal threshold \(T_\mathrm{min}\) where polymerization stops. We therefore approximate the total microtubule length as a linear function,
\begin{align} L_\mathrm{MT} \approx \left\{\begin{array}{ccl} 0 & &T_1 \le T_\mathrm{min} \\ \beta(T_1 - T_\mathrm{min}) & & T_1 > T_\mathrm{min}. \end{array}\right. \end{align}
Because spindles form in Xenopus extract, \(T_0 > T_\mathrm{min}\), so there exists a \(T_1\) with \(T_\mathrm{min} < T_1 < T_0\), so going forward, we are assured that \(T_1 > T_\mathrm{min}\). Thus, we have
\begin{align} V_\mathrm{s} \approx \alpha\beta\,\frac{T_1 - T_\mathrm{min}}{T_\mathrm{s}}. \end{align}
With insertion of our expression for \(T_1\), this becomes
\begin{align} V_{\mathrm{s}} \approx \alpha \beta\left(\frac{T_0 - T_\mathrm{min}}{T_\mathrm{s}} - \frac{V_\mathrm{s}}{V_0}\right). \end{align}
Solving for \(V_\mathrm{s}\), we have
\begin{align} V_\mathrm{s} \approx \frac{\alpha\beta}{1 + \alpha\beta/V_0}\,\frac{T_0 - T_\mathrm{min}}{T_\mathrm{s}} =\frac{V_0}{1 + V_0/\alpha\beta}\,\frac{T_0 - T_\mathrm{min}}{T_\mathrm{s}}. \end{align}
We approximate the shape of the spindle as a prolate spheroid with major axis length \(l\) and minor axis length \(w\), giving
\begin{align} V_\mathrm{s} = \frac{\pi}{6}\,l w^2 = \frac{\pi}{6}\,k^2 l^3, \end{align}
where \(k \equiv w/l\) is the aspect ratio of the spindle. We can now write an expression for the spindle length as
\begin{align} l \approx \left(\frac{6}{\pi k^2}\, \frac{T_0 - T_\mathrm{min}}{T_\mathrm{s}}\, \frac{V_0}{1+V_0/\alpha\beta}\right)^{\frac{1}{3}}. \end{align}
For a spherical droplet, \(V_0 \approx \pi d^3 / 6\), giving
\begin{align} l \approx \left( \frac{T_0 - T_\mathrm{min}}{k^2\,T_\mathrm{s}}\, \frac{d^3}{1+\pi d^3/6\alpha\beta}\right)^{\frac{1}{3}}. \end{align}
Indentifiability of parameters¶
Good and coworkers measured the microtubule length \(l\) and droplet diameter \(d\) directly from their microscope images. They can also measure the spindle aspect ratio \(k\) directly from the images. Thus, we have four unknown parameters, since we already know α ≈ 2.7 nmol/µm. The unknown parameters are:
parameter |
meaning |
---|---|
\(\beta\) |
rate constant for MT growth |
\(T_0\) |
total tubulin concentration |
\(T_\mathrm{min}\) |
critical tubulin concentration for polymerization |
\(T_s\) |
tubulin concentration in the spindle |
We would like to determine all of these parameters. We could measure them all either in this experiment or in other experiments. We could measure the total tubulin concentration \(T_0\) by doing spectroscopic or other quantitative methods on the Xenopus extract. We can \(T_\mathrm{min}\) and \(T_s\) might be assessed by other in vitro assays, though these parameters may by strongly dependent on the conditions of the extract.
Importantly, though, the parameters only appear in combinations with each other in our theoretical model. Specifically, we can define two parameters,
\begin{align} \gamma &= \left(\frac{T_0-T_\mathrm{min}}{k^2T_\mathrm{s}}\right)^\frac{1}{3} \\[1em] \phi &= \gamma\left(\frac{6\alpha\beta}{\pi}\right)^{\frac{1}{3}}. \end{align}
We can then rewrite the general model expression in terms of these parameters as
\begin{align} l(d) \approx \frac{\gamma d}{\left(1+(\gamma d/\phi)^3\right)^{\frac{1}{3}}}. \end{align}
If we tried to determine all four parameters from this experiment only, we would be in trouble. This experiment alone cannot distinguish all of the parameters. Rather, we can only distinguish two combinations of them, which we have defined as \(\gamma\) and \(\phi\). This is an issue of identifiability. We may not be able to distinguish all parameters in a given model, and it is important to think carefully before the analysis about which ones we can identify. We are not quite done with characterizing identifiability.
Limiting behavior¶
For large droplets, with \(d \gg \phi/\gamma\), the spindle size becomes independent of \(d\), with
\begin{align} l \approx \phi. \end{align}
Conversely, for $ d \ll `:nbsphinx-math:phi`/:nbsphinx-math:`gamma`$, the spindle length varies approximately linearly with diameter.
\begin{align} l(d) \approx \gamma\,d. \end{align}
Note that the expression for the linear regime gives bounds for \(\gamma\). Obviously, \(\gamma > 0\), lest we have spindles of negative length in the small \(d\) regime. Because \(l \le d\), lest the spindle not fit in the droplet, we also have \(\gamma \le 1\).
Importantly, if the experiment is done in the regime where \(d\) is large (and we do not really know a priori how large that is since we do not know the parameters \(\phi\) and \(\gamma\)), we cannot tell the difference between this conserved tubulin model and the independent size model, since they are equivalent in that regime. Further, if the experiment is in this regime the model is unidentifiable because we cannot resolve \(\gamma\).
This sounds kind of dire, but this is actually a convenient fact. The conserved tubulin model is more complex, but it has a simpler model, the independent size model, as a limit. Thus, the two models are in fact commensurate with each other. Knowledge of how these limits also enhances the experimental design. We should strive for small droplets. And perhaps most importantly, if we didn’t consider the second model, we might automatically assume that droplet size has nothing to do with spindle length if we happened to do the experiment in larger droplets.
Generative model¶
We have a theoretical model relating the droplet diameter to the spindle length. Let us now build a generative model. For spindle, droplet pair \(i\), we assume
\begin{align} l_i = \frac{\gamma d_i}{\left(1+(\gamma d/\phi)^3\right)^{\frac{1}{3}}} + e_i. \end{align}
We again assume that \(e_i\) is Normally distributed with variance \(\sigma^2\). In the independent size model, we assumed that \(\sigma\) scaled with the characteristic spindle size, which was \(\phi\) in that case. We will do the same here, scaling \(\sigma\) with the theoretical spindle size for each droplet diameter. We can choose the same prior as before for \(\phi\), and we choose a Beta prior for \(\gamma\) with a slight preference for intermediate \(\gamma\) values. Our model is then
\begin{align} &\phi \sim \text{LogNorm}(\ln 20, 0.75),\\[1em] &\gamma \sim \text{Beta}(1.1, 1.1), \\[1em] &\sigma_0 \sim \text{Gamma}(2, 10),\\[1em] &\mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &\sigma_i = \sigma_0\,\mu_i,\\[1em] &l_i \sim \text{Norm}(\mu_i, \sigma_i) \;\forall i. \end{align}
Importantly, note that this model builds upon our first model. Generally, when doing Bayesian modeling, it is a good idea to build more complex models on your initial baseline model such that the models are related to each other by limiting behavior. This gives you a continuum of models and a sound basis for making comparisons among models.
Prior predictive checks¶
We will now will perform the prior predictive checks using Stan.
functions {
real ell_theor(real d, real phi, real gamma) {
real denom_ratio = (gamma * d / phi)^3;
return gamma * d / (1 + denom_ratio)^(1.0 / 3.0);
}
}
data {
int N;
real d[N];
real phi_mu;
real phi_sigma;
real gamma_alpha;
real gamma_beta;
real sigma_0_alpha;
real sigma_0_beta;
}
generated quantities {
// Parameters
real<lower=0> phi;
real<lower=0, upper=1> gamma;
real<lower=0> sigma_0;
real<lower=0> sigma;
// Data
real ell[N];
phi = lognormal_rng(phi_mu, phi_sigma);
gamma = beta_rng(gamma_alpha, gamma_beta);
sigma_0 = gamma_rng(sigma_0_alpha, sigma_0_beta);
sigma = sigma_0 * phi;
{
real mu;
for (i in 1:N) {
mu = ell_theor(d[i], phi, gamma);
ell[i] = normal_rng(mu, sigma_0 * mu);
}
}
}
Now let’s compile and generate our prior predictive samples!
[18]:
sm_prior_pred = cmdstanpy.CmdStanModel(
stan_file="cons_tubulin_model_prior_predictive.stan"
)
INFO:cmdstanpy:stan to c++ (/Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_05/cons_tubulin_model_prior_predictive.hpp)
INFO:cmdstanpy:compiling c++
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_05/cons_tubulin_model_prior_predictive
Now, we’ll set up the input data, draw our samples, and convert to an ArviZ InferenceData
instance.
[19]:
data = {
"N": N,
"d": d,
"phi_mu": 3.0,
"phi_sigma": 0.75,
"gamma_alpha": 1.1,
"gamma_beta": 1.1,
"sigma_0_alpha": 2.0,
"sigma_0_beta": 10.0,
}
samples = sm_prior_pred.sample(data=data, sampling_iters=1000, fixed_param=True)
samples = az.from_cmdstanpy(posterior=samples, prior=samples, prior_predictive=['ell'])
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:finish chain 1
We can start our graphical posterior predictive checks by looking at an ECDF of the values of the spindle length, ignoring for a moment the dependence on droplet diameter.
[20]:
bokeh.io.show(
bebi103.viz.predictive_ecdf(
samples.prior_predictive["ell"], x_axis_label="spindle length (µm)"
)
)
This looks quite similar to the first model. There are a few negative values, but we will accept this and move forward.
Next, we would like to look at some scatter plots of spindle length versus droplet diameter. As a start, we can make a plot of the ranges we might expect the data to span using the bebi103.viz.predictive_regression()
function. For each value of \(d\), we plot the percentiles of the spindle lengths drawn from the generative model. We will plot the middle 30th, 60th, 90th, and 99th percentiles to get an idea how the data sets are distributed.
[21]:
bokeh.io.show(
bebi103.viz.predictive_regression(
samples.prior_predictive['ell'],
samples_x=d,
percentiles=[30, 60, 90, 99],
x_axis_label='droplet diameter [µm]',
y_axis_label='spindle length [µm]'
)
)
These results look reasonable; there are very few values of spindle length below zero, and the spindle length is also only rarely greater than the droplet diameter.
It is also useful to look at individual curves to make sure they are within reason. We can build a quick dashboard to check this out. In fact, dashboarding is quite useful for prior (and posterior) predictive checks. You can read more about dashboarding in last term’s recitation about it.
Note that the interaction will only work in a running Jupyter notebook, and will not work in the HTML rendering of this notebook.
[22]:
draw_slider = pn.widgets.IntSlider(name="draw", start=0, end=1000, step=1, value=0)
@pn.depends(draw_slider.param.value)
def plot_prior_pred_data(draw):
ell = samples.prior_predictive["ell"].sel(chain=0, draw=draw).values
return hv.Scatter(
(d, ell), kdims="droplet diameter [µm]", vdims="spindle length [µm]"
).opts(frame_height=250, frame_width=250, size=2, xlim=(0, 275), ylim=(-5, 140))
@pn.depends(draw_slider.param.value)
def markdown_params(draw):
gamma = float(samples.prior["gamma_"].sel(chain=0, draw=draw))
phi = float(samples.prior["phi"].sel(chain=0, draw=draw))
sigma_0 = float(samples.prior["sigma_0"].sel(chain=0, draw=draw))
return f"""
| Parameter | Value |
|:-----------:|-----------:|
| γ | {gamma} |
| ϕ | {phi} |
| σ₀ | {sigma_0} |
"""
widgets = pn.Column(
pn.Spacer(height=30), draw_slider, pn.Spacer(height=30), pn.panel(markdown_params)
)
pn.Row(plot_prior_pred_data, pn.Spacer(width=30), widgets)
[22]:
We do get some negative spindle lengths, but we may be willing to tolerate these, as they are few. We also see all sorts of behaviors. We see some curves in the asymptotic regime where the spindle length is independent of diameter, some in the linear regime, and some that encompass both regimes.
So, we now have two sound generative models that we can use to perform parameter estimation. Performing prior predictive checks honed our model, priors including.
Before moving on to parameter estimation with posterior predictive checks, let’s take a look at the data set and think about some further considerations of the model.
Checking model assumptions¶
Now that we have two generative models in place, let’s take a quick look at the data. I’ll first replot it, as we did at the top of this notebook.
[23]:
hv.Scatter(
data=df,
kdims='Droplet Diameter (um)',
vdims='Spindle Length (um)',
)
[23]:
The experiment more heavily sampled droplet diameters between 20 and 80 µm. It is hard to say if it sampled the droplet diameter-independent regime. We will explore the posterior distribution for this model, informed by these data in the likelihood, in the next lesson. Before we do, we should check to make sure the model assumptions hold.
In deriving the model, we made a couple key assumptions. First, we assumed that \(V_s/V_0 \ll 1\). Second, we assumed that all spindles have the same aspect ratio, \(k\). We should examine how these assumptions hold up and maybe relax those assumptions and build a more complex model if need be.
Is \(V_\mathrm{s} / V_0 \ll 1\)?¶
Let’s do a quick verification that the droplet volume is indeed much larger than the spindle volume. Remember, the spindle volume for a prolate spheroid of length \(l\) and width \(w\) is \(V_\mathrm{s} = \pi l w^2 / 6\).
[24]:
# Compute spindle volume
spindle_volume = np.pi * df["Spindle Length (um)"] * df["Spindle Width (um)"] ** 2 / 6
# Compute the ratio V_s / V_0 (taking care of units)
vol_ratio = spindle_volume / df["Droplet Volume (uL)"] * 1e-9
# Plot an ECDF of the results
bokeh.io.show(bokeh_catplot.ecdf(vol_ratio.values, x_axis_label="Vs/V0"))
We see that for pretty much all spindles that were measured, \(V_\mathrm{s} / V_0\) is small, so this is a sound assumption.
Do all spindles have the same aspect ratio \(k\)?¶
In setting up our model, we assumed that all spindles had the same aspect ratio. We can check this assumption because we have the data to do so available to us.
[25]:
# Compute the aspect ratio
k = df['Spindle Width (um)'] / df['Spindle Length (um)']
# Plot ECDF
bokeh.io.show(bokeh_catplot.ecdf(k.values, x_axis_label='k'))
The mean aspect ratio is about 0.4, and we see spindle lengths about \(\pm 25\%\) of that. This could be significant variation. Going forward, we will assume \(k\) is constant, but you may wish to perform the analysis with nonconstant \(k\) as an exercise.
Importantly, these checks of the model highlight the importance of checking your assumptions against your data. Always a good idea!
[26]:
bebi103.stan.clean_cmdstan()
Computing environment¶
[27]:
%load_ext watermark
%watermark -v -p numpy,pandas,scipy,cmdstanpy,arviz,bokeh,holoviews,panel,bokeh_catplot,bebi103,jupyterlab
CPython 3.7.6
IPython 7.11.1
numpy 1.18.1
pandas 0.24.2
scipy 1.3.1
cmdstanpy 0.8.0
arviz 0.6.1
bokeh 1.4.0
holoviews 1.12.7
panel 0.7.0
bokeh_catplot 0.1.8
bebi103 0.0.51
jupyterlab 1.2.5