Building a generative model

The process of model building usually involves starting with a cartoon model, mathematizing it, and the forming that into a statistical model to model noise in measurement. Sometimes this process is very simple, and sometimes it involves careful and difficult modeling.

Examples of generative models

It is often easiest to learn from example. I present here some examples of how we might come up with generative models.

The size of eggs laid by C. elegans

The experiment here is repeated measurements of the length of eggs laid by C. elegans worms. We do not pretend to know much about how the process of egg generation sets its length. Surely many processes are involved, and we choose to model the egg length as being Normally distributed, as this story roughly matches what we would expect. We further assume that the length of each egg we measure is independent of all of the other eggs we measure, and further that the distribution we use to describe the egg length of any given egg is the same as any other. That is to say that the eggs are independent and identically distributed, abbreviated i.i.d.

We can then write down the probability density function for the length of egg \(i\), \(y_i\), as

\[\begin{align} f(y_i; \mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}}\,\mathrm{e}^{-(y_i-\mu)^2/2\sigma^2}, \end{align}\]

which is the PDF for a Normal distribution. Since each measurement is independent, the PDF for the joint distribution of all measurements, \(\mathbf{y} = \{y_1, y_2, \ldots, y_n\}\), is given by the product of the PDFs of the individual measurements.

\[\begin{split}\begin{align} f(\mathbf{y}; \mu, \sigma) &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}}\,\mathrm{e}^{-(y_i-\mu)^2/2\sigma^2} \\ &= \left(\frac{1}{2\pi \sigma^2}\right)^{n/2}\,\exp\left[-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\mu)^2\right]. \end{align}\end{split}\]

The PDF has all of the information for the generative model. Importantly, the statistical model dictates what parameters you are trying to estimate. In this case, there are two parameters, \(\mu\) and \(\sigma\). The generative model tells us that we can infer the characteristic egg length \(\mu\) and the variance, \(\sigma^2\).

In this, model, we skipped directly through the cartoon model, through mathematization, and directly to the generative statistical model, since the former two models are trivial.

Short-hand model definition

Writing out the mathematical expression for the PDF can be cumbersome, even for a relatively simple model like we have here. In English, the model is, “The egg lengths are i.i.d. and are Normally distributed with mean \(\mu\) and standard deviation \(\sigma\).” A shorthand for this is

\[\begin{align} y_i \sim \text{Norm}(\mu, \sigma) \;\forall i. \end{align}\]

This is read just like the English sentence describing the model. The tilde symbol means “is distributed as.”

The amount of time before microtubule catastrophe

In your homework, you have already built a model for the time to microtubule catastrophe. We started with a story: Catatstrophe occurs after the arrival of two different successive Poisson processes. The story here is the cartoon model. You derived the probability distribution function for the time it takes for a single catastrophe.

\[\begin{align} f(t_i;\beta_1, \beta_2) = \frac{\beta_1 \beta_2}{\beta_2 - \beta_1}\left(\mathrm{e}^{-\beta_1 t_i} - \mathrm{e}^{-\beta_2 t_i}\right), \end{align}\]

where we have implicitly assumed that \(\beta_1 \ne \beta_2\). We could explicitly model some errors in measurement of catastrophe times, but the experiment is quite clean. It is obvious from the images when catastrophe occurs, so the mathematical model leads directly to the generative statistical model.

If we again model the catastrophe events as i.i.d., we can write the joint PDF for a set of measured catastrophe times \(\mathbf{t} = \{t_1, t_2, \ldots, t_n\}\).

\[\begin{align} f(\mathbf{t};\beta_1, \beta_2) = \left(\frac{\beta_1 \beta_2}{\beta_2 - \beta_1}\right)^n\prod_{i=1}^n\left(\mathrm{e}^{-\beta_1 t_i} - \mathrm{e}^{-\beta_2 t_i}\right). \end{align}\]

This model is more difficult to write in shorthand, but we can.

\[\begin{split}\begin{align} &t'_i \sim \text{Expon}(\beta_1) \;\forall i,\\ &t_i - t'_i \sim \text{Expon}(\beta_2) \;\forall i. \end{align}\end{split}\]

Note that this construction of the model has a latent variable, \(t_i'\), a random variable that we can define in the model, but we cannot measure.

An alternative model for microtubule catastrophe

As an alternative model, we may consider the case where catastrophe is itself a Poisson process (or triggered by the arrival of a single Poisson process). In that case, our model is simpler.

\[\begin{align} &t_i \sim \text{Expon}(\beta) \;\forall i. \end{align}\]

The change in bacterial mass over time

When you performed segmentation of growing Caulobacter in the home and determined growth events, you were working toward an ultimate goal of parameter estimation for how the cellular volume changes over time. You cannot really measure the mass, so we start by making the assumption that the volume of a bacterium is calculated from its area, which we can measure, by a multiplicative constant. We had two models in mind, linear growth and exponential growth.

Linear growth

We will start with linear growth; stating that the growth is linear is the cartoon model. More precisely, we model bacterial growth as a constant process for each bacterium; it grows at the same rate regardless of bacterial mass. We can mathematize our model as

\[\begin{align} a(t) = a^0 + b t, \end{align}\]

where \(a(t)\) is the area of the bacterium over time, and \(t\) is the time since the last cell division. So, we now have our mathematical model. The growth rate is \(b\), and the area immediately after the last cell division is \(a^0\).

For the statistical model, we need to model error in measurement. The idea is that the cell grows according to the above equation, but there will be some natural stochastic variation away from that curve. Furthermore, there are errors in measurement for the area at each time point. (We assume that we can measure the time exactly without error.) Thus, the measured area \(a_i\) for a bacterium at time point \(t_i\) is

\[\begin{align} a_i = a^0 + b t_i + e_i, \end{align}\]

where \(e_i\) is the variation in the measurement from the mathematical model, called a residual. To complete the statistical model, we need to specify how \(e_i\) is distributed, and also the relationship between different time points. We first consider the latter. In time series analysis, the value (in this case the area) at time point \(t_{i+1}\) may be influenced by some memory process by the value at time point \(t_i\). Nonetheless, we often model measurements at different time points as i.i.d., only being connected with those at previous times by virtue of the fact that there is explicit time dependence in the mathematical model. This is typically a reasonable assumption, as many processes are memoryless.

Given that the measurements are i.i.d., we can model the residual, \(e_i\). This is commonly modeled as Normal with mean zero and some finite variance. If that variance is the same for all time points, the residuals are said to be homoscedastic. If the variance changes over time, we have heteroscedasticity. So, if we assume homoscedastic error, we could write

\[\begin{align} f(e_i;\sigma) = \frac{1}{\sqrt{2\pi \sigma^2}} \mathrm{e}^{-e_i^2/2\sigma^2}\;\forall i. \end{align}\]

We can then write the PDF for the joint distribution of all of the measured data,

\[\begin{align} f(a_i;t_i,a^0, b, \sigma) = \left(\frac{1}{2\pi\sigma^2}\right)^{n/2}\exp\left[\frac{1}{2\sigma^2}\sum_{i=1}^n(a_i - a^0-bt_i)^2\right]. \end{align}\]

It is convenient to write this in shorthand.

\[\begin{split}\begin{align} &a_i = a^0 + b t_i + e_i \;\forall i,\\ &e_i \sim \text{Norm}(0, \sigma)\;\forall i, \end{align}\end{split}\]

or, equivalently,

\[\begin{align} a_i \sim \text{Norm}(a^0 + b t_i, \sigma)\;\forall i. \end{align}\]

Exponential growth

We can use exactly the same logic as above to write the model for Exponential growth.

\[\begin{split}\begin{align} &a_i = a^0 \mathrm{e}^{kt} + e_i \;\forall i,\\ &e_i \sim \text{Norm}(0, \sigma)\;\forall i, \end{align}\end{split}\]

or, equivalently,

\[\begin{align} a_i \sim \text{Norm}(a^0 \mathrm{e}^{kt}, \sigma)\;\forall i. \end{align}\]

Important notes on generative modeling

In the three example models presented here, we used our best scientific and statistical insights to put forward a generative model. The model for linear growth of bacteria is in some sense “standard,” in that it leads to linear regression, a widely-used statistical tool. Nonetheless, your modeling should be bespoke. You should choose models that are appropriate for the experiment and data your are analyzing.

The bulk of next term is about parametric statistical models, how to build them, and how to assess them.