Building a generative model¶
The process of model building usually involves starting with a cartoon model, mathematizing it, and the forming that into a statistical model to model noise in measurement. Sometimes this process is very simple, and sometimes it involves careful and difficult modeling.
Examples of generative models¶
It is often easiest to learn from example. I present here some examples of how we might come up with generative models.
The size of eggs laid by C. elegans¶
The experiment here is repeated measurements of the length of eggs laid by C. elegans worms. We do not pretend to know much about how the process of egg generation sets its length. Surely many processes are involved, and we choose to model the egg length as being Normally distributed, as this story roughly matches what we would expect. We further assume that the length of each egg we measure is independent of all of the other eggs we measure, and further that the distribution we use to describe the egg length of any given egg is the same as any other. That is to say that the eggs are independent and identically distributed, abbreviated i.i.d.
We can then write down the probability density function for the length of egg
which is the PDF for a Normal distribution. Since each measurement is independent, the PDF for the joint distribution of all measurements,
The PDF has all of the information for the generative model. Importantly, the statistical model dictates what parameters you are trying to estimate. In this case, there are two parameters,
In this, model, we skipped directly through the cartoon model, through mathematization, and directly to the generative statistical model, since the former two models are trivial.
Short-hand model definition¶
Writing out the mathematical expression for the PDF can be cumbersome, even for a relatively simple model like we have here. In English, the model is, “The egg lengths are i.i.d. and are Normally distributed with mean
This is read just like the English sentence describing the model. The tilde symbol means “is distributed as.”
The amount of time before microtubule catastrophe¶
In your homework, you have already built a model for the time to microtubule catastrophe. We started with a story: Catatstrophe occurs after the arrival of two different successive Poisson processes. The story here is the cartoon model. You derived the probability distribution function for the time it takes for a single catastrophe.
where we have implicitly assumed that
If we again model the catastrophe events as i.i.d., we can write the joint PDF for a set of measured catastrophe times
This model is more difficult to write in shorthand, but we can.
Note that this construction of the model has a latent variable,
An alternative model for microtubule catastrophe¶
As an alternative model, we may consider the case where catastrophe is itself a Poisson process (or triggered by the arrival of a single Poisson process). In that case, our model is simpler.
The change in bacterial mass over time¶
You may be familiar with exponential microbial growth. When you put a single cell in growth media, it divides, and then you have two. Those two cells then grow and divide, giving four cells. This continues, and the number of cells grows exponentially with time.
In an interesting paper (PNAS, 2014), Iyer-Biswas and coworkers addressed the question of whether or not a single cell exhibits exponential growth. That is, right after a division, does the total mass of a cell grow exponentially before dividing? Even if individual cells grow linearly, in bulk it is still exponential, so we cannot really tell from a growth experiment.
Their clever experimental set-up allows imaging of single dividing cells in conditions that are identical through time. This is accomplished by taking advantage of a unique morphological feature of Caulobacter. The mother cell is adherent to the a surface through its stalk. Upon division, one of the daughter cells does not have a stalk and is mobile. The system is part of a microfluidic device that gives a constant flow. So, every time a mother cell divides, the un-stalked daughter cell gets washed away. In such a way, the dividing cells are never in a crowded environment and the buffer is always fresh. Using microscopy and image processing, they have many curves, starting from a single mother cell with its growth to division, to assess growth models.
We can consider two models for growth of an individual cell, linear growth and exponential growth.
Linear growth¶
We will start with linear growth; stating that the growth is linear is the cartoon model. More precisely, we model bacterial growth as a constant process for each bacterium; it grows at the same rate regardless of bacterial mass. We can mathematize our model as
where
For the statistical model, we need to model error in measurement. The idea is that the cell grows according to the above equation, but there will be some natural stochastic variation away from that curve. Furthermore, there are errors in measurement for the area at each time point. (We assume that we can measure the time exactly without error.) Thus, the measured area
where
Given that the measurements are i.i.d., we can model the residual,
We can then write the PDF for the joint distribution of all of the measured data,
It is convenient to write this in shorthand.
or, equivalently,
Exponential growth¶
We can use exactly the same logic as above to write the model for Exponential growth.
or, equivalently,
Important notes on generative modeling¶
In the three example models presented here, we used our best scientific and statistical insights to put forward a generative model. The model for linear growth of bacteria is in some sense “standard,” in that it leads to linear regression, a widely-used statistical tool. Nonetheless, your modeling should be bespoke. You should choose models that are appropriate for the experiment and data your are analyzing.
The bulk of next term is about parametric statistical models, how to build them, and how to assess them.