Statistical modeling

As scientists, our goal is to learn about how nature works. We can make observations and measurements, and these are usually in the service of gaining an understanding of how nature works. In their book Physical Biology of the Cell, Phillips, et al. write that “…quantitative data demand quantitative models and, conversely, that quantitative models need to provide experimentally testable quantitative predictions about biological phenomena.” It is in this spirit that we approach parametric inference. Our goal is to model the process of data generation, and using the measured data to learn about the model. This leads to knowledge.

As an example to have in mind when thinking about modeling, we can consider measurements of optical density (OD) in a solution of E. coli in LB media. We might expect the OD to grow exponentially over time, with some small measurement error. To model this, we specify a probability distribution to describe the measurements. We can then use the data and statistical inference to learn something about the parameters in the model.

Levels of models

The word “model” in the biological literature takes on many different meanings. Since we are now becoming statistical modelers, we need to clearly define what we are talking about when we use the word “model.””

  • Cartoon model. These models are the typical cartoons we see in text books or in discussion sections of biological papers. They are a sketch of what we think might be happening in a system of interest, but they do not provide quantifiable predictions.

  • Mathematical model. These models give quantifiable predictions that must be true if the hypothesis (which is sketched as a cartoon model) is true. In many cases, getting to predictions from a hypothesis is easy. For example, if I hypothesize that protein A binds protein B, a quantifiable prediction would be that they are colocalized when I image them. However, sometimes harder work and deeper thought is needed to generate quantitative predictions. This often requires “mathematizing” the cartoon. This is how a mathematical model is derived from a cartoon model. Oftentimes when biological physicists refer to a “model,”” they are talking about what we are calling a mathematical model. In the bacterial growth example, the mathematical model is that they grow exponentially.

  • Statistical model. A statistical model goes a step beyond the mathematical model and uses a probability distribution to describe any measurement error, or stochastic noise in the system being measured. This essentially means specifying \(f(y; \theta)\), the probability density function (or probability mass function) for observing data \(y\) parametrized by \(\theta\). The statistical models we will use are generative in that the encompass the cartoon and mathematical models and any noise to use probability to describe how the data are generated.