5. Collector’s box of distributions

Probability distributions are the building blocks of any generative model. In building a model, we usually start with the likelihood and prescribe a distribution that generates the data, \(f(y\mid \theta)\). The likelihood makes clear which parameter \(\theta\) we need priors, \(g(\theta)\), for. We will ultimately characterize the posterior distribution of the parameters, \(g(\theta\mid y)\).

Check out the Distribution Explorer

Sometimes, we need to derive a distribution from first principles. But often, we can use off-the-shelf named distributions in our modeling. It is therefore very valuable to be familiar with these distributions. You should therefore read and interact with the content of the Distribution Explorer.

Choosing distributions

Each of the named distributions has a story. Oftentimes the process of choosing a distribution for a model amounts to matching the story of the data generative process to that of a distribution. To help pare down which distributions to consider, we answer the following questions about the variable (either data or parameter) of interest.

  1. Is the variable a discrete quantity of a continuous one? Only discrete distributions describe discrete variables and only continuous distributions describe continuous variables.

  2. Is the variable univariate or multivariate?

  3. On what domain is the variable defined?

When we answer the questions, though, we typically answer them approximately. As an example, we can consider repeated measurements of the length of C. elegans egg lengths. We might measure these by taking microscope images and then counting the number of pixels along the long axis of the egg. We can then convert the pixel count to a length using the known interpixel distance of the optical setup.

  1. Is the egg length discrete or continuous? Strictly speaking, it is discrete. We are using a digital camera and we have precision to an integer number of pixels. This is generally the case with any digital measurement, which most of our measurements are. Nonetheless, we can approximate the measurement as continuous with little error. When we can approximate a discrete quantity as continuous, we usually want to go ahead and do that, as it simplifies the modeling and calculations.

  2. Is egg length univariate or multivariate? It is univariate.

  3. On what domain is egg length defined? Egg length is defined on the set of positive real numbers. We might then choose a likelihood defined on that domain. However, a Normal distribution would also be a good choice. It matches the story of how egg length might be generated (the sum of many processes, none of which has a giant variance), and its light tails mean that negative egg lengths are extremely unlikely (though not entirely impossible). So, a univariate Normal distribution is a good (approximate) distribution to use as a likelihood for egg length.

As another example, imagine we have a likelihood that is a bivariate Normal and we want to come up with a prior distribution for the \(2 \times 2\) covariance matrix.

  1. Is it discrete or continuous. It is continuous.

  2. Is the variable univariate or multivariate? The covariance matrix is multivariate. Note that it is often the case that we approximate multivariate variables as being a set if univariate variables. So, instead of coming up with a prior for a covariance matrix, we could come up with a prior for each of the three entries (three because the fourth is determined by symmetry). We can write the covariance matrix as \(\mathsf{\Sigma} = \mathsf{S} \cdot \mathsf{C} \cdot \mathsf{S}\), where \(S_{11} = \sigma_1\), \(S_{22} = \sigma_2\), \(S_{12} = S_{21} = 0\), \(C_{11} = C_{22} = 1\), and \(C_{12} = C_{21} = \rho\). We could then assume that \(\sigma_1\), \(\sigma_2\), and \(\rho\) are all independent and we can come up with priors for each.

  3. On what domain is the covariance matrix defined? The covariance matrix is positive definite, which defined the domain. We would therefore need to choose a distribution defined on that domain, and the LKJ distribution works for that purpose. Alternatively, we could define a distribution for \(\sigma_1\), \(\sigma_2\), and \(\rho\) separately. The domains of \(\sigma_1\) and \(\sigma_2\) are positive real numbers. We could use, e.g., Half-Normal priors for these two parameters. The parameter \(\rho\) is defined in the interval \([-1, 1]\). None of the named distributions in the Distribution Explorer are defined on this domain, but the Beta distribution is defined on the domain [0, 1]. We can use a linear transformation to get a distribution for \(\rho\).

\[\begin{split}\begin{align} &\xi \sim \text{Beta}(\alpha, \beta)\\ &\rho = 2\xi - 1. \end{align}\end{split}\]

Priors for variables with support on the set of positive real numbers

We often have to choose priors with support on the set of positive real numbers. In practice, I find there are three classes of distributions I use for this.

  1. Half-distributions. Listed in order from lightest to heaviest tail, these are the Half-Normal, Half-Student-t and Half-Cauchy distributions. There is more mass around around zero in these distribution than the Gamma/Inv-Gamma distributions, but these can also have very heavy tails, as is the case with the Half-Cauchy.

  2. Gamma/Inv-Gamma distirbutions. The Gamma distribution and Inverse Gamma distributions are defined on the domain of positive real numbers. The Gamma distribution put more probability mass near zero than the Inverse Gamma. The Inverse Gamma has a much heavier tail. So, if you want to keep the probability mass away from zero and toward larger values, use an Inverse Gamma. For the opposite, use a Gamma.

  3. Log-Normal. I find the Log-Normal distribution to be very useful when I have order-of-magnitude estimates of the maximum and minimum values I might expect a parameter to have. Consider for example a single molecule experiment where we are assessing the speed of a kinesin motor. I am sure that the kinesin motor walks slower than a 0.1 meters per second and faster than a nanometer per second. That is, I know a priori that kinesin’s walking speed is between \(10^{-3}\) µm/s and \(10^{5}\) µm/s. So, I might choose a prior such that the base ten logarithm of the kinesin speed is between -3 and 5. If I choose to have about 95% of the probability mass between these two values, I can choose a Normal distribution with a location parameter of 1 and a scale parameter of \((5 - (-3)) / 2 / 2 = 2\). Thus, \(\log_{10} v \sim \text{Norm}(1, 2)\). This is equivalent to a Log-Normal prior for \(v\). In practice, I almost always specify a Log-Normal prior as a Normal prior on the logarithm of the parameter and then exponentiate. Here is how that could look in Stan.

parameters {
  real log10_v;
}


transformed parameters {
  real v = 10^log10_v;
}


model {
  log10_v ~ normal(1, 2);

  ...
}