Tutorial 3b: Probability distributions and their stories

(c) 2016 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [1]:
import numpy as np
import scipy.stats as st

import bokeh.io
import bokeh.layouts
import bokeh.plotting

bokeh.io.output_notebook()
Loading BokehJS ...

We have talked about the three levels of building a model in the biological sciences.

  • Cartoon model: A cartoon or verbal phenomenological description of the system of interest.
  • Mathemtical model: A mathematization of a cartoon model. This gives quantitative predictions on how a biological system should behave.
  • Statistical model: A description of what we expect from an experiment, given our model mathematical. In the Bayesian context, this is specification of the likelihood and prior.

We have talked briefly about specifying priors. We endeavor to be uninformative when we do not know much about model parameters or about the hypothesis in question. We will be a bit sloppy in our prior formulation knowing that if we are sufficiently uninformative and we have enough data, the prior has little impact on the posterior.

To specify a likelihood, we need to be careful, as this is very much at the heart of our model. Specifying the likelihood amounts to choosing a probability distribution that describes how the data will look under your model. In many cases, you need to derive the likelihood (or even numerically compute it when it cannot be written in closed form). In many practical cases, though, the choice of likelihood is among standard probability distributions. These distributions all have stories associated with them. Importantly, if your data and model match the story of a distribution, you know that this is the distribution to choose for your likelihood.

Review of probability distributions

Before we begin talking about distributions, let's remind ourselves what probability distributions are. We cut some corners in our definitions here, but these definitions are functional for most of our data analysis purposes.

A probability mass function (PMF), $P(x)$, describes the probability of a discrete variable obtaining value $x$. The variable $x$ takes on discrete values, so the normalization condition is

\begin{align} \sum_x P(x) = 1. \end{align}

A probability density function (PDF), which we shall also call $P(x)$, is defined such that the probability that a continuous variable $x$ is $a \le x \le b$ is

\begin{align} \int_a^b \mathrm{d}x\,P(x). \end{align}

A cumulative distribution function (CDF), denoted $F(x)$ is defined such that $F(x)$ is the probability that a variable $X$ is less than or equal to $x$.

Moments

Probability distributions moments. The way they are defined can vary, but we will define the $n$th moment computed from a PMF as

\begin{align} \left\langle x^n \right\rangle = \sum_i x_i^n P(x_i), \end{align}

and computed from a PDF as

\begin{align} \left\langle x^n \right\rangle = \int \mathrm{d}x\, x^n P(x). \end{align}

These moments are often used to compute summary statistics.

\begin{align} \text{mean} &= \langle x \rangle \\[1em] \text{variance} &= \left\langle x^2 \right\rangle - \langle x \rangle^2 = \left\langle(x - \langle x \rangle)^2 \right\rangle \\[1em] \text{skew} &= \frac{\left\langle(x - \langle x \rangle)^3 \right\rangle} {\left\langle(x - \langle x \rangle)^2 \right\rangle^{\frac{3}{2}}} \\[1em] \text{Pearson kurtosis} &= \frac{\left\langle(x - \langle x \rangle)^4 \right\rangle} {\left\langle(x - \langle x \rangle)^2 \right\rangle^2} \\[1em] \text{Fisher kurtosis} &= \frac{\left\langle(x - \langle x \rangle)^4 \right\rangle} {\left\langle(x - \langle x \rangle)^2 \right\rangle^2} - 3. \end{align}

We will present PMFs, PDFs, and CDFs for distributions below. We only show the univariate forms; the multivariate versions are easily derived or looked up.

Sampling

Given that we know a probability distribution, we can take samples out of it. This means that we can randomly draw numbers and the probability that we draw a certain number $x$ is proportional to the PMF or PDF, $P(x)$. As we will soon discover in the course, sampling out of a distribution is often easier than computing the distribution over a range of values because many of those values are zero.

Distributions and their stories

In what follows, I will present some commonly used probability distributions and their associated stories. For each distribution, I will give the name of the distribution, its story, its parameters, a biological example that follows this distribution, expressions for the PDF or PMF and CDF, as well as plots of these. Along the way, we'll define other terms, such as Bernoulli trial and Poisson process. I will use the scipy.stats module (imported as st) to compute the PDF/PMF and CDF. The API for this module will be apparent as I use it.

Before we begin, I'll write some functions to generate the distributions for plotting. For this tutorial, I'll use Bokeh.

In [2]:
def plot_pmf(p, x, dist, params, **kwargs):
    """
    Generate plot of PMF at specified values of x.
    """
    y = dist.pmf(x, *params, **kwargs)
    p.circle(x, y, size=7, color='dodgerblue')
    p.segment(x0=x, x1=x, y0=0, y1=y, line_width=3, color='dodgerblue')
    return p


def plot_pdf(p, x, dist, params, **kwargs):
    """
    Generate plot of PDF at specified values of x.
    """
    y = dist.pdf(x, *params, **kwargs)
    p.line(x, y, line_width=3, color='dodgerblue')
    return p


def plot_discrete_cdf(p, x, dist, params, show_segments=False, **kwargs):
    """
    Plot CDF of discrete distribution.
    """
    y = dist.cdf(x, *params, **kwargs)
    if show_segments:
        p.circle(x[:-1], y[:-1], size=7, color='dodgerblue')
        p.segment(x0=x[:-1], x1=x[1:], y0=y[:-1], y1=y[:-1], line_width=3,
                  color='dodgerblue')
    else:
        p.circle(x, y, size=7, color='dodgerblue')
    return p


def plot_continuous_cdf(p, x, dist, params, **kwargs):
    """
    Plot CDF of continuous distribution.
    """
    y = dist.cdf(x, *params, **kwargs)
    p.line(x, y, line_width=3, color='dodgerblue')
    return p


def plot_dists(x, dist, params, param_names, dist_name, **kwargs):
    """
    Plot PDF/PMF and CDF next to each other.
    """
    # Tools for plots
    tools='pan,wheel_zoom,reset'
    
    # Title for plots
    t1 = dist_name + ', ' \
            + ''.join([param_names[i] + ' = ' + str(params[i]) + ', ' 
                                      for i in range(len(params) - 1)])
    t1 += param_names[-1] + ' = ' + str(params[-1])
    
    # Last half of y-axis label
    ylabel = ''.join(pname + ', ' for pname in param_names[:-1])
    ylabel += param_names[-1] + ')'
    
    # Set up plots
    p1 = bokeh.plotting.figure(width=325, height=250, tools=tools, title=t1)
    p2 = bokeh.plotting.figure(width=325, height=250, tools=tools, 
                               y_range=[-0.05, 1.05])

    # Make plots
    if hasattr(dist, 'pmf'):
        p1 = plot_pmf(p1, x, dist, params, **kwargs)
        p1.xaxis.axis_label = 'k'
        p1.yaxis.axis_label = 'P(k; ' + ylabel
        p2 = plot_discrete_cdf(p2, x, dist, params, **kwargs)
        p2.xaxis.axis_label = 'k'
        p2.yaxis.axis_label = 'F(k; ' + ylabel
    else:
        p1 = plot_pdf(p1, x, dist, params, **kwargs)
        p1.xaxis.axis_label = 'x'
        p1.yaxis.axis_label = 'P(x; ' + ylabel
        p2.yaxis.axis_label = 'F(x; ' + ylabel
        p2 = plot_continuous_cdf(p2, x, dist, params, **kwargs)
        p2.xaxis.axis_label = 'x'
        p2.yaxis.axis_label = 'F(k; ' + ylabel
        
    # Link the x-axes
    p1.x_range = p2.x_range

    return bokeh.layouts.row([p1, p2])

With these in place, we now describe some useful named distributions. For each, I will give the mathematical form of the PDF or PMF, but will omit the the CDF because they are often expressed as special functions, but as regularized incomplete beta functions or error functions. Remember that in general, for a discrete distribution

\begin{align} F(k) = \sum_{k'=k_\mathrm{min}}^k P(k'), \end{align}

and for a continuous distribution,

\begin{align} F(x) = \int_{-\infty}^x \mathrm{d}x'\,P(x'). \end{align}

I do plot the CDF for each distribution.

Discrete distributions

Bernoulli distribution

Story. A single trial with either a success (k = True) or failure (k=False) is performed. Such a trial is called a Bernoulli trial. The Bernoulli distribution defines the probability of getting each outcome.

Parameter. The Bernoulli distribution is parametrized by a single value, $p$, the probability that the trial is successful. These trials are called Bernoulli trials.

Example. Check to see if a given bacterium is competent, given that it has probability $p$ of being competent.

Probability mass function.

\begin{align} P(k;p) = \left\{ \begin{array}{ccc} 1-p & & k = 0 \\[0.5em] p & & k = 1 \\[0.5em] 0 & & \text{otherwise.} \end{array} \right. \end{align}
In [3]:
k = np.array([0, 1])
p = plot_dists(k, st.bernoulli, (0.4,), ('p',), 'Bernoulli')
bokeh.io.show(p)

Geometric distribution

Story. We perform a series of Bernoulli trials until we get a success. We have $k$ failures before the success.

Parameter. The Geometric distribution is parametrized by a single value, $p$, the probability that the Bernoulli trial is successful.

Example. Consider actin polymerization. At each time step, an actin monomer may add to the filament ("failure"), or an actin monomer may fall off (""success"") with (usually very low) probability $p$. The length of actin filaments are Geometrically distributed.

Probability mass function.

\begin{align} P(k;p) = (1-p)^k p. \end{align}

Notes. The Geometric distribution is only defined for non-negative integer $k$.

In [4]:
k = np.arange(11)
p = plot_dists(k, st.geom, (0.4,), ('p',), 'Geometric', **{'loc': -1})
bokeh.io.show(p)

Negative Binomial distribution

Story. We perform a series of Bernoulli trials until we get $r$ successes. The number of failures, $k$, before we get $r$ successes is Negative Binomially distributed.

Parameters. There are two parameters: the probability $p$ of success for each Bernoulli trial, and the desired number of successes, $r$.

Example. Bursty gene expression can give mRNA count distributions that are Negative Binomially distributed. Here, "success" is that a burst in gene expression stops. So, the parameter $p$ is related to the length of a burst in expression (lower $p$ means a longer burst). The parameter $r$ is related to the frequency of the bursts. If multiple bursts are possible within the lifetime of mRNA, then $r > 1$. Then, the number of "failures" is the number of mRNA transcripts that are made in the characteristic lifetime of mRNA.

Probability mass function.

\begin{align} P(k;r,p) = \begin{pmatrix} k+r-1 \\ r-1 \end{pmatrix} p^r (1-p)^k. \end{align}

Here, we use a combinatorial notation;

\begin{align} \begin{pmatrix} k+r-1 \\ r-1 \end{pmatrix} = \frac{(k+r-1)!}{(r-1)!\,k!}. \end{align}

Notes. If $r = 1$, this distribution reduces to the Geometric distribution.

In [5]:
k = np.arange(21)
p = plot_dists(k, st.nbinom, (3, 0.4), ('r', 'p',), 'Negative Binomial')
bokeh.io.show(p)

Binomial distribution

Story. We perform $n$ Bernoulli trials with probability $p$ of success. The number of successes, $k$, is binomially distributed.

Parameters. There are two parameters: the probability $p$ of success for each Bernoulli trial, and the number of trials, $n$.

Example. Distribution of plasmids between daughter cells in cell division. Each of the $n$ plasmids as a chance $p$ of being in daughter cell 1 ("success"). The number of plasmids, $k$, in daughter cell 1 is binomially distributed.

Probability mass function.

\begin{align} P(k;n,p) = \begin{pmatrix} n \\ k \end{pmatrix} p^k (1-p)^{n-k}. \end{align}
In [6]:
k = np.arange(21)
p = plot_dists(k, st.binom, (20, 0.4), ('n', 'p',), 'Binomial')
bokeh.io.show(p)

Poisson distribution

Story. Rare events occur with a rate $\lambda$ per unit time. There is no "memory" of previous events; i.e., that rate is independent of time. A process that generates such events is called a Poisson process. The occurrence of a rare event in this context is referred to as an arrival. The number $k$ of arrivals in unit time is Poisson distributed.

Parameter. The single parameter is the rate $\lambda$ of the rare events occurring.

Example. The number of mutations in a strand of DNA per unit length (since mutations are rare) are Poisson distributed.

Probability mass function. \begin{align} P(k;\lambda) = \frac{\lambda^k}{k!}\,\mathrm{e}^{-\lambda}. \end{align}

Notes. The Poisson distribution is a limit of the binomial distribution in which the number of trials goes to infinity, but the expected number of successes, $np$, stays fixed. Thus,

\begin{align} P_\mathrm{Poisson}(k;\lambda) \approx P_\mathrm{Binomial}(k;n, p), \end{align}

with $\lambda = np$. Considering the biological example of mutations, this is binomially distributed: There are $n$ bases, each with a probability $p$ of mutation, so the number of mutations, $k$ is binomially distributed. Since $p$ is small, it is approximately Poisson distributed.

In [7]:
k = np.arange(11)
p = plot_dists(k, st.poisson, (1,), ('λ',), 'Poisson')
bokeh.io.show(p)

Hypergeometric distribution

Story. Consider an urn with $w$ white balls and $b$ black balls. Draw $n$ balls from this urn without replacement. The number white balls drawn, $k$, is Hypergeometrically distributed.

Parameters. There are three parameters: the number of draws $n$, the number of white balls $w$, and the number of black balls $b$.

Example. There are $N$ finches on an island, and $n_t$ of them are tagged. You capture $n$ finches. The number of tagged finches $k$ is Hypergeometrically distributed, $P(k;n_t, N-n_t, n)$, as defined below.

Probability mass function.

\begin{align} P(k;w, b, n) = \frac{\begin{pmatrix}w\\k\end{pmatrix}\begin{pmatrix}b\\n-k\end{pmatrix}} {\begin{pmatrix}w+b\\n\end{pmatrix}}. \end{align}

Alternatively, if we define $N = w + b$, we could write

\begin{align} P(k;N, w, n) = \frac{\begin{pmatrix}w\\k\end{pmatrix}\begin{pmatrix}N-w\\n-k\end{pmatrix}} {\begin{pmatrix}N\\n\end{pmatrix}}. \end{align}

Notes. This distribution is analogous to the Binomial distribution, except that the Binomial distribution describes draws from an urn with replacement. In the analogy, $p = w/(w+b)$.

In [8]:
k = np.arange(11)
p = plot_dists(k, st.hypergeom, (25, 5, 10), ('N', 'w', 'n'), 'Hypergeometric')
bokeh.io.show(p)

Continuous distributions

Uniform distribution

Story. Any outcome in a given range has equal probability.

Parameters. The Uniform distribution is not defined on an infinite or semi-infinite domain, so bounds, $x_\mathrm{min}$ and $x_\mathrm{max}$ are necessary parameters.

Example. Anything in which all possibilities are equally likely.

Probability density function.

\begin{align} P(x;x_\mathrm{min}, x_\mathrm{max}) = \left\{ \begin{array}{ccc} \frac{1}{x_\mathrm{max} - x_\mathrm{min}} & & x_\mathrm{min} \le x \le x_\mathrm{max} \\[0.5em] 0 & & \text{otherwise.} \end{array} \right. \end{align}
In [9]:
x = np.array([0, 1])
p = plot_dists(x, st.uniform, (0, 1), ('xmin', 'xmax'), 'Uniform')
bokeh.io.show(p)

Gaussian (a.k.a. Normal) distribution

Story. Any quantity that emerges from a large number of subprocesses tends to be Gaussian distributed provided none of the subprocesses is very broadly distributed.

Parameters. The Gaussian distribution has two parameters, the mean $\mu$, which determines the location of its peak, and the standard deviation $\sigma$, which is strictly positive (the $\sigma\to 0$ limit defines a Dirac delta function) and determines the width of the peak.

Example. We measure the length of many C. elegans eggs. The lengths are Gaussian distributed.

Probability density function.

\begin{align} P(x;\mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}}\,\mathrm{e}^{-(x-\mu)^2/2\sigma^2}. \end{align}

Notes. This is a limiting distribution in the sense of the central limit theorem, but also in that many distributions have a Gaussian distribution as a limit. This is seen by formally taking limits of, e.g., the Gamma, Student-t, Binomial distributions, which allows direct comparison of parameters.

In [10]:
x = np.linspace(0, 20, 200)
p = plot_dists(x, st.norm, (10, 2), ('µ', 'σ'), 'Gaussian')
bokeh.io.show(p)

Log-Normal distribution

Story. If $\ln x$ is Gaussian distributed, $x$ is Log-Normally distributed.

Parameters. As for a Gaussian, there are two parameters, the mean logarithm, $\ln \mu$, and the variance $\sigma^2$.

Example. A measure of fold change in gene expression can be Log-Normally distributed.

Probability density function.

\begin{align} P(x;\mu, \sigma) = \frac{1}{x\sqrt{2\pi \sigma^2}}\,\mathrm{e}^{-(\ln x - \ln \mu)^2/2\sigma^2}. \end{align}
In [11]:
x = np.linspace(0, 10, 200)
p = plot_dists(x, st.lognorm, (0.5, 1), ('σ', 'µ'), 'Log-normal')
bokeh.io.show(p)

Von Mises distribution

Story. Gaussian, except on a periodic domain.

Parameters. As for a Gaussian, with $\mu$ being the location of the peak, and $\beta$ being analogous to the variance.

Example. Repeated measurements on a periodic domain, e.g., the location of an ingression along the azimuthal angle of a developing embryo.

Probability density function.

\begin{align} P(\theta;\mu, \beta) = \frac{1}{2\pi I_0(\beta)}\,\mathrm{e}^{\beta \cos(\theta - \mu)}, \end{align}

where $I_0(\beta)$ is a modified Bessel function of the first kind.

In [12]:
x = np.linspace(0, 2*np.pi, 200)
p = plot_dists(x, st.vonmises, (0.5, 3.14), ('β', 'µ'), 'Von Mises')
bokeh.io.show(p)

Chi-square distribution

Story. If $X_1$, $X_2$, $\ldots$, $X_n$ are Gaussian distributed, $X_1^2 + X_2^2 + \cdots + X_n^2$ is $\chi^2$-distributed.

Parameters. There is only one parameter, the degrees of freedom $n$.

Probability density function. \begin{align} P(x;n) \equiv \chi^2_n(x;n) = \frac{1}{2^{n/2}\Gamma\left(\frac{n}{2}\right)}\, x^{\frac{n}{2}-1}\,\mathrm{e}^{-x/2}. \end{align}

In [13]:
x = np.linspace(0, 30, 200)
p = plot_dists(x, st.chi2, (10,), ('n'), 'Chi-square')
bokeh.io.show(p)

Student-t/Cauchy distribution

Story. We get this distribution whenever we marginalize an unknown $\sigma$ out of a Gaussian distribution with a Jeffreys prior for $\sigma$.

Parameters. The Student-t distribution is peaked, and its peak is located at $m$. The peak's width is dictated by parameter $s$. Finally, we define the "degrees of freedom" as $n$.

Example. The story says it all!

Probability density function.

\begin{align} P(x;m, s, n) = \frac{\Gamma\left(\frac{n}{2}\right)}{\Gamma\left(\frac{n+1}{2}\right)}\, \frac{\sqrt{\pi n s^2}}{\left(1 + \frac{(x-m)^2}{ns^2}\right)^{\frac{n+1}{2}}}. \end{align}

Notes. For $n\to \infty$, we get a Gaussian distribution. When $n = 1$, we get the Cauchy distribution,

\begin{align} P(x;m, s) = \left[\pi s \left(\frac{x-m}{s}\right)^2\right]^{-1}. \end{align}
In [14]:
x = np.linspace(0, 20, 200)
p = plot_dists(x, st.t, (5, 10, 2), ('n', 'm', 's'), 'Student-t')
bokeh.io.show(p)

Exponential distribution

Story. This is the waiting time for an arrival from a Poisson process. I.e., the inter-arrival time of a Poisson process is Exponentially distributed.

Parameter. The single parameter is the average arrival rate, $r$.

Example. The time between conformational switches in a protein is Exponentially distributed (under simple mass action kinetics).

Probability density function. \begin{align} P(x;r) = r \mathrm{e}^{-rx}. \end{align}

Alternatively, we could parametrize it as

\begin{align} P(x;\lambda) = \frac{1}{\lambda}\, \mathrm{e}^{-x/\lambda}. \end{align}

Notes. The Exponential distribution is the continuous analog of the Geometric distribution. The "rate" in the Exponential distribution is analogous to the probability of success of the Bernoulli trial. Note also that because they are uncorrelated, the amount of time between any two arrivals is independent of all other inter-arrival times.

In [15]:
x = np.linspace(0, 10, 200)
p = plot_dists(x, st.expon, (0,), ('λ',), 'Exponential')
p.children[0].title.text = 'Exponential, λ = 1'
bokeh.io.show(p)

Gamma distribution

Story. The amount of time we have to wait for $a$ arrivals of a Poisson process. More concretely, if we have events, $X_1$, $X_2$, $\ldots$, $X_a$ that are exponentially distributed, $X_1 + X_2 + \cdots + X_a$ is gamma distributed.

Parameters. The number of arrivals, $a$, and the rate of arrivals, $r$.

Example. Any multistep process where each step happens at the same rate. This is common in molecular rearrangements, and we will use it in class to describe the nature of processes triggering microtubule catastrophe.

Probability density function.

\begin{align} P(x;a, r) = \frac{1}{\Gamma(a)}\,\frac{(rx)^a}{x}\,\mathrm{e}^{-rx}, \end{align}

where $\Gamma(a)$ is the gamma function.

Notes. The Gamma distribution is the continuous analog of the Negative Binomial distribution.

In [16]:
x = np.linspace(0, 10, 200)
p = plot_dists(x, st.gamma, (3, 0), ('a', 'r'), 'Gamma')
p.children[0].title.text = 'Gamma, a = 3, r = 1'
bokeh.io.show(p)

Weibull distribution

Story. Distribution of $x = y^\beta$ if $y$ is exponentially distributed. For $\beta > 1$, the longer we have waited, the more likely the event is to come, and vice versa for $\beta < 1$.

Parameters. There are two parameters, both strictly positive: the shape parameter $\beta$, which dictates the shape of the curve, and the scale parameter $\lambda$, which dictates the rate of arrivals of the event.

Example. This is a model for aging. The longer an organism lives, the more likely it is to die.

Probability density function.

\begin{align} P(x;\lambda, \beta) = \frac{\beta}{\lambda}\left(\frac{x}{\lambda}\right)^{\beta - 1}\, \mathrm{e}^{-(x/\lambda)^\beta}. \end{align}
In [17]:
x = np.linspace(0, 3, 200)
p = plot_dists(x, st.frechet_r, (3, 0), ('β', 'λ'), 'Weibull')
p.children[0].title.text = 'Weibull, β = 3, λ = 1'
bokeh.io.show(p)