Parameter estimation by optimization case study: Normal likelihood

Data set download


[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade bebi103 watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st
import statsmodels.tools.numdiff as smnd

import tqdm

import bebi103

import holoviews as hv
hv.extension('bokeh')
bebi103.hv.set_defaults()

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

We will begin our exploration of methods of parameter estimation by optimization again using the data set from Good, et al., Science, 342, 856-860, 2013. You should refamiliarize yourself with the data set if you need to.

Exploratory data analysis

To start, let’s make a quick plot of the spindle length versus droplet data, which you can download here: https://s3.amazonaws.com/bebi103.caltech.edu/data/good_invitro_droplet_data.csv.

[2]:
# Load in Data Frame
df = pd.read_csv(os.path.join(data_path, "good_invitro_droplet_data.csv"), comment="#")

# Pull out numpy arrays
ell = df["Spindle Length (um)"].values
d = df["Droplet Diameter (um)"].values

# Make a plot
data_plot = hv.Scatter(
    data=df,
    kdims="Droplet Diameter (um)",
    vdims="Spindle Length (um)",
).opts(
    xlim=(0, None),
    ylim=(0, None),
)

data_plot

Data type cannot be displayed:

[2]:

Our goal is to get parameter estimates for models describing how spindle length varies with droplet size. We will start with the independent size model, in which we assume that spindle length is independent of droplet diameter.

Independent size model

We choose a Normal likelihood for the spindle size. We assume all spindles are distributed about the same location parameter \(\phi\). In our updated model, we will again use a Normal likelihood.

\begin{align} l_i \sim \text{Norm}(\phi, \sigma) \;\forall i. \end{align}

In specifying this likelihood, we see that there are two parameters to estimate, \(\phi\) and \(\sigma\). To choose priors for these, I reason as follows.

Now that we have thought a bit more about Bayesian modeling, we will abandon using Jeffreys priors and instead use weakly informative priors. The parameter \(\phi\) is necessarily positive and is the characteristic length of a spindle. I doubt any spindle would be as small as a bacterial cell, which is about one micron in length. On the other hand, I’m not sure how much longer it could be. I will therefore choose a prior distribution that I can move away from zero but with a heavy tail toward larger values. A Log-Normal distribution fits this bill. I will center it at around three microns, with a scale parameter of 3/4, which gives an appropriate tail. So, I have

\begin{align} \phi \sim \text{LogNorm}(3, 0.75). \end{align}

Note that for brevity, I do not include the units of the parameters in the likelihood definition, and units of microns are assumed.

To understand the prior, here is a plot.

[3]:
phi = np.linspace(0, 100, 200)

hv.Curve(
    (phi, st.lognorm.pdf(phi, 0.75, loc=0, scale=np.exp(3))),
    kdims='ϕ [µm]',
    vdims='g(ϕ) [1/µm]',
).opts(
    frame_height=200,
    xlim=(0, 100),
)

Data type cannot be displayed:

[3]:

The parameter \(\sigma\) describes how the data vary off of the characteristic value of \(\phi\). I am not really sure how variable spindle length is, so I will choose a weakly informative Half-Normal prior.

\begin{align} \sigma \sim \text{HalfNorm}(20). \end{align}

Here is a plot of this prior.

[4]:
sigma = np.linspace(0, 100, 200)

hv.Curve(
    (sigma, st.halfnorm.pdf(sigma, 0, 20)),
    kdims='σ [µm]',
    vdims='g(σ) [1/µm]',
).opts(
    frame_height=200,
    xlim=(0, 100),
)

Data type cannot be displayed:

[4]:

Note that zero is the most probable value of \(\sigma\), and I don’t really believe this, so the Half-Normal is probably not the very best choice for capturing my prior knowledge. Nonetheless, this prior is very broad, so these details will have little effect on the posterior.

Thus, my complete generative model for the independent spindle size model is

\begin{align} &\phi \sim \text{LogNorm}(3, 0.75), \\[1em] &\sigma \sim \text{HalfNorm}(20),\\[1em] &l_i \sim \text{Norm}(\phi, \sigma) \;\forall i. \end{align}

Estimation of the MAP parameters

We could plot the entire posterior distribution (and in fact we did, albeit with different priors, in a previous lesson) to get a full understanding of the posterior. Instead, we will compute the MAP and approximate the posterior as Normal near the MAP.

Finding the MAP amounts to an optimization problem in which we find the arguments (parameters) for which a function (the posterior distribution) is maximal. It is almost always much easier to find the maximum for the logarithm of the posterior. In fact, in the fast majority of Bayesian inference problems, we work with log posteriors instead of the posteriors themselves. Since the logarithm function is monotonic, a maximum of the log posterior is also a maximum of the posterior. So, our first step is code up a function that returns the log posterior.

In considering a log posterior, it is useful to know that

\begin{align} \ln g(\theta \mid y) = \text{constant} + \ln f(y \mid \theta) + \ln g(\theta). \end{align}

Furthermore, if the priors for each parameter are independent, such that

\begin{align} g(\theta) = g(\theta_1)\, g(\theta_2)\cdots, \end{align}

and the measurements are independent and identically distributed (i.i.d.), such that

\begin{align} f(y\mid \theta) = f(y_1 \mid \theta)\,f(y_2 \mid \theta)\,f(y_3 \mid \theta)\cdots, \end{align}

then

\begin{align} \ln g(\theta \mid y) = \text{constant} + \sum_i \ln f(y_i \mid \theta) + \sum_j \ln g(\theta_j). \end{align}

This means that we can construct the log posterior by summing up all of the terms in the log likelihood and log prior.

Fortunately, we do not have to hand-code any of the mathematical expressions for the log PDFs; we can use the built-in scipy.stats functions.

[5]:
def log_prior(phi, sigma):
    """Log prior for independent size model"""
    lp = st.lognorm.logpdf(phi, 0.75, loc=0, scale=np.exp(3))
    lp += st.halfnorm.logpdf(sigma, 0, 20)

    return lp


def log_likelihood(ell, phi, sigma):
    """Log likelihood for independent size model"""
    return np.sum(st.norm.logpdf(ell, phi, sigma))


def log_posterior(params, ell):
    """Log posterior of indpendent size model."""
    phi, sigma = params

    lp = log_prior(phi, sigma)
    if lp == -np.inf:
        return lp

    return lp + log_likelihood(ell, phi, sigma)

Now that we have the log posterior, we can perform numerical optimization. The function scipy.optimize.minimize() offers many options for algorithms, which you can read about in the documentation. I find that Powell’s method tends to work well. It does not rely on derivative information, so discontinuities don’t hurt, nor do the \(-\infty\) values we get in the log posterior for parameter values that are out of bounds. That is particularly important for the present example because we have hard bounds on \(p\). We could use constrained optimizers such as L-BFGS-B or COBYLA, but Powell’s method generally suffices. It is sometimes slow, though.

The optimizers offered by scipy.optimize.minimize() find minimizers, so we need to define our objective function as the negative log posterior.

[6]:
def neg_log_posterior(params, ell):
    return -log_posterior(params, ell)

Next, we have to provide an initial guess for parameter values. The optimization routine only finds a local maximum and is not in general guaranteed to converge. Therefore, the initial guess can be very important. Looking at the data, we would expect \(\phi\) to be somewhere around 35 µm and for \(\sigma\) to be 7 µm.

[7]:
params_0 = np.array([35, 7])

We are now ready to use scipy.optimize.minimze() to compute the MAP. We use the args kwarg to pass in the other arguments to the neg_log_posterior() function. In our case, these arguments are the data points. The scipy.optimzie.minimize() function returns an object that has several attributes about the resulting optimization. Most important is x, the optimal parameter values.

[8]:
# Specify extra arguments into posterior function
args = (ell,)

# Compute the MAP
res = scipy.optimize.minimize(
    neg_log_posterior, params_0, args=args, method="powell"
)

# Extract optimal parameters
popt = res.x

# For convenience...
phi_MAP, sigma_MAP = popt

# Print results
print(
    """
Most probable parameters
------------------------
φ = {0:.2f} µm
σ = {1:.3f} µm
""".format(
        phi_MAP, sigma_MAP
    )
)

Most probable parameters
------------------------
φ = 32.86 µm
σ = 4.784 µm

We have successfully found the MAP which provides just that; the maximally probable parameter values. This tells us only one piece of information about the posterior; where it is maximal. So we know where all of the first derivatives of the posterior are zero. We want to learn more about it, at least near the MAP. To that end, we can compute all of the second derivatives of the log posterior near the MAP. Using the second derivatives (the Hessian), we can start to understand the posterior, at least locally, by approximating it as Normal.

Normal approximation of the posterior

To compute the Normal approximation of the posterior, our task is two-fold. First, find the parameter values that give the maximal posterior probability, which we have already done. Second, compute the Hessian of the log posterior at the MAP and invert it to get your covariance matrix. Note that the covariance matrix is not the inverse of every element of the Hessian; it is the inverse of the Hessian matrix. The credible interval for each parameter can then be calculated from the diagonal elements of the covariance matrix.

To compute the covariance matrix, we need to compute the Hessian of the log of the posterior at the MAP. We will do this numerically using the statsmodels.tools.numdiff module, which we imported as smnd. We use the function smnd.approx_hess() to compute the Hessian at the optimal parameter values. Note that since we already have the log posterior coded up and the MAP parameter values, we can just directly shove these into the numerical Hessian calculator.

[9]:
hes = smnd.approx_hess(popt, log_posterior, args=args)

Now that we have the Hessian, we take its negative inverse to get the covariance matrix.

[10]:
# Compute the covariance matrix
cov = -np.linalg.inv(hes)

# Look at it
cov
[10]:
array([[ 3.41658852e-02, -1.39151314e-05],
       [-1.39151314e-05,  1.70799661e-02]])

The diagonal terms give the approximate variance in the parameters in the posterior. The off-diagonal terms give the covariance, which describes how the parameters relate to each other. Nonzero covariance indicates that they are not completely independent.

Credible intervals

We can compute the approximate 95% credible intervals for the parameters by multiplying the diagonal terms of the posterior covariance matrix \(\mathsf{\Sigma}\) by 1.96.

[11]:
# Report results
print("""
Results for conserved tubulin model (≈ 95% of total probability density)
------------------------------------------------------------------------
ϕ = {0:.2f} ± {1:.2f} µm
σ = {2:.3f} ± {3:.3f} µm
""".format(phi_MAP, 1.96*np.sqrt(cov[0,0]), sigma_MAP, 1.96*np.sqrt(cov[1,1])))

Results for conserved tubulin model (≈ 95% of total probability density)
------------------------------------------------------------------------
ϕ = 32.86 ± 0.36 µm
σ = 4.784 ± 0.256 µm

How good is the approximation?

To assess how close the Normal approximation is to the posterior, let’s plot the Normal approximation and the exact marginalized posterior on the same contour plot. We can use scipy.stats.multivariate_normal.pdf() function to generate the PDF for the Normal approximation.

The calculation will take a while, since we have to compute the whole posterior in the neighborhood of the MAP to plot it. The whole reason we used optimization in the first place was to avoid this calculation! But we are doing it here to visualize the Normal approximation.

[12]:
# Set up ranges for plots; should be near MAP
phi = np.linspace(32.4, 33.5, 200)
sigma = np.linspace(4.4, 5.2, 200)
PHI, SIGMA = np.meshgrid(phi, sigma)

# Compute posteriors
post_norm = np.empty_like(PHI)
log_post_exact = np.empty_like(PHI)
for i in tqdm.tqdm(range(PHI.shape[0])):
    for j in range(PHI.shape[1]):
        post_norm[i, j] = st.multivariate_normal.pdf(
            (PHI[i, j], SIGMA[i, j]), popt, cov
        )
        log_post_exact[i, j] = log_posterior((PHI[i, j], SIGMA[i, j]), args)

# Compute exact posteiror for plotting
post_exact = np.exp(log_post_exact - log_post_exact.max())

# Make contours
p = bebi103.viz.contour(
    PHI, SIGMA, post_exact, x_axis_label="σ [µm]", y_axis_label="φ [µm]"
)
p = bebi103.viz.contour(
    PHI, SIGMA, post_norm, line_kwargs=dict(line_color="orange"), p=p
)
bokeh.io.show(p)
100%|██████████| 200/200 [00:25<00:00,  7.96it/s]

The Normal approximation, shown in orange, is not that far off from the posterior, especially at the peak. This is not always the case! This is one of the main dangers of using optimization, you are using very local information to summarize a posterior that may have relevant features away from the MAP.

Computing environment

[13]:
%load_ext watermark
%watermark -v -p numpy,pandas,scipy,tqdm,statsmodels,bokeh,holoviews,bebi103,jupyterlab
Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

numpy      : 1.19.2
pandas     : 1.1.5
scipy      : 1.5.2
tqdm       : 4.54.1
statsmodels: 0.12.1
bokeh      : 2.2.3
holoviews  : 1.14.0
bebi103    : 0.1.1
jupyterlab : 2.2.6