Parameter estimation by optimization

Data set download


[1]:
import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st
import statsmodels.tools.numdiff as smnd

import tqdm

import bebi103

import holoviews as hv
hv.extension('bokeh')
bebi103.hv.set_defaults()

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

In this lesson, we will continue our analysis of the data set from Good, et al., Science, 342, 856-860, 2013. You should refamiliarize yourself with the data set if you need to.

Exploratory data analysis

To start, let’s make a quick plot of the spindle length versus droplet data, which you can download here.

[2]:
# Load in Data Frame
df = pd.read_csv('../data/good_invitro_droplet_data.csv', comment='#')

# Pull out numpy arrays
ell = df['Spindle Length (um)'].values
d = df['Droplet Diameter (um)'].values

# Make a plot
data_plot = hv.Scatter(
    data=df,
    kdims='Droplet Diameter (um)',
    vdims='Spindle Length (um)',
).opts(
    xlim=(0, None),
    ylim=(0, None),
)

data_plot
[2]:

As was the case the last time we visited this data set, our goal is to get parameter estimates for models describing how spindle length varies with droplet size.

Independent size model

We will demonstrate parameter estimation by optimization first using the independent size model, in which we assume that spindle length is independent of droplet diameter. Now that we have thought a bit more about Bayesian modeling, we will abandon using Jeffreys priors and instead use weakly informative priors.

Full generative model

We choose a Normal likelihood for the spindle size. We assume all spindles are distributed about the same location parameter \(\phi\). In our updated model, we will again use a Normal likelihood.

\begin{align} l_i \sim \text{Norm}(\phi, \sigma) \;\forall i. \end{align}

In specifying this likelihood, we see that there are two parameters to estimate, \(\phi\) and \(\sigma\). To choose priors for these, I reason as follows.

The parameter \(\phi\) is necessarily positive and is the characteristic length of a spindle. I doubt any spindle would be as small as a bacterial cell, which is about one micron in length. On the other hand, I’m not sure how much longer it could be. I will therefore choose a prior distribution that I can move away from zero but with a heavy tail toward larger values. A Log-Normal distribution fits this bill. I will center it at around three microns, with a scale parameters of 3/4, which gives an appropriate tail. So, I have

\begin{align} \phi \sim \text{LogNorm}(3, 0.75). \end{align}

Note that for brevity, I do not include the units of the parameters in the likelihood definition, and units of microns are assumed.

To understand the prior, here is a plot.

[3]:
phi = np.linspace(0, 100, 200)

hv.Curve(
    (phi, st.lognorm.pdf(phi, 0.75, loc=0, scale=np.exp(3))),
    kdims='ϕ [µm]',
    vdims='g(ϕ) [1/µm]',
).opts(
    frame_height=200
)
[3]:

The parameter \(\sigma\) describes how the data vary off of the characteristic value of \(\phi\). I am not really sure how variable spindle length is, so I will choose a weakly informative Half-Normal prior.

\begin{align} \sigma \sim \text{HalfNorm}(20). \end{align}

Here is a plot of this prior.

[4]:
sigma = np.linspace(0, 100, 200)

hv.Curve(
    (sigma, st.halfnorm.pdf(sigma, 0, 20)),
    kdims='σ [µm]',
    vdims='g(σ) [1/µm]',
).opts(
    frame_height=200,
    xlim=(0, None),
)
[4]:

Note that zero is the most probable value of \(\sigma\), and I don’t really believe this, so the Half-Normal is probably not the very best choice for capturing my prior knowledge. Nonetheless, this prior is very broad, so these details will have little effect on the posterior.

Thus, my complete model for the independent spindle size mode is

\begin{align} &\phi \sim \text{LogNorm}(3, 0.75), \\[1em] &\sigma \sim \text{HalfNorm}(20),\\[1em] &l_i \sim \text{Norm}(\phi, \sigma) \;\forall i. \end{align}

Estimation of the MAP parameters

We could plot the entire posterior distribution (and in fact we did, albeit with different priors, in a previous lesson) to get a full understanding of the posterior. In this lesson, we will explore a commonly used approach to make sense of the posterior in which we find the parameter values that maximize the posterior, or the maximum a posteriori, or MAP, parameter values. We then locally approximate the posterior as Normal near the MAP, which we will do momentarily in the next section.

Finding the MAP amounts to an optimization problem in which we find the arguments (parameters) for which a function (the posterior distribution) is maximal. It is often much easier to find the maximum for the logarithm of the posterior. Since the logarithm function is monotonic, a maximum of the log posterior is also a maximum of the posterior. So, our first step is code up a function that returns the log posterior. Fortunately, we do not have to hand-code any of the mathematical expressions for the log PDFs; we can use the built-in scipy.stats functions.

[5]:
def log_prior(phi, sigma):
    """Log prior for independent size model"""
    lp = st.lognorm.logpdf(phi, 0.75, loc=0, scale=np.exp(3))
    lp += st.halfnorm.logpdf(sigma, 0, 20)

    return lp


def log_likelihood(ell, phi, sigma):
    """Log likelihood for independent size model"""
    return np.sum(st.norm.logpdf(ell, phi, sigma))


def log_posterior(params, ell):
    """Log posterior of indpendent size model."""
    phi, sigma = params

    lp = log_prior(phi, sigma)
    if lp == -np.inf:
        return lp

    return lp + log_likelihood(ell, phi, sigma)

Now that we have the log posterior, we can perform numerical optimization. The function scipy.optimize.minimize() offers many options for algorithms, which you can read about in the documentation. I find that Powell’s method tends to work well. It does not rely on derivative information, so discontinuities don’t hurt, nor do the \(-\infty\) values we get in the log posterior for parameter values that are out of bounds. That is particularly important for the present example because we have hard bounds on \(p\). We could use constrained optimizers such as L-BFGS-B or COBYLA, but Powell’s method generally suffices. It is sometimes slow, though.

The optimizers offered by scipy.optimize.minimize() find minimizers, so we need to define our objective function as the negative log posterior.

[6]:
def neg_log_posterior(params, ell):
    return -log_posterior(params, ell)

Next, we have to provide an initial guess for parameter values. The optimization routine only finds a local maximum and is not in general guaranteed to converge. Therefore, the initial guess can be very important. Looking at the data, we would expect \(\phi\) to be somewhere around 35 µm and for \(\sigma\) to be 7 µm.

[7]:
params_0 = np.array([35, 7])

We are now ready to use scipy.optimize.minimze() to compute the MAP. We use the args kwarg to pass in the other arguments to the neg_log_posterior() function. In our case, these arguments are the data points. The scipy.optimzie.minimize() function returns an object that has several attributes about the resulting optimization. Most important is x, the optimal parameter values.

[8]:
# Specify extra arguments into posterior function
args = (ell,)

# Compute the MAP
res = scipy.optimize.minimize(
    neg_log_posterior, params_0, args=args, method="powell"
)

# Extract optimal parameters
popt = res.x

# For convenience...
phi_MAP, sigma_MAP = popt

# Print results
print(
    """
Most probable parameters
------------------------
φ = {0:.2f} µm
σ = {1:.3f} µm
""".format(
        phi_MAP, sigma_MAP
    )
)

Most probable parameters
------------------------
φ = 32.86 µm
σ = 4.784 µm

We have successfully found the MAP which provides just that; the maximally probable parameter values. But with just the MAP value, we know next to nothing (that is, we only know one thing about it: where it is maximal) about the posterior probability density function, \(g(\phi, \sigma\mid y)\), which is what we want to characterize. Can we find an approximate expression for the posterior that is more tractable to treat to give an error bar for these parameters?

Normal approximation of the posterior

If we have an expression for the posterior, we can write an approximate posterior that is easier to work with. Specifically, we will perform a multidimensional Taylor series expansion of the log posterior to give (in this case) a two-dimensional Normal distribution. In general, for a set of parameters \(\theta\), where \(\theta^*\) is the most probable set of parameter values, the approximation to give a Normal log posterior is

\begin{align} \ln g(\theta\mid y) \approx \text{constant} + \frac{1}{2}\left(\theta - \theta^*\right)^T \cdot \mathsf{B} \cdot \left(\theta - \theta^*\right), \end{align}

where \(\mathsf{B}\) is the symmetric Hessian matrix of second derivatives, with entries

\begin{align} B_{ij} = \left.\frac{\partial^2\,\ln g(\theta\mid y)}{\partial \theta_i\,\partial \theta_j}\right|_{\theta = \theta^*}.\\[0.1em] \nonumber \phantom{blah} \end{align}

Because we are evaluating the second derivatives at the point of maximal posterior probability, the Hessian is positive definite and therefore invertible. Its negative inverse is the covariance matrix, \(\mathsf{\Sigma}\), of the Normal approximation of the posterior. That is,

\begin{align} \mathsf{\Sigma} = -\mathsf{B}^{-1}. \end{align}

So, to compute the Normal approximation of the posterior, our task is two-fold. First, find the parameter values that give the maximal posterior probability, which we have already done. Second, compute the Hessian of the log posterior at the MAP and invert it to get your covariance matrix. Note that the covariance matrix is not the inverse of every element of the Hessian; it is the inverse of the Hessian matrix. The credible interval for each parameter can then be calculated from the diagonal elements of the covariance matrix. (I’ll clarify a credible interval is in a moment.)

To compute the covariance matrix, we need to compute the Hessian of the log of the posterior at the MAP. We will do this numerically using the statsmodels.tools.numdiff module, which we imported as smnd. We use the function smnd.approx_hess() to compute the Hessian at the optimal parameter values. Note that since we already have the log posterior coded up and the MAP parameter values, we can just directly shove these into the numerical Hessian calculator.

[9]:
hes = smnd.approx_hess(popt, log_posterior, args=args)

Now that we have the Hessian, we take its negative inverse to get the covariance matrix.

[10]:
# Compute the covariance matrix
cov = -np.linalg.inv(hes)

# Look at it
cov
[10]:
array([[ 3.41658852e-02, -1.39151314e-05],
       [-1.39151314e-05,  1.70799661e-02]])

The diagonal terms give the approximate variance in the regression parameters. The off-diagonal terms give the covariance, which describes how the parameters relate to each other. Nonzero covariance indicates that they are not completely independent.

Credible intervals

A \(\alpha\)-percent Bayesian credible interval for a parameter is an an interval that contains \(\alpha\) percent of the posterior probability density. (This is different form a frequentist confidence interval, though the latter is often erroneously interpreted as the former.) For example, if \(g(\theta \mid y)\) is the marginalized posterior probability density function for a single parameter \(\theta\) and \(G(\theta\mid y)\) is the cumulative distribution function, then a 95% credible interval is \([G^{-1}(0.025),\;G^{-1}(0.975)]\).

If we approximate the posterior as a multivariate Normal distribution, we have analytical results for credible intervals for each parameter. Specifically, if \(\Sigma_{ii}\) is a diagonal element in the covariance matrix \(\mathsf{\Sigma}\), then, e.g., the 95% credible interval for parameter \(\theta_i\) is centered on the MAP estimate, \(\theta_i^*\) and is \([\theta_i^* - 1.96\sqrt{\Sigma_{ii}},\;\theta_i^* - 1.96\sqrt{\Sigma_{ii}}]\). Because the credible intervals are symmetric under this approximation, we can report the estimates as the \(\theta_i^* \pm 1.96 \sqrt{\Sigma_{ii}}\).

[11]:
# Report results
print("""
Results for conserved tubulin model (≈ 95% of total probability density)
------------------------------------------------------------------------
ϕ = {0:.2f} ± {1:.2f} µm
σ = {2:.3f} ± {3:.3f} µm
""".format(phi_MAP, 1.96*np.sqrt(cov[0,0]), sigma_MAP, 1.96*np.sqrt(cov[1,1])))

Results for conserved tubulin model (≈ 95% of total probability density)
------------------------------------------------------------------------
ϕ = 32.86 ± 0.36 µm
σ = 4.784 ± 0.256 µm

How good is the approximation?

To assess how close the Normal approximation is to the posterior, let’s plot the Normal approximation and the exact marginalized posterior on the same contour plot. We can use scipy.stats.multivariate_normal.pdf() function to generate the PDF for the Normal approximation.

[12]:
# Set up ranges for plots; should be near MAP
phi = np.linspace(32.4, 33.5, 200)
sigma = np.linspace(4.4, 5.2, 200)
PHI, SIGMA = np.meshgrid(phi, sigma)

# Compute posteriors
post_norm = np.empty_like(PHI)
log_post_exact = np.empty_like(PHI)
for i in tqdm.tqdm(range(PHI.shape[0])):
    for j in range(PHI.shape[1]):
        post_norm[i, j] = st.multivariate_normal.pdf(
            (PHI[i, j], SIGMA[i, j]), popt, cov
        )
        log_post_exact[i, j] = log_posterior((PHI[i, j], SIGMA[i, j]), args)

# Compute exact posteiror for plotting
post_exact = np.exp(log_post_exact - log_post_exact.max())

# Make contours
p = bebi103.viz.contour(
    PHI, SIGMA, post_exact, x_axis_label="σ [µm]", y_axis_label="φ [µm]"
)
p = bebi103.viz.contour(
    PHI, SIGMA, post_norm, line_kwargs=dict(line_color="orange"), p=p
)
bokeh.io.show(p)
100%|██████████| 200/200 [00:24<00:00,  8.04it/s]

The Normal approximation is not that far off from the posterior, especially at the peak. This is not always the case! You should always check your posterior after you find the MAP to make sure there is nothing pathological going on.

The tubulin conservation model

We will now perform a similar analysis with the tubulin conservation model. In this model, recall that the theoretical relationship between spindle length and droplet diameter is

\begin{align} l(d) = \frac{\gamma d}{\left(1+(\gamma d/\phi)^3\right)^{\frac{1}{3}}} \end{align}

We will assume homoscedastic variation about this theoretical curve, giving us a likelihood of

\begin{align} &\mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &l_i \sim \text{Norm}(\mu_i, \sigma) \;\forall i. \end{align}

We therefore need to provide priors for \(\gamma\), \(\phi\), and \(\sigma\). We will use the same priors as we did for the independent size model for \(\phi\) and \(\sigma\) because they have the same meaning in the context of this model. Physically, \(0 < \gamma \le 1\), so we need to provide a prior for \(\gamma\) that is defined between zero and one. We can use a Beta distribution. I think \(\gamma\) having a values close to zero is very unlikely, since sets the spindle lengths to zero. Similarly, I do not think values of \(\gamma\) close to one are likely, since this sets the spindle length to be the same as the droplet diameter, meaning that the spindle would press against the sides of the droplet. So, I will choose a prior for \(\gamma\) that is fairly flat for intermediate values of \(\gamma\), but goes toward zero at \(\gamma = 0, 1\). A Beta distribution with \(\alpha = \beta = 3/2\) accomplishes this. The prior is plotted below.

[13]:
gamma = np.linspace(0, 1, 200)

hv.Curve(
    (gamma, st.beta.pdf(gamma, 1.5, 1.5)),
    kdims='γ',
    vdims='g(γ)',
).opts(
    frame_height=200,
)
[13]:

So, our complete generative model is

\begin{align} &\phi \sim \text{LogNorm}(\ln 20, 0.75),\\[1em] &\gamma \sim \text{Beta}(1.5, 1.5), \\[1em] &\sigma \sim \text{HalfNorm}(20),\\[1em] &\mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}\;\forall i, \\[1em] &l_i \sim \text{Norm}(\mu_i, \sigma) \;\forall i. \end{align}

Parameter estimation

We will not proceed to compute the MAP and credible intervals. Before embarking on this calculation, I write down the steps for making sense of the posterior using optimization and multivariate Normal approximation. After defining your model, do the following.

  1. Write a function to compute the log posterior.

  2. Define the negative log posterior function.

  3. Set up the args that need to be passed to the log posterior function.

  4. Provide an inital guess for the solver.

  5. Use scipy.optimize.minimize() to find the MAP.

  6. Extract the optimal parameters from the output.

  7. Compute the Hessian evaluated at the MAP of the log posterior by numerical differentiation.

  8. Compute the covariance by inverting the Hessian.

  9. Report the results (usually by reporting credible intervals).

We will step through this procedure for this model, starting with coding up the posterior.

[14]:
def theoretical_spindle_length(d, phi, gamma):
    """Theoretical spindle length for tubulin conservation model."""
    return gamma * d / np.cbrt(1 + (gamma * d / phi)**3)


def log_prior(phi, gamma, sigma):
    """Log prior for tubulin conservation model"""
    lp = st.lognorm.logpdf(phi, 0.75, loc=0, scale=np.exp(3))
    lp += st.beta.logpdf(gamma, 1.5, 1.5)
    lp += st.halfnorm.logpdf(sigma, 0, 20)

    return lp


def log_likelihood(d, ell, phi, gamma, sigma):
    """Log likelihood for independent size model"""
    mu = theoretical_spindle_length(d, phi, gamma)

    return np.sum(st.norm.logpdf(ell, mu, sigma))


def log_posterior(params, d, ell):
    """Log posterior of indpendent size model."""
    phi, gamma, sigma = params

    lp = log_prior(phi, gamma, sigma)
    if lp == -np.inf:
        return lp

    return lp + log_likelihood(d, ell, phi, gamma, sigma)

Next, we define the negative log posterior, which we need for minimization.

[15]:
def neg_log_posterior(params, d, ell):
    return -log_posterior(params, d, ell)

And the arguments we need to pass….

[16]:
args = (d, ell)

And now for the initial guesses for \(\phi\), \(\gamma\), and \(\sigma\), respectively.

[17]:
params_0 = np.array([35, 0.8, 4])

Now, we can find the MAP and extract the parameter values.

[18]:
# Compute the MAP
res = scipy.optimize.minimize(neg_log_posterior, params_0, args=args, method="powell")

# Extract optimal parameters
popt = res.x

Now to compute the Hessian and invert it to get the covariance.

[19]:
# Compute Hessian and covariance matrix
hes = smnd.approx_hess(popt, log_posterior, args=args)
cov = -np.linalg.inv(hes)

Finally, we can report the results.

[20]:
# For convenience...
phi_MAP, gamma_MAP, sigma_MAP = popt

# Print results
print(
    """
Most probable parameters
------------------------
φ = {0:.2f} ± {1:.2f}  µm
γ = {2:.3f} ± {3:.3f}
σ = {4:.3f} ± {5:.3f} µm
""".format(
        phi_MAP,
        1.96 * np.sqrt(cov[0, 0]),
        gamma_MAP,
        1.96 * np.sqrt(cov[1, 1]),
        sigma_MAP,
        1.96 * np.sqrt(cov[2, 2]),
    )
)

Most probable parameters
------------------------
φ = 38.26 ± 0.77  µm
γ = 0.859 ± 0.034
σ = 3.754 ± 0.201 µm

Plotting the (marginalized) posterior

Plotting this distribution is harder than in the independent size model because we now have three parameters. We therefore have to marginalize out one of the parameters to make a contour plot of two parameters we are interested in. To do so, we first need to grid up the parameter values and compute the log posterior. Unlike in our previous treatment of this model when we could analytically marginalized away \(\sigma\), we cannot do that here because of our choice of prior. We need to compute the posterior for many values of \(\sigma\) and then use numerical quadrature to marginalize it away.

Note that this problem of computing the posterior for plotting does not scale well. For the independent size model, we evaluated the posterior at 200 \(\phi\) values and 200 \(\sigma\) values for a total of 40,000 posterior evaluations. It took a few seconds to run. Now, we will evaluate the posterior at 100 \(\phi\) values, 100 \(\gamma\) values, and 100 \(\sigma_0\) values, for a total of a million posterior evaluations, and you can expect the calculation in the next cell to take several minutes. If we had any more parameters, this brute force style of posterior evaluation becomes intractable. Furthermore, notice below that I chose a tight range of parameter values. I was able to do this because I already found the MAP and credible regions first. Without this, the posterior would be mostly a big sea of low probability and finding where it is not is like finding a needle in a haystack.

[21]:
# Set up plotting range
phi = np.linspace(37, 39.5, 100)
gamma = np.linspace(0.8, 0.91, 100)
sigma = np.linspace(3.0, 4.5, 100)
PHI, GAMMA, SIGMA = np.meshgrid(phi, gamma, sigma)

# Compute log posterior
LOG_POST = np.empty_like(PHI)
for i in tqdm.tqdm(range(PHI.shape[0])):
    for j in range(PHI.shape[1]):
        for k in range(PHI.shape[2]):
            LOG_POST[i, j, k] = log_posterior(
                    np.array([PHI[i,j,k], GAMMA[i,j,k], SIGMA[i,j,k]]), d, ell)

# Exponentiate. Ignore normalization, so set max LOG_POST to zero
POST_exact = np.exp(LOG_POST - LOG_POST.max())
100%|██████████| 100/100 [09:52<00:00,  5.92s/it]

To plot the marginalized posterior \(g(\phi, \gamma\mid d, l)\), which is what we are interested in, we need to marginalize over \(\sigma\). We can do so numerically using np.trapz(), integrating over the \(\sigma\) values.

[22]:
# Compute marginalized posterior
POST_marginalized = np.trapz(POST_exact, x=sigma, axis=2)

We can now make the plot.

[23]:
# Plot
p = bebi103.viz.contour(PHI[:,:,0],
                        GAMMA[:,:,0],
                        POST_marginalized,
                        x_axis_label='ϕ (µm)',
                        y_axis_label='γ',
                        overlaid=False)
bokeh.io.show(p)

To see how good the Normal approximation is, we can plot the marginalized Normal distribution as well. In this case, as we explored in the first homework, we can marginalize a multivariate Normal simply by ignoring the parameters we are ignoring. We simply cut \(\sigma\) out of the MAP and out of the covariance matrix and compute the PDF for a multivariate Normal. This calculation is almost instantaneous compared to the calculation of the full posterior.

[24]:
# Set up plotting range
POST_norm = st.multivariate_normal.pdf(np.dstack((PHI[:,:,0], GAMMA[:,:,0])), popt[:-1], cov[:-1, :-1])

p = bebi103.viz.contour(
    PHI[:,:,0],
    GAMMA[:,:,0],
    POST_norm,
    line_kwargs=dict(line_color='orange'),
    overlaid=False,
    p=p
)
bokeh.io.show(p)

We are again fortunate that the Normal approximation is a good one. This is not always the case, though, and this is a major danger in using optimization methods to making sense of the posterior.

Displaying the best fit line

Researchers typically display the best-fit line with their data points. I generally discourage this; you should instead show results from posterior predictive checks, which we will discuss in coming weeks. However, for now we do not know how to do that, so we will just display the best fit line.

[25]:
# Extract parameters
phi, gamma = popt[:2]

# Make theoretical curve
d_theor = np.linspace(0, 250, 250)
ell_theor = theoretical_spindle_length(d_theor, phi, gamma)

# Make the plot
data_plot * hv.Curve(data=(d_theor, ell_theor)).opts(color='orange')
[25]:

It looks like most of the sampling was in the curved part of the plot, meaning we had few samples in the linear regime where the only relevant parameter is \(\gamma\), and few in the regime where \(\phi\) is the only relevant parameter.

Computing environment

[26]:
%load_ext watermark
%watermark -v -p numpy,pandas,scipy,tqdm,statsmodels,bokeh,holoviews,bebi103,jupyterlab
CPython 3.7.6
IPython 7.11.1

numpy 1.17.4
pandas 0.24.2
scipy 1.3.1
tqdm 4.41.1
statsmodels 0.10.1
bokeh 1.4.0
holoviews 1.12.7
bebi103 0.0.49
jupyterlab 1.2.4