Parameter estimation by optimization case study: Regression
[2]:
import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st
import statsmodels.tools.numdiff as smnd
import tqdm
import bebi103
import holoviews as hv
hv.extension('bokeh')
bebi103.hv.set_defaults()
import bokeh.io
bokeh.io.output_notebook()
We continue our analysis of the data set from Good, et al., Science, 342, 856-860, 2013, now considering the tubulin conservation model. Before we continue, we’ll get a quick plot.
[3]:
# Load in Data Frame
df = pd.read_csv("../data/good_invitro_droplet_data.csv", comment="#")
# Pull out numpy arrays
ell = df["Spindle Length (um)"].values
d = df["Droplet Diameter (um)"].values
# Make a plot
data_plot = hv.Scatter(
data=df,
kdims="Droplet Diameter (um)",
vdims="Spindle Length (um)",
).opts(
xlim=(0, None),
ylim=(0, None),
)
data_plot
[3]:
The tubulin conservation model
In the tubulin conservation model, recall that the theoretical relationship between spindle length and droplet diameter is
\begin{align} l(d) = \frac{\gamma d}{\left(1+(\gamma d/\phi)^3\right)^{\frac{1}{3}}} \end{align}
We will assume homoscedastic variation about this theoretical curve, giving us a likelihood of
\begin{align} &\mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &l_i \sim \text{Norm}(\mu_i, \sigma) \;\forall i. \end{align}
We therefore need to provide priors for \(\gamma\), \(\phi\), and \(\sigma\). We will use the same priors as we did for the independent size model for \(\phi\) and \(\sigma\) because they have the same meaning in the context of this model.
Physically, \(0 < \gamma \le 1\), so we need to provide a prior for \(\gamma\) that is defined between zero and one. We can use a Beta distribution. I think \(\gamma\) having a values close to zero is very unlikely, since that sets the spindle lengths to zero. Similarly, I do not think values of \(\gamma\) close to one are likely, since this sets the spindle length to be the same as the droplet diameter, meaning that the spindle would always press against the sides of the droplet. So, I will choose a prior for \(\gamma\) that is fairly flat for intermediate values of \(\gamma\), but goes toward zero at \(\gamma = 0, 1\). A Beta distribution with \(\alpha = \beta = 3/2\) accomplishes this. The prior is plotted below.
[4]:
gamma = np.linspace(0, 1, 200)
hv.Curve((gamma, st.beta.pdf(gamma, 1.5, 1.5)), kdims="γ", vdims="g(γ)",).opts(
frame_height=200,
ylim=(0, None),
)
[4]:
So, our complete generative model is
\begin{align} &\phi \sim \text{LogNorm}(2.3, 2.3),\\[1em] &\gamma \sim \text{Beta}(1.5, 1.5), \\[1em] &\sigma \sim \text{HalfNorm}(20),\\[1em] &\mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}\;\forall i, \\[1em] &l_i \sim \text{Norm}(\mu_i, \sigma) \;\forall i. \end{align}
Parameter estimation
We will now proceed to compute the MAP and credible intervals. Before embarking on this calculation, I write down the steps for making sense of the posterior using optimization and a multivariate Normal approximation. After defining your model, do the following.
Write a function to compute the log posterior.
Define the negative log posterior function.
Set up the
args
that need to be passed to the log posterior function.Provide an initial guess for the solver.
Use
scipy.optimize.minimize()
to find the MAP.Extract the optimal parameters from the output.
Compute the Hessian evaluated at the MAP of the log posterior by numerical differentiation.
Compute the covariance by inverting the Hessian.
Report the results (usually by reporting credible intervals).
We will step through this procedure for this model, starting with coding up the posterior.
[5]:
def theoretical_spindle_length(d, phi, gamma):
"""Theoretical spindle length for tubulin conservation model."""
return gamma * d / np.cbrt(1 + (gamma * d / phi)**3)
def log_prior(phi, gamma, sigma):
"""Log prior for tubulin conservation model"""
lp = st.lognorm.logpdf(phi, 2.3, loc=0, scale=np.exp(2.3))
lp += st.beta.logpdf(gamma, 1.5, 1.5)
lp += st.halfnorm.logpdf(sigma, 0, 20)
return lp
def log_likelihood(d, ell, phi, gamma, sigma):
"""Log likelihood for independent size model"""
mu = theoretical_spindle_length(d, phi, gamma)
return np.sum(st.norm.logpdf(ell, mu, sigma))
def log_posterior(params, d, ell):
"""Log posterior of indpendent size model."""
phi, gamma, sigma = params
lp = log_prior(phi, gamma, sigma)
if lp == -np.inf:
return lp
return lp + log_likelihood(d, ell, phi, gamma, sigma)
Next, we define the negative log posterior, which we need for minimization.
[6]:
def neg_log_posterior(params, d, ell):
return -log_posterior(params, d, ell)
And the arguments we need to pass….
[7]:
args = (d, ell)
And now for the initial guesses for \(\phi\), \(\gamma\), and \(\sigma\), respectively.
[8]:
params_0 = np.array([35, 0.8, 4])
Now, we can find the MAP and extract the parameter values.
[9]:
# Compute the MAP
res = scipy.optimize.minimize(neg_log_posterior, params_0, args=args, method="powell")
# Extract optimal parameters
popt = res.x
Now to compute the Hessian and invert it to get the covariance.
[10]:
# Compute Hessian and covariance matrix
hes = smnd.approx_hess(popt, log_posterior, args=args)
cov = -np.linalg.inv(hes)
Finally, we can report the results.
[11]:
# For convenience...
phi_MAP, gamma_MAP, sigma_MAP = popt
# Print results
print(
"""
Most probable parameters
------------------------
φ = {0:.2f} ± {1:.2f} µm
γ = {2:.3f} ± {3:.3f}
σ = {4:.3f} ± {5:.3f} µm
""".format(
phi_MAP,
1.96 * np.sqrt(cov[0, 0]),
gamma_MAP,
1.96 * np.sqrt(cov[1, 1]),
sigma_MAP,
1.96 * np.sqrt(cov[2, 2]),
)
)
Most probable parameters
------------------------
φ = 38.26 ± 0.77 µm
γ = 0.859 ± 0.034
σ = 3.754 ± 0.201 µm
Checking the Normal approximation
Plotting this distribution is harder than in the independent size model because we now have three parameters. We therefore have to marginalize out one of the parameters to make a contour plot of two parameters we are interested in. To do so, we first need to grid up the parameter values and compute the log posterior. Unlike in our previous treatment of this model when we could analytically marginalize away \(\sigma\), we cannot do that here because of our choice of prior. We need to compute the posterior for many values of \(\sigma\) and then use numerical quadrature to marginalize it away.
Note that this problem of computing the posterior for plotting does not scale well. For the independent size model, we evaluated the posterior at 200 \(\phi\) values and 200 \(\sigma\) values for a total of 40,000 posterior evaluations. It took a few seconds to run. Now, we will evaluate the posterior at 100 \(\phi\) values, 100 \(\gamma\) values, and 100 \(\sigma_0\) values, for a total of a million posterior evaluations, and you can expect the calculation in the next cell to take several minutes. If we had any more parameters, this brute force style of posterior evaluation becomes intractable. Furthermore, notice below that I chose a tight range of parameter values. I was able to do this because I already found the MAP and credible regions first. Without this, the posterior would be mostly a big sea of low probability and finding where it is not is like finding a needle in a haystack.
[12]:
# Set up plotting range
phi = np.linspace(37, 39.5, 100)
gamma = np.linspace(0.8, 0.91, 100)
sigma = np.linspace(3.0, 4.5, 100)
PHI, GAMMA, SIGMA = np.meshgrid(phi, gamma, sigma)
# Compute log posterior
LOG_POST = np.empty_like(PHI)
for i in tqdm.tqdm(range(PHI.shape[0])):
for j in range(PHI.shape[1]):
for k in range(PHI.shape[2]):
LOG_POST[i, j, k] = log_posterior(
np.array([PHI[i,j,k], GAMMA[i,j,k], SIGMA[i,j,k]]), d, ell)
# Exponentiate. Ignore normalization, so set max LOG_POST to zero
POST_exact = np.exp(LOG_POST - LOG_POST.max())
100%|██████████████████████████████████████████████████████████████████████| 100/100 [08:32<00:00, 5.13s/it]
To plot the marginalized posterior \(g(\phi, \gamma\mid d, l)\), which is what we are interested in, we need to marginalize over \(\sigma\). We can do so numerically using np.trapz()
, integrating over the \(\sigma\) values.
[13]:
# Compute marginalized posterior
POST_marginalized = np.trapz(POST_exact, x=sigma, axis=2)
We can now make the plot.
[14]:
# Plot
p = bebi103.viz.contour(
PHI[:, :, 0],
GAMMA[:, :, 0],
POST_marginalized,
x_axis_label="ϕ (µm)",
y_axis_label="γ",
overlaid=False,
)
bokeh.io.show(p)
To see how good the Normal approximation is, we can plot the marginalized Normal distribution as well. In this case, we can marginalize a multivariate Normal simply by ignoring the parameters we are marginalizing out. We simply cut \(\sigma\) out of the MAP and out of the posterior covariance matrix and compute the PDF for a multivariate Normal. This calculation is almost instantaneous compared to the calculation of the full posterior.
[15]:
# Set up plotting range
POST_norm = st.multivariate_normal.pdf(
np.dstack((PHI[:, :, 0], GAMMA[:, :, 0])), popt[:-1], cov[:-1, :-1]
)
p = bebi103.viz.contour(
PHI[:, :, 0],
GAMMA[:, :, 0],
POST_norm,
line_kwargs=dict(line_color="orange"),
p=p,
)
bokeh.io.show(p)
We are again fortunate that the Normal approximation is a good one. This is not always the case, though, and this is a major danger in using optimization methods to making sense of the posterior.
Displaying the best fit line
Researchers typically display the best-fit line with their data points. I generally discourage this; you should instead show results from posterior predictive checks, which we will discuss in coming weeks. However, for now we do not know how to do that, so we will just display the best fit line.
[16]:
# Extract parameters
phi, gamma = popt[:2]
# Make theoretical curve
d_theor = np.linspace(0, 250, 250)
ell_theor = theoretical_spindle_length(d_theor, phi, gamma)
# Make the plot
data_plot * hv.Curve(data=(d_theor, ell_theor)).opts(color="orange")
[16]:
It looks like most of the sampling was in the curved part of the plot, meaning we had few samples in the linear regime where the only relevant parameter is \(\gamma\), and few in the regime where \(\phi\) is the only relevant parameter.
Computing environment
[17]:
%load_ext watermark
%watermark -v -p numpy,pandas,scipy,tqdm,statsmodels,bokeh,holoviews,bebi103,jupyterlab
Python implementation: CPython
Python version : 3.9.7
IPython version : 7.29.0
numpy : 1.20.3
pandas : 1.3.5
scipy : 1.7.3
tqdm : 4.62.3
statsmodels: 0.12.2
bokeh : 2.3.3
holoviews : 1.14.6
bebi103 : 0.1.10
jupyterlab : 3.2.1