3. Marginalization by numerical quadrature¶
[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
cmd = "pip install --upgrade bebi103 watermark"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
data_path = "../data/"
# ------------------------------
import numpy as np
import pandas as pd
import scipy.integrate
import scipy.stats as st
import bebi103
import holoviews as hv
hv.extension('bokeh')
bebi103.hv.set_defaults()
We will continue working with the spindle length data set, this time using the tubulin conservation model. We’ll start by loading in the data set and making a plot to remind us of the structure of the data.
[2]:
# Load in Data Frame
df = pd.read_csv(os.path.join(data_path, 'good_invitro_droplet_data.csv'), comment='#')
# Pull out numpy arrays
ell = df['Spindle Length (um)'].values
d = df['Droplet Diameter (um)'].values
# Make a plot
hv.Scatter(
data=df,
kdims='Droplet Diameter (um)',
vdims='Spindle Length (um)',
)
Data type cannot be displayed:
[2]:
The tubulin conservation model¶
In the tubulin conservation model, the spindle length depends on the droplet diameter \(d\) according to
\begin{align} l(d;\gamma, \phi) = \frac{\gamma d}{(1 + (\gamma d/\phi)^3)^\frac{1}{3}}. \end{align}
Note that \(0 \le \gamma \le 1\) and \(\phi\) is again positive. We again assume that the measurements are Normally distributed about this theoretical curve. We assume a Uniform prior for \(\gamma\), giving a generative model of
\begin{align} &g(\phi) = \text{constant} \\ &\gamma \sim \text{Uniform}(0, 1) \\ &g(\sigma) = 1/\sigma \\ &l_i \sim \text{Normal}(l(d_i;\gamma,\phi), \sigma) \;\forall i. \end{align}
Analytical marginalization¶
The posterior distribution is
\begin{align} g(\gamma, \phi, \sigma \mid \{d_i, l_i\}) \propto \frac{1}{\sigma^{n+1}}\,\exp\left[-\frac{1}{2\sigma^2}\sum_{i=1}^n\left(l_i-\frac{\gamma d_i}{(1 + (\gamma d_i/\phi)^3)^\frac{1}{3}}\right)^2\right], \end{align}
a trivariate distribution. We cannot really visualize a tri-variate distribution, so we need to marginalize the distribution. If we want the marginalized posterior \(g(\gamma, \phi \mid \{l_i\})\), we can analytically marginalize out \(\sigma\). We again use the fact that
\begin{align} \int_0^\infty \mathrm{d}\sigma \,\frac{1}{\sigma^{n+1}}\,\mathrm{e}^{-x/2\sigma^2} \propto x^{-n/2}. \end{align}
to get
\begin{align} g(\gamma, \phi \mid \{d_i, l_i\}) = \int_0^\infty \mathrm{d}\sigma\,g(\gamma, \phi, \sigma \mid \{d_i, l_i\}) \propto \left(\sum_{i=1}^n\left(l_i-\frac{\gamma d_i}{(1 + (\gamma d_i/\phi)^3)^\frac{1}{3}}\right)^2\right)^{-n/2}. \end{align}
We can then plot this marginalized posterior, which contains information about the two parameters we care most about, \(\phi\) and \(\gamma\), as a contour plot. First, we code up a function to compute the log marginalized posterior.
[3]:
def theor_spindle_length(gamma, phi, d):
"""Compute spindle length using mathematical model"""
return gamma * d / np.cbrt(1 + (gamma * d / phi) ** 3)
def log_marginalized_posterior(params, d, ell):
"""Log posterior for independent size model"""
gamma, phi = params
return (
-len(ell) * np.log(np.sum((ell - theor_spindle_length(gamma, phi, d)) ** 2)) / 2
)
Next, we select ranges to make the plot. We can eyeball the plot of the data to guess a good range. We know that \(\phi\) should be somewhere around 37 µm or so, which is where the data asymptote for large \(d\). The initial slope of the data (which is what \(\gamma\) gives) is somewhere around 3/4. So, we will plot the marginalized posterior in the range \(0.7 \le \gamma \le 0.9\) and \(30 \le \phi \le 45\) µm.
[4]:
gamma = np.linspace(0.7, 0.9, 200)
phi = np.linspace(30, 45, 200)
LOG_POST = np.zeros((len(gamma), len(phi)))
for j, phi_val in enumerate(phi):
for i, gamma_val in enumerate(gamma):
LOG_POST[i, j] = log_marginalized_posterior(
np.array([gamma_val, phi_val]), d, ell
)
We can now construct a contour plot, again adjusting the log posterior to avoid underflow or overflow when exponentiating.
[5]:
POST = np.exp(LOG_POST - LOG_POST.max())
im = hv.Image((phi, gamma, POST), kdims=["φ (µm)", "γ"])
hv.operation.contours(im, levels=5, overlaid=True).opts(
hv.opts.Contours(
cmap=["white"],
frame_height=300,
frame_width=300,
line_width=2,
show_legend=False,
)
).opts(padding=0)
Data type cannot be displayed:
[5]:
We have zoomed out too much, but we can identify the peak. Let’s plot with more instructive ranges.
[6]:
gamma = np.linspace(0.8, 0.9, 200)
phi = np.linspace(36, 40, 200)
LOG_POST = np.zeros((len(gamma), len(phi)))
for j, phi_val in enumerate(phi):
for i, gamma_val in enumerate(gamma):
LOG_POST[i, j] = log_marginalized_posterior(
np.array([gamma_val, phi_val]), d, ell
)
# Store log_post_max for later calcs
log_post_max = LOG_POST.max()
POST = np.exp(LOG_POST - log_post_max)
im = hv.Image((phi, gamma, POST), kdims=["φ (µm)", "γ"])
hv.operation.contours(im, levels=5, overlaid=True).opts(
hv.opts.Contours(
cmap=["white"],
frame_height=300,
frame_width=300,
line_width=2,
show_legend=False,
)
).opts(padding=0)
Data type cannot be displayed:
[6]:
Numerical marginalization¶
The marginal distributions \(g(\gamma, \mid \{d_i, l_i\})\) and \(g(\phi \mid \{d_i, l_i\})\) tell us about the individual parameters \(\gamma\) and \(\phi\). For example to get \(g(\gamma, \mid \{d_i, l_i\})\), we have to perform an integral
\begin{align} \int_0^\infty \mathrm{d}\phi\, g(\gamma, \phi, \mid \{d_i, l_i\}) \propto \int_0^\infty \mathrm{d}\phi\,\left(\sum_{i=1}^n\left(l_i-\frac{\gamma d_i}{(1 + (\gamma d_i/\phi)^3)^\frac{1}{3}}\right)^2\right)^{-n/2}. \end{align}
Unfortunately, we cannot do this integral analytically. We can do it numerically, though, using numerical quadrature. To get the marginal distributions, we will start by obtaining \(g(\phi \mid \{d_i, l_i\})\), which we get by integrating over \(\gamma\). If we consider a particular value of \(\phi\), say \(\varphi\), then
\begin{align} g(\varphi\mid \{d_i, l_i\}) = \int_0^1 \mathrm{d}\gamma\,g(\gamma, \varphi\mid \{d_i, l_i\}), \end{align}
Note that the bounds are between zero and one, since those are the physical bounds on \(\gamma\). If we were integrating over \(\phi\), the bounds would be zero to infinity.
For every value of \(\phi\) we want, we need to perform the above integral. The result is a curve, evaluated at a series of points \(\varphi\), that is the unnormalized marginalized posterior \(g(\varphi\mid\{d_i, l_i\})\). Let’s perform the calculation, beginning with specifying a marginalized posterior function that compute the unnormalized posterior while avoiding over/underflow problems using the log_post_max
value we calculated when making the plot of the two-dimensional
marginalized posterior.
[7]:
def marginalized_posterior(gamma, phi, d, ell, log_post_max):
return np.exp(log_marginalized_posterior((gamma, phi), d, ell) - log_post_max)
Next, we set up the values of \(\phi\) for which we want the posterior and then evaluate the integrals.
[8]:
phi = np.linspace(30, 45, 200)
unnormalized_marg_post_phi = np.empty_like(phi)
for i, phi_val in enumerate(phi):
unnormalized_marg_post_phi[i], _ = scipy.integrate.quad(
marginalized_posterior, 0, 1, args=(phi_val, d, ell, log_post_max)
)
We can now plot the resulting curve, which is an unnormalized probability density function \(g(\phi \mid \{d_i, l_i\})\).
[9]:
hv.Curve(
(phi, unnormalized_marg_post_phi),
kdims='φ (mm)',
vdims='∝ g(φ | {dᵢ, lᵢ})'
)
Data type cannot be displayed:
[9]:
To normalize this PDF, we need to integrate over the unnormalized curve and divide by the result. In this case, though, we do not have an analytical expression for the PDF, so we cannot write a function for the integrand and therefore cannot use scipy.integrate.quad()
to do the integration. Instead, we can integrate it by summing up the area under the curve using the trapezoidal rule. This is implemented in np.trapz()
.
[10]:
normalization_constant = np.trapz(unnormalized_marg_post_phi, x=phi)
Now, we can plot the normalized result.
[11]:
hv.Curve(
(phi, unnormalized_marg_post_phi / normalization_constant),
kdims='φ (mm)',
vdims='g(φ | {dᵢ, lᵢ}) (1/µm)'
)
Data type cannot be displayed:
[11]:
We can do the same procedure to get the marginalized distribution \(g(\gamma\mid \{d_i, l_i\})\). The wrinkle is that scipy.integrate.quad()
integrates over the first variable in the function giving the integrand. As we have the function marginalized_posterior()
written, \(\phi\), the variable we need to integrate over, is the second variable. We can get around this by using a lambda function to swap the order of the input.
[12]:
gamma = np.linspace(0.7, 1.0, 200)
unnormalized_marg_post_gamma = np.empty_like(gamma)
for i, gamma_val in enumerate(gamma):
unnormalized_marg_post_gamma[i], _ = scipy.integrate.quad(
lambda phi: marginalized_posterior(gamma_val, phi, d, ell, log_post_max), 30, 45
)
We can then proceed to plot the marginalized posterior \(g(\gamma\mid \{d_i, l_i\})\).
[13]:
norm_const = np.trapz(unnormalized_marg_post_gamma, x=gamma)
hv.Curve(
(gamma, unnormalized_marg_post_gamma),
kdims='γ',
vdims='g(γ | {dᵢ, lᵢ})'
)
Data type cannot be displayed:
[13]:
The curse of dimensionality and overcoming it¶
Having a complete picture of the marginalized posterior PDF is certainly valuable. For this problem where the posterior is trivariate, we were able to analytically integrate out one of the parameter (\(\sigma\)), and then numerically integrate the other (either \(\phi\) or \(\gamma\)). But what if we chose another prior for \(\sigma\) so that we could not do the integral analytically? We would have to do the following to get the marginal PDF for \(\gamma\).
For each pair of values \(\phi, \gamma\), perform numerical quadrature to marginalize out \(\sigma\).
For each value of \(\gamma\), numerically integrate over \(\phi\) to marginalize it out. For this calculation, because the expression is no longer analytical, we need to use, e.g., a trapezoidal rule.
Perform a final set of integrations to normalize.
If we wish to evaluate the marginalized PDF of \(\gamma\) at \(n\) values of \(\gamma\), we now have to do order \(n^2\) numerical integrations, as opposed to order \(n\) that we did when we were able to analytically marginalize over \(\sigma\). In general, if we have \(p\) parameters, we would have to perform order \(n^{p-1}\) integrations to marginalize the posterior down to one variable. This is the curse of dimensionality rearing its head again.
It has become clear that making sense of posterior distributions involves computing integrals. But we have just shown that for all but the simplest posteriors, we are doomed by the curse of dimensionality. Much of Bayesian inference involves finding ways of handling the curse. Going forward, we will investigate several ways of doing this:
Choosing prior/likelihood pairs where the integrals can be computed analytically (conjugacy).
Approximating the posterior, at least locally, with distributions that have known analytical expressions.
Computing integrals by sampling out of the posterior using Markov chain Monte Carlo.
In the next lesson, we will consider conjugacy. After that, we will use optimization methods to find where the posterior is peaked, and then approximate the posterior as Gaussian near the peak. We will then spend the bulk of the class using Markov Chain Monte Carlo methods, before returning to variational inference, a more sophisticated way of approximating posteriors.
Computing environment¶
[14]:
%load_ext watermark
%watermark -v -p numpy,pandas,scipy,bokeh,holoviews,bebi103,jupyterlab
CPython 3.8.5
IPython 7.19.0
numpy 1.19.2
pandas 1.2.0
scipy 1.5.2
bokeh 2.2.3
holoviews 1.14.0
bebi103 0.1.2
jupyterlab 2.2.6