Marginalization by numerical quadrature

Data set download


[1]:
import numpy as np
import pandas as pd
import scipy.integrate
import scipy.stats as st

import bebi103

import holoviews as hv
hv.extension('bokeh')
bebi103.hv.set_defaults()

We will continue working with the spindle length data set, this time using the tubulin conservation model. We’ll start by loading in the data set and making a plot to remind us of the structure of the data.

[2]:
# Load in Data Frame
df = pd.read_csv('../data/good_invitro_droplet_data.csv', comment='#')

# Pull out numpy arrays
ell = df['Spindle Length (um)'].values
d = df['Droplet Diameter (um)'].values

# Make a plot
hv.Scatter(
    data=df,
    kdims='Droplet Diameter (um)',
    vdims='Spindle Length (um)',
)
[2]:

The tubulin conservation model

In the tubulin conservation model, the spindle length depends on the droplet diameter \(d\) according to

\begin{align} l(d;\gamma, \phi) = \frac{\gamma d}{(1 + (\gamma d/\phi)^3)^\frac{1}{3}}. \end{align}

Note that \(0 \le \gamma \le 1\) and \(\phi\) is again positive. We again assume that the measurements are Normally distributed about this theoretical curve. We assume a Uniform prior for \(\gamma\), giving a generative model of

\begin{align} &g(\phi) = \text{constant} \\ &\gamma \sim \text{Uniform}(0, 1) \\ &g(\sigma) = 1/\sigma \\ &l_i \sim \text{Normal}(l(d_i;\gamma,\phi), \sigma) \;\forall i. \end{align}

Analytical marginalization

The posterior distribution if

\begin{align} g(\gamma, \phi, \sigma \mid \{d_i, l_i\}) \propto \frac{1}{\sigma^{n+1}}\,\exp\left[-\frac{1}{2\sigma^2}\sum_{i=1}^n\left(l_i-\frac{\gamma d_i}{(1 + (\gamma d_i/\phi)^3)^\frac{1}{3}}\right)^2\right], \end{align}

a trivariate distribution. We cannot really visualize a tri-variate distribution, so we need to marginalize the distribution. If we want the marginalized posterior \(g(\gamma, \phi \mid \{l_i\})\), we can analytically marginalize out \(\sigma\). We again use the fact that

\begin{align} \int_0^\infty \mathrm{d}\sigma \,\frac{1}{\sigma^{n+1}}\,\mathrm{e}^{-x/2\sigma^2} \propto x^{-n/2}. \end{align}

to get

\begin{align} g(\gamma, \phi \mid \{d_i, l_i\}) = \int_0^\infty \mathrm{d}\sigma\,g(\gamma, \phi, \sigma \mid \{d_i, l_i\}) \propto \left(\sum_{i=1}^n\left(l_i-\frac{\gamma d_i}{(1 + (\gamma d_i/\phi)^3)^\frac{1}{3}}\right)^2\right)^{-n/2}. \end{align}

We can then plot this marginalized posterior, which contains information about the two parameters we care most about, \(\phi\) and \(\gamma\), as a contour plot. First, we code up a function to compute the log marginalized posterior.

[3]:
def theor_spindle_length(gamma, phi, d):
    """Compute spindle length using mathematical model"""
    return gamma * d / np.cbrt(1 + (gamma * d / phi) ** 3)


def log_marginalized_posterior(params, d, ell):
    """Log posterior for independent size model"""
    gamma, phi = params

    return (
        -len(ell) * np.log(np.sum((ell - theor_spindle_length(gamma, phi, d)) ** 2)) / 2
    )

Next, we select ranges to make the plot. We can eyeball the plot to guess a good range. We know that \(\phi\) should be somewhere around 37 or so, which is where the data asymptote for large \(d\). The initial slope of the data (which is what \(\gamma\) gives) is somewhere around 3/4. So, we will plot the marginalized posterior in the range \(0.7 \le \gamma \le 0.9\) and \(30 \le \phi \le 45\) µm.

[4]:
gamma = np.linspace(0.7, 0.9, 200)
phi = np.linspace(30, 45, 200)

LOG_POST = np.zeros((len(gamma), len(phi)))

for j, phi_val in enumerate(phi):
    for i, gamma_val in enumerate(gamma):
        LOG_POST[i, j] = log_marginalized_posterior(
            np.array([gamma_val, phi_val]), d, ell
        )

We can now construct a contour plot, again adjusting the log posterior to avoid underflow or overflow when exponentiating.

[5]:
POST = np.exp(LOG_POST - LOG_POST.max())
im = hv.Image((phi, gamma, POST), kdims=["φ (µm)", "γ"])

hv.operation.contours(im, levels=5, overlaid=True).opts(
    hv.opts.Contours(
        cmap=["white"],
        frame_height=300,
        frame_width=300,
        line_width=2,
        show_legend=False,
    )
).opts(padding=0)
[5]:

We have zoomed out too much, but we can identify the peak. Let’s plot with more instructive ranges.

[6]:
gamma = np.linspace(0.8, 0.9, 200)
phi = np.linspace(36, 40, 200)

LOG_POST = np.zeros((len(gamma), len(phi)))

for j, phi_val in enumerate(phi):
    for i, gamma_val in enumerate(gamma):
        LOG_POST[i, j] = log_marginalized_posterior(
            np.array([gamma_val, phi_val]), d, ell
        )

# Store log_post_max for later calcs
log_post_max = LOG_POST.max()
POST = np.exp(LOG_POST - log_post_max)

im = hv.Image((phi, gamma, POST), kdims=["φ (µm)", "γ"])

hv.operation.contours(im, levels=5, overlaid=True).opts(
    hv.opts.Contours(
        cmap=["white"],
        frame_height=300,
        frame_width=300,
        line_width=2,
        show_legend=False,
    )
).opts(padding=0)
[6]:

Numerical marginalization

The marginal distributions \(g(\gamma, \mid \{d_i, l_i\})\) and \(g(\phi \mid \{d_i, l_i\})\) tell us about the individual parameters \(\gamma\) and \(\phi\). To get, e.g., \(g(\gamma, \mid \{d_i, l_i\})\), we have to perform an integral

\begin{align} \int_0^\infty \mathrm{d}\phi\, g(\gamma, \mid \{d_i, l_i\}) \propto \int_0^\infty \mathrm{d}\phi\,\left(\sum_{i=1}^n\left(l_i-\frac{\gamma d_i}{(1 + (\gamma d_i/\phi)^3)^\frac{1}{3}}\right)^2\right)^{-n/2}. \end{align}

Unfortunately, we cannot do this integral analytically. We can do it numerically, though, using numerical quadrature. We will start by obtaining \(g(\phi \mid \{d_i, l_i\})\), which we get by integrating over \(\phi\). If we consider a particular value of \(\phi\), say \(\varphi\), then

\begin{align} g(\varphi\mid \{d_i, l_i\}) = \int_0^1 \mathrm{d}\gamma\,g(\gamma, \varphi\mid \{d_i, l_i\}), \end{align}

Note that the bounds are between zero and one, since those are the physical bounds on \(\gamma\). If we were integrating over \(\phi\), the bounds would be zero to infinity.

For every value of \(\phi\) we want, we need to perform the above integral. The result is a curve, evaluated at a series of points \(\varphi\), that is the unnormalized marginalized posterior \(g(\varphi\mid\{d_i, l_i\})\). Let’s perform the calculation, beginning with specifying a marginalized posterior function that compute the unnormalized posterior while avoiding over/underflow problems using the log_post_max value we calculated when making the plot of the two-dimensional marginalized posterior.

[7]:
def marginalized_posterior(gamma, phi, d, ell, log_post_max):
    return np.exp(log_marginalized_posterior((gamma, phi), d, ell) - log_post_max)

Next, we set up the values of \(\phi\) for which we want the posterior and then evaluate the integrals.

[8]:
phi = np.linspace(30, 45, 200)

unnormalized_marg_post_phi = np.empty_like(phi)

for i, phi_val in enumerate(phi):
    unnormalized_marg_post_phi[i], _ = scipy.integrate.quad(
        marginalized_posterior, 0, 1, args=(phi_val, d, ell, log_post_max)
    )

We can now plot the resulting curve, which is an unnormalized probability density function \(g(\phi \mid \{d_i, l_i\})\).

[9]:
hv.Curve(
    (phi, unnormalized_marg_post_phi),
    kdims='φ (mm)',
    vdims='∝ g(φ | {dᵢ, lᵢ})'
)
[9]:

To normalize this PDF, we need to integrate over the unnormalized curve and divide by the result. In this case, though, we do not have an analytical expression for the PDF, so we cannot write a function for the integrand and therefore cannot use scipy.integrate.quad() to do the integration. Instead, we can integrate it by summing up the area under the curve using the trapezoidal rule. This is implemented in np.trapz().

[10]:
normalization_constant = np.trapz(unnormalized_marg_post_phi, x=phi)

Now, we can plot the normalized result.

[11]:
hv.Curve(
    (phi, unnormalized_marg_post_phi / normalization_constant),
    kdims='φ (mm)',
    vdims='g(φ | {dᵢ, lᵢ}) (1/µm)'
)
[11]:

We can do the same procedure to get the marginalized distribution \(g(\gamma\mid \{d_i, l_i\})\). The wrinkle is that scipy.integrate.quad() integrates over the first variable in the function giving the integrand. As we have the function marginalized_posterior() written, \(\phi\), the variable we need to integrate over, is the second variable. We can get around this by using a lambda function to swap the order of the input.

[12]:
gamma = np.linspace(0.7, 1.0, 200)

unnormalized_marg_post_gamma = np.empty_like(gamma)

for i, gamma_val in enumerate(gamma):
    unnormalized_marg_post_gamma[i], _ = scipy.integrate.quad(
        lambda phi: marginalized_posterior(gamma_val, phi, d, ell, log_post_max), 30, 45
    )

We can then proceed to plot the marginalized posterior \(g(\gamma\mid \{d_i, l_i\})\).

[13]:
norm_const = np.trapz(unnormalized_marg_post_gamma, x=gamma)

hv.Curve(
    (gamma, unnormalized_marg_post_gamma),
    kdims='γ',
    vdims='g(γ | {dᵢ, lᵢ})'
)
[13]:

Computing environment

[14]:
%load_ext watermark
%watermark -v -p numpy,pandas,scipy,bokeh,holoviews,bebi103,jupyterlab
CPython 3.7.5
IPython 7.11.1

numpy 1.17.4
pandas 0.24.2
scipy 1.3.1
bokeh 1.4.0
holoviews 1.12.7
bebi103 0.0.49
jupyterlab 1.2.4