Bayesian approach to parameter estimation by optimization¶
[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
cmd = "pip install --upgrade bebi103 watermark"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
data_path = "../data/"
# ------------------------------
import numpy as np
import scipy.stats as st
import bebi103.viz
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
We have learned in previous lessons that the posterior distribution for a set of parameters \(\theta\) given a data set \(y\) is given by the likelihood \(f(y\mid \theta)\) and prior \(g(\theta)\) according to
\begin{align} g(\theta\mid y) \propto f(y\mid \theta)\,g(\theta). \end{align}
The central goal of Bayesian inference is making sense of this posterior distribution; writing it down is the easy part.
Summarizing the posterior near its maximum¶
For many posteriors, there exists a set of parameters for which the posterior is maximal. This set of parameters, denoted \(\theta^*\), is referred to as the maximal a posteriori parameter set, abbreviated MAP. Given that \(\theta^*\) maximizes the posterior, it is of some importance (but certainly not paramount importance, as we will learn throughout the course), and may be useful in summarizing a posterior.
Finding the MAP is an optimization problem. In some simple, rare cases, the optimization problem may be solved analytically. We will not dwell on these, because they are so seldom encountered. Instead, we will discuss numerical methods for finding \(\theta*\) as we proceed through this lesson. For now, we will proceed assuming we have found \(\theta^*\) and seek to approximate the posterior distribution near the MAP.
Near the MAP, we can express the log posterior as a Taylor series expansion,
\begin{align} \ln g(\theta\mid y) \approx \ln g(\theta^*\mid y) + \frac{1}{2}\left(\theta - \theta^*\right)^\mathsf{T} \cdot \mathsf{B} \cdot \left(\theta - \theta^*\right), \end{align}
where \(\mathsf{B}\) is the symmetric Hessian matrix of second derivatives, with entries
\begin{align} B_{ij} = \left.\frac{\partial^2\,\ln g(\theta\mid y)}{\partial \theta_i\,\partial \theta_j}\right|_{\theta = \theta^*}. \end{align}
In one dimension, this is
\begin{align} \ln g(\theta\mid y) \approx \ln g(\theta^*\mid y) + \frac{1}{2}\left(\left.\frac{\partial^2\,\ln g(\theta\mid y)}{\partial \theta^2}\right|_{\theta = \theta^*}\right) \left(\theta - \theta^*\right)^2 = \ln g(\theta^*\mid y) - \frac{\left(\theta - \theta^*\right)^2}{2\sigma^2}, \end{align}
where we have defined
\begin{align} \sigma^2 = -\left(\left.\frac{\partial^2\,\ln g(\theta\mid y)}{\partial \theta^2}\right|_{\theta = \theta^*}\right)^{-1}. \end{align}
We note that the \(-(\theta-\theta^*)^2/2\sigma^2\) term is the \(\theta\)-dependence of the logarithm of the PDF of a Normal distribution! We can therefore locally approximate the posterior distribution as Normal near the MAP;
\begin{align} g(\theta\mid y) \approx \frac{1}{\sqrt{2\pi\sigma^2}}\,\exp\left[-\frac{\left(\theta - \theta^*\right)^2}{2\sigma^2}\right]. \end{align}
This generalizes to multiple dimensions, where the covariance matrix \(\mathsf{\Sigma}\) of the approximate multivariate Normal distribution is related to the Hessian matrix by
\begin{align} \mathsf{\Sigma} = - \mathsf{B}^{-1}. \end{align}
Note that the covariance matrix is not the inverse of every element of the Hessian; it is the inverse of the Hessian matrix. Note also that optimization theory tells us that because the Hessian is evaluated at \(\theta^*\), which maximizes the posterior, the Hessian much be negative definite. Therefore, the covariance matrix must be positive definite, which we already know must by the case for a multivariate Normal distribution.
Demonstration of the Normal approximation¶
Actually, the probability density function of any continuous, peaked distribution can be approximated as Normal near its mode. (A mode of a distribution is the set of parameter values that maximize it; a MAP is a mode for a posterior distribution.) We can demonstrate this fact by approximating a Gamma distribution locally as Normal. Recall that the PDF for a Gamma distribution is
\begin{align} f(x ; \alpha, \beta) = \frac{1}{\Gamma(\alpha)}\,\frac{(\beta x)^\alpha}{x}\,\mathrm{e}^{-\beta x}. \end{align}
We can analytically find the mode by setting
\begin{align} \frac{\mathrm{d}\,\ln f}{\mathrm{d}x} = \frac{\mathrm{d}}{\mathrm{d}x}\left((\alpha-1)\ln x - \beta x\right) = \frac{\alpha-1}{x} - \beta = 0, \end{align}
and solving to get \(x^* = (\alpha - 1) / \beta\). We can also compute the second derivative as
\begin{align} \frac{\mathrm{d}^2\,\ln f}{\mathrm{d}x^2} = -\frac{\alpha - 1}{x^2}, \end{align}
which evaluated at \(x^*\) is \(-\beta^2 / (\alpha - 1)\). Therefore, the scale parameter for the approximate Normal distribution is \(\sigma = \sqrt{\alpha - 1} / \beta\). We can plot the distribution and its Normal approximation as a visualization. We do this for \(\alpha = 50\) and \(\beta = 20\).
[2]:
x = np.linspace(1, 5, 400)
alpha = 50
beta = 20
y_gamma = st.gamma.pdf(x, alpha, loc=0, scale=1 / beta)
y_norm = st.norm.pdf(x, (alpha - 1) / beta, np.sqrt(alpha - 1) / beta)
p = bokeh.plotting.figure(
frame_width=400,
frame_height=200,
x_range=[1, 5],
x_axis_label='x',
y_axis_label='PDF'
)
p.line(x, y_gamma, line_width=2, legend_label='Gamma')
p.line(x, y_norm, line_width=2, color='tomato', legend_label='Normal')
bokeh.io.show(p)
Credible intervals¶
A \(\alpha\)-percent Bayesian credible interval for a parameter is an an interval that contains \(\alpha\) percent of the posterior probability density. (This is different form a frequentist confidence interval, though the latter is often erroneously interpreted as the former.) For example, if \(g(\theta \mid y)\) is the marginalized posterior probability density function for a single parameter \(\theta\) and \(G(\theta\mid y)\) is the cumulative distribution function, then a 95% credible interval is \([G^{-1}(0.025),\;G^{-1}(0.975)]\).
If we approximate the posterior as a multivariate Normal distribution, we have analytical results for credible intervals for each parameter. This is the main benefit of using optimization; it affords us an approximate (Normal) expression for the posterior that has convenient, known analytical results. This enables us to make sense of the posterior.
Specifically, if \(\Sigma_{ii}\) is a diagonal element in the covariance matrix \(\mathsf{\Sigma}\), then, e.g., the 95% credible interval for parameter \(\theta_i\) is centered on the MAP estimate, \(\theta_i^*\) and is \([\theta_i^* - 1.96\sqrt{\Sigma_{ii}},\;\theta_i^* - 1.96\sqrt{\Sigma_{ii}}]\). Because the credible intervals are symmetric under this approximation, we can report the estimates as the \(\theta_i^* \pm 1.96 \sqrt{\Sigma_{ii}}\).
Credible intervals may be skewed¶
We are approximating the PDF of a distribution close to its mode as Normal. However, for an asymmetric distribution, which posterior distributions can certainly be, the mode may not be all that relevant as far as credible regions go. As an extreme example, consider an Exponential distribution. It’s mode is zero, which is also the minimal allowed value. If we were to compute a central 95% credible region for a posterior that happened to be Exponential, the mode would not be included. In cases like these, though the Normal distribution approximately describes the PDF of the posterior near the MAP, the Normal approximation can lead to rather skewed credible regions.
To demonstrate this, let us again return to the Gamma example, this time with \(\alpha = 5\) and \(\beta = 2\). The mode of this Gamma distribution is 2, but the median is about 2.34. If we look graphically, we can see right away that the Normal approximation is not as good as the \(\alpha = 50\), \(\beta = 20\) case.
[3]:
x = np.linspace(0, 8, 400)
alpha = 5
beta = 2
y_gamma = st.gamma.pdf(x, alpha, loc=0, scale=1 / beta)
y_norm = st.norm.pdf(x, (alpha - 1) / beta, np.sqrt(alpha - 1) / beta)
p = bokeh.plotting.figure(
frame_width=400,
frame_height=200,
x_range=[0, 8],
x_axis_label='x',
y_axis_label='PDF'
)
p.line(x, y_gamma, line_width=2, legend_label='Gamma')
p.line(x, y_norm, line_width=2, color='tomato', legend_label='Normal')
bokeh.io.show(p)
Let’s look at how this will affect the credible intervals. We will compute the lower and upper bound to a credible interval directly from the Gamma distribution and from its Normal approximation. We can compute credible intervals using the ppf()
method of distributions in scipy.stats, which returns the inverse of the CDF. We can then plot the 1%, …, 99% credible intervals for the Gamma distribution and for its Normal approximation.
[4]:
perc_cred_int = np.linspace(1, 99, 200)
mu = (alpha - 1) / beta
sigma = np.sqrt(alpha - 1) / beta
gamma_low = st.gamma.ppf((50 - perc_cred_int / 2) / 100, alpha, loc=0, scale=1 / beta)
gamma_high = st.gamma.ppf((50 + perc_cred_int / 2) / 100, alpha, loc=0, scale=1 / beta)
norm_low = st.norm.ppf((50 - perc_cred_int / 2) / 100, mu, sigma)
norm_high = st.norm.ppf((50 + perc_cred_int / 2) / 100, mu, sigma)
p = bebi103.viz.fill_between(
gamma_low,
perc_cred_int,
gamma_high,
perc_cred_int,
patch_kwargs=dict(alpha=0.5),
x_axis_label="x",
y_axis_label="percent credible interval",
y_range=[0, 100],
)
p = bebi103.viz.fill_between(
norm_low,
perc_cred_int,
norm_high,
perc_cred_int,
patch_kwargs=dict(alpha=0.5, color="tomato"),
line_kwargs=dict(color="tomato"),
p=p,
)
bokeh.io.show(p)
For tiny credible intervals, the shift is clear. The median for a Gamma distribution with these parameters is shifted rightward from its mode, whereas the mode and median for a Normal distribution are equal. As we get broader credible intervals, the Normal approximation gets worse, notably with the 99% credible interval containing negative values (not allowed for a Gamma distribution) and missing important values at the high end.
Why use the MAP for parameter estimation?¶
Despite these drawbacks, using the MAP/Normal approximation for parameter estimation is still useful in Bayesian inference, at least for simple, well-behaved problems. Why? The most important reason is speed. For problems with simple models (which we will encountered in the next section of this lesson), optimization is much faster than other methods for characterizing the posterior, such as directly plotting it or using Markov chain Monte Carlo. If you have high-throughput data, you may not have the computational resources to do a full Bayesian treatment with other methods. Provided your model that is tractable by optimization method, finding the MAP and summarizing with an approximate Normal distribution is an attractive method.
In many cases, the Normal approximation can also be a good approximation. But this is definitely not guaranteed! In real inference problems, it is very difficult to identify pathologies that would lead to breakdown of the Normal approximation, since comparing to the real posterior would involve computationally intensive methods to compute it.
In the next sections of this lesson, we will work though the mechanics of finding the MAP and computing credible regions.
Computing environment¶
[5]:
%load_ext watermark
%watermark -v -p numpy,scipy,bokeh,bebi103,jupyterlab
Python implementation: CPython
Python version : 3.8.5
IPython version : 7.19.0
numpy : 1.19.2
scipy : 1.5.2
bokeh : 2.2.3
bebi103 : 0.1.1
jupyterlab: 2.2.6