Gaussian process hyperparameters by optimization

Data set download


[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade colorcet bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    import cmdstanpy; cmdstanpy.install_cmdstan()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st

import cmdstanpy
import arviz as az

import bebi103

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

In the previous lesson we introduced Gaussian processes (GPs) with particular emphasis on the following model where the data consist of centered-and-scaled measurements \(\mathbf{y}\) acquired at positions \(\mathbf{X} = (\mathbf{x}_1, \mathbf{x}_2, \ldots)\).

\begin{align} &\theta_k \sim \text{some hyperprior}\\[1em] &\boldsymbol{\sigma} \sim \text{some prior}\\[1em] &f \mid \mathbf{X}, \theta_k \sim \text{GP}(\mathbf{0}, k(\mathbf{x}, \mathbf{x}' ; \theta_k)),\\[1em] &\mathbf{y} \mid \mathbf{X}, f, \boldsymbol{\sigma} \sim \mathrm{MultiNorm}(f(\mathbf{X}), \mathrm{diag}(\boldsymbol{\sigma}^2)). \end{align}

We learned that we can marginalize out \(f\), giving a simpler model

\begin{align} &\theta_k \sim \text{some hyperprior}\\[1em] &\boldsymbol{\sigma} \sim \text{some prior}\\[1em] &\mathbf{y} \mid \mathbf{X}, \boldsymbol{\sigma}, \theta_k \sim \mathrm{MultiNorm}(\mathbf{0}, \mathsf{K}_\mathbf{y}), \end{align}

where \(\mathsf{K}_\mathbf{y} = \mathsf{K}(\mathbf{X}, \mathbf{X}'; \theta_k) + \mathrm{diag}(\boldsymbol{\sigma}^2)\), where \(\mathsf{K}(\mathbf{X}, \mathbf{X}';\theta_k)\) is the covariance matrix produced by the kernel \(k(\mathbf{x}, \mathbf{x}' ; \theta_k)\). This likelihood and prior together define the posterior distribution of this model. The parameters that remain to be determined are \(\theta_k\) and \(\boldsymbol{\sigma}\). In this lesson, we will use an optimization approach to obtain MAP estimates for these parameters.

To start with, we will use the same data set as last time from Wolfenden and Snider, which can be downloaded here.

[2]:
df = pd.read_csv(os.path.join(data_path, 'wolfenden_arrhenius.csv'))
df['T (K)'] = 1000 / df['1000/T (1/K)']
df['k (1/s)'] = np.exp(df['ln k (1/s)'])

T = df["T (K)"].values
k = df["k (1/s)"].values

While we’re at it, we should center and scale and also set up the points at which we want to evaluate the functions generated by the posterior Gaussian process.

[3]:
k_scaled = (k - k.mean()) / k.std()
T_scaled = (T - T.mean()) / T.std()

# Sample at 250 points
Nstar = 250

# Set up xstar
T_range = T_scaled.max() - T_scaled.min()
xstar = np.linspace(
    T_scaled.min() - 0.05 * T_range, T_scaled.max() + 0.05 * T_range, Nstar
)

Priors for hyperparameters

Before we can proceed with inference, we need to first stipulate which kernel we will use and the hyperpriors for its hyperparameters. We will use a squared exponential kernel, which has hyperparameters \(\rho\) (a length scale) and \(\alpha\) (a marginal deviation). That is, \(\theta_k = (\alpha, \rho)\).

In choosing the priors for these hyperparameters, it is important to use domain knowledge about what kinds of functions might describe the data (remembering that the data are centered and scaled!). Specifically, \(\alpha\) quantifies how large of an amplitude the the function has and \(\rho\) quantifies how wiggly the function is.

For \(\alpha\), a Half-Normal prior is usually reasonable, capturing functions that do not vary much and those that do. For centered and scaled data that I suspect are free of extreme outliers, I usually chose \(\alpha \sim \text{HalfNorm}(2)\), but you can be more generous on the tail of your prior if you suspect their may be outliers.

Choosing a prior for the length scale \(\rho\) is a bit trickier. Specifically, we do not want \(\rho\) to be too small, since this would result in a very wiggly function that would just contort itself to go right through all of the data and leave the parameter \(\sigma\) to be irrelevant. So, it is wise to use a prior that pushes \(\rho\) away from zero. An Inverse Gamma distribution works well for this purpose. You can choose parametrizations for it that sufficiently push the length scale away from zero, and its heavy tail can readily encompass longer lengths scales. You should think carefully about your choice, keeping in mind whether or not you centered and scaled the x-data. As a generic Inverse Gamma prior for centered and scaled x-data that varied on a linear scale before I centered and scaled it, I use InvGamma(0.5, 2). But you should think carefully about your prior for your specific application every time and make sure you choose wise priors for your modeling problem.

Finally, I will choose a Half-Normal prior for \(\sigma\), giving a complete model of

\begin{align} &\alpha \sim \text{HalfNorm}(2)\\[1em] &\rho \sim \text{InvGamma}(0.5, 2)\\[1em] &\sigma \sim \text{HalfNorm}(0.1)\\[1em] &\mathbf{y} \mid \mathbf{X}, \sigma, \alpha, \rho \sim \mathrm{MultiNorm}(\mathbf{0}, \mathsf{K}_\mathbf{y}), \end{align}

where \(\mathsf{K}_\mathbf{y}\) is constructed from an SE kernel parametrized by \(\alpha\) and \(\rho\), with \(\sigma\) added to its diagonal.

Computing the MAP with SciPy

As we did in early lessons, we can use SciPy to compute the maximal a posteriori probability parameter values of \(\alpha\), \(\rho\), and \(\sigma\). We proceed exactly as we did in the early lessons, first by coding up the log prior.

[4]:
def log_prior(sigma, alpha, rho):
    """Log prior for homoscedastic error and GP hyperparameters"""
    # Choose 1e-6 as upper bound for sigma to protect against
    # singular covariance matrix
    if sigma <= 1e-6 or alpha <= 0 or rho <= 0:
        return -np.inf

    lp = st.halfnorm.logpdf(sigma, 0, 1)
    lp += st.halfnorm.logpdf(alpha, 0, 2)
    lp += st.invgamma.logpdf(rho, 2, loc=0, scale=10)

    return lp

Next, we define the log likelihood. To compute it, we first need to compute the covariance matrix \(\mathsf{K}_\mathbf{y}\) using the SE kernel. This is accomplished using the bebi103.gp.cov_exp_quad() function. We then add \(\sigma^2\) to the diagonal of that matrix. The log likelihood is given by the log PDF of the Multinormal distribution with zero mean and covariance \(\mathsf{K}_\mathbf{y}\) evaluated at the measured data points.

[5]:
def log_likelihood(x, y, sigma, alpha, rho):
    """Normal log likelihood for centered and scaled GP."""
    diag_to_add = sigma ** 2 * np.eye(len(x))
    Ky = bebi103.gp.cov_exp_quad(x, alpha=alpha, rho=rho) + diag_to_add

    return st.multivariate_normal.logpdf(y, np.zeros_like(x), Ky)

Finally, we code up the log posterior and negative log posterior to enable optimization.

[6]:
def log_posterior(params, x, y):
    """Log posterior for GP with normal likelihood."""
    sigma, alpha, rho = params

    lp = log_prior(sigma, alpha, rho)
    if lp == -np.inf:
        return lp

    return lp + log_likelihood(x, y, sigma, alpha, rho)


def neg_log_posterior(params, x, y):
    return -log_posterior(params, x, y)

We are now ready to optimize. We provide an initial guess for the parameters \(\sigma\), \(\alpha\), and \(\rho\), specify the centered-and-scaled temperature and rate constant values for arguments, and then minimize the log posterior.

[7]:
# Initial guess
params_0 = np.ones(3)

# Perform optimization
args = (T_scaled, k_scaled)
res = scipy.optimize.minimize(neg_log_posterior, params_0, args=args, method="powell")

# Extract results
sigma, alpha, rho = res.x

# Show results
print("α = {0:.4f}, ρ = {1:.4f}, σ = {2:.4f}".format(alpha, rho, sigma))
/Users/bois/opt/anaconda3/lib/python3.8/site-packages/scipy/optimize/optimize.py:2149: RuntimeWarning: invalid value encountered in double_scalars
  tmp2 = (x - v) * (fx - fw)
α = 1.5887, ρ = 2.5791, σ = 0.0886

Excellent! We can now make a plot of the nonparametric functions coming from the posterior GP with our optimal parameters. First, we compute \(\mathbf{m}_*\) and \(\mathsf{\Sigma}_*\) for the values of temperature, \(T_*\), that we want. We again use the gp.posterior_mean_cov() function.

[8]:
mstar, Sigmastar = bebi103.gp.posterior_mean_cov(
    T_scaled, k_scaled, xstar, sigma=sigma, rho=rho, alpha=rho
)

Finally, to make the plot, we compute the theoretical credible interval and then uncenter and unscale for plotting.

[9]:
# Bound of credible interval
high = mstar + 1.96 * np.sqrt(np.diag(Sigmastar))
low = mstar - 1.96 * np.sqrt(np.diag(Sigmastar))

# Uncenter/scale
Tstar = T.std() * xstar + T.mean()
kstar = k.std() * mstar + k.mean()
low = k.std() * low + k.mean()
high = k.std() * high + k.mean()

# Make plot
p = bokeh.plotting.figure(
    frame_height=250, frame_width=350, x_axis_label="T (K)", y_axis_label="k (1/s)"
)
p.line(Tstar, kstar, line_width=2, color="orange")
p = bebi103.viz.fill_between(
    Tstar,
    high,
    Tstar,
    low,
    show_line=False,
    patch_kwargs=dict(color="orange", alpha=0.3),
    p=p,
)
p.circle(source=df, x="T (K)", y="k (1/s)")

bokeh.io.show(p)

Again, remember that the line in the plot is the most probable function for the MAP values of the hyperparameters and \(\sigma\). The envelope is not posterior predictive of data sets. However, we can sample out of the generative model for the posterior with fixed parameters \(\alpha\), \(\rho\), and \(\sigma\) specified by our MAP estimates.

Posterior predictive samples

To sample out of a posterior predictive distribution \(f(\tilde{y} \mid y)\) for a fixed set of hyperparameters and \(\sigma\), we need to draw a sample of a function out of the posterior GP, and then draw samples from a Normal distribution that has a mean given by the function and a standard deviation of \(\sigma\).

[10]:
# Sample function
rg = np.random.default_rng()

def draw_gp_ppc(mstar, Sigmastar, sigma, y_mean, y_std):
    """Draw a GP posterior predictive sample of centered
    and scaled data that is then uncentered and unscaled."""
    # Draw nonparametric function
    f = np.random.multivariate_normal(mstar, Sigmastar)

    # Draw data
    y = rg.normal(f, sigma)

    # Uncenter and unscale
    return y_std * y + y_mean

k_ppc = draw_gp_ppc(mstar, Sigmastar, sigma, k.mean(), k.std())

# Overlay new data on plot
p.circle(Tstar, k_ppc, color='tomato', alpha=0.5, size=3)

bokeh.io.show(p)

We can generate many of these samples and them make a posterior predictive plot. Note, though, that this is not a full posterior predictive plot. Rather, we are fixing \(\alpha\), \(\rho\), and \(\sigma\) to their MAP values and are then drawing the nonparameteric function that serves as the location parameter for the Normal likelihood.

[11]:
k_ppc = np.array(
    [draw_gp_ppc(mstar, Sigmastar, sigma, k.mean(), k.std()) for _ in range(1000)]
)

bokeh.io.show(
    bebi103.viz.predictive_regression(k_ppc, Tstar, data=np.stack((T, k)).transpose(),)
)

The posterior predictive plot looks good!

Obtaining hyperpriors by optimizing with Stan

While we have not discussed it thus far, Stan does have functionality to find MAP parameters by optimization. It may be easier to code you model in Stan and then use its optimizer. This can be the case for GP-based models. The Stan code for the model we have been considering is

data {
  int<lower=1> N;
  real x[N];
  vector[N] y;
}


parameters {
  real<lower=0> alpha;
  real<lower=0> rho;
  real<lower=0> sigma;
}


model {
  alpha ~ normal(0, 2);
  rho ~ inv_gamma(0.5, 2.0);
  sigma ~ normal(0, 0.1);

  matrix[N, N] cov = cov_exp_quad(x, alpha, rho)
                   + diag_matrix(rep_vector(square(sigma), N));
  matrix[N, N] L_cov = cholesky_decompose(cov);

  y ~ multi_normal_cholesky(rep_vector(0, N), L_cov);
}

Note that we use the Cholesky decomposition, as it is more numerically stable. Note also that the construction of the covariance matrix and its Cholesky decomposition are in the model block instead of transformed parameters because we do not need nor want them stored as output.

To find the MAP parameters, we simple compile the model and then use the optimize() method.

[12]:
with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file="gp_kinetics_no_ppc.stan")

    data = dict(N=len(T_scaled), x=T_scaled, y=k_scaled)

    res = sm.optimize(data=data)

The resulting object has the optimal parameters presented in a dictionary, Numpy array, or Pandas data frame. In this case, the dictionary is easiest to use.

[13]:
res.optimized_params_dict
[13]:
OrderedDict([('lp__', 7.79106),
             ('alpha', 1.40337),
             ('rho', 2.25657),
             ('sigma', 0.0893157)])

The “parameter” lp__ is the value of the log posterior at the MAP. The remaining entries are the MAP parameter values, which you can see are close to what we got using SciPy, but differ slightly. This is because Stan does transformations of variables to handle constraints, and with the new transformed variables, the tolerance may be hit with different parameter values. We can verify that it gives the same result by again plotting the resulting nonparametric function.

[14]:
# Plot Numpy result
p = bokeh.plotting.figure(
    frame_height=250, frame_width=350, x_axis_label="T (K)", y_axis_label="k (1/s)"
)
p.line(Tstar, kstar, line_width=2, color="orange")
p = bebi103.viz.fill_between(
    Tstar,
    high,
    Tstar,
    low,
    show_line=False,
    patch_kwargs=dict(color="orange", alpha=0.3),
    p=p,
)

# Plot Stan result
sigma, alpha, rho = res.optimized_params_pd[["sigma", "alpha", "rho"]].values.flatten()
mstar, Sigmastar = bebi103.gp.posterior_mean_cov(
    T_scaled, k_scaled, xstar, sigma=sigma, rho=rho, alpha=rho
)

high = mstar + 1.96 * np.sqrt(np.diag(Sigmastar))
low = mstar - 1.96 * np.sqrt(np.diag(Sigmastar))

Tstar = T.std() * xstar + T.mean()
kstar = k.std() * mstar + k.mean()
low = k.std() * low + k.mean()
high = k.std() * high + k.mean()

p.line(Tstar, kstar, line_width=2, color="tomato")
p = bebi103.viz.fill_between(
    Tstar,
    high,
    Tstar,
    low,
    show_line=False,
    patch_kwargs=dict(color="tomato", alpha=0.3),
    p=p,
)

# Plot data
p.circle(source=df, x="T (K)", y="k (1/s)")

bokeh.io.show(p)

The results almost exactly overlap.

[15]:
bebi103.stan.clean_cmdstan()

Computing environment

[16]:
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,cmdstanpy,arviz,bokeh,bebi103,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())
Python implementation: CPython
Python version       : 3.8.8
IPython version      : 7.21.0

numpy     : 1.19.2
scipy     : 1.6.1
pandas    : 1.2.3
cmdstanpy : 0.9.68
arviz     : 0.11.1
bokeh     : 2.2.2
bebi103   : 0.1.5
jupyterlab: 2.2.6

cmdstan   : 2.26.1