Gaussian processes with non-Normal likelihoods¶
[1]:
%load_ext blackcellmagic
%load_ext autoreload
%autoreload 2
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
cmd = "pip install --upgrade colorcet bebi103 arviz cmdstanpy watermark"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
import cmdstanpy; cmdstanpy.install_cmdstan()
data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
data_path = "../data/"
# ------------------------------
import numpy as np
import pandas as pd
import cmdstanpy
import arviz as az
import iqplot
import bebi103
import bokeh.io
bokeh.io.output_notebook()
Thus far in our studies of Gaussian processes, we have considered models with Normal likelihoods which enables marginalization over the latent variables, the values of the nonparametric function we are fitting. We could then compute the function values (the latent variables) after optimization or sampling of the hyperparameters by drawing samples out of the posterior, for which we have an analytical expression.
In this lesson, we will explicitly sample the latent variables. We will first do it for a model with a Normal likelihood to as a demonstration. We will then do it for a non-Normal likelihood, where is it necessary.
Generating samples of latent variables using MCMC¶
We will again consider the Wolfenden, et al., chemical kinetics data set. We’ll start be loading the data set, centering and scaling, and getting up the points for which we want to evaluate the nonparametric function.
[2]:
# Load data
df = pd.read_csv(os.path.join(data_path, "wolfenden_arrhenius.csv"))
df["T (K)"] = 1000 / df["1000/T (1/K)"]
df["k (1/s)"] = np.exp(df["ln k (1/s)"])
T = df["T (K)"].values
k = df["k (1/s)"].values
Just as a reminder, let’s make a quick plot of the data.
[3]:
# Make plot
p = bokeh.plotting.figure(
frame_height=250, frame_width=350, x_axis_label="T (K)", y_axis_label="k (1/s)"
)
p.circle(source=df, x="T (K)", y="k (1/s)")
bokeh.io.show(p)
A GP generative model¶
The unmarginalized generative model we have been using is
\begin{align} &\alpha \sim \text{HalfNorm}(2)\\[1em] &\rho \sim \text{InvGamma}(0.5, 2)\\[1em] &\sigma \sim \text{HalfNorm}(0.1)\\[1em] &f \mid \mathbf{T}, \alpha, \rho \sim \text{GP}(\mathbf{0}, k_\mathrm{SE}(\mathbf{T}, \mathbf{T}' ; \alpha, \rho)),\\[1em] &y_i \mid \mathbf{T}, f, \sigma \sim \text{Norm}(f(T_i), \sigma)\;\forall i. \end{align}
We can equivalently write this as
\begin{align} &\alpha \sim \text{HalfNorm}(2)\\[1em] &\rho \sim \text{InvGamma}(0.5, 2)\\[1em] &\sigma \sim \text{HalfNorm}(0.1)\\[1em] &\mathbf{f} \mid \mathbf{T}, \alpha, \rho \sim \text{MultiNorm}(\mathbf{T}, \mathsf{K}),\\[1em] &y_i \mid f_i, \sigma \sim \text{Norm}(f_i, \sigma)\;\forall i, \end{align}
where \(\mathsf{K}\) is the covariance matrix constructed from the squared exponential kernel with hyperparameters \(\rho\) and \(\alpha\). Importantly, we are going to sample the values of \(f\) at temperature \(T_i\).
Noncentering¶
This model has a hierarchical structure, with a layer of latent variables \(\mathbf{f}\) between the hyperparamaters \(\alpha\) and \(\rho\) and the observed data \(\mathbf{y}\). As such, it is useful for sampling to noncenter the model. We have not yet noncentered a multivariate conditional prior. Say we have a multivariate Normal conditional prior for parameters \(\boldsymbol{\theta}\) conditioned on hyperparameters \(\boldsymbol{\phi}\) and \(\mathsf{\Sigma}\),
\begin{align} \boldsymbol{\theta} \sim \text{MultiNorm}(\boldsymbol{\phi}, \mathsf{\Sigma}). \end{align}
If \(\mathsf{L}\) is the Cholesky decomposition of \(\mathsf{\Sigma}\), then this is can be equivalently written in noncentered form as
\begin{align} &\tilde{\theta}_i \sim \text{Norm}(0, 1)\;\forall i,\\[1em] &\boldsymbol{\theta} = \boldsymbol{\phi} + \mathsf{L}\cdot \tilde{\boldsymbol{\theta}} \end{align}
Therefore, we can write the generative model in noncentered form,
\begin{align} &\alpha \sim \text{HalfNorm}(2)\\[1em] &\rho \sim \text{InvGamma}(0.5, 2)\\[1em] &\sigma \sim \text{HalfNorm}(0.1)\\[1em] &\tilde{f}_i \sim \text{Norm}(0, 1)\;\forall i,\\[1em] &\mathbf{f} = \mathsf{L}\cdot\mathbf{f},\\[1em] &y_i \mid f_i, \sigma \sim \text{Norm}(f_i, \sigma)\;\forall i, \end{align}
where we have used the fact that because of the preprocessing of the data set, the mean of the GP is zero. Here, \(\mathsf{L}\) is the Cholesky decomposition of \(\mathsf{K}\).
Sampling latent variables with Stan¶
Now that we have our model, we can code it up in Stan and draw samples.
data {
int<lower=1> N;
vector[N] y;
int<lower=1> Nstar;
array[N] int xstar_inds;
array[Nstar] real xstar;
}
transformed data {
real delta = 1e-8;
}
parameters {
real<lower=0> alpha;
real<lower=0> rho;
real<lower=0> sigma;
vector[Nstar] f_tilde;
}
transformed parameters {
vector[Nstar] f;
{
matrix[Nstar, Nstar] K = cov_exp_quad(xstar, alpha, rho)
+ diag_matrix(rep_vector(delta, Nstar));
matrix[Nstar, Nstar] L = cholesky_decompose(K);
f = L * f_tilde;
}
}
model {
alpha ~ normal(0.0, 2.0);
rho ~ inv_gamma(0.5, 2.0);
sigma ~ normal(0.0, 1.0);
f_tilde ~ normal(0.0, 1.0);
y ~ normal(f[xstar_inds], sigma);
}
generated quantities {
array[N] real y_ppc;
y_ppc = normal_rng(f, sigma);
}
There a few things to note in this Stan code.
This is a generic Stan code for sampling univariate values of \(x\) and \(y\). For the problem of the kinetics data, \(x = T\) and \(y = k\).
The array
f
has the function value at all points we want to sample,xstar
. Note thatxstar
must also contain the \(x\)-values of the points where data were actually measured. Thexstar_inds
array is the same length as the measured datay
, and it contains the indices of the corresponding values inxstar
where they
values were measured.We need to add a small value to the diagonal of the covariance matrix to ensure positive definiteness.
Prior to sampling, we need to generate the arrays xstar
and xstar_inds
so that they contain the \(x\)-values for measured data. We also need to ensure that xstar
has no repeats. This can be accomplished with the bebi103.gp.append_sort_index()
function. It takes as input the values for which you want samples of the latent variables, \(\mathbf{x}_*\), the \(x\)-values from the data set, \(\mathbf{x}\), and an index_origin
kwarg that indicates whether to start
indexing at zero (for Python) or one (for Stan). It then generates a set of \(x\)-values for which to get samples of \(f\) by appending \(\mathbf{x}\) to \(\mathbf{x}_*\) and deleting duplicates. It also generates the array of indicies linking the values of xstar
to the measured quantities y
.
[4]:
# Center and scale
k_scaled = (k - k.mean()) / k.std()
T_scaled = (T - T.mean()) / T.std()
# Sample at 50 points
Nstar = 50
# Set up xstar
T_range = T_scaled.max() - T_scaled.min()
xstar = np.linspace(
T_scaled.min() - 0.05 * T_range, T_scaled.max() + 0.05 * T_range, Nstar
)
# Generate the arrays for use in Stan
xstar, xstar_inds = bebi103.gp.append_sort_index(xstar, T_scaled, index_origin=1)
Let’s take a look at the results.
[5]:
xstar, xstar_inds, T
[5]:
(array([-1.73988004, -1.67222743, -1.60457482, -1.58919923, -1.53692222,
-1.46926961, -1.401617 , -1.3339644 , -1.26631179, -1.22693001,
-1.19865918, -1.13100658, -1.06335397, -0.99570136, -0.96124699,
-0.92804876, -0.86039615, -0.79274354, -0.72509093, -0.65743833,
-0.58978572, -0.52324311, -0.52213311, -0.45448051, -0.3868279 ,
-0.31917529, -0.25152269, -0.18459205, -0.18387008, -0.11621747,
-0.04856487, 0.01908774, 0.08674035, 0.15439295, 0.22204556,
0.28969817, 0.35735077, 0.42500338, 0.49265599, 0.50367757,
0.56030859, 0.6279612 , 0.69561381, 0.76326641, 0.81156517,
0.83091902, 0.89857163, 0.96622423, 1.03387684, 1.10152945,
1.16918205, 1.23683466, 1.2418742 , 1.30448727, 1.37213987,
1.42441689, 1.43979248, 1.50744509, 1.57509769]),
array([56, 53, 45, 40, 40, 28, 22, 15, 10, 4]),
array([526.97268063, 522.8398459 , 513.09748852, 506.12679808,
506.12679808, 490.5441167 , 482.87692992, 472.96035845,
466.94519538, 458.74328484]))
We see that we have a repeat measurement around \(T = 506\) K. The xstar_inds
array has a double entry for index 40
, which corresponds to the entry in xstar
for that data point.
It is important to note that I chose \(N_* = 50\). I chose this instead of a larger number because the speed of the sampling scales like \(N_*^3\), since we need to perform a Cholesky decomposition of an \(N_*\times N_*\) matrix for every leapfrog step in the sampling (not just for every sample as we do in the generated quantities
block when we sample with the latent variables marginalized out). This can lead to very long sampling times if we allow \(N_*\) to be too large.
We now have everything in place to set up sampling!
[6]:
data = dict(
N=len(T_scaled),
y=k_scaled,
Nstar=len(xstar),
xstar_inds=xstar_inds,
xstar=xstar
)
with bebi103.stan.disable_logging():
sm = cmdstanpy.CmdStanModel(stan_file="gp_kinetics_no_marg.stan")
samples = sm.sample(data=data)
samples = az.from_cmdstanpy(samples, posterior_predictive="y_ppc")
As usual, we should check diagnostics.
[7]:
bebi103.stan.check_all_diagnostics(samples)
Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
45 of 4000 (1.125%) iterations ended with a divergence.
Try running with larger adapt_delta to remove divergences.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
[7]:
4
It’s not too bad, but we do see a lot of divergences. This is not surprising given the hierarchical model. We have already noncentered, so we can try dialing up adapt_delta
.
[8]:
with bebi103.stan.disable_logging():
samples = sm.sample(data=data, adapt_delta=0.95)
samples = az.from_cmdstanpy(samples, posterior_predictive="y_ppc")
bebi103.stan.check_all_diagnostics(samples)
Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
11 of 4000 (0.275%) iterations ended with a divergence.
Try running with larger adapt_delta to remove divergences.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
[8]:
4
Much better! Let’s take a look at the corner plot of the hyperparameters and \(\sigma\) samples.
[9]:
bokeh.io.show(bebi103.viz.corner(samples, parameters=["alpha", "rho", "sigma"]))
If we compare this corner plot to the one we acquired by sampling the posterior with the latent variables marginalized out (see this portion of the lesson), they are the same, as we would expect.
We can also plot the samples of \(\mathbf{f}_*\) as before.
[10]:
fstar_scaled = (
samples.posterior["f"]
.stack({"sample": ("chain", "draw")})
.transpose("sample", "f_dim_0")
)
# Uncenter and unscale
fstar = k.std() * fstar_scaled + k.mean()
Tstar = T.std() * xstar + T.mean()
bokeh.io.show(
bebi103.viz.predictive_regression(
fstar,
Tstar,
data=np.stack((T, k)).transpose(),
color="orange",
data_kwargs=dict(line_color="#1f78b4", fill_color="#1f78b4"),
)
)
Sampling with a Poisson likelihood¶
In the previous example, we decided to do things the hard way by sampling out of a hierarchical posterior when we could have marginalized out the latent variables because of the convenient properties of a Normal likelihood. Now, we consider a contrived scenario where we really do need to sample the latent variables. While contrived, it could conceivably be encountered in a real analysis.
Image we measure the copy number of a given mRNA transcript using, say single-molecule FISH, over the some time course. For each time point, we get transcript counts in 50 or so cells. We model the transcript counts as Poisson distributed (as we would for simple non-bursty gene expression). We suspect that the parameter \(\lambda\) of the Poisson distribution varies in time, and we want to get a plot of how we think it may vary. We therefore model \(\lambda\), or more specifically \(\ln \lambda\) as varying according to a GP. Our generative model is
\begin{align} \ln \lambda(t) \sim \text{GP}(m, k_\mathrm{SE}(t, t')),\\[1em] n_i(t) \sim \text{Poisson}(\lambda(t)) \;\forall i. \end{align}
Note that because we are using count data, we cannot center and scale, so we choose a constant \(m \ne 0\) as our mean function in the GP.
To demonstrate how to use this model, we will first generate synthetic data. We will use an underlying function of \(\lambda(t) = a_0 \mathrm{e}^{a_1 \sin(2 \pi t)}\), with time varying from zero to one in arbitrary time units. We will choose \(a_0 = 10\) and \(a_1 = 0.3\), and will take measurements every 0.1 time units. We will get transcript counts from about 50 cells at each time point.
[11]:
def varying_function(t, a0, a1):
"""Underlying function defining λ(t)."""
return a0 * np.exp(a1 * np.sin(2 * np.pi * t))
rg = np.random.default_rng()
# Set up time points and number of cells
time_points = np.arange(0, 11) / 10
n_cells = rg.normal(50, 5, size=len(time_points)).astype(int)
# Time points of each measurement
t = np.concatenate([[time_point] * n for n, time_point in zip(n_cells, time_points)])
# Draw counts out of Poisson
counts = np.concatenate(
[
rg.poisson(varying_function(time_point, 10, 0.3), size=n)
for n, time_point in zip(n_cells, time_points)
]
)
# Plop everything into a data fram
df = pd.DataFrame(data={"t": t, "count": counts})
# Plot the data set
bokeh.io.show(
iqplot.strip(
df,
q="count",
cats="t",
jitter=True,
q_axis="y",
x_axis_label="time",
marker_kwargs=dict(fill_color="black", line_color="black", alpha=0.3),
frame_width=600,
)
)
We can code om the model in Stan as follows.
data {
int<lower=1> N;
array[N] int y;
array[N] int t_inds;
int<lower=1> Nstar;
array[Nstar] real tstar;
}
transformed data {
real delta = 1e-8;
}
parameters {
real<lower=0> alpha;
real<lower=0> rho;
real m;
vector[Nstar] log_f_tilde;
}
transformed parameters {
vector[Nstar] log_f;
{
matrix[Nstar, Nstar] K = cov_exp_quad(tstar, alpha, rho)
+ diag_matrix(rep_vector(delta, Nstar));
matrix[Nstar, Nstar] L = cholesky_decompose(K);
log_f = m + L * log_f_tilde;
}
}
model {
alpha ~ normal(0, 5.0);
rho ~ inv_gamma(0.1, 2.0);
m ~ normal(0.0, 5.0);
log_f_tilde ~ normal(0.0, 1.0);
y ~ poisson_log(log_f[t_inds]);
}
generated quantities {
array[N] int y_ppc;
y_ppc = poisson_log_rng(log_f[t_inds]);
}
This proceeds very much like the previous example, except we use a Poisson likelihood and we have a parameter \(m\) for the mean function in the GP. We again noncenter \(f\). Note that we use Stan’s poisson_log
distribution, which is a Poisson distribution parametrized by the logarithm of the parameter.
Let’s start be setting up the data array, again using bebi103.gp.append_sort_index()
to conveniently add the measured data points to xstar
(called tstar
here).
[12]:
tstar = np.linspace(0, 1, 50)
tstar, t_inds = bebi103.gp.append_sort_index(tstar, t, index_origin=1)
data = dict(N=len(counts), t_inds=t_inds, y=counts, Nstar=len(tstar), tstar=tstar)
Now, we’re ready to compile and sample!
[13]:
with bebi103.stan.disable_logging():
sm = cmdstanpy.CmdStanModel(stan_file="gp_transcript_counts.stan")
samples = sm.sample(data=data)
samples = az.from_cmdstanpy(samples, posterior_predictive="y_ppc")
We’ll do our usual diagnostic checks.
[14]:
bebi103.stan.check_all_diagnostics(samples)
Effective sample size looks reasonable for all parameters.
Rhat for parameter alpha is 1.0109498403542605.
Rank-normalized Rhat above 1.01 indicates that the chains very likely have not mixed.
203 of 4000 (5.075%) iterations ended with a divergence.
Try running with larger adapt_delta to remove divergences.
2163 of 4000 (54.075%) iterations saturated the maximum tree depth of 10.
Try running again with max_treedepth set to a larger value to avoid saturation.
E-BFMI indicated no pathological behavior.
[14]:
14
Oof. Again, too many divergences and we’re hitting the tree depth. Let’s try changing adapt_delta
and allowing for bigger tree depth.
[15]:
with bebi103.stan.disable_logging():
samples = sm.sample(data=data, adapt_delta=0.99, max_treedepth=15)
samples = az.from_cmdstanpy(samples, posterior_predictive="y_ppc")
bebi103.stan.check_all_diagnostics(samples, max_treedepth=15)
Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 15.
E-BFMI indicated no pathological behavior.
[15]:
0
The diagnostics look a lot better. Let’s take a quick look at a corner plot of the hyperparameters.
[16]:
bokeh.io.show(bebi103.viz.corner(samples, parameters=["alpha", "rho", "m"]))
The is a strong funnel shape in \(\alpha\) and \(\rho\). Small \(\alpha\) means that we have little amplitude in the variation, and the model effectively captures all ranges of variation. The length scale \(\rho\) does not push too close to zero, since we used an Inverse-Gamma prior to keep the length scale from being too small.
Finally, let’s look at the underlying time-variation of \(\lambda\) that we have inferred (which was, after all, our main goal). For comparison, I will overlay the actual function we used
[20]:
log_fstar = (
samples.posterior["log_f"]
.stack({"sample": ("chain", "draw")})
.transpose("sample", "log_f_dim_0")
)
fstar = np.exp(log_fstar)
p = bebi103.viz.predictive_regression(fstar, tstar)
p.line(tstar, varying_function(tstar, 10, 0.3), color="orange", line_width=2)
bokeh.io.show(p)
We have done a good job capturing the variation.
[18]:
bebi103.stan.clean_cmdstan()
Computing environment¶
[19]:
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,cmdstanpy,arviz,bokeh,bebi103,jupyterlab
print("cmdstan :", bebi103.stan.cmdstan_version())
Python implementation: CPython
Python version : 3.8.8
IPython version : 7.21.0
numpy : 1.19.2
scipy : 1.6.1
pandas : 1.2.3
cmdstanpy : 0.9.68
arviz : 0.11.1
bokeh : 2.2.2
bebi103 : 0.1.7
jupyterlab: 2.2.6
cmdstan : 2.26.1