26: Variational Bayesian inference

This lesson was written by Heidi Klumpe and Justin Bois.


[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade colorcet bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    import cmdstanpy; cmdstanpy.install_cmdstan()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import io

import numpy as np
import pandas as pd

import cmdstanpy
import arviz as az

import iqplot
import bebi103

import holoviews as hv
hv.extension('bokeh')

bebi103.hv.set_defaults()

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

Large data-sets, mathematical models including many variables, or hierarchical models can require computationally expensive sampling with MCMC. We can speed up sampling if we instead use an analytical approximation of the posterior and then sample from it. We call this process variational inference (VI). So, VI is used when MCMC is intractable for a given problem. Variational inference is sometimes referred to as variational Bayes, a term that is used interchangeably with variation inference in the Stan documentation.

Importantly, MCMC samples asymptotically approach the true posterior density, but VI does not. VI will only give information about an approximation of the posterior, and that approximation may not be good.

In this lesson, we will go over the basic concepts behind VI and implementations in Stan. For more detail, I recommend this review of VI David Blei and others.

The main ideas of variational inference

For VI, we want to approximate a posterior distribution \(g(\theta \mid y)\) with an approximate distribution \(Q(\theta)\). For \(Q(\theta)\) to be a good approximation of the posterior. We saw in lessons on model comparison that the Kullback-Leibler divergence is used to quantify the dissimilarity between two probability distributions (note that it is not symmetric in the two distributions, and therefore cannot be considered a distance). So, we want the Kullback-Leibler divergence to be as small as possible. That is, we wish to find a \(Q\) such that the K-L divergence

\begin{align} D_{KL} \left(Q\middle\| g \right) = \sum_\theta Q(\theta)\, \ln \frac{Q(\theta)}{g(\theta \mid y)}. \end{align}

is as close to zero as possible. (The K-L divergence is zero if \(Q = P\), and positive otherwise.) Note that we can replace the sum with an integral for continuous \(\theta\). Because the full posterior appears in this expression, to evaluate the K-L divergence directly, we would have to compute the evidence, \(f(y) = \int d\theta\, g(\theta, y)\), which is generally an intractable integral. Instead, we would like the objective function of our optimize to be in terms of distributions that are more easily expressed, so we modify this expression. By the definition of conditional probability,

\begin{align} g(\theta \mid y) = \frac{\pi(\theta, y)}{f(y)}. \end{align}

We can substitute this expression into the above expression for the K-L divergence and rearrange using properties of logarithms.

\begin{align} D_{KL} \left(Q\middle\| g \right) &= \sum_\theta Q(\theta) \, \ln \frac{Q(\theta)}{\frac{\pi(\theta, y)}{f(y)}}\\[1em] &= \sum_\theta Q(\theta)\, \ln \frac{Q(\theta)}{\pi(\theta, y)} + \sum_\theta Q(\theta) \, \ln f(y)\\[1em] &= \sum_\theta Q(\theta)\, \ln \frac{Q(\theta)}{\pi(\theta, y)} + \ln f(y)\sum_\theta Q(\theta) \\[1em] &= \sum_\theta Q(\theta)\, \ln \frac{Q(\theta)}{\pi(\theta, y)} + \ln f(y). \end{align}

Note that we effectively replaced the posterior (the conditional distribution \(P(\theta \mid y)\)) with the product of the likelihood and prior (the joint distribution \(\pi(\theta, y) = f(y \mid \theta) g(\theta)\)).

We further rearrange to get

\begin{align} \ln f(y) &= D_{KL} \left(Q\middle\| g \right) - \sum_\theta Q(\theta) \ln \frac{Q(\theta)}{\pi(\theta, y)} \\ &= D_{KL} \left(Q\middle\| g \right) + \mathcal{L}\left(Q(\theta)\right), \end{align}

where

\begin{align} \mathcal{L}\left(Q(\theta)\right) = - \sum_\theta Q(\theta) \ln \frac{Q(\theta)}{\pi(\theta, y)}. \end{align}

\(\mathcal{L}(Q)\) is the log of the evidence lower bound (ELBO). That is because the the log evidence is guaranteed to be at least as big as \(\mathcal{L}(Q)\). \(\mathcal{L}(Q)\) is also sometimes called the “negative variational free energy.” This is because it has the form of a free energy, summing up an expected value for the energy and the entropy.

\begin{align} \\ \mathcal{L}(Q) &= -\sum_\theta Q(\theta) \mathrm{log} \frac{Q(\theta)}{P(\theta, y)} \\ &= \sum_\theta Q(\theta) \left[ \mathrm{log} P(\theta, y) - \mathrm{log} {Q(\theta)} \right] \\ &= \sum_\theta Q(\theta) \mathrm{log} P(\theta, y) - \sum_\theta Q(\theta)\mathrm{log} Q(\theta) \end{align}

The first term approximates an expected value for the “energy”, and the second closely resembles the definition of the Shannon entropy.

Note that no matter what approximation we pick, \(\log{f(y)}\) has the same value, i.e. it is a fixed constant for all possible choices of \(Q(\theta)\). Therefore, making \(\mathcal{L}(Q)\) large is equivalent to making \(D_{KL} \left(Q\middle\| g \right)\) small.

For \(Q(\theta)\) to be a useful approximation of the posterior, we should choose \(Q(\theta)\) such that \(\mathcal{L}(Q)\) is easy to compute and optimize, but not so simple that \(Q(\theta)\) very poorly approximates the posterior.

Choosing Q(θ)

We can choose any family of functions to approximate our posterior, but it is particularly useful to choose the mean-field variational family. This assumes the parameters of our statistical model (\(\theta\)) are independent, with their own unique parameters. Stated mathematically,

\begin{align} Q(\theta) = \prod_i q_i (\theta_i ; \phi_i), \end{align}

where \(q_i\) are partitions of the approximation, each of which is an independent function of only one parameter \(\theta_i\), and \(\phi_i\) are the unique parameters that parametrize \(q_i\). A posterior distribution in general does not factorize this way, which is what makes this an approximation.

To arrive at our final approximation, we need expressions for each \(q_i\). For a given \(q_i\), we want to estimate the posterior as a function of \(\theta_i\) only. To do this, we replace all \(\theta_{j\neq i}\) terms in the posterior with their expected values, i.e. \(\theta_j^{mean} = \int d\theta_j q_j\). (That is why this is considered a mean field approach—we approximate the effect of other variables as a “mean field,” rather than their true values. Note also that the mean is some constant, for which there is an analytical expression for most probability distributions.)

But now we see our final hurdle. To fully specify \(q_i\), we need information about all the other partitions \(q_{j\neq i}\), since the mean of that distribution may appear in the expression for \(q_i\). Thus, to complete the optimization, we iteratively update \(\phi_i\) for each parameter \(\theta_i\), updating one variable at a time. We stop when we arrive at a solution (i.e. a set of \(q_i\) and \(\phi_i\)) that maximizes the ELBO.

Summary of VI algorithm

  1. Our goal is to find a distribution \(Q(\theta)\) that maximizes the ELBO, which is equivalent to minimizing the dissimilarity between the posterior \(g(\theta \mid y)\) and an approximate distribution \(Q(\theta)\).

  2. Posit that \(Q(\theta) = \prod_i q_i(\theta_i; \phi_i)\). Note that you can derive what the functional form for \(q_i\) should be (e.g. Normal, Gamma, etc.), but Stan’s VI algorithm will do it for you. This defines the “restricted class” from which you find your best \(Q(\theta)\)

  3. Vary \(\phi_i\) to maximize the ELBO and find the best \(Q(\theta)\). This gives the final form of approximation of the posterior.

  4. Draw samples out of the approximation of the posterior. This does not need MCMC; independent samples may be directly drawn using a randon number generator and transforms.

Automatic Differentiation Variational Inference and Stan

The algorithm Stan uses for VI is called Automatic Differentiation Variational Inference (ADVI). The technique was developed by Alp Kucukelbir and was published in 2015.

I will not go into the details of the ADVI algorithm here, but will give a few highlights that are important to understand while using it. First, any parameters that must be positive or are otherwise constrained are transformed such that they may take on any real value. We will call the transformed parameters \(\zeta\). The ADVI algorithm is clever in its choice of transform, and you should see the Kucukelbir, et al. paper for details. Next, we choose a family of distributions for \(Q(\theta)\). One choice is a Gaussian mean field family where

\begin{align} &Q(\zeta) = \prod_i q_i(\zeta_i),\\[1em] &\zeta_i \sim \text{Norm}(\mu_i, \sigma_i). \end{align}

The ADVI algorithm then finds the values of the \(\mu_i\)’s and \(\sigma_i\)’s that maximize the ELBO using clever a optimization algorithm. Note that when we choose the Gaussian mean field family, we lose information about the covariance of parameter values.

Stan’s implementation of ADVI also allows specification of a full rank Gaussian.

\begin{align} &\zeta \sim \text{MultiNorm}(\boldsymbol{\mu}, \mathsf{\Sigma}). \end{align}

There are now many more parameters to manipulate to maximize the ELBO (all of the entries of \(\mathsf{\Sigma}\) instead of just the diagonal ones in the mean field case), and the optimization calculation is more difficult than with the mean field family. The benefit is that the extra complexity (and therefore flexibility) allows better approximation of the posterior and more covariance information about the parameters is retained.

After the optimal parameters are found for the optimizing distributions, samples of \(\zeta\) are drawn out of these Gaussian distributions and these samples are transformed back into \(\theta\).

Examples

We will now apply ADVI to two examples we have already studied with MCMC. We will use the same Stan models we used previously. Using ADVI is as simple as using example the Stan model as you would with MCMC, but using variational sampling instead of MCMC sampling.

Example 1: Spindle size as a function of volume

We will first consider a model for microtubule spindle size based on Matt Good’s experiments. We considered this model in lesson 11. Here is a quick plot of the data.

[2]:
# Load in Data Frame
df = pd.read_csv(os.path.join(data_path, "good_invitro_droplet_data.csv"), comment="#")

hv.Scatter(data=df, kdims="Droplet Diameter (um)", vdims="Spindle Length (um)")

Data type cannot be displayed:

[2]:

The generative model we used for this data set is

\begin{align} &\log_{10} \phi \sim \text{Norm}(1.5, 0.75),\\[1em] &\gamma \sim \text{Beta}(1.1, 1.1), \\[1em] &\sigma \sim \text{HalfNorm}(10),\\[1em] &\mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &l_i \mid d_i, \gamma, \phi, \sigma \sim \text{Norm}(\mu_i, \sigma) \;\forall i. \end{align}

The corresponding Stan code is

functions {
  real ell_theor(real d, real phi, real gamma_) {
    real denom_ratio = (gamma_ * d / phi)^3;
    return gamma_ * d / (1 + denom_ratio)^(1.0 / 3.0);
  }
}


data {
  int N;
  real d[N];
  real ell[N];
}


parameters {
  real log10_phi;
  real<lower=0, upper=1> gamma_;
  real<lower=0> sigma;
}


transformed parameters {
  real phi = 10^log10_phi;
}


model {
  log10_phi ~ normal(1.5, 0.75);
  gamma_ ~ beta(1.1, 1.1);
  sigma ~ normal(0.0, 10.0);

  for (i in 1:N) {
    ell[i] ~ normal(ell_theor(d[i], phi, gamma_), sigma);
  }
}

Let’s compile the model. The compilation for ADVI is exactly the same as for MCMC.

[3]:
with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file="spindle.stan")

Now, we can perform ADVI. To do so, we use sm.variational() instead of sm.sample(). We specify that we want to use a Gaussian mean field approximation using the algorithm="meanfield" kwarg. Stan will automatically do the optimization. Remember, though, for general optimization problems, there are no guarantees that we can find a global optimum.

Finally, we then use output_samples=4000 to get samples out of the approximate posterior.

[4]:
N_ppc = 200
d_ppc = np.linspace(0.1, 250, N_ppc)

d = df["Droplet Diameter (um)"].values
ell = df["Spindle Length (um)"].values

data = dict(N=len(df), d=d, ell=ell, N_ppc=N_ppc, d_ppc=d_ppc)

samples_vi = sm.variational(
    data=data,
    algorithm="meanfield",
    output_samples=4000,
    seed=3252,
)
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:finish chain 1

Unfortunately, ArviZ does not have support for VI results, so we need to extract the information about the VI calculation directly from the object returned by sm.variational(). First, we can look at Stan’s output pertaining to its optimization calculation for maximizing the ELBO. To do this, we will pull out the stdout_files and print their content

[5]:
for fname in samples_vi.runset._stdout_files:
    with open(fname, "r") as f:
        print(f.read())
method = variational
  variational
    algorithm = meanfield (Default)
      meanfield
    iter = 10000 (Default)
    grad_samples = 1 (Default)
    elbo_samples = 100 (Default)
    eta = 1 (Default)
    adapt
      engaged = 1 (Default)
      iter = 50 (Default)
    tol_rel_obj = 0.01 (Default)
    eval_elbo = 100 (Default)
    output_samples = 4000
id = 0 (Default)
data
  file = /var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmpj_q0kkz9/807yr95k.json
init = 2 (Default)
random
  seed = 3252
output
  file = /var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmpj_q0kkz9/spindle-202103112324-1-j56ylnkp.csv
  diagnostic_file =  (Default)
  refresh = 100 (Default)
  sig_figs = -1 (Default)
  profile_file = profile.csv (Default)

------------------------------------------------------------
EXPERIMENTAL ALGORITHM:
  This procedure has not been thoroughly tested and may be unstable
  or buggy. The interface is subject to change.
------------------------------------------------------------



Gradient evaluation took 0.000295 seconds
1000 transitions using 10 leapfrog steps per transition would take 2.95 seconds.
Adjust your expectations accordingly!


Begin eta adaptation.
Iteration:   1 / 250 [  0%]  (Adaptation)
Iteration:  50 / 250 [ 20%]  (Adaptation)
Iteration: 100 / 250 [ 40%]  (Adaptation)
Iteration: 150 / 250 [ 60%]  (Adaptation)
Iteration: 200 / 250 [ 80%]  (Adaptation)
Success! Found best value [eta = 1] earlier than expected.

Begin stochastic gradient ascent.
  iter             ELBO   delta_ELBO_mean   delta_ELBO_med   notes
   100        -3265.390             1.000            1.000
   200        -2237.840             0.730            1.000
   300        -1925.216             0.541            0.459
   400        -1857.383             0.415            0.459
   500        -1842.961             0.333            0.162
   600        -1847.458             0.278            0.162
   700        -1857.168             0.239            0.037
   800        -1843.915             0.210            0.037
   900        -1844.077             0.187            0.008   MEDIAN ELBO CONVERGED

Drawing a sample of size 4000 from the approximate posterior...
COMPLETED.

The stochastic gradient ascent was used to maximize the ELBO, and we got the message that the optimization converged. Stan then proceeded to draw 4000 samples out of the approximate posterior.

Now, we can obtain the draws using the samples_vi.variational_sample object, which has the samples as a data frame. We need to change the column headings to be descriptive.

[6]:
df_vi = samples_vi.variational_sample
df_vi.columns = samples_vi.column_names

# Take a look
df_vi.head()
[6]:
lp__ log_p__ log_g__ phi gamma_ sigma_0 mu[1] mu[2] mu[3] mu[4] ... ell_ppc[191] ell_ppc[192] ell_ppc[193] ell_ppc[194] ell_ppc[195] ell_ppc[196] ell_ppc[197] ell_ppc[198] ell_ppc[199] ell_ppc[200]
0 0 -1834.08 -3.219880 37.5893 0.893684 0.109048 22.3790 23.0834 23.8232 24.7628 ... 35.3526 41.9736 34.4416 35.1618 38.7766 40.2281 36.7422 33.4592 37.7218 41.7577
1 0 -1833.35 -0.954350 37.4528 0.889917 0.114269 22.2874 22.9892 23.7263 24.6625 ... 33.7560 34.5190 40.2171 37.5276 41.5957 44.5650 33.0368 34.0257 31.2869 39.9706
2 0 -1842.13 -1.061710 37.2047 0.858509 0.116629 21.6294 22.3235 23.0545 23.9857 ... 33.5521 42.1813 29.5439 36.8618 38.7475 40.3001 39.4238 40.6471 41.9217 38.4536
3 0 -1838.20 -1.103570 37.0504 0.883538 0.112488 22.1108 22.8053 23.5346 24.4605 ... 29.9520 38.3768 30.2980 31.9198 39.5240 35.0960 34.9750 47.4871 34.3413 33.8177
4 0 -1836.68 -0.772152 37.3489 0.872675 0.119126 21.9316 22.6298 23.3643 24.2988 ... 42.9722 35.1234 28.4203 30.0156 34.0073 37.7349 29.9380 40.7458 45.5881 34.6173

5 rows × 1546 columns

The lp__ column is always zero (it is a legacy feature that is not ever used). The log_p__ and log_g__ columns are values of the log posterior and log approximate posterior, respectively, so we do not really use those. The remaining columns are samples of the parameter values and posterior predictive checks.

We can also directly get the values of \(\gamma\), \(\phi\) and \(\sigma\) that resulted in the maximal ELBO.

[7]:
samples_vi.variational_params_pd
[7]:
lp__ log_p__ log_g__ phi gamma_ sigma_0 mu[1] mu[2] mu[3] mu[4] ... ell_ppc[191] ell_ppc[192] ell_ppc[193] ell_ppc[194] ell_ppc[195] ell_ppc[196] ell_ppc[197] ell_ppc[198] ell_ppc[199] ell_ppc[200]
0 0.0 0.0 0.0 37.3375 0.874764 0.114737 21.972 22.6703 23.4047 24.3389 ... 34.6317 38.1759 38.863 36.7289 39.4734 43.4675 33.6988 40.8484 30.7719 30.8126

1 rows × 1546 columns

Let’s take a quick look at the posterior predictive checks to make sure the likelihood parametrized by draws out of the approximate posterior can generate the data.

[8]:
ell_ppc = df_vi[df_vi.columns[df_vi.columns.str.contains("ell_ppc")]].values

bokeh.io.show(
    bebi103.viz.predictive_regression(
        ell_ppc,
        d_ppc,
        data=(d, ell),
        x_axis_label="droplet diameter (µm)",
        y_axis_label="spindle length (µm)",
    )
)

This looks ok, but remember that we are looking at how the likelihood can generate data using parameters sampled from the approximate posterior. Now, let’s look at the samples out of the approximate posterior.

[9]:
parameters = ["phi", "gamma_", "sigma_0"]

bokeh.io.show(bebi103.viz.corner(df_vi[parameters], parameters=parameters))

Notice that the marginal posterior distribution for each pair of parameters is symmetric. This is because the covariance between parameters is neglected using the Gaussian mean field approximation of the posterior.

To compare, let’s look at a corner plot from the parameters obtained using MCMC.

[10]:
with bebi103.stan.disable_logging():
    samples = az.from_cmdstanpy(sm.sample(data=data))

bokeh.io.show(bebi103.viz.corner(samples, parameters=parameters))

The full posterior clearly shows a strong covariance between \(\gamma\) and \(\phi\). For a clearer comparison, let’s compare the ECDFs of the marginal distributions for \(\gamma\) and \(\phi\) obtained from MCMC and from VI.

[11]:
def plot_marginal_ecdfs():
    p_phi = iqplot.ecdf(
        samples.posterior.phi.values.flatten(), legend_label="MCMC", x_axis_label="φ"
    )
    p_phi = iqplot.ecdf(
        df_vi.phi.values,
        marker_kwargs=dict(fill_color="orange", line_color="orange"),
        p=p_phi,
        legend_label="ADVI",
    )

    p_gamma = iqplot.ecdf(samples.posterior.gamma_.values.flatten(), x_axis_label="γ")
    p_gamma = iqplot.ecdf(
        df_vi.gamma_.values,
        marker_kwargs=dict(fill_color="orange", line_color="orange"),
        p=p_gamma,
    )

    return p_phi, p_gamma

bokeh.io.show(bokeh.layouts.column(*plot_marginal_ecdfs()))

We see that there are differences in the marginal distributions between the approximate and exact posteriors. This will in general be the case; the approximate posterior based on VI will never be exact (unless of course the problem is set up such that the posterior is in fact Gaussian), and the approximation may introduce substantial differences.

To attempt to do better, capturing the covariance between \(\phi\) and \(\gamma\), and hopefully also getting the marginal distributions for each closer to the true posterior, let’s use a full rank Gaussian family in our approximation.

[12]:
# Take samples using full rank Gaussian approximation
samples_vi = sm.variational(
    data=data, algorithm="fullrank", output_samples=4000, seed=3251
)

# Convert to data frame
df_vi = samples_vi.variational_sample
df_vi.columns = samples_vi.column_names

# Show corner plot
bokeh.io.show(bebi103.viz.corner(df_vi[parameters], parameters=parameters))
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:finish chain 1

This seems to have at least captured the covariance between \(\phi\) and \(\gamma\). Let’s check the marginal distributions for each, again compared to MCMC.

[13]:
bokeh.io.show(bokeh.layouts.column(*plot_marginal_ecdfs()))

The samples of \(\gamma\) look good, but those of \(\phi\) are off in the opposite direction than was the case for the mean field method.

Importantly, since there is some stochasticity to the maximization of the ELBO procedure, we will not get the same results each time. Let’s try again with a different seed.

[14]:
# Take samples using full rank Gaussian approximation
samples_vi = sm.variational(
    data=data, algorithm="fullrank", output_samples=4000, seed=88812
)

# Convert to data frame
df_vi = samples_vi.variational_sample
df_vi.columns = samples_vi.column_names

# Show corner plot
bokeh.io.show(bebi103.viz.corner(df_vi[parameters], parameters=parameters))
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:finish chain 1

In this case, we do not see covariance between \(\phi\) and \(\gamma\), but rather that they covary with \(\sigma_0\). We could still pass posterior predictive checks; the approximate posterior could generate the observed data, but of course the approximate posterior is not the model. Nonetheless, the marginal distributions at least give reasonable values for the range of values the parameters may take.

If we were doing a more careful analysis with ADVI, we might run it multiple times, checking the ELBO value of each, and finding the best approximation of the posterior.

Example 2: A multilevel hierarchical model

For our next example, we will use the contrived multilevel hierarchical model we considered in lesson 21. First, let’s “load” in the contrived data and plot it as a reminder.

[15]:
data_str = "".join(
    [
        "day,batch,colony,y\nm,1,1,11.40\nm,1,1,10.54\n",
        "m,1,1,12.17\nm,1,1,12.41\nm,1,1,9.97\nm,1,2,10.76\n",
        "m,1,2,9.16\nm,1,3,9.50\nm,2,1,9.34\nm,2,1,10.14\n",
        "m,2,2,10.72\nm,2,2,10.63\nm,3,1,11.37\nm,3,1,10.51\n",
        "m,4,1,11.06\nm,4,1,10.68\nm,4,1,12.58\nm,4,2,11.21\n",
        "m,4,2,11.07\nm,4,2,10.74\nm,4,2,11.68\nm,4,3,10.65\n",
        "m,4,3,9.06\nw,1,1,10.40\nw,1,2,10.75\nw,1,2,11.42\n",
        "w,1,2,10.42\nw,1,2,9.18\nw,1,2,10.69\nw,1,2,9.37\n",
        "w,1,2,11.32\nw,2,1,9.90\nw,2,1,10.53\nw,2,1,10.76\n",
        "w,3,1,11.08\nw,3,1,9.27\nw,3,1,12.01\nw,3,1,12.20\n",
        "w,3,1,11.23\nw,3,1,10.96\nr,1,1,9.73\nr,1,2,11.25\n",
        "r,1,2,9.99\nr,1,2,10.12\nr,1,3,9.65\nr,1,3,10.18\nr,1,4,12.70\n",
    ]
)

data_str = (
    data_str.replace("m", "monday").replace("w", "wednesday").replace("r", "thursday")
)
df = pd.read_csv(io.StringIO(data_str))

bokeh.io.show(
    iqplot.strip(
        df,
        q="y",
        cats=["day", "batch"],
        color_column="colony",
        marker_kwargs=dict(alpha=0.6),
    )
)

The statistical model we used is

\begin{align} &\theta \sim \text{Norm}(10, 3) \\[1em] &\tau \sim \text{HalfNorm}(10, 3) \\[1em] &\theta_1 \sim \text{Norm}(\theta, \tau) \\[1em] &\theta_2 \sim \text{Norm}(\theta_1, \tau) \\[1em] &\theta_3 \sim \text{Norm}(\theta_2, \tau) \\[1em] &\sigma \sim \text{HalfNorm}(5) \\[1em] &y\sim \text{Norm}(\theta_3, \sigma). \end{align}

We coded this up as a Stan model with noncentering.

data {
  // Total number of data points
  int N;

  // Number of entries in each level of the hierarchy
  int J_1;
  int J_2;
  int J_3;

  //Index arrays to keep track of hierarchical structure
  int index_1[J_2];
  int index_2[J_3];
  int index_3[N];

  // The measurements
  real y[N];
}


parameters {
  // Hyperparameters level 0
  real theta;

  // How hyperparameters vary
  real<lower=0> tau;

  // Hyperparameters level 1
  vector[J_1] theta_1_tilde;

  // Hyperparameters level 2
  vector[J_2] theta_2_tilde;

  // Parameters
  vector[J_3] theta_3_tilde;
  real<lower=0> sigma;
}


transformed parameters {
  // Transformations from noncentered
  vector[J_1] theta_1 = theta + tau * theta_1_tilde;
  vector[J_2] theta_2 = theta_1[index_1] + tau * theta_2_tilde;
  vector[J_3] theta_3 = theta_2[index_2] + tau * theta_3_tilde;
}


model {
  theta ~ normal(10, 3);
  tau ~ normal(0, 5);

  theta_1_tilde ~ normal(0, 1);
  theta_2_tilde ~ normal(0, 1);
  theta_3_tilde ~ normal(0, 1);

  sigma ~ normal(0, 5);

  y ~ normal(theta_3[index_3], sigma);
}

Before we can sample or perform ADVI, we need to encode the data for Stan to use.

[16]:
data, df = bebi103.stan.df_to_datadict_hier(
    df, level_cols=["day", "batch", "colony"], data_cols="y"
)

Now, we’ll load and compile the model and draw MCMC samples and samples using ADVI.

[17]:
with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file="hier.stan")
    samples = az.from_cmdstanpy(sm.sample(data=data, adapt_delta=0.95))
    samples_vi = sm.variational(
        data=data, algorithm="fullrank", output_samples=4000, seed=3252
    )

df_vi = samples_vi.variational_sample
df_vi.columns = samples_vi.column_names

First, we’ll make a corner plot for the full MCMC samples. We will only show the hyperparameters for ease of display.

[18]:
bokeh.io.show(
    bebi103.viz.corner(samples, parameters=["theta", "sigma", "tau"])
)

We see a strong funnel shape with small values of \(\tau\) being heavily sampled. There is a strong tendency toward pooling.

Now, let’s look at the corner plot of the variational samples.

[19]:
bokeh.io.show(bebi103.viz.corner(df_vi, parameters=["theta", "sigma", "tau"]))

Oof! This does not look much like the posterior! The parameter \(\theta\) is underestimated and \(\tau\) is overestimated. Let’s compare the marginal ECDFs.

[20]:
p_theta = iqplot.ecdf(
    samples.posterior.theta.values.flatten(),
    legend_label="MCMC",
    x_axis_label="θ",
    frame_height=150,
)
p_theta = iqplot.ecdf(
    df_vi.theta.values,
    marker_kwargs=dict(fill_color="orange", line_color="orange"),
    p=p_theta,
    legend_label="ADVI",
)

p_sigma = iqplot.ecdf(
    samples.posterior.sigma.values.flatten(), x_axis_label="σ", frame_height=150,
)
p_sigma = iqplot.ecdf(
    df_vi.sigma.values,
    marker_kwargs=dict(fill_color="orange", line_color="orange"),
    p=p_sigma,
)

p_tau = iqplot.ecdf(
    samples.posterior.tau.values.flatten(), x_axis_label="τ", frame_height=150,
)
p_tau = iqplot.ecdf(
    df_vi.tau.values,
    marker_kwargs=dict(fill_color="orange", line_color="orange"),
    p=p_tau,
)

bokeh.io.show(bokeh.layouts.gridplot([p_theta, p_sigma, p_tau], ncols=1))

The VI results are a serious departure from full MCMC. In general, VI performs poorly when there is complex model structure like we have here in this hierarchical model. There are custom methods for VI inference with hierarchical models, for example as described in this paper by Ranganath and coworkers, but these are not implemented in Stan.

In general, my advice is to employ VI only when you really need the speed and your model is not too complex.

[21]:
bebi103.stan.clean_cmdstan()

Computing environment

[22]:
%load_ext watermark
%watermark -v -p numpy,pandas,cmdstanpy,arviz,bokeh,holoviews,iqplot,bebi103,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())
Python implementation: CPython
Python version       : 3.8.8
IPython version      : 7.21.0

numpy     : 1.19.2
pandas    : 1.2.3
cmdstanpy : 0.9.68
arviz     : 0.11.1
bokeh     : 2.2.2
holoviews : 1.14.2
iqplot    : 0.2.1
bebi103   : 0.1.6
jupyterlab: 2.2.6

cmdstan   : 2.26.1