“Hello, world” —Stan¶

[1]:

# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade iqplot colorcet bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    import cmdstanpy; cmdstanpy.install_cmdstan()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np
import pandas as pd
import scipy.special
import scipy.stats as st

import cmdstanpy
import arviz as az

import iqplot

import bebi103

import colorcet

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

Loading BokehJS ...

When getting familiar with a new programming language, we often write a “Hello, world” program. This is a simple, often minimal, to demonstrate some of the basic syntax of the language. Python’s Hello, world program is:

[2]:

print("Hello, world.")

Hello, world.

Here, we introduce Stan, and write a Hello, world program for it.

Before we do, we note that you may run Stan on your own machine if you have managed to get Stan and CmdStanPy installed. Otherwise, you can use AWS using the BE/Bi 103 b 2021 Amazon Machine Image. If you wish, you may also use Google Colab, though you will be limited in how many cores you can use and how long you can use them.

Basics of Stan programs¶

This is our first introduction to Stan, a probabilistic programming language that we will use for much of our statistical modeling. Stan is a separate language. It has a command line interface and interfaces for R, Python, Julia, Matlab, Stata, Scala, and Mathematica.

We will be using one of the two Python interfaces, CmdStanPy. PyStan is another popular interface. Remember, though, that Stan is a separate language, and any Stan program you write works across all of these interfaces.

Before we dive in and write our first Stan program to draw samples out of the Normal distribution, I want to tell you a few things about Stan. Briefly, Stan works as follows when using the CmdStanPy interface.

A user writes a model using the Stan language. This is usually stored in a .stan text file.
The model is compiled in two steps. First, Stan translates the model in the .stan file into C++ code. Then, that C++ code is compiled into machine code.
Once the machine code is built, the user can, via the CmdStanPy interface, sample out of the distribution defined by the model and perform other calculations (such as optimization) with the model.
The results from the sampling are written to disk as CSV and txt files. As demonstrated below, we conveniently access these files using ArviZ, so we do not directly interact with them.

We will learn the Stan language structure and syntax as we go along. To start with, a Stan program consists of seven sections, called blocks. They are, in order

functions: Any user-defined functions that can be used in other blocks.
data: Any inputs from the user. Most commonly, these are measured data themselves. You can also put user-adjustable parameters in this block as well, but nothing you intend to sample.
transformed data: Any transformations that need to be done on the data.
parameters: The parameters of the model. Stan will give you samples of the variables described in this block. These are the \(\theta\) that the posterior \(g(\theta\mid y)\) describes.
transformed parameters: Any transformations that need to be done on the parameters.
model: Specification of the generative model. The sampler will sample the parameters \(\theta\) out of this model.
generated quantities: Any other quantities you want to generate on each iteration of the sampler.

Not all blocks need to be in a Stan program, but they must be in this order. Some other important points to keep in mind as we venture into Stan:

The Stan documentation will be a very good friend of yours, both the user’s guide and reference manual.
The index origin of Stan is 1, not 0 as in Python.
Stan is strongly statically typed, which means that you need to declare the data type of a variable explicitly before using it.
All Stan commands must end with a semicolon.
Blocks of code are separated using curly braces.
Stan programs are stored outside of your notebook in a .stan file. These are text files, which you can prepare with your favorite text editor, including the one included in JupyterLab.

Say hi, Stan¶

With this groundwork laid, let’s just go ahead and write our “Hello, world” Stan program to generate samples out of a standard Normal distribution (with zero mean and unit variance) with a specified mean and variance. (Note that this is not sampling out of a posterior.) Here is the code, which I have stored in the file hello_world.stan.

parameters {
  real x;
}


model {
  x ~ normal(0, 1);
}

Note that there are two blocks in this particular Stan code, the parameters block and the model block. These are two of the seven possible blocks in a Stan code, and we will explore others in the next part of the lesson when we learn more about Stan after we complete our Hello, world program.

In the parameters block, we have the names and types of parameters we want to obtain samples for. In this case, we want to obtain samples of a real number we will call x.

In the model block, we have our statistical model. The syntax is similar to how we would write the model on paper. We specify that x, the parameter we want to get samples of, is Normally distributed with location parameter zero and scale parameter one.

Now that we have our code (which I have stored in a file names hello_world.stan), we can use CmdStanPy to compile it and get CmdStanModel, which is a Python object that provides access to the compiled Stan executable that we can conveniently access using Python syntax.

[3]:

sm = cmdstanpy.CmdStanModel(stan_file='hello_world.stan')

INFO:cmdstanpy:compiling stan program, exe file: /Users/bois/Dropbox/git/bebi103_course/2021/b/content/lessons/09/hello_world
INFO:cmdstanpy:compiler options: stanc_options=None, cpp_options=None
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103_course/2021/b/content/lessons/09/hello_world

Now that we have the Stan model, stored as the variable sm, we can collect samples from it using the sm.sample() method. We pass in the number of chains; that is, the number of Markov chains to use in sampling. We can also pass in the number of sampling iterations to do. We’ll do four chains, which each taking 1000 samples. Let’s do it!

[4]:

samples = sm.sample(
    chains=4,
    iter_sampling=1000,
)

INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 3
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:finish chain 4

Direct sampling¶

Stan can also draw samples out of probability distributions without using MCMC, just as Numpy and Scipy can. For a generic posterior, we use MCMC, but for many named distributions we can directly sample.

Let’s draw 300 random numbers from a Normal distribution with location parameter zero and scale parameter one using Numpy and Scipy.

[10]:

rg = np.random.default_rng()
np_samples = rg.normal(0, 1, size=300)

sp_samples = st.norm.rvs(0, 1, size=300)

# Plot samples
p = iqplot.ecdf(
    np_samples,
    style='staircase',
    palette=[colorcet.b_glasbey_category10[0]],
)

p = iqplot.ecdf(
    sp_samples,
    style='staircase',
    palette=[colorcet.b_glasbey_category10[1]],
    p=p,
)

bokeh.io.show(p)

To generate random draws from a standard Normal distribution without using Markov chain Monte Carlo, we use the following Stan code.

generated quantities {
  real x;

  x = normal_rng(0, 1);
}

Let’s compile it, and then comment on the code.

[11]:

sm_rng = cmdstanpy.CmdStanModel(stan_file='norm_rng.stan')

INFO:cmdstanpy:compiling stan program, exe file: /Users/bois/Dropbox/git/bebi103_course/2021/b/content/lessons/09/norm_rng
INFO:cmdstanpy:compiler options: stanc_options=None, cpp_options=None
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103_course/2021/b/content/lessons/09/norm_rng

There is just one block in this particular Stan code, the generated quantities block. In the generated quantities block, we have code for that tells Stan what to generate for each set of parameters it encountered while doing Markov chain Mote Carlo. Here, we are not performing Markov chain Monte Carlo, so we do the “sampling” in fixed parameter mode when we call sm_rng.sample() by setting the fixed_param kwarg to True.

[12]:

# Draw samples
stan_samples = sm_rng.sample(
    chains=1,
    iter_sampling=300,
    fixed_param=True,
)

INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:finish chain 1

To convert this sampling object to a Numpy array, we can first convert it to an ArviZ InferenceData instance and then extract the Numpy array. Note that we will define the samples as coming from a “posterior,” even though it is not a posterior, since that’s the default for ArviZ.

[13]:

# Convert to ArviZ InferenceData
stan_samples = az.from_cmdstanpy(
    posterior=stan_samples
)

# Extract Numpy array
stan_samples = stan_samples.posterior['x'].values.flatten()

Now, we can add the ECDF of these samples to the plot of Numpy and Scipy samples.

[14]:

p = iqplot.ecdf(
    stan_samples,
    style='staircase',
    palette=[colorcet.b_glasbey_category10[2]],
    p=p,
)

bokeh.io.show(p)

Why are we using that?¶

Yes, sampling using MCMC with Stan is a novel feature, and we used it to sample out of a trivial distribution (a standard Normal), but we can use it to sample out of very complex distributions. But with respect to the direct sampling we just did, you might be thinking, “Sampling using Stan was so much harder than with Numpy! Why are we doing that?” The answer is that for more complicated models, and doing things like prior predictive checks and posterior predictive checks, using Stan for all modeling is more convenient.

Recalling also last term’s course, here is a breakdown of when we will use the respective samplers.

We will use Numpy for sampling techniques in frequentist-based inference, that is for things like computing confidence intervals and p-values using resampling methods.
We will use scipy.stats when plotting distributions and using optimization methods in Bayesian inference.
We will occasionally use Numpy for prior predictive checks and posterior predictive checks (defined in coming lessons).
We will use Stan for everything else. This includes all Bayesian modeling that does not use optimization (and even some that does).

Displaying your Stan code¶

When you are working on assignments, your Stan models are written as separate files. They should of course be committed to your repository. It is also instructive to display the Stan code in the Jupyter notebook. This is easily accomplished for any CmdStanPy model using the code() method.

[15]:

print(sm.code())

parameters {
  real x;
}


model {
  x ~ normal(0, 1);
}

You should do this in your notebooks so the code is visible.

Saving samples¶

While your samples are saved in CSV and text files by Stan, is is convenient to save the sampling information in a format the can immediately be read into an ArviZ InferenceData object. The NetCDF format is useful for this. ArviZ enables saving as NetCDF as follows.

[16]:

samples.to_netcdf('stan_hello_world.nc')

[16]:

'stan_hello_world.nc'

When calling the function, it returns the string of the filename to which the NetCDF file is written. The samples can be read from the NetCDF file using az.from_netcdf().

[17]:

samples = az.from_netcdf('stan_hello_world.nc')

Cleaning up the shrapnel¶

When using Stan, CmdStanPy leaves a lot of files on your file system.

Your stan model is translated into C++, and the result is stored in a .hpp file.
The .hpp file is compiled into an object file (.o file).
The .o file is used to build an executable.

All of these files are deposited in your present working directory, and can get annoying for version control purposes and can add clutter. To clean them up after you are finished running your models, you can run the function below.

[18]:

bebi103.stan.clean_cmdstan()

When doing sampling the results are stored in a /var/ directory in various CSV and text files. We never work with these directly, but rather read them into RAM in a convenience az.InferenceData object using ArviZ. When exiting your session, CmdStanPy deletes all of these CSV files, etc., unless you specifically say which directory to store the results in your call to sm.sample() using the outpur_dir kwarg.

Computing environment¶

[19]:

%load_ext watermark
%watermark -v -p numpy,pandas,scipy,cmdstanpy,arviz,iqplot,bebi103,bokeh,colorcet,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

numpy     : 1.19.2
pandas    : 1.1.5
scipy     : 1.5.2
cmdstanpy : 0.9.67
arviz     : 0.10.0
iqplot    : 0.1.6
bebi103   : 0.1.2
bokeh     : 2.2.3
colorcet  : 2.0.2
jupyterlab: 2.2.6

cmdstan   : 2.25.0