“Hello, world” —Stan


[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade iqplot colorcet bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    import cmdstanpy; cmdstanpy.install_cmdstan()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np
import pandas as pd
import scipy.special
import scipy.stats as st

import cmdstanpy
import arviz as az

import iqplot

import bebi103

import colorcet

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

When getting familiar with a new programming language, we often write a “Hello, world” program. This is a simple, often minimal, to demonstrate some of the basic syntax of the language. Python’s Hello, world program is:

[2]:
print("Hello, world.")
Hello, world.

Here, we introduce Stan, and write a Hello, world program for it.

Before we do, we note that you may run Stan on your own machine if you have managed to get Stan and CmdStanPy installed. Otherwise, you can use AWS using the BE/Bi 103 b 2021 Amazon Machine Image. If you wish, you may also use Google Colab, though you will be limited in how many cores you can use and how long you can use them.

Basics of Stan programs

This is our first introduction to Stan, a probabilistic programming language that we will use for much of our statistical modeling. Stan is a separate language. It has a command line interface and interfaces for R, Python, Julia, Matlab, Stata, Scala, and Mathematica.

We will be using one of the two Python interfaces, CmdStanPy. PyStan is another popular interface. Remember, though, that Stan is a separate language, and any Stan program you write works across all of these interfaces.

Before we dive in and write our first Stan program to draw samples out of the Normal distribution, I want to tell you a few things about Stan. Briefly, Stan works as follows when using the CmdStanPy interface.

  1. A user writes a model using the Stan language. This is usually stored in a .stan text file.

  2. The model is compiled in two steps. First, Stan translates the model in the .stan file into C++ code. Then, that C++ code is compiled into machine code.

  3. Once the machine code is built, the user can, via the CmdStanPy interface, sample out of the distribution defined by the model and perform other calculations (such as optimization) with the model.

  4. The results from the sampling are written to disk as CSV and txt files. As demonstrated below, we conveniently access these files using ArviZ, so we do not directly interact with them.

We will learn the Stan language structure and syntax as we go along. To start with, a Stan program consists of seven sections, called blocks. They are, in order

  • functions: Any user-defined functions that can be used in other blocks.

  • data: Any inputs from the user. Most commonly, these are measured data themselves. You can also put user-adjustable parameters in this block as well, but nothing you intend to sample.

  • transformed data: Any transformations that need to be done on the data.

  • parameters: The parameters of the model. Stan will give you samples of the variables described in this block. These are the \(\theta\) that the posterior \(g(\theta\mid y)\) describes.

  • transformed parameters: Any transformations that need to be done on the parameters.

  • model: Specification of the generative model. The sampler will sample the parameters \(\theta\) out of this model.

  • generated quantities: Any other quantities you want to generate on each iteration of the sampler.

Not all blocks need to be in a Stan program, but they must be in this order. Some other important points to keep in mind as we venture into Stan:

  1. The Stan documentation will be a very good friend of yours, both the user’s guide and reference manual.

  2. The index origin of Stan is 1, not 0 as in Python.

  3. Stan is strongly statically typed, which means that you need to declare the data type of a variable explicitly before using it.

  4. All Stan commands must end with a semicolon.

  5. Blocks of code are separated using curly braces.

  6. Stan programs are stored outside of your notebook in a .stan file. These are text files, which you can prepare with your favorite text editor, including the one included in JupyterLab.

Say hi, Stan

With this groundwork laid, let’s just go ahead and write our “Hello, world” Stan program to generate samples out of a standard Normal distribution (with zero mean and unit variance) with a specified mean and variance. (Note that this is not sampling out of a posterior.) Here is the code, which I have stored in the file hello_world.stan.

parameters {
  real x;
}


model {
  x ~ normal(0, 1);
}

Note that there are two blocks in this particular Stan code, the parameters block and the model block. These are two of the seven possible blocks in a Stan code, and we will explore others in the next part of the lesson when we learn more about Stan after we complete our Hello, world program.

In the parameters block, we have the names and types of parameters we want to obtain samples for. In this case, we want to obtain samples of a real number we will call x.

In the model block, we have our statistical model. The syntax is similar to how we would write the model on paper. We specify that x, the parameter we want to get samples of, is Normally distributed with location parameter zero and scale parameter one.

Now that we have our code (which I have stored in a file names hello_world.stan), we can use CmdStanPy to compile it and get CmdStanModel, which is a Python object that provides access to the compiled Stan executable that we can conveniently access using Python syntax.

[3]:
sm = cmdstanpy.CmdStanModel(stan_file='hello_world.stan')
INFO:cmdstanpy:compiling stan program, exe file: /Users/bois/Dropbox/git/bebi103_course/2021/b/content/lessons/09/hello_world
INFO:cmdstanpy:compiler options: stanc_options=None, cpp_options=None
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103_course/2021/b/content/lessons/09/hello_world

Now that we have the Stan model, stored as the variable sm, we can collect samples from it using the sm.sample() method. We pass in the number of chains; that is, the number of Markov chains to use in sampling. We can also pass in the number of sampling iterations to do. We’ll do four chains, which each taking 1000 samples. Let’s do it!

[4]:
samples = sm.sample(
    chains=4,
    iter_sampling=1000,
)
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 3
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:finish chain 4

Parsing output with ArviZ

At this point, Stan did its job and acquired the samples. So, it said “hello, world.” Let’s take a look at the samples. They are stored as a CmdStanMCMC instance.

[5]:
samples
[5]:
CmdStanMCMC: model=hello_world chains=4['method=sample', 'num_samples=1000', 'algorithm=hmc', 'adapt', 'engaged=1']
 csv_files:
        /var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmpq2rvcjdk/hello_world-202101030855-1-t75ei107.csv
        /var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmpq2rvcjdk/hello_world-202101030855-2-w3sgwpcc.csv
        /var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmpq2rvcjdk/hello_world-202101030855-3-vuxyrr9e.csv
        /var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmpq2rvcjdk/hello_world-202101030855-4-0_zgcsij.csv
 output_files:
        /var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmpq2rvcjdk/hello_world-202101030855-1-t75ei107-stdout.txt
        /var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmpq2rvcjdk/hello_world-202101030855-2-w3sgwpcc-stdout.txt
        /var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmpq2rvcjdk/hello_world-202101030855-3-vuxyrr9e-stdout.txt
        /var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmpq2rvcjdk/hello_world-202101030855-4-0_zgcsij-stdout.txt

This object that was returned by CmdStanPy points to CSV and text files Stan generated while running. We can load them into a more convenient format using ArviZ (pronounced like “RVs”, the abbreviation for “recreational vehicles” or “random variables”).

[6]:
# There may be a deprecation warning from cmdstanpy, which can be ignored
samples = az.from_cmdstanpy(samples)

# Take a look
samples
[6]:
arviz.InferenceData
    • <xarray.Dataset>
      Dimensions:  (chain: 4, draw: 1000)
      Coordinates:
        * chain    (chain) int64 0 1 2 3
        * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
      Data variables:
          x        (chain, draw) float64 0.4528 -0.1188 -0.1828 ... -1.73 -1.142
      Attributes:
          created_at:                 2021-01-03T16:55:37.932824
          arviz_version:              0.10.0
          inference_library:          cmdstanpy
          inference_library_version:  0.9.67

    • <xarray.Dataset>
      Dimensions:      (chain: 4, draw: 1000)
      Coordinates:
        * chain        (chain) int64 0 1 2 3
        * draw         (draw) int64 0 1 2 3 4 5 6 7 ... 993 994 995 996 997 998 999
      Data variables:
          lp           (chain, draw) float64 -0.1025 -0.007057 ... -1.497 -0.6526
          accept_stat  (chain, draw) float64 0.9847 0.954 0.9993 ... 0.9413 0.8053 1.0
          stepsize     (chain, draw) float64 0.9862 0.9862 0.9862 ... 1.028 1.028
          treedepth    (chain, draw) int64 2 2 2 1 1 2 2 1 2 2 ... 2 2 2 1 2 1 1 1 1 1
          n_leapfrog   (chain, draw) int64 3 3 3 1 1 3 3 3 7 3 ... 3 3 3 3 3 3 3 3 3 1
          diverging    (chain, draw) bool False False False ... False False False
          energy       (chain, draw) float64 0.3818 0.4782 0.01679 ... 1.572 1.329
      Attributes:
          created_at:                 2021-01-03T16:55:37.938813
          arviz_version:              0.10.0
          inference_library:          cmdstanpy
          inference_library_version:  0.9.67

We used ArviZ to convert the data type to an ArviZ InferenceData data type. This has two groups, posterior, which contains the samples, and sample_stats which gives information about the sampling. (Note that ArviZ named the group “posterior,” which it does by default, even though these samples are out of a standard Normal distribution and not out of a posterior distribution for some model we may have built.) We’ll start by looking at the samples themselves. Since the samples were taken using the model block, they are assumed to be samples out of a posterior distribution, and are therefore present in the samples.posterior group.

[7]:
samples.posterior
[7]:
<xarray.Dataset>
Dimensions:  (chain: 4, draw: 1000)
Coordinates:
  * chain    (chain) int64 0 1 2 3
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
Data variables:
    x        (chain, draw) float64 0.4528 -0.1188 -0.1828 ... -1.73 -1.142
Attributes:
    created_at:                 2021-01-03T16:55:37.932824
    arviz_version:              0.10.0
    inference_library:          cmdstanpy
    inference_library_version:  0.9.67

This is a new, interesting data type. This is an xarray Dataset. The xarray package is a very powerful package for data analysis. The two main data types we will use are xarray DataArrays and xarray Datasets. You can think of a DataArray like a Pandas data frame, except that the data need not be structured in a two-dimensional table like a data frame is. A Dataset is a collection of DataArrays and associated attributes. Interestingly, if multiple DataArrays in a Dataset have the same indexes, you can index multiple arrays at the same time.

Essentially, you can think of xarray structures as Pandas data frames that can be arbitrarily multidimensional.

If we want to access the samples of x, we do so like this.

[8]:
samples.posterior['x']
[8]:
<xarray.DataArray 'x' (chain: 4, draw: 1000)>
array([[ 0.452796, -0.118799, -0.182823, ...,  1.41755 , -0.851091,
        -0.807712],
       [-1.11476 , -0.781469, -1.17315 , ..., -0.586583, -0.840861,
        -0.718498],
       [ 1.24925 ,  1.19028 ,  1.60511 , ...,  0.353375,  0.353375,
         1.7048  ],
       [-1.02144 ,  1.03154 ,  1.81508 , ..., -0.6151  , -1.73038 ,
        -1.14249 ]])
Coordinates:
  * chain    (chain) int64 0 1 2 3
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999

We see that this is a two dimensional array, with the first index (the rows) being the chain and the second index (the columns) being the draw, of which there are 1000 for each chain. We can put all of our draws together by converting the DataArray to a Numpy array using the .values attribute and then raveling the Numpy array, and then plot an ECDF. The ECDF should look like a Normal distribution with location parameter zero and scale parameter one.

[9]:
bokeh.io.show(
    iqplot.ecdf(
        samples.posterior['x'].values.ravel()
    )
)

Indeed it does! We have just verified that Stan properly said, “Hello, world.”

Direct sampling

Stan can also draw samples out of probability distributions without using MCMC, just as Numpy and Scipy can. For a generic posterior, we use MCMC, but for many named distributions we can directly sample.

Let’s draw 300 random numbers from a Normal distribution with location parameter zero and scale parameter one using Numpy and Scipy.

[10]:
rg = np.random.default_rng()
np_samples = rg.normal(0, 1, size=300)

sp_samples = st.norm.rvs(0, 1, size=300)

# Plot samples
p = iqplot.ecdf(
    np_samples,
    style='staircase',
    palette=[colorcet.b_glasbey_category10[0]],
)

p = iqplot.ecdf(
    sp_samples,
    style='staircase',
    palette=[colorcet.b_glasbey_category10[1]],
    p=p,
)

bokeh.io.show(p)

To generate random draws from a standard Normal distribution without using Markov chain Monte Carlo, we use the following Stan code.

generated quantities {
  real x;

  x = normal_rng(0, 1);
}

Let’s compile it, and then comment on the code.

[11]:
sm_rng = cmdstanpy.CmdStanModel(stan_file='norm_rng.stan')
INFO:cmdstanpy:compiling stan program, exe file: /Users/bois/Dropbox/git/bebi103_course/2021/b/content/lessons/09/norm_rng
INFO:cmdstanpy:compiler options: stanc_options=None, cpp_options=None
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103_course/2021/b/content/lessons/09/norm_rng

There is just one block in this particular Stan code, the generated quantities block. In the generated quantities block, we have code for that tells Stan what to generate for each set of parameters it encountered while doing Markov chain Mote Carlo. Here, we are not performing Markov chain Monte Carlo, so we do the “sampling” in fixed parameter mode when we call sm_rng.sample() by setting the fixed_param kwarg to True.

[12]:
# Draw samples
stan_samples = sm_rng.sample(
    chains=1,
    iter_sampling=300,
    fixed_param=True,
)
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:finish chain 1

To convert this sampling object to a Numpy array, we can first convert it to an ArviZ InferenceData instance and then extract the Numpy array. Note that we will define the samples as coming from a “posterior,” even though it is not a posterior, since that’s the default for ArviZ.

[13]:
# Convert to ArviZ InferenceData
stan_samples = az.from_cmdstanpy(
    posterior=stan_samples
)

# Extract Numpy array
stan_samples = stan_samples.posterior['x'].values.flatten()

Now, we can add the ECDF of these samples to the plot of Numpy and Scipy samples.

[14]:
p = iqplot.ecdf(
    stan_samples,
    style='staircase',
    palette=[colorcet.b_glasbey_category10[2]],
    p=p,
)

bokeh.io.show(p)

Why are we using that?

Yes, sampling using MCMC with Stan is a novel feature, and we used it to sample out of a trivial distribution (a standard Normal), but we can use it to sample out of very complex distributions. But with respect to the direct sampling we just did, you might be thinking, “Sampling using Stan was so much harder than with Numpy! Why are we doing that?” The answer is that for more complicated models, and doing things like prior predictive checks and posterior predictive checks, using Stan for all modeling is more convenient.

Recalling also last term’s course, here is a breakdown of when we will use the respective samplers.

  • We will use Numpy for sampling techniques in frequentist-based inference, that is for things like computing confidence intervals and p-values using resampling methods.

  • We will use scipy.stats when plotting distributions and using optimization methods in Bayesian inference.

  • We will occasionally use Numpy for prior predictive checks and posterior predictive checks (defined in coming lessons).

  • We will use Stan for everything else. This includes all Bayesian modeling that does not use optimization (and even some that does).

Displaying your Stan code

When you are working on assignments, your Stan models are written as separate files. They should of course be committed to your repository. It is also instructive to display the Stan code in the Jupyter notebook. This is easily accomplished for any CmdStanPy model using the code() method.

[15]:
print(sm.code())
parameters {
  real x;
}


model {
  x ~ normal(0, 1);
}

You should do this in your notebooks so the code is visible.

Saving samples

While your samples are saved in CSV and text files by Stan, is is convenient to save the sampling information in a format the can immediately be read into an ArviZ InferenceData object. The NetCDF format is useful for this. ArviZ enables saving as NetCDF as follows.

[16]:
samples.to_netcdf('stan_hello_world.nc')
[16]:
'stan_hello_world.nc'

When calling the function, it returns the string of the filename to which the NetCDF file is written. The samples can be read from the NetCDF file using az.from_netcdf().

[17]:
samples = az.from_netcdf('stan_hello_world.nc')

Cleaning up the shrapnel

When using Stan, CmdStanPy leaves a lot of files on your file system.

  1. Your stan model is translated into C++, and the result is stored in a .hpp file.

  2. The .hpp file is compiled into an object file (.o file).

  3. The .o file is used to build an executable.

All of these files are deposited in your present working directory, and can get annoying for version control purposes and can add clutter. To clean them up after you are finished running your models, you can run the function below.

[18]:
bebi103.stan.clean_cmdstan()

When doing sampling the results are stored in a /var/ directory in various CSV and text files. We never work with these directly, but rather read them into RAM in a convenience az.InferenceData object using ArviZ. When exiting your session, CmdStanPy deletes all of these CSV files, etc., unless you specifically say which directory to store the results in your call to sm.sample() using the outpur_dir kwarg.

Computing environment

[19]:
%load_ext watermark
%watermark -v -p numpy,pandas,scipy,cmdstanpy,arviz,iqplot,bebi103,bokeh,colorcet,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())
Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

numpy     : 1.19.2
pandas    : 1.1.5
scipy     : 1.5.2
cmdstanpy : 0.9.67
arviz     : 0.10.0
iqplot    : 0.1.6
bebi103   : 0.1.2
bokeh     : 2.2.3
colorcet  : 2.0.2
jupyterlab: 2.2.6

cmdstan   : 2.25.0