“Hello, world” —Stan


[1]:
import numpy as np
import pandas as pd
import scipy.special
import scipy.stats as st

import cmdstanpy
import arviz as az

import bokeh_catplot

import bebi103

import colorcet

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

When getting familiar with a new programming language, we often write a “Hello, world” program. This is a simple, often minimal, to demonstrate some of the basic syntax of the language. Python’s Hello, world program is:

[2]:
print("Hello, world.")
Hello, world.

Here, we introduce Stan, and write a Hello, world program for it.

Before we do, we note that you may run Stan on your own machine if you have managed to get Stan and CmdStanPy installed. Otherwise, you can use AWS using the BE/Bi 103 b 2020 Amazon Machine Image.

Basics of Stan programs

This is our first introduction to Stan, a probabilistic programming language that we will use for much of our statistical modeling. Stan is a separate language. It has a command line interface and interfaces for R, Python, Julia, Matlab, Stata, Scala, and Mathematica.

We will be using one of the two Python interfaces, CmdStanPy. PyStan is another popular interface. Remember, though, that Stan is a separate language, and any Stan program you write works across all of these interfaces.

Before we dive in and write our first Stan program to draw samples out of the Normal distribution, I want to tell you a few things about Stan. Briefly, Stan works as follows when using the CmdStanPy interface.

  1. A user writes a model using the Stan language. This is usually stored in a .stan file.

  2. The model is compiled in two steps. First, Stan translates the model in the .stan file into C++ code. Then, that C++ code is compiled into machine code.

  3. Once the machine code is built, the user can, via the CmdStanPy interface, sample out of the distribution defined by the model and perform other calculations (such as optimization, discussed in the next tutorial) with the model.

  4. The results from the sampling are written to disk as CSV and txt files. As demonstrated below, we conveniently access these files using ArviZ, so we do not directly interact with them.

We will learn the Stan language structure and syntax as we go along. To start with, a Stan program consists of seven sections, called blocks. They are, in order

  • functions: Any user-defined functions that can be used in other blocks.

  • data: Any inputs from the user. Most commonly, these are measured data themselves. You can also put user-adjustable parameters in this block as well, but nothing you intend to sample.

  • transformed data: Any transformations that need to be done on the data.

  • parameters: The parameters of the model. Stan will give you samples of the variable described in this block. These are the \(\theta\) of that the posterior \(g(\theta\mid y)\) describes.

  • transformed parameters: Any transformations that need to be done on the parameters.

  • model: Specification of the generative model. The sampler will sample the parameters \(\theta\) out of this model.

  • generated quantities: Any other quantities you want to generate on each iteration of the sampler.

Not all blocks need to be in a Stan program, but they must be in this order. Some other important points to keep in mind as we venture into Stan:

  1. The Stan documentation will be a very good friend of yours, both the user’s guide and reference manual.

  2. The index origin of Stan is 1, not 0 as in Python.

  3. Stan is strongly statically typed, which means that you need to declare the data type of a variable explicitly before using it.

  4. All Stan commands must end with a semicolon.

  5. Blocks of code are separated using curly braces.

  6. Stan programs are stored outside of your notebook in a .stan file. These are text files, which you can prepare with your favorite text editor, including the one included in JupyterLab.

Say hi, Stan

With this groundwork laid, let’s just go ahead and write our “Hello, world” Stan program to generate samples out of a Normal distribution with a specified mean and variance. Here is the code, which I have stored in the file hello_world.stan.

data {
  real mu;
  real sigma;
}


parameters {
  real x;
}


model {
  x ~ normal(mu, sigma);
}

Note that there are three blocks in this particular Stan code, the data block, the parameters block, and the model block. These are three of the seven possible blocks in a Stan code, and we will explore others in the next part of the lesson when we learn more about Stan after we complete our Hello, world program. The data block contains anything that a user wants to input into the program. In our case, we want to input mu, the location parameter of the Normal distribution we want to draw from, and sigma, its scale parameter. We declare these in the data block, and we will pass them in as a Python dictionary when we do the sampling using CmdStanPy.

In the parameters block, we have the names and types of parameters we want to obtain samples for. In this case, we want to obtain samples of a real number we will call x.

In the model block, we have our statistical model. The syntax is similar to how we would write the model on paper. We specify that x, the parameter we want to get samples of, is Normally distributed with location parameter \(\mu\) and scale parameter \(\sigma\).

Now that we have our code, we can use CmdStanPy to compile it and get CmdStanModel, which is a Python object that provides access to the compiled Stan executable that we can conveniently access using Python syntax.

[3]:
sm = cmdstanpy.CmdStanModel(stan_file='hello_world.stan')
INFO:cmdstanpy:stan to c++ (/Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_04/hello_world.hpp)
INFO:cmdstanpy:compiling c++
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_04/hello_world

Now that we have the Stan model, stored as the variable sm, we can collect samples from it using the sm.sample() method. Stan is expecting inputted data, though, since we specified a data block. We pass the expected parameters as a dictionary, where the keys are the name of the variables in the data block of our Stan code, and the values are the values we would like to assign those variables.

[4]:
data = {'mu': 0.0, 'sigma': 1.0}

Now, we can draw samples using HMC. We need to pass in the data. We can also pass in the number of chains; that is, the number of Markov chains to use in sampling. We can also pass in the number of sampling iterations to do. We’ll do four chains, which each taking 1000 samples. Let’s do it!

[5]:
samples = sm.sample(
    data=data,
    chains=4,
    sampling_iters=1000,
)
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 3
INFO:cmdstanpy:finish chain 4

Parsing output with ArviZ

At this point, Stan did its job and acquired the samples. So, it said “hello, world.” Let’s take a look at the samples. They are stored as a CmdStanMCMC instance.

[6]:
samples
[6]:
CmdStanMCMC: model=hello_world chains=4['method=sample', 'num_samples=1000', 'algorithm=hmc', 'adapt', 'engaged=1']
 csv_files:
        /var/folders/0y/05rr2ttn5kv_pp6nzsg15mw00000gn/T/tmp23dyrbvy/hello_world-202001241410-1-y02oklnm.csv
        /var/folders/0y/05rr2ttn5kv_pp6nzsg15mw00000gn/T/tmp23dyrbvy/hello_world-202001241410-2-yt1fxv44.csv
        /var/folders/0y/05rr2ttn5kv_pp6nzsg15mw00000gn/T/tmp23dyrbvy/hello_world-202001241410-3-o3xzgsj8.csv
        /var/folders/0y/05rr2ttn5kv_pp6nzsg15mw00000gn/T/tmp23dyrbvy/hello_world-202001241410-4-i0bkf1m9.csv
 console_files
        /var/folders/0y/05rr2ttn5kv_pp6nzsg15mw00000gn/T/tmp23dyrbvy/hello_world-202001241410-1-y02oklnm.txt
        /var/folders/0y/05rr2ttn5kv_pp6nzsg15mw00000gn/T/tmp23dyrbvy/hello_world-202001241410-2-yt1fxv44.txt
        /var/folders/0y/05rr2ttn5kv_pp6nzsg15mw00000gn/T/tmp23dyrbvy/hello_world-202001241410-3-o3xzgsj8.txt
        /var/folders/0y/05rr2ttn5kv_pp6nzsg15mw00000gn/T/tmp23dyrbvy/hello_world-202001241410-4-i0bkf1m9.txt

This object that was returned by CmdStanPy points to CSV and text files Stan generated while running. We can load them into a more convenient format using ArviZ (pronounced like “RVs”, the abbreviation for “recreational vehicles” or “random variables”).

[7]:
samples = az.from_cmdstanpy(samples)

# Take a look
samples
[7]:
Inference data with groups:
        > posterior
        > sample_stats

We used ArviZ to convert the data type to an ArviZ InferenceData data type. This has two groups, posterior, which contains the samples, and sample_stats which gives information about the sampling. We’ll start by looking at the samples themselves. Since the samples were taken using the model block, they are assumed to be samples out of a posterior distribution, and are therefore present in the samples.posterior group.

[8]:
samples.posterior
[8]:
<xarray.Dataset>
Dimensions:  (chain: 4, draw: 1000)
Coordinates:
  * chain    (chain) int64 0 1 2 3
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
Data variables:
    x        (chain, draw) float64 -1.28 -1.28 -1.637 ... -0.5305 -0.9143
Attributes:
    created_at:                 2020-01-24T22:10:54.925322
    inference_library:          cmdstanpy
    inference_library_version:  0.8.0

This is a new, interesting data type. This is an xarray Dataset. The xarray package is a very powerful package for data analysis. The two main data type we will use are xarray DataArrays and xarray Datasets. You can think of a DataArray like a Pandas data frame, except that the data need not be structured in a two-dimensional table like a data frame is. A Dataset is a collection of DataArrays and associated attributes. Interestingly, if multiple DataArrays in a Dataset have the same indexes, you can index multiple arrays at the same time.

Essentially, you can think of xarray structures as Pandas data frames that can be arbitrarily multidimensional.

If we want to access the samples of x, we do so like this.

[9]:
samples.posterior['x']
[9]:
<xarray.DataArray 'x' (chain: 4, draw: 1000)>
array([[-1.27992 , -1.27992 , -1.63714 , ...,  0.44351 , -0.20202 ,  0.213056],
       [-0.800211, -1.28095 , -0.529018, ...,  0.08819 , -0.038983, -0.017019],
       [-0.483601, -1.60441 , -1.03922 , ..., -0.337338,  1.43953 ,  1.58611 ],
       [-0.114515, -0.081809, -0.697097, ..., -0.404361, -0.530511, -0.914253]])
Coordinates:
  * chain    (chain) int64 0 1 2 3
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999

We see that this is a two dimensional array, with the first index (the rows) being the chain and the second index (the columns) being the draw, of which there are 1000 for each chain. We can put all of our draws together by converting the DataArray to a Numpy array using the .values attribute and then raveling the Numpy array, and then plot an ECDF. The ECDF should look like a Normal distribution with location parameter zero and scale parameter one.

[10]:
bokeh.io.show(
    bokeh_catplot.ecdf(
        samples.posterior['x'].values.ravel()
    )
)

Indeed it does! We have just verified that Stan properly said, “Hello, world.”

Direct sampling

Stan can also draw samples out of probability distributions without using MCMC, just as Numpy and Scipy can. For a generic posterior, we must use MCMC, but for many named distributions we can directly sample.

Let’s draw 300 random numbers from a Normal distribution with location parameter zero and scale parameter one using Numpy and Scipy.

[11]:
rg = np.random.default_rng()
np_samples = rg.normal(0, 1, size=300)

sp_samples = st.norm.rvs(0, 1, size=300)

# Plot samples
p = bokeh_catplot.ecdf(
    np_samples,
    style='staircase',
    palette=[colorcet.b_glasbey_category10[0]],
)

p = bokeh_catplot.ecdf(
    sp_samples,
    style='staircase',
    palette=[colorcet.b_glasbey_category10[1]],
    p=p,
)

bokeh.io.show(p)

To generate random draws from a Normal distribution without using Markov chain Monte Carlo, we use the following Stan code.

data {
  real mu;
  real sigma;
}


generated quantities {
  real x;

  x = normal_rng(mu, sigma);
}

Let’s compile it, and then comment on the code.

[12]:
sm_rng = cmdstanpy.CmdStanModel(stan_file='norm_rng.stan')
INFO:cmdstanpy:stan to c++ (/Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_04/norm_rng.hpp)
INFO:cmdstanpy:compiling c++
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/lesson_04/norm_rng

There are two blocks in this particular Stan code, the data block we have seen before and the generated quantities block. In the generated quantities block, we have code for that tells Stan what to generate for each set of parameters it encountered while doing Markov chain Mote Carlo. Here, we are not performing Markov chain Monte Carlo, so we do the “sampling” in fixed parameter mode when we call sm_rng.sample() by setting the fixed_param kwarg to True.

[13]:
# Draw samples
stan_samples = sm_rng.sample(
    data=data,
    chains=1,
    sampling_iters=300,
    fixed_param=True,
)
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:finish chain 1

To convert this sampling object to a Numpy array, we can first convert it to an ArviZ InferenceData instance and then extract the Numpy array. Note that we will define the samples as coming from a “posterior,” even though it is not a posterior, since that’s the default for ArviZ.

[14]:
# Convert to ArviZ InferenceData
stan_samples = az.from_cmdstanpy(
    posterior=stan_samples
)

# Extract Numpy array
stan_samples = stan_samples.posterior['x'].values.flatten()

Now, we can add the ECDF of these samples to the plot of Numpy and Scipy samples.

[15]:
p = bokeh_catplot.ecdf(
    stan_samples,
    style='staircase',
    palette=[colorcet.b_glasbey_category10[2]],
    p=p,
)

bokeh.io.show(p)

Why are we using that?

Yes, sampling using MCMC with Stan is a novel feature, and we used it to sample out of a trivial distribution (a unit Normal), but we can use it to sample out of very complex distributions. But with respect to the direct sampling we just did, you might be thinking, “Sampling using Stan was so much harder than with Numpy! Why are we doing that?” The answer is that for more complicated models, and doing things like prior predictive checks and posterior predictive checks, using Stan for all modeling is much more convenient.

Recalling also last term’s course, here is a breakdown of when we will use the respective samplers.

  • We will use Numpy for sampling techniques in frequentist-based inference, that is for things like computing confidence intervals and p-values using resampling methods.

  • We will use scipy.stats when plotting distributions and using optimization methods in Bayesian inference.

  • We will occasionally use Numpy for prior predictive checks and posterior predictive checks (defined in coming lessons).

  • We will use Stan for everything else. This includes all Bayesian modeling that does not use optimization (and even some that does).

Displaying your Stan code

When you are working on assignments, your Stan models are written as separate files. They should of course be committed to your repository. It is also instructive to display the Stan code in the Jupyter notebook. This is easily accomplished for any CmdStanPy model using the code() method.

[16]:
print(sm.code())
data {
  real mu;
  real sigma;
}


parameters {
  real x;
}


model {
  x ~ normal(mu, sigma);
}

You should do this in your notebooks so the code is visible.

Saving samples

While your samples are saved in CSV and text files by Stan, is is convenient to save the sampling information in a format the can immediately be read into an ArviZ InferenceData object. The NetCDF format is useful for this. ArviZ enables saving as NetCDF as follows.

[17]:
samples.to_netcdf('stan_hello_world.nc')
[17]:
'stan_hello_world.nc'

When calling the function, it returns the string of the filename to which the NetCDF file is written. The samples can be read from the NetCDF file using az.from_netcdf().

[18]:
samples = az.from_netcdf('stan_hello_world.nc')

Cleaning up the shrapnel

When using Stan, CmdStanPy leaves a lot of files on your file system.

  1. Your stan model is translated into C++, and the result is stored in a .hpp file.

  2. The .hpp file is compiled into an object file (.o file).

  3. The .o file is used to build an executable.

All of these files are deposited in your present working directory, and can get annoying for version control purposes and can add clutter. To clean them up after you are finished running your models, you can run the function below.

[19]:
bebi103.stan.clean_cmdstan()

When doing sampling the results are stored in a /var/ directory in various CSV and text files. We never work with these directly, but rather read them into RAM in a convenience az.InferenceData object using ArviZ. When exiting your session, CmdStanPy deletes all of these CSV files, etc., unless you specifically say which directory to store the results in your call to sm.sample() using the outpur_dir kwarg.

Computing environment

[20]:
%load_ext watermark
%watermark -v -p numpy,pandas,scipy,cmdstanpy,arviz,bokeh_catplot,bokeh,colorcet,jupyterlab
CPython 3.7.6
IPython 7.11.1

numpy 1.18.1
pandas 0.24.2
scipy 1.3.1
cmdstanpy 0.8.0
arviz 0.6.1
bokeh_catplot 0.1.7
bokeh 1.4.0
colorcet 2.0.2
jupyterlab 1.2.5