0. Preparing for the course


In this lesson, you will get the necessary computing resources for the class set up.

Students who took BE/Bi 103 a

If you took BE/Bi 103 a last term, your computer is mostly configured. You should do the following on the command line.

conda update --all
pip install --upgrade arviz cmdstanpy bebi103 iqplot awscli

After applying the above updates, you can skip to the AWS setup section and continue.

If you do want to upgrade your Anaconda installation for Python 3.9 instead of Python 3.8, you can follow the instructions in Lesson 0 from BE/Bi 103 a again, being sure to uninstall Anaconda at the appropriate point as you are doing so.

Students who did not take BE/Bi 103 a

If you did not take BE/Bi 103 a last term, complete Lesson 0 from BE/Bi 103 a, and then proceed. The only exception is that you should install the Anaconda distribution for Python 3.9 and not 3.8. You can also install AWS command line utilities by doing the following on the command line.

pip install --upgrade awscli

Use of Google Colab

In order to use Google Colab, you must have a Google account. Caltech students and employees have an account through Caltech’s G Suite. Many of you may have a personal Google account, usually set up for things like GMail, YouTube, etc. For your work in this class, use your Caltech account. This will facilitate collaboration with your teammates in the course, as well as with course staff.

Many of you probably use your personal Google account on your machine, so it can get annoying to log in and out of it. A trick that I find useful is to use one browser, e.g., Safari or Microsoft Edge, for your personal use, web browsing, etc., and a different browser for your scientific work, including the work in this class. Google Colab are most tested for Chrome, Firefox, and Safari (in fact JupyterLab, which you will use on your own machine, only supports these three browsers).

Once you have either logged out of all of your personal accounts or have a different browser open, you can launch a Colab notebook by simply navigating to https://colab.research.google.com/. Alternatively, you can click the “Launch in Colab” badge at the top right of this page, and you will launch this notebook in Colab. That badge will appear in the top right of all pages in the course content generated from notebooks.

Watchouts when using Colab

If you do run a notebook in Colab, you are doing your computing on one of Google’s computers via a virtual machine. You get two CPU cores and 12 GB of RAM. You can also get GPUs and TPUs (Google’s tensor processing units), but we will not use those in this course. The computing resources should be enough for all of our calculations this term (though you will need more computing power in the sequel of this course). However, there are some limitations you should be aware of.

  • If your notebook is idle for too long, you will get disconnected from your notebook. “Idle” means that cells are not being edited or executed. The idle timeout varies depending on the load on Google’s computers; I find that I almost always get disconnected if idle for an hour.

  • Your virtual machine will disconnect if it is being used for too long. It typically will only available for 12 hours before disconnecting, though times can vary, again based on load.

These limitations can result in problems if you are running long-ish Stan calculations. If the calculation takes, say, four hours, you can do it on Colab, but you probably want to go do something else while it is running. If the calculation ends and your Colab sessions sits idle for too long, your virtual machine may disconnect and you may lose your samples. You should therefore have safeguards in place to store your results so you do not lose them. Another obvious limitation is that 12+ hour Stan calculations can result in the virtual machine timing out and disconnecting.

These limitations are in place so that Google can offer Colab for free. If you want more cores, longer timeouts, etc., you might want to check out Colab Pro. You of course can always run on your own machine or on AWS.

There are additional software-specific watchouts when using Colab.

  • Colab will not render HoloViews plots unless hv.extension(‘bokeh’) is called in each cell that has a HoloViews plot.

  • Colab does not allow for full functionality Bokeh apps and some Panel functionality that we will use later in the course when we do dashboarding.

  • Colab instances have specific software installed, so you will need to install anything else you need in your notebook. This is not a major burden, and is discussed in the next section.

I recommend reading the Colab FAQs for more information about Colab.

Software in Colab

When you launch a Google Colab notebook, much of the software we will use in class is already installed. It is not always the latest version of the software, however. In fact, as of December 2021, Colab is running Python 3.7, whereas you will run Python 3.9 on your machine and on AWS. Nonetheless, most (but not all) of the analyses we do for this class will work just fine in Colab. We will make every effort to let you know when Colab will not be able to handle activities in class, the most important example being some dashboarding applications.

Because the notebooks in Colab have specific software preinstalled, and no more, you will often need to install software before you can run the rest of the code in a notebook. To enable this, when necessary, in the first code cell of each notebook in this class, we will have the following code (or a variant thereof depending on what is needed or if the default installations of Colab change). Running this code will not affect running your notebook on your local machine; the same notebook will work on your local machine or on Colab. Importantly, when using Stan, you will need to install Stan in your Colab session using cmdstanpy.install_cmdstan(), which can take some time, usually several minutes.

[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade iqplot colorcet datashader bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    import cmdstanpy; cmdstanpy.install_cmdstan()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

AWS setup

We will be doing some involved computations that may tax the computing resources of your own computer. We therefore encourage using Amazon Web Services (AWS) to enable access to more powerful machines for doing calculations. Amazon has the AWS Educate program for students. They give students computing credits on AWS, allowing you to use their machines. We will give you instructions on how to use AWS later in the term, but you should request the credits now because there can be a delay in their approval.

Go to the AWS Educate page, set up an account, and request credits ($50 for the term should be enough).

Stan installation

We will be using Stan for much of our statistical modeling. Stan has a probabilistic programming language. Programs written in this language, called Stan programs, are translated into C++ by the Stan parser, and then the C++ code is compiled. As you will see throughout the class, there are many advantages to this approach.

There are many interfaces for Stan, including the two most widely used RStan and PyStan, which are R and Python interfaces, respectively. We will use a newer interface, CmdStanPy, which has several advantages that will become apparent when you start using it.

Whichever interface you use needs to have Stan installed and functional, which means you have to have an installed C++ toolchain. Installation and compilation can be tricky and varies from operating system to operating system. To facilitate configuration of Stan and also to allow you to tackle more involved calculations, we will use AWS for computing with Stan. Within the first few weeks of class, you will receive instructions on how to use AWS. We have a pre-built Amazon Machine Image (AMI) that has all of the installations you need, and you can run your calculations on those machines. Bear in mind that because of the difficulties involved with local installations, we may not be able to provide support for all local installations of Stan, CmdStanPy, or other Stan interfaces. AWS provides a viable, if not free, alternative with more computing power.

Note, however, that you can also use Stan and CmdStanPy on Google Colab, but you are limited to only two cores for their free service.

That said, if you would like to install Stan and CmdStanPy locally, you may do so. Read on for instructions, though we offer no guarantees that they will work.

Configuring a C++ toolchain for MacOS

If you are using MacOS and you installed XCode as was required for the BE/Bi 103 a installations, you should already have a C++ toolchain. You can skip ahead to install Stan with CmdStanPy.

Configuring a C++ toolchain for Windows

You need to install a C++ toolchain for Windows. One possibility is to install a MinGW toolchain, and one way to do that is using conda.

conda install libpython m2w64-toolchain -c msys2

Configuring a C++ toolchain for Linux

If you are using Linux, we assume you already have the C++ utilities installed.

Installing Stan with CmdStanPy

If you have a functioning C++ toolchain, you can use CmdStanPy to install Stan/CmdStan. You can do this by running the following on the command line.

python -c "import cmdstanpy; cmdstanpy.install_cmdstan()"

This may take several minutes to run. (I did it on my Raspberry Pi, and it took hours.)

Checking your Stan installation

To check your Stan installation, you can run the following code. It will take several seconds for the model to compile and then sample. In the end, you should see a scatter plot of samples. You might not appreciate it yet, but this is a nifty demonstration of Stan’s power to sample hierarchical models, which is no trivial feat.

[2]:
import numpy as np

import cmdstanpy
import arviz as az

import bokeh.plotting
import bokeh.io
bokeh.io.output_notebook()

schools_data = {
    "J": 8,
    "y": [28, 8, -3, 7, -1, 1, 18, 12],
    "sigma": [15, 10, 16, 11, 9, 11, 10, 18],
}

schools_code = """
data {
  int<lower=0> J; // number of schools
  vector[J] y; // estimated treatment effects
  vector<lower=0>[J] sigma; // s.e. of effect estimates
}

parameters {
  real mu;
  real<lower=0> tau;
  vector[J] eta;
}

transformed parameters {
  vector[J] theta = mu + tau * eta;
}

model {
  eta ~ normal(0, 1);
  y ~ normal(theta, sigma);
}
"""

with open("schools_code.stan", "w") as f:
    f.write(schools_code)

sm = cmdstanpy.CmdStanModel(stan_file="schools_code.stan")
samples = sm.sample(data=schools_data, output_dir="./", show_progress=False)
samples = az.from_cmdstanpy(samples)

# Make a plot of samples
p = bokeh.plotting.figure(
    frame_height=250, frame_width=250, x_axis_label="μ", y_axis_label="τ"
)
p.circle(
    np.ravel(samples.posterior["mu"]),
    np.ravel(samples.posterior["tau"]),
    alpha=0.1
)

bokeh.io.show(p)
Loading BokehJS ...
INFO:cmdstanpy:compiling stan file /Users/bois/Dropbox/git/bebi103_course/2022/b/content/lessons/00/schools_code.stan to exe file /Users/bois/Dropbox/git/bebi103_course/2022/b/content/lessons/00/schools_code
INFO:cmdstanpy:compiled model executable: /Users/bois/Dropbox/git/bebi103_course/2022/b/content/lessons/00/schools_code
INFO:cmdstanpy:CmdStan start procesing
INFO:cmdstanpy:Chain [1] start processing
INFO:cmdstanpy:Chain [2] start processing
INFO:cmdstanpy:Chain [3] start processing
INFO:cmdstanpy:Chain [4] start processing
INFO:cmdstanpy:Chain [1] done processing
INFO:cmdstanpy:Chain [2] done processing
INFO:cmdstanpy:Chain [3] done processing
INFO:cmdstanpy:Chain [4] done processing

Computing environment

[3]:
%load_ext watermark
%watermark -v -p numpy,bokeh,cmdstanpy,arviz,jupyterlab
print("CmdStan : {0:d}.{1:d}".format(*cmdstanpy.cmdstan_version()))
Python implementation: CPython
Python version       : 3.9.7
IPython version      : 7.29.0

numpy     : 1.20.3
bokeh     : 2.3.3
cmdstanpy : 1.0.0
arviz     : 0.11.4
jupyterlab: 3.2.1

CmdStan : 2.28