1. Preparing your computer


In this lesson, you will set up a Python computing environment for scientific computing. There are two main ways people set up Python for scientific computing.

  1. By downloading and installing package by package with tools like apt-get, pip, etc.

  2. By downloading and installing a Python distribution that contains binaries of many of the scientific packages needed. The major distributions of these are Anaconda and Enthought Canopy. Both contain IDEs.

In this class, we will use Anaconda, with its associated package manager, conda. It has become the de facto package manager/distribution for scientific use.

Students who took BE/Bi 103 a

If you took BE/Bi 103 a last term, your computer is mostly configured. You should do the following on the command line.

conda update --all
pip install --upgrade arviz cmdstanpy bebi103 bokeh-catplot awscli

After applying the above updates, you can skip to the Stan installation section and continue.

Students who did not take BE/Bi 103 a

If you did not take BE/Bi 103 a last term, follow the instructions below to set up your machine.

macOS users: Install XCode

If you are using macOS, you should install XCode, if you haven’t already. It’s a large piece of software, taking up about 5GB on your hard drive, so make sure you have enough space. You can install it through the App Store.

After installing it, you need to open the program. Be sure to do that, for example by clicking on the XCode icon in your Applications folder. Upon opening XCode, it may perform more installations. After these are completed, you can close XCode.

Windows users: Install Git and Chrome or Firefox

We will be using JupyterLab in this course. It is browser-based, and Chrome, Firefox, and Safari are supported. Internet Explorer is not. Therefore, if you are a Windows user, you need to be sure you have either Chrome of Firefox installed.

Git is installed on Macs with XCode. For Windows users, you need to install Git. You can do this by following the instructions here.

Uninstalling Anaconda

Unless you have experience with Anaconda and know how to set up environments, if you have previously installed Anaconda with a version of Python other than 3.7, you need to uninstall it, removing it completely from your computer. You can find instructions on how to do that from the official uninstallation documentation.

Downloading and installing Anaconda

Downloading and installing Anaconda is simple. 1. Go to the Anaconda distribution homepage and download the graphical installer.
2. Be sure to download Anaconda for Python 3.7 for the appropriate operating system. 3. Follow the on-screen instructions for installation. When prompted, be sure to “Install for me only.” 4. You may be prompted for optional installations, like PyCharm. You will not need these for the course.

That’s it! After you do that, you will have a functioning Python distribution.

Launching JupyterLab and a terminal

After installing the Anaconda distribution, you should be able to launch the Anaconda Navigator. If you are using macOS, this is available in your Applications menu. If you are using Windows, you can do this from the Start menu. Launch Anaconda Navigator.

We will be using JupyterLab throughout the course. You should see an option to launch JupyterLab. When you do that, a new browser window or tab will open with JupyterLab running. Within the JupyterLab window, you will have the option to launch a notebook, a console, a terminal, or a text editor. We will use all of these during the course. For the updating and installation of necessary packages, click on Terminal to launch a terminal. You will get a terminal window (probably black) with a bash prompt. We refer to this text interface in the terminal as the command line.

The conda package manager

conda is a package manager for keeping all of your packages up-to-date. It has plenty of functionality beyond our basic usage in class, which you can learn more about by reading the docs. We will primarily be using conda to install and update packages.

conda works from the command line. Now that you know how to get a command line prompt, you can start using conda. The first thing we’ll do is update the packages that came with the Anaconda distribution. To do this, enter the following on the command line:

conda update --all

If anything is out of date, you will be prompted to perform the updates, and press y to continue. (If everything is up to date, you will just see a list of all the installed packages.) They may even be some downgrades. This happens when there are package conflicts where one package requires an earlier version of another. conda is very smart and figures all of this out for you, so you can almost always say “yes” (or “y”) to conda when it prompts you.

Installations

There are several additional installations you need to do. We will first install some plotting packages we need, which are available as HoloViz.

conda install -c pyviz holoviz

Some packages may again be downgraded with the installation of PyViz, and that is ok. Next, to configure JupyterLab, we need to install node.js.

conda install nodejs

We will also install watermark, which enables us to conveniently display version numbers of the software we are using. For this installation, we will use pip. There are a few other packages from pip we will need, so we can go ahead and install those now.

pip install --upgrade cmdstanpy arviz
pip install awscli
pip install --upgrade watermark black blackcellmagic bokeh-catplot bebi103

Finally, we need to configure JupyterLab to work with the plotting packages we will use.

jupyter labextension install --no-build @pyviz/jupyterlab_pyviz

You may also wish to install a spell-checker (this one isn’t necessary).

jupyter labextension install --no-build @ijmbarr/jupyterlab_spellchecker

After installing all of these extensions, you can rebuild JupyterLab.

jupyter lab build

You should close your JupyterLab session and terminate Anaconda Navigator after you have completed the build. Relaunch Anaconda Navigator and launch a fresh JupyterLab instance. As before, after JupyterLab launches, launch a new terminal window so that you can proceed with setting up Git.

Usage of Git/GitHub

We will make extensive use of Git during the course. We will use GitHub to host the repositories. You need to set up a GitHub account. Go to http://github.com/ to get an account. You should register with your academic email address so you get free private repositories as academics. You should also think carefully about picking your user name. There is a good chance other people in your professional life will see this.

Once you have a GitHub account, send an email to bois at caltech dot edu with your account ID to get access to the BE/Bi 103 Group on GitHub. Within this group, you will form a team. Your team consists of your partners for homework submission.

Stan installation

We will be using Stan for much of our statistical modeling. Stan has a probabilistic programming language. Programs written in this language, called Stan programs, are translated into C++ by the Stan parser, and then the C++ code is compiled. As you will see throughout the class, there are many advantages to this approach.

There are many interfaces for Stan, including the two most widely used RStan and PyStan, which are R and Python interfaces, respectively. We will use a newer interface, CmdStanPy, which has several advantages that will become apparent when you start using it.

Whichever interface you use needs to have Stan installed and functional, which means you have to have an installed C++ toolchain. Installation and compilation can be tricky and varies from operating system to operating system. To facilitate configuration of Stan and also to allow you to tackle more involved calculations, we will use AWS for computing with Stan. Within the first few weeks of class, you will receive instructions on how to use AWS. We have a pre-built Amazon Machine Image (AMI) that has all of the installations you need, and you can run your calculations on those machines. Because we (and Amazon) are providing this resource and because of the difficulties involved with local installations, we will not provide support for local installations of Stan, CmdStanPy, or other Stan interfaces.

That said, if you would like to install Stan and CmdStanPy locally, you may do so. Read on for instructions, though we offer no guarantees that they will work.

Configuring a C++ toolchain for MacOS

If you are using MacOS and you installed XCode as described above, you should already have a C++ toolchain. You can skip ahead to install Stan with CmdStanPy.

Configuring a C++ toolchain for Windows

You need to install a C++ toolchain for Windows. One possibility is to install a MinGW toolchain, and one way to do that is using conda.

conda install libpython m2w64-toolchain -c msys2

Installing Stan with CmdStanPy

If you have a functioning C++ toolchain, you can use CmdStanPy to install Stan/CmdStan. You can do this by running the following on the command line.

python -c "import cmdstanpy; cmdstanpy.install_cmdstan()"

This may take several minutes to run.

Checking your distribution

We’ll now run a quick test to make sure things are working properly. We will make a quick plot that requires some of the scientific libraries we will use.

Use the JupyterLab launcher (you can get a new launcher by clicking on the + icon on the left pane of your JupyterLab window) to launch a notebook. In the first cell (the box next to the [ ]: prompt), paste the code below. To run the code, press Shift+Enter while the cursor is active inside the cell. You should see a plot that looks like the one below. If you do, you have a functioning Python environment for scientific computing!

[1]:
import numpy as np
import bokeh.plotting
import bokeh.io

bokeh.io.output_notebook()

# Generate plotting values
t = np.linspace(0, 2*np.pi, 200)
x = 16 * np.sin(t)**3
y = 13 * np.cos(t) - 5 * np.cos(2*t) - 2 * np.cos(3*t) - np.cos(4*t)

p = bokeh.plotting.figure(height=250, width=275)
p.line(x, y, color='red', line_width=3)
text = bokeh.models.Label(x=0, y=0, text='BE/Bi 103 b', text_align='center')
p.add_layout(text)

bokeh.io.show(p)
Loading BokehJS ...

Checking your Stan installation

To check your Stan installation, you can run the following code. It will take several seconds for the model to compile and then sample. In the end, you should see a scatter plot of samples. You might not appreciate it yet, but this is a nifty demonstration of Stan’s power to sample hierarchical models, which is no trivial feat.

[2]:
import cmdstanpy
import arviz as az

schools_data = {
    "J": 8,
    "y": [28, 8, -3, 7, -1, 1, 18, 12],
    "sigma": [15, 10, 16, 11, 9, 11, 10, 18],
}

schools_code = """
data {
  int<lower=0> J; // number of schools
  vector[J] y; // estimated treatment effects
  vector<lower=0>[J] sigma; // s.e. of effect estimates
}

parameters {
  real mu;
  real<lower=0> tau;
  vector[J] eta;
}

transformed parameters {
  vector[J] theta = mu + tau * eta;
}

model {
  eta ~ normal(0, 1);
  y ~ normal(theta, sigma);
}
"""

with open("schools_code.stan", "w") as f:
    f.write(schools_code)

sm = cmdstanpy.CmdStanModel(stan_file="schools_code.stan")
samples = sm.sample(data=schools_data, output_dir="./")
samples = az.from_cmdstanpy(samples)

# Make a plot of samples
p = bokeh.plotting.figure(
    frame_height=250, frame_width=250, x_axis_label="mu", y_axis_label="tau"
)
p.circle(
    np.ravel(samples.posterior["mu"]),
    np.ravel(samples.posterior["tau"]),
    alpha=0.1
)

bokeh.io.show(p)
INFO:cmdstanpy:compiling c++
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103_course/2020/b/content/lessons/schools_code
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 3
INFO:cmdstanpy:finish chain 4

Computing environment

[3]:
%load_ext watermark
%watermark -v -p numpy,bokeh,cmdstanpy,arviz,jupyterlab
CPython 3.7.5
IPython 7.10.2

numpy 1.17.4
bokeh 1.4.0
cmdstanpy 0.8.0
arviz 0.6.1
jupyterlab 1.2.4