Recitation 4: Running Jupyter notebooks with AWS¶

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

As we start to do more involved statistical inference (and image processing), you will want to run your calculations on more powerful machines. There are many options for this, including Amazon Web Services (AWS), Google Cloud Platform, Microsoft Azure, and Caltech's own high performance computing center. In this tutorial, we will show you how to get up and running with AWS. While it looks like a lot of steps, some of the steps are done only once, so it is not much more work to launch instances after the initial setup.

1. Create an Amazon Web Services account¶

The first thing to do is create an AWS account by clicking on the link on the upper right corner of this page. Once you have your account you can go back to it by clicking on Sign In To Console, which should now be in the upper right corner of the same site.

2. Launch your instance¶

After you have an account, you can launch your instance. We have set up an Amazon Machine Image (AMI), which has the software you need for the course installed and the data sets pre-loaded. The AMI is available in Oregon. Be sure to select this region from the top right corner of the console. You should use the same region throughout the course, since that is physically where your machine will live.

To launch an instance with this AMI, choose EC2 among the services available from your AWS console. You can select EC2 from the Services pulldown menu at the top of your screen.
After selecting EC2, you will see the EC2 Dashboard on the left pane. Under the Images there, click AMIs.
The resulting menu will default to AMIs Owned by me (you likely do not have any). Select instead Public images.
In the search menu, search for BE/Bi 103, and the class AMI should appear. If it does not, double check to make sure your region is Oregon.
You will see the Caltech BE/Bi 103 2018 edition AMI listed. Right click on it and select Launch. (You may also select Spot request if you want to save some money, but when you stop a spot instance, you will lose whatever you stored there.)
You will then be given many choices of instance types. The t2.micro type will work for the early part of the course, which is not computationally intensive. This instance is free, provided your storage does not exceed 30 GB, which it shouldn't. When we use AWS for more involved calculations using Stan, I recommend a c5.xlarge or c5.2xlarge instance.
After you select which instance you want, click Next: Configure Instance Details at the bottom of the screen.
You can leave the defaults in the Step 3: Configure Instance Details page and click Next: Add Storage at the bottom of the page.
You can increase the root storage size in the Size (GiB) box. 30 GB falls within the free tier, and should be enough, but you can add more. (If you have experience with AWS, you can add a new volume, but be sure you know how to mount it once you get your instance running.)
Click Next: Add Tags at the bottom of the page.
Leave the defaults and click Next: Add Security Group.
Select Create a new security group. You can name the group and add a description so you can use it again later. Leave the default SSH rule as it is.
Click Add Rule, and select HTTPS from the pulldown menu. Leave the defaults for that rule that appear.
Click Add Rule, and select Custom TCP Rule. Change the Port Range to 8888.
For the Source column, you may want to adjust the source to be Anywhere for each pulldown menu. That allows you to access your instance from anywhere, but it is not very secure. It is convenient because you can run the instance from your work machine and then from home later on. However, you may want to be more secure and instead select My IP or provide a custom rule for IP addresses that may access the instance.
Click Review and Launch at the bottom of the page and make sure everything is specified as you like.
Click Launch.
You will be prompted for a key pair. For your first time launching, you will need to create a keypair. This will be provided to you only once, so download it and store it locally on your machine. DO NOT, I repeat, DO NOT store it store it in any git repository, anything that is backed up to the cloud, like Dropbox. ONLY store it locally on your machine and never let it out to the internet. If it is your second time launching, you can use an exisiting key pair.
Finally, click Launch Instance.

After following those steps, your instance will spin up. You can view your running instances by clicking on Instances on the dashboard on the left part of the screen.

It will take a while for your instance to spin up. When the Instance State says running and the Status Checks are complete, your instance is ready for you.

3. Connect to your instance¶

Now that your instance is launched, you can connect to it using your computer and the ssh protocol. The instructions work for Windows, macOS, or Linux, assuming you have a terminal running bash. In Windows, this is accomplished using GitBash. For macOS, use Terminal.

Identify where you put your keypair file. For the purposes of this exercise, I will assume that you have a directory in your home directory called key_pairs/ and that your keypair file is ~/key_pairs/bebi103_aws_keypair.pem.
Change permissions on your keypair for security. Do this in the terminal using

chmod 400 ~/key_pairs/bebi103_aws_keypair.pem
Open a new GitBash (Windows) or Terminal (macOS) window.
SSH into your instance in the terminal. To do this, click on yout instance on the Instances page in the Management Console. At the bottom of the webpage will appear information about your instance, inclugint the IPv4 Public IP. It will look something like 54.92.67.113. Copy this. In what following, I refer to this as <IPv4 Public IP>. SSH into your instance by doing

ssh -i "~/key_pairs/bebi103_aws_keypair.pem" ec2-user@<IPv4 Public IP>
(optional, may only work for macOS) To avoid having to use -i "~/key_pairs/bebi103_aws_keypair.pem" each time, you can add your keypair to your bash profile by doing

echo ssh-add -K $PWD/bebi103_aws_keypair.pem >> ~/.bash_profile; source ~/.bash_profile
Now that you are SSH'd into the instance, you can clone your team's repository. Using team 6 as an example, execute

git clone https://github.com/bebi103/06-bebi103-2018.git

This will clone your respository to your instance. Note that if you log back in to your instance, you may not need to clone again, but can instead do cd ~/06-bebi103-2018; git pull. In fact, you should pretty much always git pull as the first thing you do when you ssh into your instance so that you are up to date with your group's work.

The data for the course are located in the ~/data folder. To keep the same relative paths you have been using from inside your repository (and as the TAs are expecting for grading), you should make a symbolic link to the ~/data directory. To do that, for example if you are in team 6, do the following on the command line.

cd ~/06-bebi103-2018; ln -s $HOME/data ./data
You now have your repository and all of the data for the course on your instance!

4. Launch JupyterLab¶

When you launch JupyterLab, you want to use screen. By running screen, your JupyterLab session will not get interrupted if you disconnect from your instance. So, on the command line in your instance, execute

screen

Then, you can launch JupyterLab by executing

jupyter lab --no-browser

on the command line. This will launch JupyterLab. You will see output like this:

Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
    http://localhost:8888/?token=4eea0b108226fe68b7ahdf219d5efe85d16a0648a3f3f4

Keep this window open.

In order to use JupyterLab through a browser on your machine, you need to set up a socket. To do so, open up another GitBash or Terminal window and execute the following.

ssh -i "~/key_pairs/bebi103_aws_keypair.pem" -L 8000:localhost:8888 ec2-user@<IPv4 Public IP>

This sets up a socket connecting port 8888 on your EC2 instance to port 8000 on your local machine. You can change these numbers as necessary. For example, in the URL listed above that you got with you launched JupyterLab, the port may be localhost:8889, in which case you need to substitute 8889 for 8888 in your ssh command. You may also want a different local port if you already have a JupyterLab instance running on port 8000, e.g., 8001. In what follows, I will use port number 8000 and 8888, which you will probably use 90% of the time, but you can make changes as you see fit.

After you have set up the socket, you can paste the URL given when you launched JupyterLab on your EC2 instance into your browser, but substitute 8000 for 8888. That is, direct your browser to

http://localhost:8000/?token=4eea0b108226fe68b7ahdf219d5efe85d16a0648a3f3f4

You will now have JupyterLab up and running!

5. If you get detached¶

If you lose your internet connection, you can reconnect to your session, with JupyterLab running, by reattaching your screen. execute screen -r on the command line after SSH-ing back in to your EC2 instance to do this.

You can see what screens are active by doing screen -ls on the command line. You can also detach the current screen by using screen -d.

6. Copying results to and from AWS to your local machine¶

As you work on notebooks and create new files you want to save, you may want to move them to your local machine. If you are working on a notebook or .stan file, the best option is to use git and commit and push those files to your repository directly from the command line on your EC2 instance.

Some files, though, such as MCMC results or intermediate data processing results, are not meant to be under version control. For these file, you an use scp. Within your GitBash or Terminal window on your local machine (you probably have to open yet another), you can copy files as follows.

scp -i "~/key_pairs/bebi103_aws_keypair.pem" ec2-user@<IPv4 Public IP>:~/my_file.csv ./

This command will copy files from your EC2 instance to your present working directory. Simply put the full path to the file you want to transfer after the colon above (remember ~/ means "home directory"). The second argument of scp is where you want to copy the file.

Similarly, you can upload files to your EC2 instance as follows (in this example to the home directory in your instance).

scp -i "~/key_pairs/bebi103_aws_keypair.pem" my_file.txt ec2-user@<IPv4 Public IP>:~/

7. Exiting¶

When you are finished with your session, you can shut down your notebook in the browser. Then, in the terminal window, you can shut down JupyterLab by pressing Ctrl-c. After Jupyter is terminated, you should detach your screen by doing screen -d. Finally, you should quit your screen by doing screen -X quit.

In the past, I have had students have their instances littered with detached screens. You should clean house from time to time and run screen -X quit.

After you are finished with your work on your instance, you should stop your instance. To do this, go back to the Instances page on your EC2 console. Right click your instance, and navigate to Instance State → Stop. Do not terminate your instance unless you really want to. Terminating an instance will get rid of any changes you made to it.

8. Seriously. Stop your instances if you are not using them.¶

If your instance is not stopped and you leave it running, you will get charged. You can rack up a massive bill with idle, but running, instances. You should stop your instances whenever you are not using them. It is a minor pain to wait for them to spin up again, but forgetting about a running instance will cause more pain than that to your pocketbook.

9. Terminate your instances after the class is over¶

AFter the class is over, you might want to terminate your instance. This is because the storage in your instance (stored using AWS's EBS, which is what keeps your repository, installations, etc., all in tact) is not free. Once your free tier accessibility expires in a year if you are new to AWS, and/or your promo codes expire, you will start getting bills for your EBS usage. These get wiped if you terminate your instance and you will not get billed.

We will distribute promo codes graciously donated by Amazon for you to use resources beyond the free tier. One member of your group will get the promo code, but all members should be able to use the AWS resource. To enable this feature, the owner of your promo code needs to set up identity and access management (IAM). To set this up, perform the following steps.

Once logged in to your AWS console, click the Services pulldown menu at the upper left of the screen. Under Security, Identity & Compliance click IAM.
On the left side of the screen on the IAM console, click Groups.
Click the Create New Group button, and give the group a name on the next screen. Something like "BE/Bi 103 2018 Team 6" will suffice. Once you have done that, click the Next Step button on the lower right of the screen.
On the Attach Policies page, select AmazonEC2FullAccess. Then click the Next Step button.
After reviewing, you can click the Create Group button, and the group is created.
Now click Users on the left IAM dashboard.
Click the Add User button.
Fill in a user names for your teammates and then click the AWS Management Console access button.
For Console password, either choose you own strong password or use the autogenerated password. This will only be generated once, so write it down to share with your teammates. Obviously, do not email the password.
You can leave the Require password reset box checked or not, depending on your preference. Then click the Next: Permssions button.
Add the users to the group you just set up. Then click Next: Review.
Make sure everything looks good and then click Create users.
The next page shows the login credentials. Importantly, it gives a link and a password that the users need to log in using IAM. For the users to log in to the console, they need to go to the link displayed and enter their credentials. If you click Download CSV, you get a CSV file with the links and passwords. If you click Send email, you can send an email to your teammates that has the instructions for login. The password will not be sent. They simply need to follow these instructions and use the password you give them to run instances on your account, which gives them access to the promo code.

After the term is over, you should delete the extra users, which you can do from the IAM Users page.

A limitation with Bokeh plots in JupyterLab when using AWS¶

Serving up interactive Bokeh plots with AWS is a but more involved than doing it on your local machine. You still get the native interaction, like zooming and tooltips, but if you want to have interactions that require Python code running behind it, you will not be able to do that as configured here. Specifically, you will not be able to use the bebi103.viz.record_clicks() and bebi103.viz.draw_rois() functions. As a workaround, you can use the bebi103.viz.imshow() function with the display_clicks=True kwarg. With this activated, whenever you click on the plot of the image, the location of the click will appear right next to the plot. You can then copy and paste these clicks to make a Numpy array.

Perhaps a better way to do this, though, is to use bebi103.viz.draw_rois() on your local machine, write the result out as a CSV file, and the scp it to your instance and load it from there.

Building the AMI¶

You should now have the power to use AWS for your computing. If you are curious, to build the AMI, I ran the steps below on the command line of my instance when I first SSH'd into it.

sudo yum -y update;
sudo yum -y install git;
git clone https://github.com/justinbois/bebi103_ami_setup.git;
bash ~/bebi103_ami_setup/setup_ami.sh;
rm -rf bebi103_ami_setup;
rm -f  miniconda.sh;