(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
This tutorial was generated from an Jupyter notebook. You can download the notebook here.
As we start to do more involved statistical inference (and image processing), you will want to run your calculations on more powerful machines. There are many options for this, including Amazon Web Services (AWS), Google Cloud Platform, Microsoft Azure, and Caltech's own high performance computing center. In this tutorial, we will show you how to get up and running with AWS. While it looks like a lot of steps, some of the steps are done only once, so it is not much more work to launch instances after the initial setup.
After you have an account, you can launch your instance. We have set up an Amazon Machine Image (AMI), which has the software you need for the course installed and the data sets pre-loaded. The AMI is available in Oregon. Be sure to select this region from the top right corner of the console. You should use the same region throughout the course, since that is physically where your machine will live.
Services
pulldown menu at the top of your screen. Images
there, click AMIs
. Owned by me
(you likely do not have any). Select instead Public images
. BE/Bi 103
, and the class AMI should appear. If it does not, double check to make sure your region is Oregon.Launch
. (You may also select Spot request
if you want to save some money, but when you stop a spot instance, you will lose whatever you stored there.) t2.micro
type will work for the early part of the course, which is not computationally intensive. This instance is free, provided your storage does not exceed 30 GB, which it shouldn't. When we use AWS for more involved calculations using Stan, I recommend a c5.xlarge
or c5.2xlarge
instance.Next: Configure Instance Details
at the bottom of the screen.Next: Add Storage
at the bottom of the page.Size (GiB)
box. 30 GB falls within the free tier, and should be enough, but you can add more. (If you have experience with AWS, you can add a new volume, but be sure you know how to mount it once you get your instance running.)Next: Add Tags
at the bottom of the page.Next: Add Security Group
.Create a
new
security group
. You can name the group and add a description so you can use it again later. Leave the default SSH rule as it is.Add Rule
, and select HTTPS
from the pulldown menu. Leave the defaults for that rule that appear.Add Rule
, and select Custom TCP Rule
. Change the Port Range
to 8888
.Source
column, you may want to adjust the source to be Anywhere
for each pulldown menu. That allows you to access your instance from anywhere, but it is not very secure. It is convenient because you can run the instance from your work machine and then from home later on. However, you may want to be more secure and instead select My IP
or provide a custom rule for IP addresses that may access the instance.Review and Launch
at the bottom of the page and make sure everything is specified as you like.Launch
.Launch Instance
.After following those steps, your instance will spin up. You can view your running instances by clicking on Instances
on the dashboard on the left part of the screen.
It will take a while for your instance to spin up. When the Instance State
says running
and the Status Checks
are complete, your instance is ready for you.
Now that your instance is launched, you can connect to it using your computer and the ssh protocol. The instructions work for Windows, macOS, or Linux, assuming you have a terminal running bash. In Windows, this is accomplished using GitBash. For macOS, use Terminal.
key_pairs/
and that your keypair file is ~/key_pairs/bebi103_aws_keypair.pem
.Change permissions on your keypair for security. Do this in the terminal using
chmod 400 ~/key_pairs/bebi103_aws_keypair.pem
Open a new GitBash (Windows) or Terminal (macOS) window.
SSH into your instance in the terminal. To do this, click on yout instance on the Instances
page in the Management Console. At the bottom of the webpage will appear information about your instance, inclugint the IPv4 Public IP. It will look something like 54.92.67.113
. Copy this. In what following, I refer to this as <IPv4 Public IP>
. SSH into your instance by doing
ssh -i "~/key_pairs/bebi103_aws_keypair.pem" ec2-user@<IPv4 Public IP>
(optional, may only work for macOS) To avoid having to use -i "~/key_pairs/bebi103_aws_keypair.pem"
each time, you can add your keypair to your bash profile by doing
echo ssh-add -K $PWD/bebi103_aws_keypair.pem >> ~/.bash_profile;
source ~/.bash_profile
Now that you are SSH'd into the instance, you can clone your team's repository. Using team 6 as an example, execute
git clone https://github.com/bebi103/06-bebi103-2018.git
This will clone your respository to your instance. Note that if you log back in to your instance, you may not need to clone again, but can instead do cd ~/06-bebi103-2018; git pull
. In fact, you should pretty much always git pull
as the first thing you do when you ssh into your instance so that you are up to date with your group's work.
The data for the course are located in the ~/data
folder. To keep the same relative paths you have been using from inside your repository (and as the TAs are expecting for grading), you should make a symbolic link to the ~/data
directory. To do that, for example if you are in team 6, do the following on the command line.
cd ~/06-bebi103-2018; ln -s $HOME/data ./data
You now have your repository and all of the data for the course on your instance!
When you launch JupyterLab, you want to use screen
. By running screen, your JupyterLab session will not get interrupted if you disconnect from your instance. So, on the command line in your instance, execute
screen
Then, you can launch JupyterLab by executing
jupyter lab --no-browser
on the command line. This will launch JupyterLab. You will see output like this:
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://localhost:8888/?token=4eea0b108226fe68b7ahdf219d5efe85d16a0648a3f3f4
Keep this window open.
In order to use JupyterLab through a browser on your machine, you need to set up a socket. To do so, open up another GitBash or Terminal window and execute the following.
ssh -i "~/key_pairs/bebi103_aws_keypair.pem" -L 8000:localhost:8888 ec2-user@<IPv4 Public IP>
This sets up a socket connecting port 8888
on your EC2 instance to port 8000
on your local machine. You can change these numbers as necessary. For example, in the URL listed above that you got with you launched JupyterLab, the port may be localhost:8889
, in which case you need to substitute 8889
for 8888
in your ssh command. You may also want a different local port if you already have a JupyterLab instance running on port 8000
, e.g., 8001
. In what follows, I will use port number 8000
and 8888
, which you will probably use 90% of the time, but you can make changes as you see fit.
After you have set up the socket, you can paste the URL given when you launched JupyterLab on your EC2 instance into your browser, but substitute 8000
for 8888
. That is, direct your browser to
http://localhost:8000/?token=4eea0b108226fe68b7ahdf219d5efe85d16a0648a3f3f4
You will now have JupyterLab up and running!
If you lose your internet connection, you can reconnect to your session, with JupyterLab running, by reattaching your screen. execute screen -r
on the command line after SSH-ing back in to your EC2 instance to do this.
You can see what screens are active by doing screen -ls
on the command line. You can also detach the current screen by using screen -d
.
As you work on notebooks and create new files you want to save, you may want to move them to your local machine. If you are working on a notebook or .stan
file, the best option is to use git and commit and push those files to your repository directly from the command line on your EC2 instance.
Some files, though, such as MCMC results or intermediate data processing results, are not meant to be under version control. For these file, you an use scp
. Within your GitBash or Terminal window on your local machine (you probably have to open yet another), you can copy files as follows.
scp -i "~/key_pairs/bebi103_aws_keypair.pem" ec2-user@<IPv4 Public IP>:~/my_file.csv ./
This command will copy files from your EC2 instance to your present working directory. Simply put the full path to the file you want to transfer after the colon above (remember ~/
means "home directory"). The second argument of scp
is where you want to copy the file.
Similarly, you can upload files to your EC2 instance as follows (in this example to the home directory in your instance).
scp -i "~/key_pairs/bebi103_aws_keypair.pem" my_file.txt ec2-user@<IPv4 Public IP>:~/
When you are finished with your session, you can shut down your notebook in the browser. Then, in the terminal window, you can shut down JupyterLab by pressing Ctrl-c
. After Jupyter is terminated, you should detach your screen by doing screen -d
. Finally, you should quit your screen by doing screen -X quit
.
In the past, I have had students have their instances littered with detached screens. You should clean house from time to time and run screen -X quit
.
After you are finished with your work on your instance, you should stop your instance. To do this, go back to the Instances
page on your EC2 console. Right click your instance, and navigate to Instance State
→ Stop
. Do not terminate your instance unless you really want to. Terminating an instance will get rid of any changes you made to it.
If your instance is not stopped and you leave it running, you will get charged. You can rack up a massive bill with idle, but running, instances. You should stop your instances whenever you are not using them. It is a minor pain to wait for them to spin up again, but forgetting about a running instance will cause more pain than that to your pocketbook.
AFter the class is over, you might want to terminate your instance. This is because the storage in your instance (stored using AWS's EBS, which is what keeps your repository, installations, etc., all in tact) is not free. Once your free tier accessibility expires in a year if you are new to AWS, and/or your promo codes expire, you will start getting bills for your EBS usage. These get wiped if you terminate your instance and you will not get billed.
We will distribute promo codes graciously donated by Amazon for you to use resources beyond the free tier. One member of your group will get the promo code, but all members should be able to use the AWS resource. To enable this feature, the owner of your promo code needs to set up identity and access management (IAM). To set this up, perform the following steps.
Services
pulldown menu at the upper left of the screen. Under Security, Identity & Compliance
click IAM
.Groups
.Create New Group
button, and give the group a name on the next screen. Something like "BE/Bi 103 2018 Team 6" will suffice. Once you have done that, click the Next Step
button on the lower right of the screen.Attach Policies
page, select AmazonEC2FullAccess
. Then click the Next Step
button.Create Group
button, and the group is created.Users
on the left IAM dashboard.Add User
button.AWS Management Console access
button.Console password
, either choose you own strong password or use the autogenerated password. This will only be generated once, so write it down to share with your teammates. Obviously, do not email the password.Require password reset
box checked or not, depending on your preference. Then click the Next: Permssions
button.Next: Review
.Create users
.Download CSV
, you get a CSV
file with the links and passwords. If you click Send email
, you can send an email to your teammates that has the instructions for login. The password will not be sent. They simply need to follow these instructions and use the password you give them to run instances on your account, which gives them access to the promo code.After the term is over, you should delete the extra users, which you can do from the IAM Users
page.
Serving up interactive Bokeh plots with AWS is a but more involved than doing it on your local machine. You still get the native interaction, like zooming and tooltips, but if you want to have interactions that require Python code running behind it, you will not be able to do that as configured here. Specifically, you will not be able to use the bebi103.viz.record_clicks()
and bebi103.viz.draw_rois()
functions. As a workaround, you can use the bebi103.viz.imshow()
function with the display_clicks=True
kwarg. With this activated, whenever you click on the plot of the image, the location of the click will appear right next to the plot. You can then copy and paste these clicks to make a Numpy array.
Perhaps a better way to do this, though, is to use bebi103.viz.draw_rois()
on your local machine, write the result out as a CSV file, and the scp
it to your instance and load it from there.
You should now have the power to use AWS for your computing. If you are curious, to build the AMI, I ran the steps below on the command line of my instance when I first SSH'd into it.
sudo yum -y update;
sudo yum -y install git;
git clone https://github.com/justinbois/bebi103_ami_setup.git;
bash ~/bebi103_ami_setup/setup_ami.sh;
rm -rf bebi103_ami_setup;
rm -f miniconda.sh;