AWS setup¶
As we start to do more involved statistical inference using Stan, you will want to run your calculations on more powerful machines. There are many options for this, including Amazon Web Services (AWS), Google Cloud Platform, Microsoft Azure, and Caltech’s own high performance computing center. In this tutorial, we will show you how to get up and running with AWS. While it looks like a lot of steps, some of the steps are done only once, so it is not much more work to launch instances after the initial setup. The next part of this lesson serves as a quick reference for how to spin up instances and use them after you have completed the setup outlined in this part of the lesson.
1. Create an Amazon Web Services account¶
Instructions for students¶
If you are a student, you will use AWS Educate to access AWS. You should have received an invitation from AWS.
Follow the instructions in that email and on subsequent pages you encounter in your browser to set up an AWS Educate account. You will need to fill out a form with some basic information about your education (e.g., when you graduate).
AWS will then review the information and you will receive an email from them verifying your account. (This may take some time.)
Again, follow the instructions on their confirmation email and subsequent screens in your browser. You will need to agree to their terms of use, which you should obviously only do if you are comfortable with that.
Once you are able to log in to AWS Educate, do so.
Once logged in, click
Classrooms
on the top menu.Choose our class, “Statistical Inference in the Biological Sciences” and click
Go to Classroom
.You will be directed to go to AWS’s third party educational platform provider, Vocareum.
Once on Vocareum’s page, click the
AWS Console
button. This will bring you to the AWS Console, which is where you will launch your instances.
Instructions for postdocs¶
Unfortunately, AWS does not provide credits for people who are not enrolled as students. If your lab has an AWS account, you should speak with whoever in your lab is responsible for administering them and ask for access. Discuss with them how to log in and get to the AWS Console.
If you want to use your own account, you can create an AWS account by clicking on the link on the upper right corner of this page. Once you have your account you can go back to it by clicking on Sign In To Console, which should now be in the upper right corner of the same site.
Importantly, whether you are using your own account or your lab’s account, a credit card will be charged for all of your AWS usage. The total cost for the class should be well under $50.
2. Launch your instance¶
Once you are on the AWS Console page, you can launch your instance. We have set up an Amazon Machine Image (AMI), which has the software you need for the course installed and the data sets pre-loaded. The AMI is available in Oregon and in N. Virginia. If you are a student and using AWS Educate, you must use N. Virginia. Postdocs may use either. Be sure to select this region from the top right corner of the console. You should use the same region throughout the course, since that is physically where your machine will live.
To launch an instance with this AMI, choose EC2 among the services available from your AWS console. You can select EC2 from the
Services
pulldown menu at the top of your screen.After selecting EC2, you will see the EC2 Dashboard on the left pane. Under the
Images
there, clickAMIs
.The resulting menu will default to AMIs
Owned by me
(you likely do not have any). Select insteadPublic images
.In the search menu, search for
BE/Bi 103
, and the class AMI should appear. If it does not, double check to make sure your region is N. Virginia (if you are a student using AWS Educate) or Oregon.You will see the
Caltech BE/Bi 103 b 2020
AMI listed. Right click on it and selectLaunch
. (You may also selectSpot request
if you want to save some money, but when you stop a spot instance, you will lose whatever you stored there.)You will then be given many choices of instance types. For our beginning usage of Stan, I recommend a
c5.xlarge
instance, which as 4 CPUs. When we start doing simulation based calibration (SBC), which is the most computationally intensive part of the course, I would use ac5.2xlarge
instance or larger (8 or more CPUs). Note that thec5.2xlarge
instance may not be available for AWS Educate users, but if you do want eight cores, then a t2.2xlarge instance, which is available to AWS Educate users, will work.After you select which instance you want, click
Next: Configure Instance Details
at the bottom of the screen.You can leave the defaults in the Step 3: Configure Instance Details page and click
Next: Add Storage
at the bottom of the page.You can increase the root storage size in the
Size (GiB)
box. 30 GB falls within the free tier, and should be enough, but you can add more. (If you have experience with AWS, you can add a new volume, but be sure you know how to mount it once you get your instance running.) For most users in this class, the free 30 GB should be fine.Click
Next: Add Tags
at the bottom of the page.You can add tags if you wish to describe this instance for future reference, but it is not necessary. After entering tags (or not), click
Next: Add Security Group
.Select
Create a
``new``security group
. You can name the group and add a description so you can use it again later. Leave the default SSH rule as it is.Click
Add Rule
, and selectHTTPS
from the pulldown menu. Leave the defaults for that rule that appear.Click
Add Rule
, and selectCustom TCP Rule
. Change thePort Range
to8888-8892
. This will allow you to run as many as five notebooks (on ports 8888, 8889, 8890, 8891, and 8892) from AWS.For the
Source
column, you may want to adjust the source to beAnywhere
for each pulldown menu. That allows you to access your instance from anywhere, but it is not very secure. It is convenient because you can run the instance from your work machine and then from home later on. However, you may want to be more secure and instead selectMy IP
or provide a custom rule for IP addresses that may access the instance.Click
Review and Launch
at the bottom of the page and make sure everything is specified as you like.Click
Launch
.You will be prompted for a key pair. For your first time launching, you will need to create a keypair. This will be provided to you only once, so download it and store it locally on your machine. DO NOT, I repeat, DO NOT store it store it in any git repository, or anything that is backed up to the cloud, like Dropbox. ONLY store it locally on your machine and never, ever let it out to the internet. If it is your second time launching, you can use an exisiting key pair.
Finally, click
Launch Instance
.
After following those steps, your instance will spin up. You can view your running instances by clicking on Instances
on the dashboard on the left part of the screen.
It will take a while for your instance to spin up. When the Instance State
says running
and the Status Checks
are complete, your instance is ready for you.
3. Connect to your instance¶
Now that your instance is launched, you can connect to it using your computer and the ssh protocol. The instructions work for Windows, macOS, or Linux, assuming you have a terminal running bash. In Windows, this is accomplished using GitBash, which you should install if you have not already. For macOS, use Terminal.
Identify where you put your keypair file. For the purposes of this exercise, I will assume that you have a directory in your home directory called
key_pairs/
and that your keypair file is~/key_pairs/bebi103_aws_keypair.pem
.Change permissions on your keypair for security. Do this in the terminal using
chmod 400 ~/key_pairs/bebi103_aws_keypair.pem
Open a new GitBash (Windows) or Terminal (macOS) window.
SSH into your instance in the terminal. To do this, click on your instance on the
Instances
page in the Management Console. At the bottom of the webpage will appear information about your instance, incluging the IPv4 Public IP. It will look something like54.92.67.113
. Copy this. In what following, I refer to this as<IPv4 Public IP>
. SSH into your instance by doingssh -i "~/key_pairs/bebi103_aws_keypair.pem" ec2-user@<IPv4 Public IP>
(optional, may only work for macOS) To avoid having to use
-i "~/key_pairs/bebi103_aws_keypair.pem"
each time, you can add your keypair to your bash profile by doingecho ssh-add -K $HOME/keypairs/bebi103_aws_keypair.pem >> ~/.bash_profile;
source ~/.bash_profile
Now that you are SSH’d into the instance, you can clone your team’s repository. Using team 6 as an example, execute
git clone https://github.com/bebi103/06-bebi103b-2020.git
This will clone your repository to your instance. Note that if you log back in to your instance, you may not need to clone again, but can instead do cd ~/06-bebi103b-2020; git pull
. In fact, you should pretty much always git pull
as the first thing you do when you ssh into your instance so that you are up to date with your group’s work.
The data for the course are located in the
~/data
folder. To keep the same relative paths you have been using from inside your repository (and as the TAs are expecting for grading), you should make a symbolic link to the~/data
directory. To do that, for example if you are in team 6, do the following on the command line.cd ~/06-bebi103b-2020; ln -s $HOME/data ./data
Whenever you log in to your instance, you should update the instance in case I updated any software or added data sets. To do this, run
bebi103_update
on the command line after SSH-ing into your instance.
You now have your repository and all of the data for the course on your instance!
4. Launch JupyterLab¶
When you launch JupyterLab, you want to use `screen
<https://en.wikipedia.org/wiki/GNU_Screen>`__. By running screen, your JupyterLab session will not get interrupted if you disconnect from your instance. So, on the command line in your instance, execute
screen
Then, you can launch JupyterLab by executing
jupyter lab --no-browser
on the command line. This will launch JupyterLab. You will see output like this:
To access the notebook, open this file in a browser:
file:///home/ec2-user/.local/share/jupyter/runtime/nbserver-1821-open.html
Or copy and paste one of these URLs:
http://localhost:8888/?token=b9910e579549381a3b6dvd359fada1624bcdf718422bab95
or http://127.0.0.1:8888/?token=b9910e579549381a3b6dvd359fada1624bcdf718422bab95
Keep this window open.
In order to use JupyterLab through a browser on your machine, you need to set up a socket. To do so, open up another GitBash or Terminal window and execute the following.
ssh -i "~/key_pairs/bebi103_aws_keypair.pem" -L 8000:localhost:8888 ec2-user@<IPv4 Public IP>
This sets up a socket connecting port 8888
on your EC2 instance to port 8000
on your local machine. You can change these numbers as necessary. For example, in the URL listed above that you got with you launched JupyterLab, the port may be localhost:8889
, in which case you need to substitute 8889
for 8888
in your ssh command. You may also want a different local port if you already have a JupyterLab instance running on port 8000
, e.g., 8001
. In what follows, I will use
port number 8000
and 8888
, which you will probably use 90% of the time, but you can make changes as you see fit.
After you have set up the socket, you can paste the URL given when you launched JupyterLab on your EC2 instance into your browser, but substitute 8000
for 8888
. That is, direct your browser to
http://localhost:8000/?token=b9910e579549381a3b6dvd359fada1624bcdf718422bab95
You will now have JupyterLab up and running!
Note that you may be running JupyterLab locally on your own machine. You should make sure you do not use the same port number of any JupyterLab instance running on your local machine when you launch JupyterLab on AWS. You can specify the port number to be, for example 8890, by launching JupyterLab with
jupyter lab --no-browser --port 8890
If you do that, make sure you use the corresponding port numbers when setting up your socket.
5. If you get detached¶
If you lose your internet connection, you can reconnect to your session, with JupyterLab running, by reattaching your screen. execute screen -r
on the command line after SSH-ing back in to your EC2 instance to do this.
You can see what screens are active by doing screen -ls
on the command line. You can also detach the current screen by using screen -d
.
6. Copying results to and from AWS to your local machine¶
As you work on notebooks and create new files you want to save, you may want to move them to your local machine. If you are working on a notebook or .stan
file, the best option is to use git and commit and push those files to your repository directly from the command line on your EC2 instance.
Some files, though, such as MCMC results or intermediate data processing results, are not meant to be under version control. For these file, you an use scp
. Within your GitBash or Terminal window on your local machine (you probably have to open yet another), you can copy files as follows.
scp -i "~/key_pairs/bebi103_aws_keypair.pem" ec2-user@<IPv4 Public IP>:~/my_file.csv ./
This command will copy files from your EC2 instance to your present working directory. Simply put the full path to the file you want to transfer after the colon above (remember ~/
means “home directory”). The second argument of scp
is where you want to copy the file.
Similarly, you can upload files to your EC2 instance as follows (in this example to the home directory in your instance).
scp -i "~/key_pairs/bebi103_aws_keypair.pem" my_file.txt ec2-user@<IPv4 Public IP>:~/
7. Exiting¶
When you are finished with your session, you can shut down your notebook in the browser. Then, in the terminal window, you can shut down JupyterLab by pressing Ctrl-c
. After Jupyter is terminated, you should detach your screen by doing screen -d
. Finally, you should quit your screen by doing screen -X quit
.
In the past, I have had students have their instances littered with detached screens. You should clean house from time to time and run screen -X quit
.
After you are finished with your work on your instance, you should stop your instance. To do this, go back to the Instances
page on your EC2 console. Right click your instance, and navigate to Instance State
→ Stop
. Do not terminate your instance unless you really want to. Terminating an instance will get rid of any changes you made to it.
8. Seriously. Stop your instances if you are not using them.¶
If your instance is not stopped and you leave it running, you will get charged against the $50 of credits AWS Educate provided. You can rack up a massive bill with idle, but running, instances. You should stop your instances whenever you are not using them. It is a minor pain to wait for them to spin up again, but forgetting about a running instance will cause more pain than that to your pocketbook.
9. Terminate your instances after the class is over¶
After the class is over, you might want to terminate your instance, especially if you plan on using your own account going forward. This is because the storage in your instance (stored using AWS’s EBS, which is what keeps your repository, installations, etc., all in tact) is not free. Once your free tier accessibility expires in a year if you are new to AWS, and/or your promo codes expire, you will start getting bills for your EBS usage. These get wiped if you terminate your instance and you will not get billed.
Building the AMI¶
You should now have the power to use AWS for your computing. If you are curious, to build the AMI, I ran the steps below on the command line of my instance when I first SSH’d into it.
sudo yum -y update;
sudo yum -y install git;
git clone https://github.com/justinbois/bebi103_ami_setup.git;
bash ~/bebi103_ami_setup/setup_ami.sh;
rm -rf bebi103_ami_setup;