AWS setup


As we start to do more involved statistical inference using Stan, you will want to run your calculations on more powerful machines. There are many options for this, including Amazon Web Services (AWS), Google Cloud Platform, Microsoft Azure, and Caltech’s own high performance computing center. In this lesson, we will show you how to get up and running with AWS. While it looks like a lot of steps, some of the steps are done only once, so it is not much more work to launch instances after the initial setup. The next part of this lesson serves as a quick reference for how to spin up instances and use them after you have completed the setup outlined in this part of the lesson.

1. Create an Amazon Web Services account

Instructions for students

If you are a student, you will use AWS Educate to access AWS. You should have received an invitation from AWS.

  1. Follow the instructions in that email and on subsequent pages you encounter in your browser to set up an AWS Educate account. You will need to fill out a form with some basic information about your education (e.g., when you graduate).

  2. AWS will then review the information and you will receive an email from them verifying your account. (This may take some time.)

  3. Again, follow the instructions on their confirmation email and subsequent screens in your browser. You will need to agree to their terms of use, which you should obviously only do if you are comfortable with that.

  4. Once you are able to log in to AWS Educate, do so.

  5. Once logged in, click Classrooms on the top menu.

  6. Choose our class, “Statistical Inference in the Biological Sciences” and click Go to Classroom.

  7. You will be directed to go to AWS’s third party educational platform provider, Vocareum.

  8. Once on Vocareum’s page, click the AWS Console button. This will bring you to the AWS Console, which is where you will launch your instances.

Instructions for postdocs

Unfortunately, AWS does not provide credits for people who are not enrolled as students. If your lab has an AWS account, you should speak with whoever in your lab is responsible for administering them and ask for access. Discuss with them how to log in and get to the AWS Console.

If you want to use your own account, you can create an AWS account by clicking on the link on the upper right corner of this page. Once you have your account you can go back to it by clicking on Sign In To Console, which should now be in the upper right corner of the same site.

Importantly, whether you are using your own account or your lab’s account, a credit card will be charged for all of your AWS usage. The total cost for the class should be well under $50.

2. Launch your instance

Once you are on the AWS Console page, you can launch your instance. We have set up an Amazon Machine Image (AMI), which has the software you need for the course installed and the data sets pre-loaded. The AMI is available in N. Virginia (us-east-1). If you are a student and using AWS Educate, you must use N. Virginia. Postdocs may use either. Be sure to select this region from the top right corner of the console. You should use the same region throughout the course, since that is physically where your machine will live.

  1. To launch an instance with this AMI, choose EC2 among the services available from your AWS console. You can select EC2 from the Services pulldown menu at the top of your screen.

  2. After selecting EC2, you will see a menu of options the left pane. Under Images there, click AMIs.

  3. The resulting menu will default to AMIs Owned by me (you likely do not have any). Select instead Public images.

  4. In the search menu, search for BE/Bi 103, and the class AMI should appear. If it does not, double check to make sure your region is N. Virginia.

  5. You will see the Caltech BE/Bi 103 b 2021 AMI listed. Right click on it and select Launch. (You may also select Spot request if you want to save some money, but when you stop a spot instance, you will lose whatever you stored there.)

  6. You will then be given many choices of instance types. For our beginning usage of Stan, I recommend a c5.xlarge instance, which as 4 cores. When we start doing simulation based calibration (SBC), which is the most computationally intensive part of the course, I would use a c5.2xlarge instance or larger (8 or more cores). Note that the c5.2xlarge instance may not be available for AWS Educate users, but if you do want eight cores, then a t2.2xlarge instance, which is available to AWS Educate users, will work.

  7. After you select which instance you want, click Next: Configure Instance Details at the bottom of the screen.

  8. You can leave the defaults in the Step 3: Configure Instance Details page and click Next: Add Storage at the bottom of the page.

  9. You can increase the root storage size in the Size (GiB) box. 30 GB falls within the free tier, and should be enough, but you can add more. (If you have experience with AWS, you can add a new volume, but be sure you know how to mount it once you get your instance running.) For most users in this class, the free 30 GB should be fine.

  10. Click Next: Add Tags at the bottom of the page.

  11. You can add tags if you wish to describe this instance for future reference, but it is not necessary. After entering tags (or not), click Next: Configure Security Group.

  12. Select Create a new security group. You can name the group and add a description so you can use it again later. Leave the default SSH rule as it is.

  13. Click Add Rule, and select HTTPS from the pulldown menu. Leave the defaults for that rule that appear.

  14. Click Add Rule, and select Custom TCP Rule. Change the Port Range to 8888-8892. This will allow you to run as many as five notebooks (on ports 8888, 8889, 8890, 8891, and 8892) from AWS.

  15. For the Source column, you may want to adjust the source to be Anywhere for each pulldown menu. That allows you to access your instance from anywhere, but it is not very secure. It is convenient because you can run the instance from your work machine and then from home later on. However, you may want to be more secure and instead select My IP or provide a custom rule for IP addresses that may access the instance.

  16. Click Review and Launch at the bottom of the page and make sure everything is specified as you like.

  17. Click Launch.

  18. You will be prompted for a key pair. For your first time launching, you will need to create a keypair. This will be provided to you only once, so download it and store it locally on your machine. DO NOT, I repeat, DO NOT store it store it in any git repository, or anything that is backed up to the cloud, like Dropbox. ONLY store it locally on your machine and never, ever let it out to the internet. If it is your second time launching, you can use an existing key pair.

  19. Finally, click Launch Instance.

After following those steps, your instance will spin up. You can view your running instances by clicking on Instances on the dashboard on the left part of the screen.

It will take a while for your instance to spin up. When the Instance State says running and the Status Checks are complete, your instance is ready for you.

3. Connect to your instance

Now that your instance is launched, you can connect to it using your computer and the ssh protocol. The instructions work for Windows, macOS, or Linux, assuming you have a terminal running bash or zsh. In Windows, this is accomplished using GitBash, which you should install if you have not already. For macOS, use Terminal.

  1. Identify where you put your keypair file. For the purposes of this exercise, I will assume that you have a directory in your home directory called key_pairs/ and that your keypair file is ~/key_pairs/bebi103_aws_keypair.pem.

  2. Change permissions on your keypair for security. Do this in the terminal using

    chmod 400 ~/key_pairs/bebi103_aws_keypair.pem

  3. Open a new GitBash (Windows) or Terminal (macOS) window.

  4. SSH into your instance in the terminal. To do this, click on your instance on the Instances page in the Management Console. At the bottom of the webpage will appear information about your instance, including the IPv4 Public IP. It will look something like 54.92.67.113. Copy this. In what following, I refer to this as <IPv4 Public IP>. SSH into your instance by doing

    ssh -i "~/key_pairs/bebi103_aws_keypair.pem" ec2-user@<IPv4 Public IP>

  5. (optional, may only work for macOS) To avoid having to use -i "~/key_pairs/bebi103_aws_keypair.pem" each time, you can add your keypair to your bash profile by doing

    echo ssh-add -K $HOME/keypairs/bebi103_aws_keypair.pem >> ~/.zshrc; source ~/.zshrc

  6. Now that you are SSH’d into the instance, you can clone your team’s repository. Using team 6 as an example, execute

    git clone https://github.com/bebi103/06-bebi103b-2021.git

This will clone your repository to your instance. Note that if you log back in to your instance, you may not need to clone again, but can instead do cd ~/06-bebi103b-2021; git pull. In fact, you should pretty much always git pull as the first thing you do when you ssh into your instance so that you are up to date with your group’s work.

  1. The data for the course are located in the ~/data folder. To keep the same relative paths you have been using from inside your repository (and as the TAs are expecting for grading), you should make a symbolic link to the ~/data directory. To do that, for example if you are in team 6, do the following on the command line.

    cd ~/06-bebi103b-2021; ln -s $HOME/data ./data

  2. Whenever you log in to your instance, you should update the instance in case I updated any software or added data sets. To do this, run

    bebi103_update

on the command line after SSH-ing into your instance.

  1. You now have your repository and all of the data for the course on your instance! Any documents you create or edit on your AWS instance can be managed with git and push/pulled to/from GitHub.

4. Launch JupyterLab

When you launch JupyterLab, you want to use `screen <https://en.wikipedia.org/wiki/GNU_Screen>`__. By running screen, your JupyterLab session will not get interrupted if you disconnect from your instance. So, on the command line in your instance, execute

screen

Then, you can launch JupyterLab by executing

jupyter lab --no-browser

on the command line. This will launch JupyterLab. You will see output like this:

To access the notebook, open this file in a browser:
    file:///home/ec2-user/.local/share/jupyter/runtime/nbserver-1821-open.html
Or copy and paste one of these URLs:
    http://localhost:8888/?token=b9910e579549381a3b6dvd359fada1624bcdf718422bab95
 or http://127.0.0.1:8888/?token=b9910e579549381a3b6dvd359fada1624bcdf718422bab95

Keep this window open.

In order to use JupyterLab through a browser on your machine, you need to set up a socket. To do so, open up another GitBash or Terminal window and execute the following.

ssh -i "~/key_pairs/bebi103_aws_keypair.pem" -L 8000:localhost:8888 ec2-user@<IPv4 Public IP>

This sets up a socket connecting port 8888 on your EC2 instance to port 8000 on your local machine. You can change these numbers as necessary. For example, in the URL listed above that you got with you launched JupyterLab, the port may be localhost:8889, in which case you need to substitute 8889 for 8888 in your ssh command. You may also want a different local port if you already have a JupyterLab instance running on port 8000, e.g., 8001. In what follows, I will use port number 8000 and 8888, which you will probably use 90% of the time, but you can make changes as you see fit.

After you have set up the socket, you can paste the URL given when you launched JupyterLab on your EC2 instance into your browser, but substitute 8000 for 8888. That is, direct your browser to

http://localhost:8000/?token=b9910e579549381a3b6dvd359fada1624bcdf718422bab95

You will now have JupyterLab up and running!

Note that you may be running JupyterLab locally on your own machine. You should make sure you do not use the same port number of any JupyterLab instance running on your local machine when you launch JupyterLab on AWS. You can specify the port number to be, for example 8890, by launching JupyterLab with

jupyter lab --no-browser --port 8890

If you do that, make sure you use the corresponding port numbers when setting up your socket.

5. If you get detached

If you lose your internet connection, you can reconnect to your session, with JupyterLab running, by reattaching your screen. Execute screen -r on the command line after SSH-ing back in to your EC2 instance to do this.

You can see what screens are active by doing screen -ls on the command line. You can also detach the current screen by using screen -d.

6. Copying results to and from AWS to your local machine

As you work on notebooks and create new files you want to save, you may want to move them to your local machine. If you are working on a notebook or .stan file, the best option is to use git and commit and push those files to your repository directly from the command line on your EC2 instance.

Some files, though, such as MCMC results or intermediate data processing results, are not meant to be under version control. For these file, you an use scp. Within your GitBash or Terminal window on your local machine (you probably have to open yet another window), you can copy files as follows.

scp -i "~/key_pairs/bebi103_aws_keypair.pem" ec2-user@<IPv4 Public IP>:~/my_file.csv ./

This command will copy files from your EC2 instance to your working directory. Simply put the full path to the file you want to transfer after the colon above (remember ~/ means “home directory”). The second argument of scp is where you want to copy the file.

Similarly, you can upload files to your EC2 instance as follows (in this example to the home directory in your instance).

scp -i "~/key_pairs/bebi103_aws_keypair.pem" my_file.txt ec2-user@<IPv4 Public IP>:~/

7. Exiting

When you are finished with your session, you can shut down your notebook in the browser. If the shutdown does not give you a new prompt in the terminal window, you can do a hard shutdown of JupyterLab by pressing Ctrl-c. After Jupyter is terminated, you should detach your screen by doing screen -d. Finally, you should quit your screen by doing screen -X quit.

In the past, I have had students have their instances littered with detached screens. You should clean house from time to time by running screen -X quit.

After you are finished with your work on your instance, you should stop your instance. To do this, go back to the Instances page on your EC2 console. Right click your instance, and navigate to Instance StateStop. Do not terminate your instance unless you really want to. Terminating an instance will get rid of any changes you made to it.

8. Seriously. Stop your instances if you are not using them.

If your instance is not stopped and you leave it running, you will get charged against the $50 of credits AWS Educate provides. You can rack up a massive bill with idle, but running, instances. You should stop your instances whenever you are not using them. It is a minor pain to wait for them to spin up again, but forgetting about a running instance will cause more pain than that to your pocketbook.

9. Terminate your instances after the class is over

After the class is over, you might want to terminate your instance, especially if you plan on using your own account going forward. This is because the storage in your instance (stored using AWS’s EBS, which is what keeps your repository, installations, etc., all in tact) is not free. Once your free tier accessibility expires in a year if you are new to AWS, and/or your promo codes expire, you will start getting bills for your EBS usage. These get wiped if you terminate your instance and you will not get billed.

Building the AMI

You should now have the power to use AWS for your computing. If you are curious, to build the AMI, I ran the steps below on the command line of my instance when I first SSH’d into it.

sudo yum -y update;
sudo yum -y install git;
git clone https://github.com/justinbois/bebi103_ami_setup.git;
bash ~/bebi103_ami_setup/setup_ami.sh;
rm -rf bebi103_ami_setup;