# Recitation 4: Running Jupyter notebooks with AWS

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

This document was prepared at [Caltech](http://www.caltech.edu) with financial support from the [Donna and Benjamin M. Rosen Bioengineering Center](http://rosen.caltech.edu).

<img src="caltech_rosen.png">

*This tutorial was generated from an Jupyter notebook.  You can download the notebook [here](aux4_AWS.ipynb).*

<br />

As we start to do more involved statistical inference (and image processing), you will want to run your calculations on more powerful machines. There are many options for this, including [Amazon Web Services](https://aws.amazon.com/) (AWS), [Google Cloud Platform](https://cloud.google.com), [Microsoft Azure](https://azure.microsoft.com/), and Caltech's own [high performance computing center](http://www.hpc.caltech.edu/). In this tutorial, we will show you how to get up and running with AWS. While it looks like a lot of steps, some of the steps are done only once, so it is not much more work to launch instances after the initial setup.

### 1. Create an Amazon Web Services account

The first thing to do is create an AWS account by clicking on the link on the upper right corner of [this page](https://aws.amazon.com/). Once you have your account you can go back to it by clicking on *Sign In To Console*, which should now be in the upper right corner of the same site.

### 2. Launch your instance

After you have an account, you can launch your instance. We have set up an Amazon Machine Image (AMI), which has the software you need for the course installed and the data sets pre-loaded. The AMI is available in Oregon. Be sure to select this region from the top right corner of the console. You should use the same region throughout the course, since that is physically where your machine will live.

1. To launch an instance with this AMI, choose EC2 among the services available from your AWS console. You can select EC2 from the `Services` pulldown menu at the top of your screen. 
2. After selecting EC2, you will see the EC2 Dashboard on the left pane. Under the `Images` there, click `AMIs`. 
3. The resulting menu will default to AMIs `Owned by me` (you likely do not have any). Select instead `Public images`. 
4. In the search menu, search for `BE/Bi 103`, and the class AMI should appear. If it does not, double check to make sure your region is Oregon.
5. You will see the Caltech BE/Bi 103 2018 edition AMI listed. Right click on it and select `Launch`. (You may also select `Spot request` if you want to save some money, but when you stop a spot instance, you will lose whatever you stored there.) 
6. You will then be given many choices of instance types. The `t2.micro` type will work for the early part of the course, which is not computationally intensive. This instance is free, provided your storage does not exceed 30 GB, which it shouldn't. When we use AWS for more involved calculations using Stan, I recommend a `c5.xlarge` or `c5.2xlarge` instance.
7. After you select which instance you want, click `Next: Configure Instance Details` at the bottom of the screen.
8. You can leave the defaults in the Step 3: Configure Instance Details page and click `Next: Add Storage` at the bottom of the page.
9. You can increase the root storage size in the `Size (GiB)` box. 30 GB falls within the free tier, and should be enough, but you can add more. (If you have experience with AWS, you can add a new volume, but be sure you know how to mount it once you get your instance running.)
10. Click `Next: Add Tags` at the bottom of the page.
11. Leave the defaults and click `Next: Add Security Group`.
12. Select `Create a` **`new`** ` security group`. You can name the group and add a description so you can use it again later. Leave the default SSH rule as it is.
13. Click `Add Rule`, and select `HTTPS` from the pulldown menu. Leave the defaults for that rule that appear.
14. Click `Add Rule`, and select `Custom TCP Rule`. Change the `Port Range` to `8888`.
15. For the `Source` column, you may want to adjust the source to be `Anywhere` for each pulldown menu. That allows you to access your instance from anywhere, but it is not very secure. It is convenient because you can run the instance from your work machine and then from home later on. However, you may want to be more secure and instead select `My IP` or provide a custom rule for IP addresses that may access the instance.
16. Click `Review and Launch` at the bottom of the page and make sure everything is specified as you like.
17. Click `Launch`.
18. You will be prompted for a key pair. For your first time launching, you will need to create a keypair. This will be provided to you only once, so download it and store it locally on your machine. **DO NOT, I repeat, DO NOT store it store it in any git repository, anything that is backed up to the cloud, like Dropbox. ONLY store it locally on your machine and never let it out to the internet.** If it is your second time launching, you can use an exisiting key pair.
19. Finally, click `Launch Instance`.

After following those steps, your instance will spin up. You can view your running instances by clicking on `Instances` on the dashboard on the left part of the screen. 

It will take a while for your instance to spin up. When the `Instance State` says `running` and the `Status Checks` are complete, your instance is ready for you.

### 3. Connect to your instance

Now that your instance is launched, you can connect to it using your computer and the *ssh* protocol. The instructions work for Windows, macOS, or Linux, assuming you have a terminal running bash. In Windows, this is accomplished using [GitBash](https://git-scm.com). For macOS, use Terminal.

1. Identify where you put your keypair file. For the purposes of this exercise, I will assume that you have a directory in your home directory called `key_pairs/` and that your keypair file is `~/key_pairs/bebi103_aws_keypair.pem`.
2. Change permissions on your keypair for security. Do this in the terminal using

    > `chmod 400 ~/key_pairs/bebi103_aws_keypair.pem`

3. Open a new GitBash (Windows) or Terminal (macOS) window. 
4. SSH into your instance in the terminal. To do this, click on yout instance on the `Instances` page in the Management Console. At the bottom of the webpage will appear information about your instance, inclugint the IPv4 Public IP. It will look something like `54.92.67.113`. Copy this. In what following, I refer to this as `<IPv4 Public IP>`. SSH into your instance by doing 

    > `ssh -i "~/key_pairs/bebi103_aws_keypair.pem" ec2-user@<IPv4 Public IP>`

5. (optional, may only work for macOS) To avoid having to use `-i "~/key_pairs/bebi103_aws_keypair.pem"` each time, you can add your keypair to your bash profile by doing

    > `echo ssh-add -K $PWD/bebi103_aws_keypair.pem >> ~/.bash_profile;`
    > `source ~/.bash_profile`
    
6. Now that you are SSH'd into the instance, you can clone your team's repository. Using team 6 as an example, execute 

    > `git clone https://github.com/bebi103/06-bebi103-2018.git`
    
This will clone your respository to your instance. Note that if you log back in to your instance, you may not need to clone again, but can instead do `cd ~/06-bebi103-2018; git pull`. In fact, you should pretty much always `git pull` as the first thing you do when you ssh into your instance so that you are up to date with your group's work.
 
7. The data for the course are located in the `~/data` folder. To keep the same relative paths you have been using from inside your repository (and as the TAs are expecting for grading), you should make a symbolic link to the `~/data` directory. To do that, for example if you are in team 6, do the following on the command line.

    > `cd ~/06-bebi103-2018; ln -s $HOME/data ./data`
    
8. You now have your repository and all of the data for the course on your instance!

### 4. Launch JupyterLab

When you launch JupyterLab, you want to use [`screen`](https://en.wikipedia.org/wiki/GNU_Screen). By running screen, your JupyterLab session will not get interrupted if you disconnect from your instance. So, on the command line in your instance, execute

    screen
    
Then, you can launch JupyterLab by executing

    jupyter lab --no-browser
    
on the command line. This will launch JupyterLab. You will see output like this:

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=4eea0b108226fe68b7ahdf219d5efe85d16a0648a3f3f4

Keep this window open.

In order to use JupyterLab through a browser on your machine, you need to set up a socket. To do so, open up another GitBash or Terminal window and execute the following.
    
    ssh -i "~/key_pairs/bebi103_aws_keypair.pem" -L 8000:localhost:8888 ec2-user@<IPv4 Public IP>
    
This sets up a socket connecting port `8888` on your EC2 instance to port `8000` on your local machine. You can change these numbers as necessary. For example, in the URL listed above that you got with you launched JupyterLab, the port may be `localhost:8889`, in which case you need to substitute `8889` for `8888` in your ssh command. You may also want a different local port if you already have a JupyterLab instance running on port `8000`, e.g., `8001`. In what follows, I will use port number `8000` and `8888`, which you will probably use 90% of the time, but you can make changes as you see fit.

After you have set up the socket, you can paste the URL given when you launched JupyterLab on your EC2 instance into your browser, but substitute `8000` for `8888`. That is, direct your browser to

    http://localhost:8000/?token=4eea0b108226fe68b7ahdf219d5efe85d16a0648a3f3f4
    
You will now have JupyterLab up and running!

### 5. If you get detached

If you lose your internet connection, you can reconnect to your session, with JupyterLab running, by reattaching your screen. execute `screen -r` on the command line after SSH-ing back in to your EC2 instance to do this.

You can see what screens are active by doing `screen -ls` on the command line. You can also detach the current screen by using `screen -d`.

### 6. Copying results to and from AWS to your local machine

As you work on notebooks and create new files you want to save, you may want to move them to your local machine. If you are working on a notebook or `.stan` file, the best option is to use git and commit and push those files to your repository directly from the command line on your EC2 instance.

Some files, though, such as MCMC results or intermediate data processing results, are not meant to be under version control. For these file, you an use `scp`. Within your GitBash or Terminal window on your local machine (you probably have to open yet another), you can copy files as follows.

    scp -i "~/key_pairs/bebi103_aws_keypair.pem" ec2-user@<IPv4 Public IP>:~/my_file.csv ./

This command will copy files from your EC2 instance to your present working directory. Simply put the full path to the file you want to transfer after the colon above (remember `~/` means "home directory"). The second argument of `scp` is where you want to copy the file.

Similarly, you can upload files to your EC2 instance as follows (in this example to the home directory in your instance).

    scp -i "~/key_pairs/bebi103_aws_keypair.pem" my_file.txt ec2-user@<IPv4 Public IP>:~/


### 7. Exiting

When you are finished with your session, you can shut down your notebook in the browser. Then, in the terminal window, you can shut down JupyterLab by pressing `Ctrl-c`. After Jupyter is terminated, you should detach your screen by doing `screen -d`. Finally, you should quit your screen by doing `screen -X quit`.

In the past, I have had students have their instances littered with detached screens. You should clean house from time to time and run `screen -X quit`.

After you are finished with your work on your instance, you should stop your instance. To do this, go back to the `Instances` page on your EC2 console. Right click your instance, and navigate to `Instance State` â†’ `Stop`. **Do not terminate your instance** unless you really want to. Terminating an instance will get rid of any changes you made to it.

### 8. Seriously. Stop your instances if you are not using them.

If your instance is not stopped and you leave it running, you will get charged. You can rack up a massive bill with idle, but running, instances. You should stop your instances whenever you are not using them. It is a minor pain to wait for them to spin up again, but forgetting about a running instance will cause more pain than that to your pocketbook.

### 9. Terminate your instances after the class is over

AFter the class is over, you might want to terminate your instance. This is because the storage in your instance (stored using AWS's EBS, which is what keeps your repository, installations, etc., all in tact) is not free. Once your free tier accessibility expires in a year if you are new to AWS, and/or your promo codes expire, you will start getting bills for your EBS usage. These get wiped if you terminate your instance and you will not get billed.

## Using promo codes with your group

We will distribute promo codes graciously donated by Amazon for you to use resources beyond the free tier. One member of your group will get the promo code, but all members should be able to use the AWS resource. To enable this feature, the owner of your promo code needs to set up identity and access management (IAM). To set this up, perform the following steps.

1. Once logged in to your AWS console, click the `Services` pulldown menu at the upper left of the screen. Under `Security, Identity & Compliance` click `IAM`.
2. On the left side of the screen on the IAM console, click `Groups`.
3. Click the `Create New Group` button, and give the group a name on the next screen. Something like "BE/Bi 103 2018 Team 6" will suffice. Once you have done that, click the `Next Step` button on the lower right of the screen.
4. On the `Attach Policies` page, select `AmazonEC2FullAccess`. Then click the `Next Step` button.
5. After reviewing, you can click the `Create Group` button, and the group is created.
6. Now click `Users` on the left IAM dashboard.
7. Click the `Add User` button.
8. Fill in a user names for your teammates and then click the `AWS Management Console access` button.
9. For `Console password`, either choose you own **strong** password or use the autogenerated password. This will only be generated once, so write it down to share with your teammates. Obviously, do not email the password.
10. You can leave the `Require password reset` box checked or not, depending on your preference. Then click the `Next: Permssions` button.
11. Add the users to the group you just set up. Then click `Next: Review`.
12. Make sure everything looks good and then click `Create users`.
13. The next page shows the login credentials. Importantly, it gives a link and a password that the users need to log in using IAM. For the users to log in to the console, they need to go to the link displayed and enter their credentials. If you click `Download CSV`, you get a `CSV` file with the links and passwords. If you click `Send email`, you can send an email to your teammates that has the instructions for login. The password will **not** be sent. They simply need to follow these instructions and use the password you give them to run instances on your account, which gives them access to the promo code.

After the term is over, you should delete the extra users, which you can do from the IAM `Users` page.

## A limitation with Bokeh plots in JupyterLab when using AWS

Serving up interactive Bokeh plots with AWS is a but more involved than doing it on your local machine. You still get the native interaction, like zooming and tooltips, but if you want to have interactions that require Python code running behind it, you will not be able to do that as configured here. Specifically, you will not be able to use the `bebi103.viz.record_clicks()` and `bebi103.viz.draw_rois()` functions. As a workaround, you can use the `bebi103.viz.imshow()` function with the `display_clicks=True` kwarg. With this activated, whenever you click on the plot of the image, the location of the click will appear right next to the plot. You can then copy and paste these clicks to make a Numpy array.

Perhaps a better way to do this, though, is to use `bebi103.viz.draw_rois()` on your local machine, write the result out as a CSV file, and the `scp` it to your instance and load it from there.

## Building the AMI

You should now have the power to use AWS for your computing. If you are curious, to build the AMI, I ran the steps below on the command line of my instance when I first SSH'd into it.

```
sudo yum -y update;
sudo yum -y install git;
git clone https://github.com/justinbois/bebi103_ami_setup.git;
bash ~/bebi103_ami_setup/setup_ami.sh;
rm -rf bebi103_ami_setup;
rm -f  miniconda.sh;
```