E2. To be completed after lesson 6

Data set download


Exercise 2.1

The Palmer penguins data set is a nice data set with which to practice various data science skills. For this exercise, we will use as subset of it, which you can download here: https://s3.amazonaws.com/bebi103.caltech.edu/data/penguins_subset.csv. The data set consists of measurements of three different species of penguins acquired at the Palmer Station in Antarctica. The measurements were made between 2007 and 2009 by Kristen Gorman.

a) Load the data set into a Pandas DataFrame called df. You will need to use the header=[0,1] kwarg of pd.read_csv() to load the data set in properly.

b) Take a look at df. Is it tidy? Why or why not?

c) Perform the following operations to make a new DataFrame from the original one you loaded in exercise 1 to generate a new DataFrame. You do not need to worry about what these operations do (that is the topic of next week, just do them to answer this question: Is the resulting data frame df_tidy tidy? Why or why not?

df_tidy = df.stack(
    level=0
).sort_index(
    level=1
).reset_index(
    level=1
).rename(
    columns={"level_1": "species"}
)

d) Using df_tidy, slice out all of the bill lengths for Gentoo penguins.

e) Make a new data frame containing the mean measured bill depth, bill length, body mass in kg, and flipper length for each species. You can use millimeters for all length measurements.

Exercise 2.2

Make a scatter plot of bill length versus flipper length with the glyphs colored by species.

Exercise 2.3

Write down any questions or points of confusion that you have.