(c) 2016 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This tutorial exercise was generated from an Jupyter notebook. You can download the notebook here. Use this downloaded Jupyter notebook to fill out your responses.
import numpy as np
import pandas as pd
The Anderson-Fisher iris data set is a classic data set used in statistical and machine learning applications. Edgar Anderson carefully measured the lengths and widths of the petals and sepals of 50 irises in each of three species, I. setosa, I. versicolor, and I. virginica. Ronald Fisher then used this data set to distinguish the three species from each other.
a) Load the data set, which you can download here into a Pandas DataFrame
called df
. Be sure to check out the structure of the data set before loading. You will need to use the header=[0,1]
kwarg of pd.read_csv()
to load the data set in properly.
b) Take a look df
. Is it tidy? Why or why not?
c) Melt the DataFrame
into a tidy DataFrame
called df_tidy
with columns ['species', 'quantity', 'value']
. Discuss why this is a tidy data frame.
d) Using df_tidy
, slice out all of the sepal lengths for I. versicolor as a Numpy array.
a) Perform the following operations to make a new DataFrame
from the original one you loaded in exercise 1 to generate a new DataFrame
. Do these operations one-by-one and explain what you are doing to the DataFrame
in each one. The Pandas documentation might help.
df_new = df.stack(level=0)
df_new = df_new.sortlevel(1)
df_new = df_new.reset_index(level=1)
df_new = df_new.rename(columns={'level_1': 'species'})
b) Is the resulting DataFrame
tidy? Why or why not?
c) Using df_new
, slice out all of the sepal lengths for I. versicolor as a Numpy array.