Tutorial 2: exercise

(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This tutorial exercise was generated from an Jupyter notebook. You can download the notebook here. Use this downloaded Jupyter notebook to fill out your responses.

In [1]:
import numpy as np
import pandas as pd

import bebi103

import bokeh.io
import holoviews as hv
bokeh.io.output_notebook()
hv.extension('bokeh')
Loading BokehJS ...

Exercise 1

The Anderson-Fisher iris data set is a classic data set used in statistical and machine learning applications. Edgar Anderson carefully measured the lengths and widths of the petals and sepals of 50 irises in each of three species, I. setosa, I. versicolor, and I. virginica. Ronald Fisher then used this data set to distinguish the three species from each other.

a) Load the data set, which you can download here into a Pandas DataFrame called df. Be sure to check out the structure of the data set before loading. You will need to use the header=[0,1] kwarg of pd.read_csv() to load the data set in properly.

In [2]:
df = pd.read_csv('../data/anderson-fisher-iris.csv', header=[0,1])

b) Take a look df. Is it tidy? Why or why not?

In [3]:
df.head()
Out[3]:
setosa versicolor virginica
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2 7.0 3.2 4.7 1.4 6.3 3.3 6.0 2.5
1 4.9 3.0 1.4 0.2 6.4 3.2 4.5 1.5 5.8 2.7 5.1 1.9
2 4.7 3.2 1.3 0.2 6.9 3.1 4.9 1.5 7.1 3.0 5.9 2.1
3 4.6 3.1 1.5 0.2 5.5 2.3 4.0 1.3 6.3 2.9 5.6 1.8
4 5.0 3.6 1.4 0.2 6.5 2.8 4.6 1.5 6.5 3.0 5.8 2.2

This DataFrame is not tidy because each row corresponds to twelve measurements of three different flowers. A tidy DataFrame has one row corresponding to one observation. Each column is an attribute of the observation.

c) Perform the following operations to make a new DataFrame from the original one you loaded in exercise 1 to generate a new DataFrame. Do these operations one-by-one and explain what you are doing to the DataFrame in each one. The Pandas documentation might help.

In [4]:
df_tidy = df.stack(level=0)

Stacking takes one of the hierarchical column indices and makes it into a row index. We chose level 0, which means we take the basal hierarchical index, in this case corresponding to the species, and make it a row index.

In [5]:
df_tidy = df_tidy.sort_index(level=1)

sort_index() sorts indices. Here, we are sorting according to level 1 of the hierarchical index, which is the species. This gives us a DataFrame in order of species, setosa, followed by versicolor, followed by virginica.

In [6]:
df_tidy = df_tidy.reset_index(level=1)

reset_index() converts an index to a data column. In this case, we converted the index corresponding to the species into a column. By default, the column is named 'level_1'.

In [7]:
df_tidy = df_tidy.rename(columns={'level_1': 'species'})

Finally, we rename the species column to have a descriptive name.

d) Is the resulting DataFrame tidy? Why or why not?

Let's look at the DataFrame.

In [8]:
df_tidy.head()
Out[8]:
species petal length (cm) petal width (cm) sepal length (cm) sepal width (cm)
0 setosa 1.4 0.2 5.1 3.5
1 setosa 1.4 0.2 4.9 3.0
2 setosa 1.3 0.2 4.7 3.2
3 setosa 1.5 0.2 4.6 3.1
4 setosa 1.4 0.2 5.0 3.6

This DataFrame could be considered to be tidy. Each row corresponds to a single observation of a given flower. Each column is an attribute of that flower. As seen below in part (c), it is still logically clear to slice values of interest out of the df_tidy.

e) Using df_tidy, slice out all of the sepal lengths for I. versicolor as a Numpy array.

In [15]:
df_tidy.loc[df_tidy['species']=='versicolor', 'sepal length (cm)'].values
Out[15]:
array([ 7. ,  6.4,  6.9,  5.5,  6.5,  5.7,  6.3,  4.9,  6.6,  5.2,  5. ,
        5.9,  6. ,  6.1,  5.6,  6.7,  5.6,  5.8,  6.2,  5.6,  5.9,  6.1,
        6.3,  6.1,  6.4,  6.6,  6.8,  6.7,  6. ,  5.7,  5.5,  5.5,  5.8,
        6. ,  5.4,  6. ,  6.7,  6.3,  5.6,  5.5,  5.5,  6.1,  5.8,  5. ,
        5.6,  5.7,  5.7,  6.2,  5.1,  5.7])


Exercise 2

a) Make a scatter plot of sepal width versus petal length with the glyphs colored by species.

In [10]:
%%opts Scatter [show_grid=True, width=500, height=350] (size=5)
%%opts NdOverlay [legend_position='right']

scatter = hv.Scatter(df_tidy, 
                     kdims=['petal length (cm)'], 
                     vdims=['sepal width (cm)', 'species'])

scatter = bebi103.viz.adjust_range(scatter)

gb = scatter.groupby('species')

gb.overlay()
Out[10]:

b) Make a plot comparing the petal widths of the respective species. Comment on why you chose the plot you chose.

In [14]:
p = None
palette=['#30a2da', '#fc4f30', '#e5ae38']
for i, species in enumerate(df_tidy['species'].unique()):
    p = bebi103.viz.ecdf(df_tidy.loc[df_tidy['species']==species, 'petal width (cm)'],
                         x_axis_label='petal length (cm)', 
                         formal=False,
                         p=p,
                         legend=species,
                         line_width=2,
                         color=palette[i])

p.legend.location = 'bottom_right'
bokeh.io.show(p)

I chose ECDFs because it most clearly shows how each is distributed and the differences among them.