Tutorial 1b: Exploratory data analysis

(c) 2016 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [39]:
import numpy as np
import pandas as pd

# Import pyplot for plotting
import matplotlib.pyplot as plt

# Some pretty Seaborn settings
import seaborn as sns
rc={'lines.linewidth': 2, 'axes.labelsize': 14, 'axes.titlesize': 14}
sns.set(rc=rc)

# Make Matplotlib plots appear inline
%matplotlib inline

In this tutorial, we will learn how to load data stored on disk into a Python data structure. We will use pandas to read in CSV (comma separated value) files and store the results in the very hand Pandas DataFrame. We will then "play" with the data to get a feel for it. This process is called exploratory data analysis, a term coined by John Tukey, and is an important first step in analysis of a data set.

The data set we will use comes from a fun paper about the adhesive properties of frog tongues. The reference is Kleinteich and Gorb, Tongue adhesion in the horned frog Ceratophrys sp., Sci. Rep., 4, 5225, 2014. You can download the paper here. You might also want to check out a New York Times feature on the paper here.

In this paper, the authors investigated various properties of the adhesive characteristics of the tongues of horned frogs when they strike prey. The authors had a striking pad connected to a cantilever to measure forces. They also used high speed cameras to capture the strike and record relevant data.

Importing modules

As I mentioned in the last tutorial, we need to import modules we need for data analysis. An important addition is Pandas, which we import as pd. As is good practice, I import everything at the top of the document.

The data file

The data from the paper are contained in the file frog_tongue_adhesion.csv, which you can download here. We can look at its contents. You can use

head -n 20 frog_tongue_adhesion.csv

from the command line. To invoke a command line command in a Jupyter notebook, precede it with an exclamation point.

In [40]:
!head -20 data/frog_tongue_adhesion.csv



















The first lines all begin with # signs, signifying that they are comments and not data. They do give important information, though, such as the meaning of the ID data. The ID refers to which specific frog was tested.

Immediately after the comments, we have a row of comma-separated headers. This row sets the number of columns in this data set and labels the meaning of the columns. So, we see that the first column is the date of the experiment, the second column is the ID of the frog, the third is the trial number, and so on.

After this row, each row represents a single experiment where the frog struck the target.

CSV files are generally a good way to store data. Commas make better delimiters than white space (such as tabs) because they have no portability issues. Delimiter collision is avoided by putting the data fields in double quotes when necessary. There are other good ways to store data, such as JSON, but we will almost exclusively use CSV files in this class.

Loading a data set

We will use pd.read_csv() to load the data set. The data are stored in a DataFrame, which is one of the data types that makes pandas so convenient for use in data analysis. DataFrames offer mixed data types, including incomplete columns, and convenient slicing, among many, many other convenient features. We will use the DataFrame to look at the data, at the same time demonstrating some of the power of DataFrames. They are like spreadsheets, only a lot better.

We will now load the DataFrame.

In [41]:
# Use pd.read_csv() to read in the data and store in a DataFrame
df = pd.read_csv('data/frog_tongue_adhesion.csv', comment='#')

Notice that we used the kwarg comment to specify that lines that begin with # are comments and are to be ignored. If you check out the doc string for pd.read_csv(), you will see there are lots of options for reading in the data.

Exploring the DataFrame

Let's jump right in and look at the contents of the DataFrame. We can look at the first several rows using the df.head() method.

In [42]:
# Look at the contents (first 5 rows)
df.head()
Out[42]:
date ID trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa)
0 2013_02_26 I 3 1205 46 1.95 -785 884 1.27 -0.290 387 70 0.82 3117 -2030
1 2013_02_26 I 4 2527 44 4.08 -983 248 1.59 -0.181 101 94 0.07 24923 -9695
2 2013_03_01 I 1 1745 34 2.82 -850 211 1.37 -0.157 83 79 0.05 21020 -10239
3 2013_03_01 I 2 1556 41 2.51 -455 1025 0.74 -0.170 330 158 0.52 4718 -1381
4 2013_03_01 I 3 493 36 0.80 -974 499 1.57 -0.423 245 216 0.12 2012 -3975

We see that the column headings were automatically assigned. pandas also automatically defined the indices (names of the rows) as integers going up from zero. We could have defined the indices to be any of the columns of data.

To access a column of data, we use the following syntax.

In [43]:
# Slicing a column out of a DataFrame is achieved by using the column name
df['impact force (mN)']
Out[43]:
0     1205
1     2527
2     1745
3     1556
4      493
5     2276
6      556
7     1928
8     2641
9     1897
10    1891
11    1545
12    1307
13    1692
14    1543
15    1282
16     775
17    2032
18    1240
19     473
20    1612
21     605
22     327
23     946
24     541
25    1539
26     529
27     628
28    1453
29     297
      ... 
50     458
51     626
52     621
53     544
54     535
55     385
56     401
57     614
58     665
59     488
60     172
61     142
62      37
63     453
64     355
65      22
66     502
67     273
68     720
69     582
70     198
71     198
72     597
73     516
74     815
75     402
76     605
77     711
78     614
79     468
Name: impact force (mN), dtype: int64

The indexing of the rows is preserved, and we can see that we can easily extract all of the impact forces. Note, though, that pd.read_csv() interpreted the data to be integer (dtype = int64), so we may want to convert these to floats.

In [44]:
# Use df.astype() method to convert it to a NumPy float64 data type.
df['impact force (mN)'] = df['impact force (mN)'].astype(np.float64)

# Check that it worked
print('dtype = ', df['impact force (mN)'].dtype)
dtype =  float64

Now let's say we only want the impact force of strikes above one Newton. Pandas DataFrames can conveniently be sliced with Booleans.

In [45]:
# Generate True/False array of rows for indexing
inds = df['impact force (mN)'] > 1000.0

# Slice out rows we want (big force rows)
df_big_force = df.loc[inds, :]

# Let's look at it; there will be only a few high-force values
df_big_force
Out[45]:
date ID trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa)
0 2013_02_26 I 3 1205.0 46 1.95 -785 884 1.27 -0.290 387 70 0.82 3117 -2030
1 2013_02_26 I 4 2527.0 44 4.08 -983 248 1.59 -0.181 101 94 0.07 24923 -9695
2 2013_03_01 I 1 1745.0 34 2.82 -850 211 1.37 -0.157 83 79 0.05 21020 -10239
3 2013_03_01 I 2 1556.0 41 2.51 -455 1025 0.74 -0.170 330 158 0.52 4718 -1381
5 2013_03_01 I 4 2276.0 31 3.68 -592 969 0.96 -0.176 341 106 0.69 6676 -1737
7 2013_03_05 I 2 1928.0 46 3.11 -804 508 1.30 -0.285 246 178 0.28 7832 -3266
8 2013_03_05 I 3 2641.0 50 4.27 -690 491 1.12 -0.239 269 224 0.17 9824 -2568
9 2013_03_05 I 4 1897.0 41 3.06 -462 839 0.75 -0.328 266 176 0.34 7122 -1733
10 2013_03_12 I 1 1891.0 40 3.06 -766 1069 1.24 -0.380 408 33 0.92 4638 -1879
11 2013_03_12 I 2 1545.0 48 2.50 -715 649 1.15 -0.298 141 112 0.21 10947 -5064
12 2013_03_12 I 3 1307.0 29 2.11 -613 1845 0.99 -0.768 455 92 0.80 2874 -1348
13 2013_03_12 I 4 1692.0 31 2.73 -677 917 1.09 -0.457 186 129 0.31 9089 -3636
14 2013_03_12 I 5 1543.0 38 2.49 -528 750 0.85 -0.353 153 148 0.03 10095 -3453
15 2013_03_15 I 1 1282.0 31 2.07 -452 785 0.73 -0.253 290 105 0.64 4419 -1557
17 2013_03_15 I 3 2032.0 60 3.28 -652 486 1.05 -0.257 147 134 0.09 13784 -4425
18 2013_03_15 I 4 1240.0 34 2.00 -692 906 1.12 -0.317 364 260 0.28 3406 -1901
20 2013_03_19 II 1 1612.0 18 3.79 -655 3087 1.54 -0.385 348 15 0.96 4633 -1881
25 2013_03_21 II 2 1539.0 43 3.62 -664 741 1.56 -0.046 85 24 0.72 18073 -7802
28 2013_03_25 II 1 1453.0 72 3.42 -92 1652 0.22 -0.008 136 0 1.00 10645 -678
34 2013_04_03 II 1 1182.0 28 2.78 -522 1197 1.23 -0.118 281 0 1.00 4213 -1860

Here, we used the .loc feature of a DataFrame. It allows us to index by row and column. We chose all columns (using a colon, :), and put an array of Booleans for the rows. We get back a DataFrame containing the rows associated with True in the arrays of Booleans.

So now we only have the strikes of high force. Note, though, that the original indexing of rows was retained! In our new DataFrame with only the big force strikes, there is no index 4, for example.

In [46]:
# Executing the below will result in an exception
df_big_force.loc[4, 'impact force (mN)']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Users/Justin/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py in _has_valid_type(self, key, axis)
   1394                 if key not in ax:
-> 1395                     error()
   1396             except TypeError as e:

/Users/Justin/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py in error()
   1389                 raise KeyError("the label [%s] is not in the [%s]" %
-> 1390                                (key, self.obj._get_axis_name(axis)))
   1391 

KeyError: 'the label [4] is not in the [index]'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-46-9473d2f3a7f3> in <module>()
      1 # Executing the below will result in an exception
----> 2 df_big_force.loc[4, 'impact force (mN)']

/Users/Justin/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1292 
   1293         if type(key) is tuple:
-> 1294             return self._getitem_tuple(key)
   1295         else:
   1296             return self._getitem_axis(key, axis=0)

/Users/Justin/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
    782     def _getitem_tuple(self, tup):
    783         try:
--> 784             return self._getitem_lowerdim(tup)
    785         except IndexingError:
    786             pass

/Users/Justin/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py in _getitem_lowerdim(self, tup)
    906         for i, key in enumerate(tup):
    907             if is_label_like(key) or isinstance(key, tuple):
--> 908                 section = self._getitem_axis(key, axis=i)
    909 
    910                 # we have yielded a scalar ?

/Users/Justin/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1464 
   1465         # fall thru to straight lookup
-> 1466         self._has_valid_type(key, axis)
   1467         return self._get_label(key, axis=axis)
   1468 

/Users/Justin/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py in _has_valid_type(self, key, axis)
   1401                 raise
   1402             except:
-> 1403                 error()
   1404 
   1405         return True

/Users/Justin/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py in error()
   1388                                     "key")
   1389                 raise KeyError("the label [%s] is not in the [%s]" %
-> 1390                                (key, self.obj._get_axis_name(axis)))
   1391 
   1392             try:

KeyError: 'the label [4] is not in the [index]'

This might seem counterintuitive, but it is actually a good idea. Remember, indices do not have to be integers!

There is a way around this, though. We can use the iloc attribute of a DataFrame. This gives indexing with sequential integers. This is generally not advised, and I debated even showing this.

In [47]:
# Using iloc enables indexing by the corresponding sequence of integers
df_big_force['impact force (mN)'].iloc[4]
Out[47]:
2276.0

Tidy data

In our DataFrame, the data are tidy. The concept of tidy data originally comes from development of databases, but has taken hold for more general data processing in recent years. See this paper by Hadley Wickham for a great discussion of this. Tidy data refers to data sets arranged in tabular form that have the following format.

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a separate table.

As we will see, structuring the data in this way allows for very easy access. Furthermore, if you know a priori that a data set it tidy, there is little need for munging; you already know how to pull out what you need.

We will work on tidying messy data ("messy data" is anything that is not tidy) next week. For now, we will take advantage of the fact that Kleinteich and Korb generously provided us with tidy data.

More data extraction

Given that the data are tidy and one row represents a single experiment, we might like to extract a row and see all of the measured quantities from a given strike. This is also very easy with DataFrames using their loc attribute.

In [48]:
# Let's slice out experiment with index 42
exp_42 = df.loc[42, :]

# And now let's look at it
exp_42
Out[48]:
date                                                    2013_05_27
ID                                                             III
trial number                                                     3
impact force (mN)                                              324
impact time (ms)                                               105
impact force / body weight                                    2.61
adhesive force (mN)                                           -172
time frog pulls on target (ms)                                 619
adhesive force / body weight                                  1.38
adhesive impulse (N-s)                                      -0.079
total contact area (mm2)                                        55
contact area without mucus (mm2)                                23
contact area with mucus / contact area without mucus          0.37
contact pressure (Pa)                                         5946
adhesive strength (Pa)                                       -3149
Name: 42, dtype: object

Notice how clever the DataFrame is. We sliced a row out, and now the indices describing its elements are the column headings. This is a Pandas Series object.

Now, slicing out index 42 is not very meaningful because the indices are arbitrary. Instead, we might look at our lab notebook, and want to look at trial number 3 on May 27, 2013, of frog III. This is a more common, and indeed meaningful, use case.

In [49]:
# Set up Boolean slicing
date = df['date'] == '2013_05_27'
trial = df['trial number'] == 3
ID = df['ID'] == 'III'

# Slide out the row
df.loc[date & trial & ID, :]
Out[49]:
date ID trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa)
42 2013_05_27 III 3 324.0 105 2.61 -172 619 1.38 -0.079 55 23 0.37 5946 -3149

The difference here is that the returned slice is a DataFrame and not a Series. We can easily make it a Series using iloc. Note that in most use cases, this is just a matter of convenience for viewing and nothing more.

In [50]:
df.loc[date & trial & ID, :].iloc[0]
Out[50]:
date                                                    2013_05_27
ID                                                             III
trial number                                                     3
impact force (mN)                                              324
impact time (ms)                                               105
impact force / body weight                                    2.61
adhesive force (mN)                                           -172
time frog pulls on target (ms)                                 619
adhesive force / body weight                                  1.38
adhesive impulse (N-s)                                      -0.079
total contact area (mm2)                                        55
contact area without mucus (mm2)                                23
contact area with mucus / contact area without mucus          0.37
contact pressure (Pa)                                         5946
adhesive strength (Pa)                                       -3149
Name: 42, dtype: object

You may take issue with the rather lengthy syntax of access column names. I.e., if you were trying to access the ratio of the contact area with mucus to the contact area without mucus for trial number 3 on May 27, 2013, you would do the following.

In [51]:
# Set up criteria for our seach of the DataFrame
date = df['date'] == '2013_05_27'
trial = df['trial number'] == 3
ID = df['ID'] == 'III'

# When indexing DataFrames, use & for Boolean and (and | for or; ~ for not)
df.loc[date & trial & ID, 'contact area with mucus / contact area without mucus']
Out[51]:
42    0.37
Name: contact area with mucus / contact area without mucus, dtype: float64

That syntax can be annoying. But many would argue that this is preferred because there is no ambiguity about what you are asking for. However, you may want to use shorter names. Conveniently, when your column names do not have spaces, you can use attribute access. For example....

In [52]:
# Attribute access
df.date[:10]
Out[52]:
0    2013_02_26
1    2013_02_26
2    2013_03_01
3    2013_03_01
4    2013_03_01
5    2013_03_01
6    2013_03_05
7    2013_03_05
8    2013_03_05
9    2013_03_05
Name: date, dtype: object

Attribute access is really for convenience and is not the default, but if can make writing the code much less cumbersome. So, let's change the name of the columns as

'trial number' → 'trial'
'contact area with mucus / contact area without mucus' → 'ca_ratio'

DataFrames have a nice rename method to do this.

In [53]:
# Make a dictionary to rename columns
rename_dict = {'trial number' : 'trial',
        'contact area with mucus / contact area without mucus' : 'ca_ratio'}

# Rename the columns
df = df.rename(columns=rename_dict)

# Try out our new column name
df.ca_ratio[42]
Out[53]:
0.37

Notice that we have introduced a new Python data structure, the dictionary. A dictionary is a collection of objects, each one with a key to use for indexing. For example, in rename_dict, we could get what we wanted to rename 'trial number'.

In [54]:
# Indexing of dictionaries looks syntactially similar to cols in DataFrames
rename_dict['trial number']
Out[54]:
'trial'

We can go on and on about indexing Pandas DataFrames, because there are many ways to do it. For much more on all of the clever ways you can access data and subsets thereof in DataFrames, see the pandas docs on indexing.

However, if our data are tidy, there is really no need to go beyond what we have done here. You really only need to generate Boolean querying arrays for tidy data, and you get a tidy data structure out that has exactly what you want. So, we will not really go any more into indexing in the course, except where needed to make data tidy.

"Playing" with data: an important part of data analysis

"Exploratory data analysis" is the time during data analysis where you explore your data. You look at numbers, plot various quantities, and think about what analysis techniques you would like to use. pandas DataFrames help you do this.

As we go through the interactive analysis, we will learn about syntax for various matplotlib plot styles.

The first thing we'll do is look at strike forces. We'll rename the column for impact force for convenient attribute access because we'll access it many times.

In [55]:
df = df.rename(columns={'impact force (mN)': 'impf'})

Now, let's start plotting the impact forces to see what we're dealing with. We'll start with the most naive plot.

In [56]:
# Just make a scatter plot of forces
plt.plot(df.impf, marker='.', linestyle='none')
plt.margins(0.02)
plt.xlabel('order in DataFrame')
plt.ylabel('impact force (mN)');

The $x$-axis is pretty meaningless. So, instead let's plot a histogram of impact forces.

In [57]:
# Make a histogram plot; bins kwarg gives the number of bars
_ = plt.hist(df.impf, bins=20, normed=False)
plt.xlabel('impact force (mN)')
plt.ylabel('number of observations');

This is a better way to look at the impact force measurements. We see that there are a few high-force impacts, but that most of the impacts are about 500 mN or so.

This is still only part of the story. We would like to know how the impacts vary from frog to frog. First, let's see how many trials we have for each frog.

In [58]:
# This is a fancy way to do string formatting; unnecessary, but nice
print("""
Frog ID      Number of samples
=======      =================
   I               {0:d}
  II               {1:d}
 III               {2:d}
  IV               {3:d}
""".format(df.ID[df.ID=='I'].count(), df.ID[df.ID=='II'].count(),
           df.ID[df.ID=='III'].count(), df.ID[df.ID=='IV'].count()))
Frog ID      Number of samples
=======      =================
   I               20
  II               20
 III               20
  IV               20

So, we only have 20 samples for each frog. That's a bit few to construct a meaningful histogram for each frog. So, maybe we can make a bar graph showing the mean and standard deviation of each sample. This is very easy using Seaborn when the data are organized into a tidy DataFrame. We use the sns.barplot() function and supply the DataFrame containing the data using the data kwarg, the column corresponding to the $x$-axis using the x kwarg, and the column corresponding to the bar height as the y kwarg.

In [59]:
# Use Seaborn to quickly make bar graph
sns.barplot(data=df, x='ID', y='impf')
plt.xlabel('Frog')
plt.ylabel('impact force (mN)');

Bar graphs are seldom a good way to show results

There. I said it. Despite their prevalence throughout biological literature, I generally think bar charts are seldom a good way to show results. They are difficult to read, and often show little information. A better alternative is a box plot. Better still are jitter or beeswarm plots, which I will show below. All are easily constructed using Seaborn.

Box plots

There are many formats for box plots, and Seaborn employs what seems to be from my experience the most common. (Matplotlib boxplots are by default very ugly, which is why we'll directly use Seaborn to make the plots.) Note that the DataFrame must be tidy for Seaborn to make the box plot.

First, we'll make a plot, and then I'll describe what it means.

In [60]:
# Use Seaborn to make box plot
ax = sns.boxplot(x='ID', y='impf', data=df, width=0.5)

# Relabel axes
ax.set_xlabel('Frog')
ax.set_ylabel('Impact force (mN)');

The middle line in a box gives the median of the measurements. The bottom and top of the box are the 25th and 75th percentile values, called the first and third quartile, respectively. The extrema on the whiskers represent the minimum and maximum values of the data, or 1.5 times the IQR (see below). Outliers, in this case, are denoted with pluses.

How are outliers defined? Typically, for box plots, an outliers is defined as follows. The total length of the box is called the interquartile range (IQR), going from the first to the third quartile (Q1 and Q3, respectively). I.e., $\text{IQR} = \text{Q3} − \text{Q1}$. A datum is an outlier if its value $x$ satisfies

\begin{align} &x < \text{Q1} −\frac{3}{2}\,\text{IQR},\\ \text{or } &x > \text{Q3}+\frac{3}{2}\,\text{IQR}. \end{align}

The factor of $3/2$ is sometimes changed to 3 to denote extreme outliers, and this can be adjusted with the whis keyword argument to sns.boxplot(). Any points outside of the range of the whiskers are plotted explicitly and considered to be outliers.

Jitter and beeswarm plots

We only have 20 measurements for each frog. Wouldn't it be better just to plot all 20 instead of trying to distill it down to a mean and a standard deviation in the bar graph or quartiles in a box plot? After all, the impact force measurements might not be Gaussian distributed; they may be bimodal or something else. So, we would like to generate a column scatter plot. We will plot each data point for each frog in a column, and "jitter" its position along the $x$-axis. We make the points somewhat transparent to allow visualization of overlap. This is accomplished using Seaborn's stripplot() function.

In [61]:
# Make the jitter plot
ax = sns.stripplot(x='ID', y='impf', data=df, jitter=True, alpha=0.6)

# Relabel axes
ax.set_xlabel('Frog')
ax.set_ylabel('Impact force (mN)');

Very nice! Now we can see that frog I, an adult, strikes with a wide range of impact forces, and can strike really hard. Frog II, also an adult, tends to strike at around 500 mN, but occasionally will strike harder. Juvenile frog III is a pretty consistent striker, while frog IV can have some pretty weak strikes.

The column scatter plot is not difficult to look at. The informational content of the data does not need to be distilled into a box and whisker and certainly not a bar with a standard deviation. For large data sets, in my experience when the number of data points are above about 200, it is useful to also include the box plot with the jitter plot. This is also easily done with Seaborn.

In [62]:
# Use Seaborn to make box plot
ax = sns.boxplot(x='ID', y='impf', data=df, width=0.5)

# Make the jitter plot
ax = sns.stripplot(x='ID', y='impf', data=df, jitter=True, marker='o', 
                   alpha=0.8, color=(0.7, 0.7, 0.7))

# Relabel axes
ax.set_xlabel('Frog')
ax.set_ylabel('Impact force (mN)');

Plots like jitter plots in which the dots to not overlap are called beeswarm plots. For large numbers of data points, these are not favored over the jitter, but for small numbers of data points (less than 200 is a good rule of thumb), beeswarm plots tend to be easier to read, as each data point can be clearly seen. We can make a beeswarm plot using Seaborn's swarmplot() function/

In [63]:
# Make the jitter plot
ax = sns.swarmplot(x='ID', y='impf', data=df)

# Relabel axes
ax.set_xlabel('Frog')
ax.set_ylabel('Impact force (mN)');

We can get even more information. We might be interested to see if the variability in impact force is day-to-day or time-independent. So, we would like to make the beeswarm plot with different colors on different days. We can use the hue kwarg of the sns.swarmplot() function to color the lots by the day they were measured.

In [64]:
# Make the jitter plot
ax = sns.swarmplot(x='ID', y='impf', data=df, hue='date')

# Remove legend because it gets in the way
ax.legend_.remove()

# Relabel axes
ax.set_xlabel('Frog')
ax.set_ylabel('Impact force (mN)');

We do see that frogs I and II were measured on different days, and III and IV were measured on the same days. There also does not seem to be any correlation between impact force and the day the frog was measured.

In [65]:
# Rename the column for convenience in future use
df = df.rename(columns={'adhesive force (mN)' : 'adhf'})

# Plot adhesive force vs. impact force
plt.plot(df.impf, df.adhf, marker='.', linestyle='none', alpha=0.6)
plt.xlabel('impact force (mN)')
plt.ylabel('adhesive force (mN)');

Later in the course, we will learn how to do regressions to test the relationship between two variables.

Conclusions

In this tutorial, we have learned how to load data from CSV files into Pandas DataFrames. DataFrames are useful objects for looking at data from many angles. Together with matplotlib, you can start to look at qualitative features of your data and think about how you will go about doing your detailed analysis.