(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
This tutorial was generated from an Jupyter notebook. You can download the notebook here.
import numpy as np
import pandas as pd
In this tutorial, we will learn how to load data stored on disk into a Python data structure. We will use Pandas to read in CSV (comma separated value) files and store the results in the very handy Pandas DataFrame
. This incredibly flexible and powerful data structure will be a centerpiece in the rest of this course and beyond. In this tutorial, we will learn about what a data frame is and how to use it.
The data set we will use comes from a fun paper about the adhesive properties of frog tongues. The reference is Kleinteich and Gorb, Tongue adhesion in the horned frog Ceratophrys sp., Sci. Rep., 4, 5225, 2014. You might also want to check out a New York Times feature on the paper here.
In this paper, the authors investigated various properties of the adhesive characteristics of the tongues of horned frogs when they strike prey. The authors had a striking pad connected to a cantilever to measure forces. They also used high speed cameras to capture the strike and record relevant data.
with open('../data/frog_tongue_adhesion.csv', 'r') as f:
for _ in range(20):
print(next(f), end='')
The first lines all begin with #
signs, signifying that they are comments and not data. They do give important information, though, such as the meaning of the ID data. The ID refers to which specific frog was tested.
Immediately after the comments, we have a row of comma-separated headers. This row sets the number of columns in this data set and labels the meaning of the columns. So, we see that the first column is the date of the experiment, the second column is the ID of the frog, the third is the trial number, and so on.
After this row, each row represents a single experiment where the frog struck the target.
CSV files are one of many ways to store data, and are widely used. Commas make better delimiters than white space (such as tabs) because they have no portability issues. Delimiter collision is avoided by putting data fields in double quotes when necessary. There are other good ways to store data, such as JSON, which we will use to specify plots together with data, but we will almost exclusively use CSV files in this class.
We will use pd.read_csv()
to load the data set. The data are stored in a DataFrame, which is one of the data types that makes Pandas so convenient for use in data analysis. DataFrame
s offer mixed data types, including incomplete columns, and convenient slicing, among many, many other convenient features. We will use the DataFrame
to look at the data, at the same time demonstrating some of the power of DataFrame
s. They are like spreadsheets, only a lot better.
We will now load the DataFrame
.
# Use pd.read_csv() to read in the data and store in a DataFrame
df = pd.read_csv('../data/frog_tongue_adhesion.csv', comment='#')
Notice that we used the kwarg comment
to specify that lines that begin with #
are comments and are to be ignored. If you check out the doc string for pd.read_csv()
, you will see there are lots of options for reading in the data.
Let's jump right in and look at the contents of the DataFrame
. We can look at the first several rows using the df.head()
method.
# Look at the contents (first 5 rows)
df.head()
We see that the column headings were automatically assigned. Pandas also automatically defined the indices (names of the rows) as integers going up from zero. We could have defined the indices to be any of the columns of data.
To access a column of data, we use the following syntax.
# Slicing a column out of a DataFrame is achieved by using the column name
df['impact force (mN)']
The indexing of the rows is preserved, and we can see that we can easily extract all of the impact forces. To extract more than one column, we can index with a list.
df[['impact force (mN)', 'adhesive force (mN)']].head()
Note, though, that pd.read_csv()
interpreted the data to be integer (dtype = int64
), so we may want to convert these to floats, which we do with the .astype()
method.
# Use df.astype() method to convert it to a NumPy float64 data type.
df['impact force (mN)'] = df['impact force (mN)'].astype(float)
# Check that it worked
df['impact force (mN)'].dtype
Now let's say we only want the impact force of strikes above two Newtons, the big strikes. Pandas DataFrame
s can conveniently be sliced with Booleans. We first make an array of Booleans that are True
when the condition of interest is met, and False
elsewhere.
# Generate True/False array of rows for indexing
inds = df['impact force (mN)'] >= 2000.0
# Take a look
inds
We now have an array of Booleans that is the same length as the DataFrame
itself. Now, we can use the .loc
feature of a DataFrame
to slice what we want out of the DataFrame
.
# Slice out rows we want (big force rows)
df_big_force = df.loc[inds, :]
# Let's look at it; there will be only a few high-force values
df_big_force
Using .loc
allows us to index by row and column. We chose all columns (using a colon, :
), and put an array of Booleans for the rows. We get back a DataFrame
containing the rows associated with True
in the arrays of Booleans.
So now we only have the strikes of high force. Note, though, that the original indexing of rows was retained! In our new DataFrame
with only the big force strikes, there is no index 3, for example.
# Executing the below will result in an exception
df_big_force.loc[3, 'impact force (mN)']
This might seem counterintuitive, but it is actually a good idea. Remember, indices do not have to be integers!
There is a way around this, though. We can use the iloc
attribute of a DataFrame
. This gives indexing with sequential integers.
# Using iloc enables indexing by the corresponding sequence of integers
df_big_force['impact force (mN)'].iloc[3]
While this works, I warn that you really should not need to use iloc
. If you find yourself using it, remember that you are relying on the sorting of the data frame, which may not be relevant nor reliable. So, use iloc
at your own risk, the consequences of which could be dire!
In our DataFrame
, the data are tidy. The concept of tidy data originally comes from development of databases, but has taken hold for more general data processing in recent years. We will spend most of next week talking about tidy data and how to work with it, including reading this paper and this paper by Hadley Wickham for a great discussion of this. Tidy data refers to data sets arranged in tabular form that have the following format.
As we can see, Kleinteich and Korb generously provided us with tidy data. Much more on tidy data next week....
Given that the data are tidy and one row represents a single experiment, we might like to extract a row and see all of the measured quantities from a given strike. This is also very easy with DataFrame
s using .loc
.
# Let's slice out experiment with index 42
df.loc[42, :]
Notice how clever the DataFrame
is. We sliced a row out, and now the indices describing its elements are the column headings. This is a Pandas Series
object.
Now, slicing out index 42
is not very meaningful because the indices are arbitrary. Instead, we might look at our lab notebook, and want to look at trial number 3 on May 27, 2013, of frog III. This is a more common, and indeed meaningful, use case.
# Set up Boolean slicing
date = df['date'] == '2013_05_27'
trial = df['trial number'] == 3
ID = df['ID'] == 'III'
# Slice out the row
df.loc[date & trial & ID, :]
The difference here is that the returned slice is a DataFrame
and not a Series
. We can easily make it a Series
using iloc
(one of the few uses of iloc
that I recommend). Note that in most use cases, this is just a matter of convenience for viewing and nothing more.
# Alternatively, df.loc[date & trial & ID, :].transpose().squeeze()
df.loc[date & trial & ID, :].iloc[0]
You may be annoyed with the rather lengthy syntax of access column names. I.e., if you were trying to access the ratio of the contact area with mucus to the contact area without mucus for trial number 3 on May 27, 2013, you would do the following.
# Set up criteria for our seach of the DataFrame
date = df['date'] == '2013_05_27'
trial = df['trial number'] == 3
ID = df['ID'] == 'III'
# When indexing DataFrames, use & for Boolean and (and | for or; ~ for not)
df.loc[date & trial & ID, 'contact area with mucus / contact area without mucus']
There are reasons to preserve the descriptive (and verbose) nature of the column headings. For example, everything is unambiguous, which is always a plus. Beyond that, many plotting packages, including Altair, which we will learn about in the next lesson, can automatically label axes based on column headings in DataFrame
s. Nonetheless, you may want to use shorter names to save keystrokes. So, let's change the name of the columns as
'trial number' → 'trial'
'contact area with mucus / contact area without mucus' → 'ca_ratio'
DataFrame
s have a nice rename
method to do this.
# Make a dictionary to rename columns
rename_dict = {'trial number' : 'trial',
'contact area with mucus / contact area without mucus' : 'ca_ratio'}
# Rename the columns
df = df.rename(columns=rename_dict)
# Try out our new column name
df.loc[date & trial & ID, 'ca_ratio']
Notice that we are using a Python dictionary. A dictionary is a collection of objects, each one with a key to use for indexing. We will use dictionaries extensively in this course, so if you need to refresh yourself about this native Python data structure, please do.
We can go on and on about indexing Pandas DataFrame
s, because there are many ways to do it. For much more on all of the clever ways you can access data and subsets thereof in DataFrames, see the Pandas docs on indexing.
We have seen that DataFrame
s are convenient for organizing and selecting data, but we naturally want to compute with data. Furthermore, we might want to store the results of the computations as a new column in the DataFrame
. As a simple example, let's say we want to have a column with the impact force in units of Newtons instead of milliNewtons. We can divide the impact force (mN)
columns elementwise by 1000, just as with Numpy arrays.
# Add a new columns with impoact force in units of Newtons
df['impact force (N)'] = df['impact force (mN)'] / 1000
# Take a look
df.head()
Creation of the new column was achieve by assigning it to Pandas Series
we calculated.
We can do other calculations on DataFrame
s besides elementwise calculation. For example, if we wanted the mean impact force in units of milliNewtons, we can do this....
df['impact force (mN)'].mean()
Actually, we can compute all sorts of useful summary statistics all at once about the DataFrame
using the describe()
method. This gives the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum for each column.
df.describe()
Some values are meaningless, like the mean trial number, but much of the information is very useful.
Now, say we wanted to compute the mean impact force of each frog. We can combine Boolean indexing with looping to do that.
for frog in df['ID'].unique():
inds = df['ID'] == frog
mean_impact_force = df.loc[inds, 'impact force (mN)'].mean()
print('{0:s}: {1:.1f} mN'.format(frog, mean_impact_force))
Notice that we have used the .unique()
method to get the unique entries in the ID
column. We will see next week that we can use Pandas's groupby()
function to do this much more cleanly and efficiently, but for now, you can appreciate your ability to pull out what data you need and compute with it.
The mean()
function we used above performed aggregating functions to the data frame. That is, it took the data in the columns, which had 80 entries, and then collapsed all of those data into a single number, in this case the mean. The describe()
function applies many aggregating functions. Pandas has many built-in ones. For example, we could use the median()
function.
df['impact force (mN)'].median()
We can also write our own. Let's say we want to compute the coefficient of variation (CoV, the standard deviation divided by the mean) of the impact forces. There is no built-in function to do this. We have to write our own function to compute the CoV and then use it with the agg()
method of a DataFrame
object. In the function below, the values of each column are denoted by data
.
def coeff_of_var(data):
"""Compute coefficient of variation from an array of data."""
return np.std(data) / np.mean(data)
Now that we have the function (which uses Numpy's std()
and mean()
functions), we can use the DataFrame
's agg()
method to apply it. (Note that we slice out the column of interest using a list so that we get a DataFrame
and not a Series
. For the latter, agg()
will apply the function to each entry, not across the entire column.)
df[['impact force (mN)']].agg(coeff_of_var)
While an aggregating function collapses the data down to a single value, a transform operation returns the same number of data points. For example, we might want to rank the entries. Pandas's built-in rank()
method does this. We will use the method='first'
kwarg to specify how the function deals with ties.
df['impact force (mN)'].rank(method='first').astype(int)
We can also apply custom transforms. Say we wanted to compute a residual for each data point.
def resid(data):
"""Compute residual for each entry in an array of data."""
return (data - np.mean(data)) / np.std(data)
(df[['impact force (mN)']].transform(resid)
.rename(columns={'impact force (mN)': 'impact force resid'}))
Looking back at the header of the original data file, we see that there is information present in the header that we would like to have in our DataFrame
. For example, it would be nice to know if each strike came from an adult or juvenile. Or what the snout-vent length was. Going toward the goal of including this in our DataFrame
, we will first construct a new DataFrame
containing information about each frog.
One way do create this new DataFrame
is to first construct a dictionary with the respective fields.
data_dict = {'ID': ['I', 'II', 'III', 'IV'],
'age': ['adult', 'adult', 'juvenile', 'juvenile'],
'SVL (mm)': [63, 70, 28, 31],
'weight (g)': [63.1, 72.7, 12.7, 12.7],
'species': ['cross', 'cross', 'cranwelli', 'cranwelli']}
Now that we have this dictionary, we can convert it into a DataFrame
by instantiating a pd.DataFrame
class with it, using the data
kwarg.
# Make it into a DataFrame
df_frog_info = pd.DataFrame(data=data_dict)
# Take a look
df_frog_info
Nice!
Sometimes the data sets are not small enough to construct a dictionary by hand. Oftentimes, we have a two-dimensional array of data that we want to make into a DataFrame
. As an example, let's say we have a Numpy array that has two columns, one for the snout-vent length and one for the weight.
data = np.array([[63, 70, 28, 31], [63.1, 72.7, 12.7, 12.7]]).transpose()
# Verify that it's what we think it is
data
To make this into a DataFrame
, we again create pd.DataFrame
instance, but this time we also specify the column
keyword argument.
df_demo = pd.DataFrame(data=data, columns=['SVL (mm)', 'weight (g)'])
# Take a look
df_demo
That also works. Generally, any two-dimensional Numpy array can be converted into a DataFrame
in this way. You just need to supply column names.
Our ultimate goal was to add that information to the main DataFrame
, df
, that we have been working with. We can do it using tools we have already learned. For each row in the DataFrame
, we can add the relevant value in each column. Because this will not be the final way I recommend doing this, I will do these operations on a copy of df
using the copy()
method.
# Make a copy of df
df_copy = df.copy()
# Build each column
for col in df_frog_info.columns[df_frog_info.columns != 'ID']:
# Make a new column with empty values
df_copy[col] = np.empty(len(df_copy))
# Add in each entry, row by row
for i, r in df_copy.iterrows():
ind = df_frog_info['ID'] == r['ID']
df_copy.loc[i, col] = df_frog_info.loc[ind, col].iloc[0]
# Take a look at the updated DataFrame
df_copy.head()
Note that I used the iterrows()
method of the df_copy
data frame. This iterator gives an index (which I called i
) and a row of a DataFrame
(which I called r
). This method, and the analogous one for iterating over columns, iteritems()
, can be useful.
But this approach seems rather clunky. There is a much better way to do it is to use Pandas's built-in merge()
method. Called with all the default keyword arguments, this function finds a common columns between two DataFrame
s (in this case, the ID
column), and then uses those columns to merge them, filling in values that match in the common column. This is exactly what we want.
df = df.merge(df_frog_info)
# Check it out!
df.head()
Boo-yah! I think this example of merging DataFrame
s highlights the power of using them in your data analysis. It also brings up an important point. When you have to perform operations on data frames, you can often "brute force" it with loops, etc. But if what you are trying to do seems like something a data analyst would frequently encounter, there is a good chance it's already built-in to Pandas, and you should ask Google how to do it.
%load_ext watermark
%watermark -v -p numpy,pandas,jupyterlab