Tutorial 1a: Data frames

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [1]:
import numpy as np
import pandas as pd

In this tutorial, we will learn how to load data stored on disk into a Python data structure. We will use Pandas to read in CSV (comma separated value) files and store the results in the very handy Pandas DataFrame. This incredibly flexible and powerful data structure will be a centerpiece in the rest of this course and beyond. In this tutorial, we will learn about what a data frame is and how to use it.

The data set we will use comes from a fun paper about the adhesive properties of frog tongues. The reference is Kleinteich and Gorb, Tongue adhesion in the horned frog Ceratophrys sp., Sci. Rep., 4, 5225, 2014. You might also want to check out a New York Times feature on the paper here.

In this paper, the authors investigated various properties of the adhesive characteristics of the tongues of horned frogs when they strike prey. The authors had a striking pad connected to a cantilever to measure forces. They also used high speed cameras to capture the strike and record relevant data.

Pandas

Pandas is a central Python package for data analysis, and we will learn how to use it, and its central data structure, the DataFrame, in this tutorial. Traditionally, Pandas is imported as pd, which we did in the first cell of this notebook.

The data file

The data from the paper are contained in the file frog_tongue_adhesion.csv, which you can download here. We can look at its contents. You can use

head -n 20 ../data/frog_tongue_adhesion.csv

from the command line. Below, I show the first twenty lines of the data file.

In [2]:
with open('../data/frog_tongue_adhesion.csv', 'r') as f:
    for _ in range(20):
        print(next(f), end='')
# These data are from the paper,
#   Kleinteich and Gorb, Sci. Rep., 4, 5225, 2014.
# It was featured in the New York Times.
#    http://www.nytimes.com/2014/08/25/science/a-frog-thats-a-living-breathing-pac-man.html
#
# The authors included the data in their supplemental information.
#
# Importantly, the ID refers to the identifites of the frogs they tested.
#   I:   adult, 63 mm snout-vent-length (SVL) and 63.1 g body weight,
#        Ceratophrys cranwelli crossed with Ceratophrys cornuta
#   II:  adult, 70 mm SVL and 72.7 g body weight,
#        Ceratophrys cranwelli crossed with Ceratophrys cornuta
#   III: juvenile, 28 mm SVL and 12.7 g body weight, Ceratophrys cranwelli
#   IV:  juvenile, 31 mm SVL and 12.7 g body weight, Ceratophrys cranwelli
date,ID,trial number,impact force (mN),impact time (ms),impact force / body weight,adhesive force (mN),time frog pulls on target (ms),adhesive force / body weight,adhesive impulse (N-s),total contact area (mm2),contact area without mucus (mm2),contact area with mucus / contact area without mucus,contact pressure (Pa),adhesive strength (Pa)
2013_02_26,I,3,1205,46,1.95,-785,884,1.27,-0.290,387,70,0.82,3117,-2030
2013_02_26,I,4,2527,44,4.08,-983,248,1.59,-0.181,101,94,0.07,24923,-9695
2013_03_01,I,1,1745,34,2.82,-850,211,1.37,-0.157,83,79,0.05,21020,-10239
2013_03_01,I,2,1556,41,2.51,-455,1025,0.74,-0.170,330,158,0.52,4718,-1381
2013_03_01,I,3,493,36,0.80,-974,499,1.57,-0.423,245,216,0.12,2012,-3975

The first lines all begin with # signs, signifying that they are comments and not data. They do give important information, though, such as the meaning of the ID data. The ID refers to which specific frog was tested.

Immediately after the comments, we have a row of comma-separated headers. This row sets the number of columns in this data set and labels the meaning of the columns. So, we see that the first column is the date of the experiment, the second column is the ID of the frog, the third is the trial number, and so on.

After this row, each row represents a single experiment where the frog struck the target.

CSV files are one of many ways to store data, and are widely used. Commas make better delimiters than white space (such as tabs) because they have no portability issues. Delimiter collision is avoided by putting data fields in double quotes when necessary. There are other good ways to store data, such as JSON, which we will use to specify plots together with data, but we will almost exclusively use CSV files in this class.

Loading a data set

We will use pd.read_csv() to load the data set. The data are stored in a DataFrame, which is one of the data types that makes Pandas so convenient for use in data analysis. DataFrames offer mixed data types, including incomplete columns, and convenient slicing, among many, many other convenient features. We will use the DataFrame to look at the data, at the same time demonstrating some of the power of DataFrames. They are like spreadsheets, only a lot better.

We will now load the DataFrame.

In [3]:
# Use pd.read_csv() to read in the data and store in a DataFrame
df = pd.read_csv('../data/frog_tongue_adhesion.csv', comment='#')

Notice that we used the kwarg comment to specify that lines that begin with # are comments and are to be ignored. If you check out the doc string for pd.read_csv(), you will see there are lots of options for reading in the data.

Exploring the DataFrame

Let's jump right in and look at the contents of the DataFrame. We can look at the first several rows using the df.head() method.

In [4]:
# Look at the contents (first 5 rows)
df.head()
Out[4]:
date ID trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa)
0 2013_02_26 I 3 1205 46 1.95 -785 884 1.27 -0.290 387 70 0.82 3117 -2030
1 2013_02_26 I 4 2527 44 4.08 -983 248 1.59 -0.181 101 94 0.07 24923 -9695
2 2013_03_01 I 1 1745 34 2.82 -850 211 1.37 -0.157 83 79 0.05 21020 -10239
3 2013_03_01 I 2 1556 41 2.51 -455 1025 0.74 -0.170 330 158 0.52 4718 -1381
4 2013_03_01 I 3 493 36 0.80 -974 499 1.57 -0.423 245 216 0.12 2012 -3975

We see that the column headings were automatically assigned. Pandas also automatically defined the indices (names of the rows) as integers going up from zero. We could have defined the indices to be any of the columns of data.

To access a column of data, we use the following syntax.

In [5]:
# Slicing a column out of a DataFrame is achieved by using the column name
df['impact force (mN)']
Out[5]:
0     1205
1     2527
2     1745
3     1556
4      493
5     2276
6      556
7     1928
8     2641
9     1897
10    1891
11    1545
12    1307
13    1692
14    1543
15    1282
16     775
17    2032
18    1240
19     473
20    1612
21     605
22     327
23     946
24     541
25    1539
26     529
27     628
28    1453
29     297
      ... 
50     458
51     626
52     621
53     544
54     535
55     385
56     401
57     614
58     665
59     488
60     172
61     142
62      37
63     453
64     355
65      22
66     502
67     273
68     720
69     582
70     198
71     198
72     597
73     516
74     815
75     402
76     605
77     711
78     614
79     468
Name: impact force (mN), Length: 80, dtype: int64

The indexing of the rows is preserved, and we can see that we can easily extract all of the impact forces. To extract more than one column, we can index with a list.

In [6]:
df[['impact force (mN)', 'adhesive force (mN)']].head()
Out[6]:
impact force (mN) adhesive force (mN)
0 1205 -785
1 2527 -983
2 1745 -850
3 1556 -455
4 493 -974

Note, though, that pd.read_csv() interpreted the data to be integer (dtype = int64), so we may want to convert these to floats, which we do with the .astype() method.

In [7]:
# Use df.astype() method to convert it to a NumPy float64 data type.
df['impact force (mN)'] = df['impact force (mN)'].astype(float)

# Check that it worked
df['impact force (mN)'].dtype
Out[7]:
dtype('float64')

Now let's say we only want the impact force of strikes above two Newtons, the big strikes. Pandas DataFrames can conveniently be sliced with Booleans. We first make an array of Booleans that are True when the condition of interest is met, and False elsewhere.

In [8]:
# Generate True/False array of rows for indexing
inds = df['impact force (mN)'] >= 2000.0

# Take a look
inds
Out[8]:
0     False
1      True
2     False
3     False
4     False
5      True
6     False
7     False
8      True
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17     True
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
      ...  
50    False
51    False
52    False
53    False
54    False
55    False
56    False
57    False
58    False
59    False
60    False
61    False
62    False
63    False
64    False
65    False
66    False
67    False
68    False
69    False
70    False
71    False
72    False
73    False
74    False
75    False
76    False
77    False
78    False
79    False
Name: impact force (mN), Length: 80, dtype: bool

We now have an array of Booleans that is the same length as the DataFrame itself. Now, we can use the .loc feature of a DataFrame to slice what we want out of the DataFrame.

In [9]:
# Slice out rows we want (big force rows)
df_big_force = df.loc[inds, :]

# Let's look at it; there will be only a few high-force values
df_big_force
Out[9]:
date ID trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa)
1 2013_02_26 I 4 2527.0 44 4.08 -983 248 1.59 -0.181 101 94 0.07 24923 -9695
5 2013_03_01 I 4 2276.0 31 3.68 -592 969 0.96 -0.176 341 106 0.69 6676 -1737
8 2013_03_05 I 3 2641.0 50 4.27 -690 491 1.12 -0.239 269 224 0.17 9824 -2568
17 2013_03_15 I 3 2032.0 60 3.28 -652 486 1.05 -0.257 147 134 0.09 13784 -4425

Using .loc allows us to index by row and column. We chose all columns (using a colon, :), and put an array of Booleans for the rows. We get back a DataFrame containing the rows associated with True in the arrays of Booleans.

So now we only have the strikes of high force. Note, though, that the original indexing of rows was retained! In our new DataFrame with only the big force strikes, there is no index 3, for example.

In [10]:
# Executing the below will result in an exception
df_big_force.loc[3, 'impact force (mN)']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
   1789                 if not ax.contains(key):
-> 1790                     error()
   1791             except TypeError as e:

~/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in error()
   1784                                .format(key=key,
-> 1785                                        axis=self.obj._get_axis_name(axis)))
   1786 

KeyError: 'the label [3] is not in the [index]'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-10-e454f2b5c1c3> in <module>()
      1 # Executing the below will result in an exception
----> 2 df_big_force.loc[3, 'impact force (mN)']

~/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1470             except (KeyError, IndexError):
   1471                 pass
-> 1472             return self._getitem_tuple(key)
   1473         else:
   1474             # we by definition only have the 0th axis

~/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
    868     def _getitem_tuple(self, tup):
    869         try:
--> 870             return self._getitem_lowerdim(tup)
    871         except IndexingError:
    872             pass

~/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_lowerdim(self, tup)
    996         for i, key in enumerate(tup):
    997             if is_label_like(key) or isinstance(key, tuple):
--> 998                 section = self._getitem_axis(key, axis=i)
    999 
   1000                 # we have yielded a scalar ?

~/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1909 
   1910         # fall thru to straight lookup
-> 1911         self._validate_key(key, axis)
   1912         return self._get_label(key, axis=axis)
   1913 

~/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
   1796                 raise
   1797             except:
-> 1798                 error()
   1799 
   1800     def _is_scalar_access(self, key):

~/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in error()
   1783                 raise KeyError(u"the label [{key}] is not in the [{axis}]"
   1784                                .format(key=key,
-> 1785                                        axis=self.obj._get_axis_name(axis)))
   1786 
   1787             try:

KeyError: 'the label [3] is not in the [index]'

This might seem counterintuitive, but it is actually a good idea. Remember, indices do not have to be integers!

There is a way around this, though. We can use the iloc attribute of a DataFrame. This gives indexing with sequential integers.

In [11]:
# Using iloc enables indexing by the corresponding sequence of integers
df_big_force['impact force (mN)'].iloc[3]
Out[11]:
2032.0

While this works, I warn that you really should not need to use iloc. If you find yourself using it, remember that you are relying on the sorting of the data frame, which may not be relevant nor reliable. So, use iloc at your own risk, the consequences of which could be dire!

Aside: Tidy data

In our DataFrame, the data are tidy. The concept of tidy data originally comes from development of databases, but has taken hold for more general data processing in recent years. We will spend most of next week talking about tidy data and how to work with it, including reading this paper and this paper by Hadley Wickham for a great discussion of this. Tidy data refers to data sets arranged in tabular form that have the following format.

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a separate table.

As we can see, Kleinteich and Korb generously provided us with tidy data. Much more on tidy data next week....

More data extraction

Given that the data are tidy and one row represents a single experiment, we might like to extract a row and see all of the measured quantities from a given strike. This is also very easy with DataFrames using .loc.

In [12]:
# Let's slice out experiment with index 42
df.loc[42, :]
Out[12]:
date                                                    2013_05_27
ID                                                             III
trial number                                                     3
impact force (mN)                                              324
impact time (ms)                                               105
impact force / body weight                                    2.61
adhesive force (mN)                                           -172
time frog pulls on target (ms)                                 619
adhesive force / body weight                                  1.38
adhesive impulse (N-s)                                      -0.079
total contact area (mm2)                                        55
contact area without mucus (mm2)                                23
contact area with mucus / contact area without mucus          0.37
contact pressure (Pa)                                         5946
adhesive strength (Pa)                                       -3149
Name: 42, dtype: object

Notice how clever the DataFrame is. We sliced a row out, and now the indices describing its elements are the column headings. This is a Pandas Series object.

Now, slicing out index 42 is not very meaningful because the indices are arbitrary. Instead, we might look at our lab notebook, and want to look at trial number 3 on May 27, 2013, of frog III. This is a more common, and indeed meaningful, use case.

In [13]:
# Set up Boolean slicing
date = df['date'] == '2013_05_27'
trial = df['trial number'] == 3
ID = df['ID'] == 'III'

# Slice out the row
df.loc[date & trial & ID, :]
Out[13]:
date ID trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa)
42 2013_05_27 III 3 324.0 105 2.61 -172 619 1.38 -0.079 55 23 0.37 5946 -3149

The difference here is that the returned slice is a DataFrame and not a Series. We can easily make it a Series using iloc (one of the few uses of iloc that I recommend). Note that in most use cases, this is just a matter of convenience for viewing and nothing more.

In [14]:
# Alternatively, df.loc[date & trial & ID, :].transpose().squeeze()
df.loc[date & trial & ID, :].iloc[0]
Out[14]:
date                                                    2013_05_27
ID                                                             III
trial number                                                     3
impact force (mN)                                              324
impact time (ms)                                               105
impact force / body weight                                    2.61
adhesive force (mN)                                           -172
time frog pulls on target (ms)                                 619
adhesive force / body weight                                  1.38
adhesive impulse (N-s)                                      -0.079
total contact area (mm2)                                        55
contact area without mucus (mm2)                                23
contact area with mucus / contact area without mucus          0.37
contact pressure (Pa)                                         5946
adhesive strength (Pa)                                       -3149
Name: 42, dtype: object

Renaming columns

You may be annoyed with the rather lengthy syntax of access column names. I.e., if you were trying to access the ratio of the contact area with mucus to the contact area without mucus for trial number 3 on May 27, 2013, you would do the following.

In [15]:
# Set up criteria for our seach of the DataFrame
date = df['date'] == '2013_05_27'
trial = df['trial number'] == 3
ID = df['ID'] == 'III'

# When indexing DataFrames, use & for Boolean and (and | for or; ~ for not)
df.loc[date & trial & ID, 'contact area with mucus / contact area without mucus']
Out[15]:
42    0.37
Name: contact area with mucus / contact area without mucus, dtype: float64

There are reasons to preserve the descriptive (and verbose) nature of the column headings. For example, everything is unambiguous, which is always a plus. Beyond that, many plotting packages, including Altair, which we will learn about in the next lesson, can automatically label axes based on column headings in DataFrames. Nonetheless, you may want to use shorter names to save keystrokes. So, let's change the name of the columns as

'trial number' → 'trial'
'contact area with mucus / contact area without mucus' → 'ca_ratio'

DataFrames have a nice rename method to do this.

In [16]:
# Make a dictionary to rename columns
rename_dict = {'trial number' : 'trial',
               'contact area with mucus / contact area without mucus' : 'ca_ratio'}

# Rename the columns
df = df.rename(columns=rename_dict)

# Try out our new column name
df.loc[date & trial & ID, 'ca_ratio']
Out[16]:
42    0.37
Name: ca_ratio, dtype: float64

Notice that we are using a Python dictionary. A dictionary is a collection of objects, each one with a key to use for indexing. We will use dictionaries extensively in this course, so if you need to refresh yourself about this native Python data structure, please do.

We can go on and on about indexing Pandas DataFrames, because there are many ways to do it. For much more on all of the clever ways you can access data and subsets thereof in DataFrames, see the Pandas docs on indexing.

Computing with DataFrames and adding columns

We have seen that DataFrames are convenient for organizing and selecting data, but we naturally want to compute with data. Furthermore, we might want to store the results of the computations as a new column in the DataFrame. As a simple example, let's say we want to have a column with the impact force in units of Newtons instead of milliNewtons. We can divide the impact force (mN) columns elementwise by 1000, just as with Numpy arrays.

In [17]:
# Add a new columns with impoact force in units of Newtons
df['impact force (N)'] = df['impact force (mN)'] / 1000

# Take a look
df.head()
Out[17]:
date ID trial impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) ca_ratio contact pressure (Pa) adhesive strength (Pa) impact force (N)
0 2013_02_26 I 3 1205.0 46 1.95 -785 884 1.27 -0.290 387 70 0.82 3117 -2030 1.205
1 2013_02_26 I 4 2527.0 44 4.08 -983 248 1.59 -0.181 101 94 0.07 24923 -9695 2.527
2 2013_03_01 I 1 1745.0 34 2.82 -850 211 1.37 -0.157 83 79 0.05 21020 -10239 1.745
3 2013_03_01 I 2 1556.0 41 2.51 -455 1025 0.74 -0.170 330 158 0.52 4718 -1381 1.556
4 2013_03_01 I 3 493.0 36 0.80 -974 499 1.57 -0.423 245 216 0.12 2012 -3975 0.493

Creation of the new column was achieve by assigning it to Pandas Series we calculated.

We can do other calculations on DataFrames besides elementwise calculation. For example, if we wanted the mean impact force in units of milliNewtons, we can do this....

In [18]:
df['impact force (mN)'].mean()
Out[18]:
801.6875

Actually, we can compute all sorts of useful summary statistics all at once about the DataFrame using the describe() method. This gives the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum for each column.

In [19]:
df.describe()
Out[19]:
trial impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) ca_ratio contact pressure (Pa) adhesive strength (Pa) impact force (N)
count 80.000000 80.000000 80.000000 80.000000 80.000000 80.000000 80.000000 80.000000 80.000000 80.000000 80.000000 80.000000 80.000000 80.000000
mean 2.400000 801.687500 39.062500 2.920375 -397.762500 1132.450000 1.444875 -0.187462 166.475000 61.400000 0.569000 6073.162500 -3005.875000 0.801687
std 1.164887 587.836143 19.558639 1.581092 228.675562 747.172695 0.659858 0.134058 98.696052 58.532821 0.341862 5515.265706 2525.468421 0.587836
min 1.000000 22.000000 6.000000 0.170000 -983.000000 189.000000 0.220000 -0.768000 19.000000 0.000000 0.010000 397.000000 -17652.000000 0.022000
25% 1.000000 456.000000 29.750000 1.470000 -567.750000 682.250000 0.990000 -0.277250 104.750000 16.750000 0.280000 2579.250000 -3443.250000 0.456000
50% 2.000000 601.000000 34.000000 3.030000 -335.000000 927.000000 1.320000 -0.165000 134.500000 43.000000 0.665000 4678.000000 -2185.500000 0.601000
75% 3.000000 1005.000000 42.000000 4.277500 -224.500000 1381.250000 1.772500 -0.081250 238.250000 92.500000 0.885000 7249.750000 -1736.000000 1.005000
max 5.000000 2641.000000 143.000000 6.490000 -92.000000 4251.000000 3.400000 -0.001000 455.000000 260.000000 1.000000 28641.000000 -678.000000 2.641000

Some values are meaningless, like the mean trial number, but much of the information is very useful.

Now, say we wanted to compute the mean impact force of each frog. We can combine Boolean indexing with looping to do that.

In [20]:
for frog in df['ID'].unique():
    inds = df['ID'] == frog
    mean_impact_force = df.loc[inds, 'impact force (mN)'].mean()
    print('{0:s}: {1:.1f} mN'.format(frog, mean_impact_force))
I: 1530.2 mN
II: 707.4 mN
III: 550.1 mN
IV: 419.1 mN

Notice that we have used the .unique() method to get the unique entries in the ID column. We will see next week that we can use Pandas's groupby() function to do this much more cleanly and efficiently, but for now, you can appreciate your ability to pull out what data you need and compute with it.

Aggregation and tranforms

The mean() function we used above performed aggregating functions to the data frame. That is, it took the data in the columns, which had 80 entries, and then collapsed all of those data into a single number, in this case the mean. The describe() function applies many aggregating functions. Pandas has many built-in ones. For example, we could use the median() function.

In [21]:
df['impact force (mN)'].median()
Out[21]:
601.0

We can also write our own. Let's say we want to compute the coefficient of variation (CoV, the standard deviation divided by the mean) of the impact forces. There is no built-in function to do this. We have to write our own function to compute the CoV and then use it with the agg() method of a DataFrame object. In the function below, the values of each column are denoted by data.

In [22]:
def coeff_of_var(data):
    """Compute coefficient of variation from an array of data."""
    return np.std(data) / np.mean(data)

Now that we have the function (which uses Numpy's std() and mean() functions), we can use the DataFrame's agg() method to apply it. (Note that we slice out the column of interest using a list so that we get a DataFrame and not a Series. For the latter, agg() will apply the function to each entry, not across the entire column.)

In [23]:
df[['impact force (mN)']].agg(coeff_of_var)
Out[23]:
impact force (mN)    0.728651
dtype: float64

While an aggregating function collapses the data down to a single value, a transform operation returns the same number of data points. For example, we might want to rank the entries. Pandas's built-in rank() method does this. We will use the method='first' kwarg to specify how the function deals with ties.

In [24]:
df['impact force (mN)'].rank(method='first').astype(int)
Out[24]:
0     62
1     79
2     73
3     70
4     27
5     78
6     37
7     76
8     80
9     75
10    74
11    69
12    65
13    72
14    68
15    64
16    56
17    77
18    63
19    25
20    71
21    41
22    12
23    60
24    34
25    67
26    31
27    49
28    66
29    10
      ..
50    22
51    48
52    47
53    36
54    32
55    15
56    16
57    45
58    50
59    26
60     4
61     3
62     2
63    20
64    13
65     1
66    28
67     9
68    53
69    39
70     5
71     6
72    40
73    30
74    59
75    17
76    42
77    52
78    46
79    24
Name: impact force (mN), Length: 80, dtype: int64

We can also apply custom transforms. Say we wanted to compute a residual for each data point.

In [25]:
def resid(data):
    """Compute residual for each entry in an array of data."""
    return (data - np.mean(data)) / np.std(data)

(df[['impact force (mN)']].transform(resid)
                          .rename(columns={'impact force (mN)': 'impact force resid'}))
Out[25]:
impact force resid
0 0.690426
1 2.953541
2 1.614845
3 1.291298
4 -0.528438
5 2.523857
6 -0.420589
7 1.928120
8 3.148696
9 1.875052
10 1.864780
11 1.272467
12 0.865038
13 1.524115
14 1.269043
15 0.822241
16 -0.045686
17 2.106156
18 0.750342
19 -0.562676
20 1.387164
21 -0.336707
22 -0.812611
23 0.247047
24 -0.446268
25 1.262196
26 -0.466810
27 -0.297333
28 1.114974
29 -0.863968
... ...
50 -0.588354
51 -0.300757
52 -0.309317
53 -0.441132
54 -0.456539
55 -0.713322
56 -0.685932
57 -0.321300
58 -0.233994
59 -0.536998
60 -1.077954
61 -1.129311
62 -1.309059
63 -0.596914
64 -0.764679
65 -1.334737
66 -0.513031
67 -0.905053
68 -0.139840
69 -0.376080
70 -1.033445
71 -1.033445
72 -0.350402
73 -0.489065
74 0.022789
75 -0.684220
76 -0.336707
77 -0.155247
78 -0.321300
79 -0.571235

80 rows × 1 columns

Creating a DataFrame from scratch

Looking back at the header of the original data file, we see that there is information present in the header that we would like to have in our DataFrame. For example, it would be nice to know if each strike came from an adult or juvenile. Or what the snout-vent length was. Going toward the goal of including this in our DataFrame, we will first construct a new DataFrame containing information about each frog.

One way do create this new DataFrame is to first construct a dictionary with the respective fields.

In [26]:
data_dict = {'ID': ['I', 'II', 'III', 'IV'],
             'age': ['adult', 'adult', 'juvenile', 'juvenile'],
             'SVL (mm)': [63, 70, 28, 31],
             'weight (g)': [63.1, 72.7, 12.7, 12.7],
             'species': ['cross', 'cross', 'cranwelli', 'cranwelli']}

Now that we have this dictionary, we can convert it into a DataFrame by instantiating a pd.DataFrame class with it, using the data kwarg.

In [27]:
# Make it into a DataFrame
df_frog_info = pd.DataFrame(data=data_dict)

# Take a look
df_frog_info
Out[27]:
ID age SVL (mm) weight (g) species
0 I adult 63 63.1 cross
1 II adult 70 72.7 cross
2 III juvenile 28 12.7 cranwelli
3 IV juvenile 31 12.7 cranwelli

Nice!

Sometimes the data sets are not small enough to construct a dictionary by hand. Oftentimes, we have a two-dimensional array of data that we want to make into a DataFrame. As an example, let's say we have a Numpy array that has two columns, one for the snout-vent length and one for the weight.

In [28]:
data = np.array([[63, 70, 28, 31], [63.1, 72.7, 12.7, 12.7]]).transpose()

# Verify that it's what we think it is
data
Out[28]:
array([[63. , 63.1],
       [70. , 72.7],
       [28. , 12.7],
       [31. , 12.7]])

To make this into a DataFrame, we again create pd.DataFrame instance, but this time we also specify the column keyword argument.

In [29]:
df_demo = pd.DataFrame(data=data, columns=['SVL (mm)', 'weight (g)'])

# Take a look
df_demo
Out[29]:
SVL (mm) weight (g)
0 63.0 63.1
1 70.0 72.7
2 28.0 12.7
3 31.0 12.7

That also works. Generally, any two-dimensional Numpy array can be converted into a DataFrame in this way. You just need to supply column names.

Merging DataFrames

Our ultimate goal was to add that information to the main DataFrame, df, that we have been working with. We can do it using tools we have already learned. For each row in the DataFrame, we can add the relevant value in each column. Because this will not be the final way I recommend doing this, I will do these operations on a copy of df using the copy() method.

In [30]:
# Make a copy of df
df_copy = df.copy()

# Build each column
for col in df_frog_info.columns[df_frog_info.columns != 'ID']:
    # Make a new column with empty values
    df_copy[col] = np.empty(len(df_copy))
    
    # Add in each entry, row by row
    for i, r in df_copy.iterrows():
        ind = df_frog_info['ID'] == r['ID']
        df_copy.loc[i, col] = df_frog_info.loc[ind, col].iloc[0]
        
# Take a look at the updated DataFrame
df_copy.head()
Out[30]:
date ID trial impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) ca_ratio contact pressure (Pa) adhesive strength (Pa) impact force (N) age SVL (mm) weight (g) species
0 2013_02_26 I 3 1205.0 46 1.95 -785 884 1.27 -0.290 387 70 0.82 3117 -2030 1.205 adult 63.0 63.1 cross
1 2013_02_26 I 4 2527.0 44 4.08 -983 248 1.59 -0.181 101 94 0.07 24923 -9695 2.527 adult 63.0 63.1 cross
2 2013_03_01 I 1 1745.0 34 2.82 -850 211 1.37 -0.157 83 79 0.05 21020 -10239 1.745 adult 63.0 63.1 cross
3 2013_03_01 I 2 1556.0 41 2.51 -455 1025 0.74 -0.170 330 158 0.52 4718 -1381 1.556 adult 63.0 63.1 cross
4 2013_03_01 I 3 493.0 36 0.80 -974 499 1.57 -0.423 245 216 0.12 2012 -3975 0.493 adult 63.0 63.1 cross

Note that I used the iterrows() method of the df_copy data frame. This iterator gives an index (which I called i) and a row of a DataFrame (which I called r). This method, and the analogous one for iterating over columns, iteritems(), can be useful.

But this approach seems rather clunky. There is a much better way to do it is to use Pandas's built-in merge() method. Called with all the default keyword arguments, this function finds a common columns between two DataFrames (in this case, the ID column), and then uses those columns to merge them, filling in values that match in the common column. This is exactly what we want.

In [31]:
df = df.merge(df_frog_info)

# Check it out!
df.head()
Out[31]:
date ID trial impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) ca_ratio contact pressure (Pa) adhesive strength (Pa) impact force (N) age SVL (mm) weight (g) species
0 2013_02_26 I 3 1205.0 46 1.95 -785 884 1.27 -0.290 387 70 0.82 3117 -2030 1.205 adult 63 63.1 cross
1 2013_02_26 I 4 2527.0 44 4.08 -983 248 1.59 -0.181 101 94 0.07 24923 -9695 2.527 adult 63 63.1 cross
2 2013_03_01 I 1 1745.0 34 2.82 -850 211 1.37 -0.157 83 79 0.05 21020 -10239 1.745 adult 63 63.1 cross
3 2013_03_01 I 2 1556.0 41 2.51 -455 1025 0.74 -0.170 330 158 0.52 4718 -1381 1.556 adult 63 63.1 cross
4 2013_03_01 I 3 493.0 36 0.80 -974 499 1.57 -0.423 245 216 0.12 2012 -3975 0.493 adult 63 63.1 cross

Boo-yah! I think this example of merging DataFrames highlights the power of using them in your data analysis. It also brings up an important point. When you have to perform operations on data frames, you can often "brute force" it with loops, etc. But if what you are trying to do seems like something a data analyst would frequently encounter, there is a good chance it's already built-in to Pandas, and you should ask Google how to do it.

Computing environment

In [1]:
%load_ext watermark
In [2]:
%watermark -v -p numpy,pandas,jupyterlab
CPython 3.7.0
IPython 6.5.0

numpy 1.15.1
pandas 0.23.4
jupyterlab 0.34.9