Merging and concatenating data frames
[2]:
import numpy as np
import pandas as pd
import iqplot
import bokeh.io
bokeh.io.output_notebook()
It often happens that experiments consist of multiple data files that need to be brought together into a single data frame to work with in exploratory data analysis and subsequent analyses. Through its concatenation and merging capabilities, Pandas provides powerful tools for handling this sort of data.
The frog tongue strike data set
As usual, we will work with a real data set to learn about concatenation and merging of data frames. The data set we will use comes from a fun paper about the adhesive properties of frog tongues. The reference is Kleinteich and Gorb, Tongue adhesion in the horned frog Ceratophrys sp., Sci. Rep., 4, 5225, 2014. You might also want to check out a New York Times feature on the paper here.
In this paper, the authors investigated various properties of the adhesive characteristics of the tongues of horned frogs when they strike prey. The authors had a striking pad connected to a cantilever to measure forces. They also used high speed cameras to capture the strike and record relevant data.
To get an idea of the experimental set up, you can check out this movie, kindly sent to me by Thomas Kleinteich. If video does not play in your browser, you may download it here.
The data files
I pulled data files from the Kleinteich and Gorb paper. You can download the data files here: https://s3.amazonaws.com/bebi103.caltech.edu/data/frog_strikes.zip.
There are four files, one for each of the four frogs, labeled with IDs I, II, III, and IV, that were studied. To see the format of the files, we can look at the content of the file for frog I. You can use
head -n 20 ../data/frog_tongue_adhesion.csv
from the command line. Here is the content of the first data file.
# These data are from Kleinteich and Gorb, Sci. Rep., 4, 5225, 2014.
# Frog ID: I
# Age: adult
# Snout-vent-length (SVL): 63 mm
# Body weight: 63.1 g
# Species: Ceratophrys cranwelli crossed with Ceratophrys cornuta
date,trial number,impact force (mN),impact time (ms),impact force / body weight,adhesive force (mN),time frog pulls on target (ms),adhesive force / body weight,adhesive impulse (N-s),total contact area (mm2),contact area without mucus (mm2),contact area with mucus / contact area without mucus,contact pressure (Pa),adhesive strength (Pa)
2013_02_26,3,1205,46,1.95,-785,884,1.27,-0.290,387,70,0.82,3117,-2030
2013_02_26,4,2527,44,4.08,-983,248,1.59,-0.181,101,94,0.07,24923,-9695
2013_03_01,1,1745,34,2.82,-850,211,1.37,-0.157,83,79,0.05,21020,-10239
2013_03_01,2,1556,41,2.51,-455,1025,0.74,-0.170,330,158,0.52,4718,-1381
2013_03_01,3,493,36,0.80,-974,499,1.57,-0.423,245,216,0.12,2012,-3975
2013_03_01,4,2276,31,3.68,-592,969,0.96,-0.176,341,106,0.69,6676,-1737
2013_03_05,1,556,43,0.90,-512,835,0.83,-0.285,359,110,0.69,1550,-1427
2013_03_05,2,1928,46,3.11,-804,508,1.30,-0.285,246,178,0.28,7832,-3266
2013_03_05,3,2641,50,4.27,-690,491,1.12,-0.239,269,224,0.17,9824,-2568
2013_03_05,4,1897,41,3.06,-462,839,0.75,-0.328,266,176,0.34,7122,-1733
2013_03_12,1,1891,40,3.06,-766,1069,1.24,-0.380,408,33,0.92,4638,-1879
2013_03_12,2,1545,48,2.50,-715,649,1.15,-0.298,141,112,0.21,10947,-5064
2013_03_12,3,1307,29,2.11,-613,1845,0.99,-0.768,455,92,0.80,2874,-1348
2013_03_12,4,1692,31,2.73,-677,917,1.09,-0.457,186,129,0.31,9089,-3636
2013_03_12,5,1543,38,2.49,-528,750,0.85,-0.353,153,148,0.03,10095,-3453
2013_03_15,1,1282,31,2.07,-452,785,0.73,-0.253,290,105,0.64,4419,-1557
2013_03_15,2,775,34,1.25,-430,837,0.70,-0.276,257,124,0.52,3019,-1677
2013_03_15,3,2032,60,3.28,-652,486,1.05,-0.257,147,134,0.09,13784,-4425
2013_03_15,4,1240,34,2.00,-692,906,1.12,-0.317,364,260,0.28,3406,-1901
2013_03_15,5,473,40,0.76,-536,1218,0.87,-0.382,259,168,0.35,1830,-2073
The first lines all begin with #
signs, signifying that they are comments. They do give important information about the frog, though.
The first line after the comments are the headers, giving the column names for the data frame we will load.
Concatenating data frames
We would like to have all of the data frames be together in one data frame so we can conveniently do things like make plots comparing the four frogs. Let’s read in the data sets and make a list of data frames.
[3]:
# On a local machine, we would do this: fnames = glob.glob('../data/frog_strikes_*.csv')
# But for Colab compatibility, we will do it by hand
fnames = [
os.path.join(data_path, f"frog_strikes_{frog_id}.csv")
for frog_id in ["I", "II", "III", "IV"]
]
dfs = [pd.read_csv(f, comment="#") for f in fnames]
# Take a look at first data frame
dfs[0].head()
[3]:
date | trial number | impact force (mN) | impact time (ms) | impact force / body weight | adhesive force (mN) | time frog pulls on target (ms) | adhesive force / body weight | adhesive impulse (N-s) | total contact area (mm2) | contact area without mucus (mm2) | contact area with mucus / contact area without mucus | contact pressure (Pa) | adhesive strength (Pa) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013_02_26 | 3 | 1205 | 46 | 1.95 | -785 | 884 | 1.27 | -0.290 | 387 | 70 | 0.82 | 3117 | -2030 |
1 | 2013_02_26 | 4 | 2527 | 44 | 4.08 | -983 | 248 | 1.59 | -0.181 | 101 | 94 | 0.07 | 24923 | -9695 |
2 | 2013_03_01 | 1 | 1745 | 34 | 2.82 | -850 | 211 | 1.37 | -0.157 | 83 | 79 | 0.05 | 21020 | -10239 |
3 | 2013_03_01 | 2 | 1556 | 41 | 2.51 | -455 | 1025 | 0.74 | -0.170 | 330 | 158 | 0.52 | 4718 | -1381 |
4 | 2013_03_01 | 3 | 493 | 36 | 0.80 | -974 | 499 | 1.57 | -0.423 | 245 | 216 | 0.12 | 2012 | -3975 |
We have successfully loaded in all of the data frames. They all have the same columns (as given by the CSV files) and they all have the same indexes (range indexes that were applied be default when loading from the CSV files). We do not really care about the indexes. So, we wish to tape the data frames together vertically. We can use the pd.concat()
function to do this.
Before we do that, though, we might notice a problem. We will not have information to tell us which frog is which. We might therefore like to add a column to each data frame that has the frog ID, and then concatenate them. We can parse the ID of the frog from the file name, as we can see by looking at the file names.
[4]:
fnames
[4]:
['../data/frog_strikes_I.csv',
'../data/frog_strikes_II.csv',
'../data/frog_strikes_III.csv',
'../data/frog_strikes_IV.csv']
So, for each data frame/file name pair, we extract the Roman numeral and add a column to the data frame containing the frog ID.
[5]:
for i, f in enumerate(fnames):
frog_id = f[f.rfind('_')+1:f.rfind('.')]
dfs[i]['ID'] = frog_id
# Take a look
dfs[0].head()
[5]:
date | trial number | impact force (mN) | impact time (ms) | impact force / body weight | adhesive force (mN) | time frog pulls on target (ms) | adhesive force / body weight | adhesive impulse (N-s) | total contact area (mm2) | contact area without mucus (mm2) | contact area with mucus / contact area without mucus | contact pressure (Pa) | adhesive strength (Pa) | ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013_02_26 | 3 | 1205 | 46 | 1.95 | -785 | 884 | 1.27 | -0.290 | 387 | 70 | 0.82 | 3117 | -2030 | I |
1 | 2013_02_26 | 4 | 2527 | 44 | 4.08 | -983 | 248 | 1.59 | -0.181 | 101 | 94 | 0.07 | 24923 | -9695 | I |
2 | 2013_03_01 | 1 | 1745 | 34 | 2.82 | -850 | 211 | 1.37 | -0.157 | 83 | 79 | 0.05 | 21020 | -10239 | I |
3 | 2013_03_01 | 2 | 1556 | 41 | 2.51 | -455 | 1025 | 0.74 | -0.170 | 330 | 158 | 0.52 | 4718 | -1381 | I |
4 | 2013_03_01 | 3 | 493 | 36 | 0.80 | -974 | 499 | 1.57 | -0.423 | 245 | 216 | 0.12 | 2012 | -3975 | I |
Good! Now all data frames have an 'ID'
column, and we can concatenate. The pd.concat()
function takes as input a list of data frames to be concatenated. Since we do not care about the index, we can use the ignore_index=True
kwarg.
[6]:
df = pd.concat(dfs, ignore_index=True)
# Make sure we got them all
print('Number of rows:', len(df), '\nUnique IDs:', df['ID'].unique())
Number of rows: 80
Unique IDs: ['I' 'II' 'III' 'IV']
Check!
More advanced concatenation
When we concatenated, we updated each data frame with a fresh column. The pd.concat()
function can handle some of this for you. If we instead passed a dictionary of data frames instead of a list, it applies the keys to each data frame that is concatenated using a multiindex. First, we’ll read in the data frames as a dictionary of data frames instead of a list.
[7]:
# Make dictionary of data frames
dfs = {
f[f.rfind("_") + 1 : f.rfind(".")]: pd.read_csv(f, comment="#")
for i, f in enumerate(fnames)
}
# Verify that keys are in fact IDs
dfs.keys()
[7]:
dict_keys(['I', 'II', 'III', 'IV'])
Now, if we call pd.concat()
with dictionary input, we get a new data frame with a multiindex.
[8]:
df = pd.concat(dfs)
# Take a look
df.head()
[8]:
date | trial number | impact force (mN) | impact time (ms) | impact force / body weight | adhesive force (mN) | time frog pulls on target (ms) | adhesive force / body weight | adhesive impulse (N-s) | total contact area (mm2) | contact area without mucus (mm2) | contact area with mucus / contact area without mucus | contact pressure (Pa) | adhesive strength (Pa) | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
I | 0 | 2013_02_26 | 3 | 1205 | 46 | 1.95 | -785 | 884 | 1.27 | -0.290 | 387 | 70 | 0.82 | 3117 | -2030 |
1 | 2013_02_26 | 4 | 2527 | 44 | 4.08 | -983 | 248 | 1.59 | -0.181 | 101 | 94 | 0.07 | 24923 | -9695 | |
2 | 2013_03_01 | 1 | 1745 | 34 | 2.82 | -850 | 211 | 1.37 | -0.157 | 83 | 79 | 0.05 | 21020 | -10239 | |
3 | 2013_03_01 | 2 | 1556 | 41 | 2.51 | -455 | 1025 | 0.74 | -0.170 | 330 | 158 | 0.52 | 4718 | -1381 | |
4 | 2013_03_01 | 3 | 493 | 36 | 0.80 | -974 | 499 | 1.57 | -0.423 | 245 | 216 | 0.12 | 2012 | -3975 |
We have a multiindex for the rows, with the high level index being the ID and the low level index being the original index of the data frame that was concatenated. It is useful to give these indexes names so we can conveniently refer to them. We can do that by setting the df.index.names
property as
df.index.names = ['ID', 'original index']
We can instead specify a names
kwarg when we call pd.concat()
. This kwarg specifies the names of the resulting multiindex from the concatenation.
[9]:
df = pd.concat(dfs, names=['ID', 'original index'])
# Take a look
df.head()
[9]:
date | trial number | impact force (mN) | impact time (ms) | impact force / body weight | adhesive force (mN) | time frog pulls on target (ms) | adhesive force / body weight | adhesive impulse (N-s) | total contact area (mm2) | contact area without mucus (mm2) | contact area with mucus / contact area without mucus | contact pressure (Pa) | adhesive strength (Pa) | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID | original index | ||||||||||||||
I | 0 | 2013_02_26 | 3 | 1205 | 46 | 1.95 | -785 | 884 | 1.27 | -0.290 | 387 | 70 | 0.82 | 3117 | -2030 |
1 | 2013_02_26 | 4 | 2527 | 44 | 4.08 | -983 | 248 | 1.59 | -0.181 | 101 | 94 | 0.07 | 24923 | -9695 | |
2 | 2013_03_01 | 1 | 1745 | 34 | 2.82 | -850 | 211 | 1.37 | -0.157 | 83 | 79 | 0.05 | 21020 | -10239 | |
3 | 2013_03_01 | 2 | 1556 | 41 | 2.51 | -455 | 1025 | 0.74 | -0.170 | 330 | 158 | 0.52 | 4718 | -1381 | |
4 | 2013_03_01 | 3 | 493 | 36 | 0.80 | -974 | 499 | 1.57 | -0.423 | 245 | 216 | 0.12 | 2012 | -3975 |
We conveniently have labeled indexes, and we can now make ID
a column in the data frame using the reset_index()
method.
[10]:
df = df.reset_index()
df.head()
[10]:
ID | original index | date | trial number | impact force (mN) | impact time (ms) | impact force / body weight | adhesive force (mN) | time frog pulls on target (ms) | adhesive force / body weight | adhesive impulse (N-s) | total contact area (mm2) | contact area without mucus (mm2) | contact area with mucus / contact area without mucus | contact pressure (Pa) | adhesive strength (Pa) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | I | 0 | 2013_02_26 | 3 | 1205 | 46 | 1.95 | -785 | 884 | 1.27 | -0.290 | 387 | 70 | 0.82 | 3117 | -2030 |
1 | I | 1 | 2013_02_26 | 4 | 2527 | 44 | 4.08 | -983 | 248 | 1.59 | -0.181 | 101 | 94 | 0.07 | 24923 | -9695 |
2 | I | 2 | 2013_03_01 | 1 | 1745 | 34 | 2.82 | -850 | 211 | 1.37 | -0.157 | 83 | 79 | 0.05 | 21020 | -10239 |
3 | I | 3 | 2013_03_01 | 2 | 1556 | 41 | 2.51 | -455 | 1025 | 0.74 | -0.170 | 330 | 158 | 0.52 | 4718 | -1381 |
4 | I | 4 | 2013_03_01 | 3 | 493 | 36 | 0.80 | -974 | 499 | 1.57 | -0.423 | 245 | 216 | 0.12 | 2012 | -3975 |
We now have a default range index for the data frame that we do not care about. Because the original index was not informative either, we can delete that column if we like, but it is not really a burden to have an unused column laying around in a data set this small. Nonetheless, let’s blow it away.
[11]:
del df['original index']
df.head()
[11]:
ID | date | trial number | impact force (mN) | impact time (ms) | impact force / body weight | adhesive force (mN) | time frog pulls on target (ms) | adhesive force / body weight | adhesive impulse (N-s) | total contact area (mm2) | contact area without mucus (mm2) | contact area with mucus / contact area without mucus | contact pressure (Pa) | adhesive strength (Pa) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | I | 2013_02_26 | 3 | 1205 | 46 | 1.95 | -785 | 884 | 1.27 | -0.290 | 387 | 70 | 0.82 | 3117 | -2030 |
1 | I | 2013_02_26 | 4 | 2527 | 44 | 4.08 | -983 | 248 | 1.59 | -0.181 | 101 | 94 | 0.07 | 24923 | -9695 |
2 | I | 2013_03_01 | 1 | 1745 | 34 | 2.82 | -850 | 211 | 1.37 | -0.157 | 83 | 79 | 0.05 | 21020 | -10239 |
3 | I | 2013_03_01 | 2 | 1556 | 41 | 2.51 | -455 | 1025 | 0.74 | -0.170 | 330 | 158 | 0.52 | 4718 | -1381 |
4 | I | 2013_03_01 | 3 | 493 | 36 | 0.80 | -974 | 499 | 1.57 | -0.423 | 245 | 216 | 0.12 | 2012 | -3975 |
We now have a nice, tidy data frame!
Creating a DataFrame from scratch
Looking back at the headers of the original data files, we see that there is information present in the header that we would like to have in our data frame. For example, it would be nice to know if each strike came from an adult or juvenile. Or what the snout-vent length was. Going toward the goal of including this in our data frame, we will first construct a new data frame containing information about each frog.
Data frames from dictionaries
One way do create this new data frame is to first construct a dictionary with the respective fields. Since these data sets are small, we can look at the files and make the dictionary by hand.
[12]:
data_dict = {
"ID": ["I", "II", "III", "IV"],
"age": ["adult", "adult", "juvenile", "juvenile"],
"SVL (mm)": [63, 70, 28, 31],
"body weight (g)": [63.1, 72.7, 12.7, 12.7],
"species": ["cross", "cross", "cranwelli", "cranwelli"],
}
Now that we have this dictionary, we can convert it into a DataFrame
by instantiating a pd.DataFrame
class with it, using the data
kwarg.
[13]:
# Make it into a DataFrame
df_frog_info = pd.DataFrame(data=data_dict)
# Take a look
df_frog_info
[13]:
ID | age | SVL (mm) | body weight (g) | species | |
---|---|---|---|---|---|
0 | I | adult | 63 | 63.1 | cross |
1 | II | adult | 70 | 72.7 | cross |
2 | III | juvenile | 28 | 12.7 | cranwelli |
3 | IV | juvenile | 31 | 12.7 | cranwelli |
Nice!
Data frames from numpy arrays
Sometimes the data sets are not small enough to construct a dictionary by hand. Oftentimes, we have a two-dimensional array of data that we want to make into a DataFrame
. As an example, let’s say we have a Numpy array where the first column is snout vent length and the second is weight.
[14]:
data = np.array([[63, 70, 28, 31], [63.1, 72.7, 12.7, 12.7]]).transpose()
# Verify that it's what we think it is
data
[14]:
array([[63. , 63.1],
[70. , 72.7],
[28. , 12.7],
[31. , 12.7]])
To make this into a DataFrame
, we again create pd.DataFrame
instance, but this time we also specify the column
keyword argument.
[15]:
df_demo = pd.DataFrame(data=data, columns=["SVL (mm)", "weight (g)"])
# Take a look
df_demo
[15]:
SVL (mm) | weight (g) | |
---|---|---|
0 | 63.0 | 63.1 |
1 | 70.0 | 72.7 |
2 | 28.0 | 12.7 |
3 | 31.0 | 12.7 |
That also works. Generally, any two-dimensional Numpy array can be converted into a DataFrame
in this way. You just need to supply column names.
Programmatically creating a data frame
Hand-entering data should be minimized. The information about each frog were hand-entered once by the experimenter. We should not hand-enter them again. We therefore should parse the comment lines of input files to get the pertinent information.
Note, though, that in the case of a single experiment with only four data sets, hand entering might be faster and indeed less error prone than doing it programmatically. We should definitely do it programmatically if we have a large number of data files or will ever do an experiment with the same file format again.
So, let’s programmatically parse the files. We start by writing a function to parse the metadata from a single file. Recall that the comment lines look like this:
# These data are from Kleinteich and Gorb, Sci. Rep., 4, 5225, 2014.
# Frog ID: I
# Age: adult
# Snout-vent-length (SVL): 63 mm
# Body weight: 63.1 g
# Species: Ceratophrys cranwelli crossed with Ceratophrys cornuta
(The function below will not work with Colab because open()
does not work for files specified by a URL.)
[16]:
def parse_frog_metadata(fname):
with open(fname, 'r') as f:
# Citation line, ignore.
f.readline()
# Frog ID
line = f.readline()
frog_id = line[line.find(':')+1:].strip()
# Age
line = f.readline()
age = line[line.find(':')+1:].strip()
# SVL, assume units given as mm
line = f.readline()
svl = line[line.find(':')+1:line.rfind(' ')].strip()
# Body weight, assume units given as g
line = f.readline()
body_weight = line[line.find(':')+1:line.rfind(' ')].strip()
# Species (either cranwelli or cross)
line = f.readline()
species = line[line.find(':')+1:].strip()
if 'cross' in species:
species = 'cross'
else:
species = 'cranwelli'
return frog_id, age, svl, body_weight, species
Let’s take it for a spin.
[17]:
parse_frog_metadata(os.path.join(data_path, 'frog_strikes_I.csv'))
[17]:
('I', 'adult', '63', '63.1', 'cross')
Looks good! Now we can create a list of tuples to use as data for making a data frame.
[18]:
data = [parse_frog_metadata(f) for f in fnames]
# Take a look
data
[18]:
[('I', 'adult', '63', '63.1', 'cross'),
('II', 'adult', '70', '72.7', 'cross'),
('III', 'juvenile', '28', '12.7', 'cranwelli'),
('IV', 'juvenile', '31', '12.7', 'cranwelli')]
We now input this list of tuples, plus the column names, into pd.DataFrame()
, and we’ve got our data frame.
[19]:
df_frog_info = pd.DataFrame(
data=data,
columns=["ID", "age", "SVL (mm)", "body weight (g)", "species"]
)
# Take a look
df_frog_info
[19]:
ID | age | SVL (mm) | body weight (g) | species | |
---|---|---|---|---|---|
0 | I | adult | 63 | 63.1 | cross |
1 | II | adult | 70 | 72.7 | cross |
2 | III | juvenile | 28 | 12.7 | cranwelli |
3 | IV | juvenile | 31 | 12.7 | cranwelli |
Merging DataFrames
Our ultimate goal is to add the information about the frogs into our main data frame, df
, that we have been working with.
Brute force merge
We can do it using tools we have already learned. For each row in the DataFrame
, we can add the relevant value in each column. Because this will not be the final way I recommend doing this, I will do these operations on a copy of df
using the copy()
method.
[20]:
# Make a copy of df
df_copy = df.copy()
# Build each column
for col in df_frog_info.columns[df_frog_info.columns != 'ID']:
# Make a new column with empty values
df_copy[col] = np.empty(len(df_copy))
# Add in each entry, row by row
for i, r in df_copy.iterrows():
ind = df_frog_info['ID'] == r['ID']
df_copy.loc[i, col] = df_frog_info.loc[ind, col].iloc[0]
# Take a look at the updated DataFrame
df_copy.head()
[20]:
ID | date | trial number | impact force (mN) | impact time (ms) | impact force / body weight | adhesive force (mN) | time frog pulls on target (ms) | adhesive force / body weight | adhesive impulse (N-s) | total contact area (mm2) | contact area without mucus (mm2) | contact area with mucus / contact area without mucus | contact pressure (Pa) | adhesive strength (Pa) | age | SVL (mm) | body weight (g) | species | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | I | 2013_02_26 | 3 | 1205 | 46 | 1.95 | -785 | 884 | 1.27 | -0.290 | 387 | 70 | 0.82 | 3117 | -2030 | adult | 63 | 63.1 | cross |
1 | I | 2013_02_26 | 4 | 2527 | 44 | 4.08 | -983 | 248 | 1.59 | -0.181 | 101 | 94 | 0.07 | 24923 | -9695 | adult | 63 | 63.1 | cross |
2 | I | 2013_03_01 | 1 | 1745 | 34 | 2.82 | -850 | 211 | 1.37 | -0.157 | 83 | 79 | 0.05 | 21020 | -10239 | adult | 63 | 63.1 | cross |
3 | I | 2013_03_01 | 2 | 1556 | 41 | 2.51 | -455 | 1025 | 0.74 | -0.170 | 330 | 158 | 0.52 | 4718 | -1381 | adult | 63 | 63.1 | cross |
4 | I | 2013_03_01 | 3 | 493 | 36 | 0.80 | -974 | 499 | 1.57 | -0.423 | 245 | 216 | 0.12 | 2012 | -3975 | adult | 63 | 63.1 | cross |
Note that I used the iterrows()
method of the df_copy
data frame. This iterator gives an index (which I called i
) and a row of a data frame (which I called r
). This method, and the analogous one for iterating over columns, iteritems()
, can be useful.
But this approach seems rather clunky.
Using pd.merge()
There is a much better way to do it is to use Pandas’s built-in merge() method. Called with all the default keyword arguments, this function finds common columns between two DataFrame
s (in this case, there is just one, the ID
column), and then uses those columns to merge them, filling in values that match in the common columns. This is exactly what we want.
[21]:
df = df.merge(df_frog_info)
# Check it out!
df.head()
[21]:
ID | date | trial number | impact force (mN) | impact time (ms) | impact force / body weight | adhesive force (mN) | time frog pulls on target (ms) | adhesive force / body weight | adhesive impulse (N-s) | total contact area (mm2) | contact area without mucus (mm2) | contact area with mucus / contact area without mucus | contact pressure (Pa) | adhesive strength (Pa) | age | SVL (mm) | body weight (g) | species | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | I | 2013_02_26 | 3 | 1205 | 46 | 1.95 | -785 | 884 | 1.27 | -0.290 | 387 | 70 | 0.82 | 3117 | -2030 | adult | 63 | 63.1 | cross |
1 | I | 2013_02_26 | 4 | 2527 | 44 | 4.08 | -983 | 248 | 1.59 | -0.181 | 101 | 94 | 0.07 | 24923 | -9695 | adult | 63 | 63.1 | cross |
2 | I | 2013_03_01 | 1 | 1745 | 34 | 2.82 | -850 | 211 | 1.37 | -0.157 | 83 | 79 | 0.05 | 21020 | -10239 | adult | 63 | 63.1 | cross |
3 | I | 2013_03_01 | 2 | 1556 | 41 | 2.51 | -455 | 1025 | 0.74 | -0.170 | 330 | 158 | 0.52 | 4718 | -1381 | adult | 63 | 63.1 | cross |
4 | I | 2013_03_01 | 3 | 493 | 36 | 0.80 | -974 | 499 | 1.57 | -0.423 | 245 | 216 | 0.12 | 2012 | -3975 | adult | 63 | 63.1 | cross |
Note that the entries for the added columns were repeated appropriately, e.g., body weight column had 63 for every row corresponding to frog I.
I think this example of merging DataFrame
s highlights the power of using them in your data analysis. Note also that there are plenty of options for how merges are done, and you should consult the Pandas documentation.
This example also brings up an important point. When you have to perform operations on data frames, you can often “brute force” it with loops, etc. But if what you are trying to do seems like something a data analyst would frequently encounter, there is a good chance it’s already built-in to Pandas, and you should ask Google how to do it.
At long last, a plot!
While the purpose of this part of the lesson was to learn how to concatenate and merge data frames, going through all of that wrangling effort would somehow be unsatisfying if we we didn’t generate a plot. Let’s compare the impact force on a per-mass basis for each frog.
[22]:
p = iqplot.strip(
df,
q="impact force / body weight",
cats="ID",
color_column="age",
jitter=True,
x_axis_label="impact force / body weight (mN/g)",
y_axis_label="frog ID"
)
bokeh.io.show(p)
Apparently Frog III consistently packs a powerful punch, er…. tongue.
Computing environment
[23]:
%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,iqplot,jupyterlab
Python implementation: CPython
Python version : 3.8.11
IPython version : 7.26.0
numpy : 1.20.3
pandas : 1.3.2
bokeh : 2.3.3
iqplot : 0.2.3
jupyterlab: 3.1.7