4. Reshaping data¶
This recitation was written by Shyam Saladi with minor edits by Justin Bois.
[1]:
import glob
import numpy as np
import pandas as pd
import scipy.stats as st
import holoviews as hv
hv.extension('bokeh')
import bebi103
bebi103.hv.set_defaults()
For the purposes of this material, it will be useful to have read through the first two sections of Lesson 4.
We’ll start by quickly reviewing some of the essentials of pd.Dataframe
s, including index
es and concatenation
, and then move into split-apply-combine with groupby
, apply
, & transform
and reshaping data with melt
, stack
, & unstack
.
As with Lesson 4, we will be using the frog tongue strike data set (since it’s awesome) from a fun paper about the adhesive properties of frog tongues Kleinteich and Gorb, 2014. In this work, the authors investigated various properties of the adhesive characteristics of the tongues of horned frogs when they strike prey. The authors had a striking pad connected to a cantilever to measure forces. They also used high speed cameras to capture the strike and record relevant data.
The goal of this exercise will be to first get a feel for the dataset and by reshaping it using stack
, unstack
, melt
, pivot
and a smattering of set_index
and reset_index
along the way. Finally, we’ll do some split-apply-combine with the long-formatted tidy data making it to an exploratory plot.
Loading the data¶
After downloading and unzipping the archive with the frog strike data, we need to load the data into a single data frame concatenating the data from each csv file:
[2]:
fnames = glob.glob('../data/frog_strikes/frog_strikes_*.csv')
def read_strike_csv(fn):
df = pd.read_csv(fn, comment='#')
df['frog_id'] = fn[fn.rfind('_')+1:fn.rfind('.')]
return df
df = pd.concat([read_strike_csv(fn) for fn in fnames])
# Take a look at data frame
df.head(10)
[2]:
date | trial number | impact force (mN) | impact time (ms) | impact force / body weight | adhesive force (mN) | time frog pulls on target (ms) | adhesive force / body weight | adhesive impulse (N-s) | total contact area (mm2) | contact area without mucus (mm2) | contact area with mucus / contact area without mucus | contact pressure (Pa) | adhesive strength (Pa) | frog_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013_03_19 | 1 | 1612 | 18 | 3.79 | -655 | 3087 | 1.54 | -0.385 | 348 | 15 | 0.96 | 4633 | -1881 | II |
1 | 2013_03_19 | 2 | 605 | 55 | 1.42 | -292 | 1261 | 0.69 | -0.294 | 248 | 24 | 0.90 | 2441 | -1177 | II |
2 | 2013_03_19 | 3 | 327 | 51 | 0.77 | -246 | 1508 | 0.58 | -0.282 | 130 | 34 | 0.74 | 2517 | -1894 | II |
3 | 2013_03_19 | 4 | 946 | 59 | 2.23 | -245 | 1841 | 0.58 | -0.340 | 106 | 26 | 0.76 | 8893 | -2301 | II |
4 | 2013_03_21 | 1 | 541 | 33 | 1.27 | -553 | 3126 | 1.30 | -0.432 | 276 | 16 | 0.94 | 1959 | -2004 | II |
5 | 2013_03_21 | 2 | 1539 | 43 | 3.62 | -664 | 741 | 1.56 | -0.046 | 85 | 24 | 0.72 | 18073 | -7802 | II |
6 | 2013_03_21 | 3 | 529 | 28 | 1.24 | -261 | 2482 | 0.61 | -0.414 | 325 | 33 | 0.90 | 1627 | -803 | II |
7 | 2013_03_21 | 4 | 628 | 31 | 1.48 | -691 | 998 | 1.63 | -0.071 | 242 | 67 | 0.72 | 2600 | -2860 | II |
8 | 2013_03_25 | 1 | 1453 | 72 | 3.42 | -92 | 1652 | 0.22 | -0.008 | 136 | 0 | 1.00 | 10645 | -678 | II |
9 | 2013_03_25 | 2 | 297 | 42 | 0.70 | -566 | 936 | 1.33 | -0.084 | 126 | 4 | 0.97 | 2367 | -4506 | II |
It looks like frogs were tested on various days and multiple times on each day.
If you were running this experiment, would you have tested all frogs on the same dates? What did Kleinteich and Gorb do? Were all four frogs tested on the same days? Let’s count the number of trials by date
and frog_id
to take a look.
[3]:
df.groupby(['date', 'frog_id'])['trial number'].count()
[3]:
date frog_id
2013_02_26 I 2
2013_03_01 I 4
2013_03_05 I 4
2013_03_12 I 5
2013_03_15 I 5
2013_03_19 II 4
2013_03_21 II 4
2013_03_25 II 4
2013_03_28 II 2
2013_04_03 II 2
2013_04_08 II 3
2013_04_12 II 1
2013_05_27 III 3
IV 3
2013_05_30 IV 3
2013_06_03 IV 1
2013_06_11 III 3
IV 4
2013_06_14 III 2
IV 1
2013_06_18 III 4
IV 4
2013_06_21 III 4
IV 4
2013_06_26 III 4
Name: trial number, dtype: int64
It looks like the experiment took place over the course of 4 months with frog I
tested initially and the others following.
Curiously, each date has a different number of trials/observations. A finer look at the observations across days and frogs would help better understand the experimental design here.
Let’s try displaying the data for all date
s split up by frog_id
and trial number
.
To do this, we should move frog_id
, trial number
, and date
from column values to row indices. We will also select a single value to focus on.
[4]:
df_force = df.set_index(['frog_id', 'trial number', 'date'])['impact force (mN)']
df_force.head()
[4]:
frog_id trial number date
II 1 2013_03_19 1612
2 2013_03_19 605
3 2013_03_19 327
4 2013_03_19 946
1 2013_03_21 541
Name: impact force (mN), dtype: int64
Unstacking/stacking indexes¶
Now, let’s pivot date
to columns.
In general, the operation of moving row indexers to column indexers (or the other way around) is called “pivoting”. Pandas also has a pivot
function, but in practice, stack
and unstack
are your friends. They can be easier to use since they need an index
to pivot on.
To be clear, stacking transforms the most-inner column index into the inner-most row index. The cell values are then reshuffled accordingly. Unstacking performs the reverse operation–moving the inner row indexes to the columns indexes.
Here we want to move frog_id
and trial number
from row indexers to column indexers. This corresponds to the unstack operation.
[5]:
df_force = df_force.unstack(['frog_id', 'trial number'])
# sort index and tidy for easy viewing
df_force = df_force.sort_index(axis=1).astype(str).replace('nan', '')
df_force
[5]:
frog_id | I | II | III | IV | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
trial number | 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 |
date | |||||||||||||||||
2013_02_26 | 1205.0 | 2527.0 | |||||||||||||||
2013_03_01 | 1745.0 | 1556.0 | 493.0 | 2276.0 | |||||||||||||
2013_03_05 | 556.0 | 1928.0 | 2641.0 | 1897.0 | |||||||||||||
2013_03_12 | 1891.0 | 1545.0 | 1307.0 | 1692.0 | 1543.0 | ||||||||||||
2013_03_15 | 1282.0 | 775.0 | 2032.0 | 1240.0 | 473.0 | ||||||||||||
2013_03_19 | 1612.0 | 605.0 | 327.0 | 946.0 | |||||||||||||
2013_03_21 | 541.0 | 1539.0 | 529.0 | 628.0 | |||||||||||||
2013_03_25 | 1453.0 | 297.0 | 703.0 | 269.0 | |||||||||||||
2013_03_28 | 751.0 | 245.0 | |||||||||||||||
2013_04_03 | 1182.0 | 515.0 | |||||||||||||||
2013_04_08 | 435.0 | 383.0 | 457.0 | ||||||||||||||
2013_04_12 | 730.0 | ||||||||||||||||
2013_05_27 | 614.0 | 414.0 | 324.0 | 172.0 | 142.0 | 37.0 | |||||||||||
2013_05_30 | 453.0 | 355.0 | 22.0 | ||||||||||||||
2013_06_03 | 502.0 | ||||||||||||||||
2013_06_11 | 776.0 | 611.0 | 544.0 | 273.0 | 720.0 | 582.0 | 198.0 | ||||||||||
2013_06_14 | 538.0 | 579.0 | 198.0 | ||||||||||||||
2013_06_18 | 806.0 | 459.0 | 458.0 | 626.0 | 597.0 | 516.0 | 815.0 | 402.0 | |||||||||
2013_06_21 | 621.0 | 544.0 | 535.0 | 385.0 | 605.0 | 711.0 | 614.0 | 468.0 | |||||||||
2013_06_26 | 401.0 | 614.0 | 665.0 | 488.0 |
I find it easier to compare groups of numerical observations vertically, so let’s transpose the axes
[6]:
df_force.sort_index(axis=1).astype(str).replace('nan', '').transpose()
[6]:
date | 2013_02_26 | 2013_03_01 | 2013_03_05 | 2013_03_12 | 2013_03_15 | 2013_03_19 | 2013_03_21 | 2013_03_25 | 2013_03_28 | 2013_04_03 | 2013_04_08 | 2013_04_12 | 2013_05_27 | 2013_05_30 | 2013_06_03 | 2013_06_11 | 2013_06_14 | 2013_06_18 | 2013_06_21 | 2013_06_26 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
frog_id | trial number | ||||||||||||||||||||
I | 1 | 1745.0 | 556.0 | 1891.0 | 1282.0 | ||||||||||||||||
2 | 1556.0 | 1928.0 | 1545.0 | 775.0 | |||||||||||||||||
3 | 1205.0 | 493.0 | 2641.0 | 1307.0 | 2032.0 | ||||||||||||||||
4 | 2527.0 | 2276.0 | 1897.0 | 1692.0 | 1240.0 | ||||||||||||||||
5 | 1543.0 | 473.0 | |||||||||||||||||||
II | 1 | 1612.0 | 541.0 | 1453.0 | 751.0 | 1182.0 | 435.0 | 730.0 | |||||||||||||
2 | 605.0 | 1539.0 | 297.0 | 245.0 | 515.0 | 383.0 | |||||||||||||||
3 | 327.0 | 529.0 | 703.0 | 457.0 | |||||||||||||||||
4 | 946.0 | 628.0 | 269.0 | ||||||||||||||||||
III | 1 | 614.0 | 776.0 | 538.0 | 806.0 | 621.0 | 401.0 | ||||||||||||||
2 | 414.0 | 611.0 | 579.0 | 459.0 | 544.0 | 614.0 | |||||||||||||||
3 | 324.0 | 544.0 | 458.0 | 535.0 | 665.0 | ||||||||||||||||
4 | 626.0 | 385.0 | 488.0 | ||||||||||||||||||
IV | 1 | 453.0 | 502.0 | 273.0 | 198.0 | 597.0 | 605.0 | ||||||||||||||
2 | 172.0 | 355.0 | 720.0 | 516.0 | 711.0 | ||||||||||||||||
3 | 142.0 | 22.0 | 582.0 | 815.0 | 614.0 | ||||||||||||||||
4 | 37.0 | 198.0 | 402.0 | 468.0 |
Ah, it looks like the measurements were taken successively frog I
to IV
across 4 months. On some days, the first few trials aren’t recorded and, from day-to-day, different numbers of trials were performed (perhaps the frog had enough science for the afternoon).
This data frame organizes frog_id
and trial number
into a MultiIndex on one axis and date
onto the other. If we wanted to pivot trial number
from a column indexer to a row indexer, stack
is our friend.
[7]:
df_force_stack = df_force.stack('trial number')
# just show the first few rows
df_force_stack.head(10)
[7]:
frog_id | I | II | III | IV | |
---|---|---|---|---|---|
date | trial number | ||||
2013_02_26 | 1 | ||||
2 | |||||
3 | 1205.0 | ||||
4 | 2527.0 | ||||
5 | NaN | NaN | NaN | ||
2013_03_01 | 1 | 1745.0 | |||
2 | 1556.0 | ||||
3 | 493.0 | ||||
4 | 2276.0 | ||||
5 | NaN | NaN | NaN |
We see that NaN values are introduced after stacking, for example in the cell given by date
2013_02_26, trial
5, and frog_id
II. Looking closely at the previous dataframe df_force
, only frog
I has observations with a trial
value of 5. When stacking, each of the column indexer values being pivoted is nested under each row indexer. Where there are missing values in the originating dataframe, NaN values are introduced.
Melting a data frame¶
Looking back at the original data frame again, it might be useful to separate the measured quantities from those that are calculated (from 2 or more measured values). Interestingly, those columns with measured data have units specified in parentheses and those without, do not. Keeping this in mind, it will be useful to add a label to each value indicating whether it’s measured
or calculated
.
One way to do this could be to convert the column indexer to a MultiIndex with two levels: is_measured
indicating the type of variable and the variable name itself:
[8]:
col_names = [['(' in c, c] for c in df.columns]
# copy dataframe to leave initial dataframe unchanged
df2 = df.copy()
df2.columns = pd.MultiIndex.from_tuples(col_names)
df2.columns.names = ['is_measured', 'variable']
df2.sort_index(axis=1).head()
[8]:
is_measured | False | True | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
variable | adhesive force / body weight | contact area with mucus / contact area without mucus | date | frog_id | impact force / body weight | trial number | adhesive force (mN) | adhesive impulse (N-s) | adhesive strength (Pa) | contact area without mucus (mm2) | contact pressure (Pa) | impact force (mN) | impact time (ms) | time frog pulls on target (ms) | total contact area (mm2) |
0 | 1.54 | 0.96 | 2013_03_19 | II | 3.79 | 1 | -655 | -0.385 | -1881 | 15 | 4633 | 1612 | 18 | 3087 | 348 |
1 | 0.69 | 0.90 | 2013_03_19 | II | 1.42 | 2 | -292 | -0.294 | -1177 | 24 | 2441 | 605 | 55 | 1261 | 248 |
2 | 0.58 | 0.74 | 2013_03_19 | II | 0.77 | 3 | -246 | -0.282 | -1894 | 34 | 2517 | 327 | 51 | 1508 | 130 |
3 | 0.58 | 0.76 | 2013_03_19 | II | 2.23 | 4 | -245 | -0.340 | -2301 | 26 | 8893 | 946 | 59 | 1841 | 106 |
4 | 1.30 | 0.94 | 2013_03_21 | II | 1.27 | 1 | -553 | -0.432 | -2004 | 16 | 1959 | 541 | 33 | 3126 | 276 |
It’s somewhat odd to have date
, trial number
, and frog_id
under is_measured
False. Let’s set them first to row indexers and then organize the columns:
[9]:
# copy dataframe to leave initial dataframe unchanged
df2 = df.set_index(['frog_id', 'date', 'trial number'])
col_names = [['(' in c, c] for c in df2.columns]
df2.columns = pd.MultiIndex.from_tuples(col_names)
df2.columns.names = ['is_measured', 'variable']
df2.sort_index(axis=1).head()
[9]:
is_measured | False | True | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
variable | adhesive force / body weight | contact area with mucus / contact area without mucus | impact force / body weight | adhesive force (mN) | adhesive impulse (N-s) | adhesive strength (Pa) | contact area without mucus (mm2) | contact pressure (Pa) | impact force (mN) | impact time (ms) | time frog pulls on target (ms) | total contact area (mm2) | ||
frog_id | date | trial number | ||||||||||||
II | 2013_03_19 | 1 | 1.54 | 0.96 | 3.79 | -655 | -0.385 | -1881 | 15 | 4633 | 1612 | 18 | 3087 | 348 |
2 | 0.69 | 0.90 | 1.42 | -292 | -0.294 | -1177 | 24 | 2441 | 605 | 55 | 1261 | 248 | ||
3 | 0.58 | 0.74 | 0.77 | -246 | -0.282 | -1894 | 34 | 2517 | 327 | 51 | 1508 | 130 | ||
4 | 0.58 | 0.76 | 2.23 | -245 | -0.340 | -2301 | 26 | 8893 | 946 | 59 | 1841 | 106 | ||
2013_03_21 | 1 | 1.30 | 0.94 | 1.27 | -553 | -0.432 | -2004 | 16 | 1959 | 541 | 33 | 3126 | 276 |
As earlier, let’s try to stack
one of the levels of the column indexer:
[10]:
df2.stack('is_measured').head(20)
[10]:
variable | adhesive force (mN) | adhesive force / body weight | adhesive impulse (N-s) | adhesive strength (Pa) | contact area with mucus / contact area without mucus | contact area without mucus (mm2) | contact pressure (Pa) | impact force (mN) | impact force / body weight | impact time (ms) | time frog pulls on target (ms) | total contact area (mm2) | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
frog_id | date | trial number | is_measured | ||||||||||||
II | 2013_03_19 | 1 | False | NaN | 1.54 | NaN | NaN | 0.96 | NaN | NaN | NaN | 3.79 | NaN | NaN | NaN |
True | -655.0 | NaN | -0.385 | -1881.0 | NaN | 15.0 | 4633.0 | 1612.0 | NaN | 18.0 | 3087.0 | 348.0 | |||
2 | False | NaN | 0.69 | NaN | NaN | 0.90 | NaN | NaN | NaN | 1.42 | NaN | NaN | NaN | ||
True | -292.0 | NaN | -0.294 | -1177.0 | NaN | 24.0 | 2441.0 | 605.0 | NaN | 55.0 | 1261.0 | 248.0 | |||
3 | False | NaN | 0.58 | NaN | NaN | 0.74 | NaN | NaN | NaN | 0.77 | NaN | NaN | NaN | ||
True | -246.0 | NaN | -0.282 | -1894.0 | NaN | 34.0 | 2517.0 | 327.0 | NaN | 51.0 | 1508.0 | 130.0 | |||
4 | False | NaN | 0.58 | NaN | NaN | 0.76 | NaN | NaN | NaN | 2.23 | NaN | NaN | NaN | ||
True | -245.0 | NaN | -0.340 | -2301.0 | NaN | 26.0 | 8893.0 | 946.0 | NaN | 59.0 | 1841.0 | 106.0 | |||
2013_03_21 | 1 | False | NaN | 1.30 | NaN | NaN | 0.94 | NaN | NaN | NaN | 1.27 | NaN | NaN | NaN | |
True | -553.0 | NaN | -0.432 | -2004.0 | NaN | 16.0 | 1959.0 | 541.0 | NaN | 33.0 | 3126.0 | 276.0 | |||
2 | False | NaN | 1.56 | NaN | NaN | 0.72 | NaN | NaN | NaN | 3.62 | NaN | NaN | NaN | ||
True | -664.0 | NaN | -0.046 | -7802.0 | NaN | 24.0 | 18073.0 | 1539.0 | NaN | 43.0 | 741.0 | 85.0 | |||
3 | False | NaN | 0.61 | NaN | NaN | 0.90 | NaN | NaN | NaN | 1.24 | NaN | NaN | NaN | ||
True | -261.0 | NaN | -0.414 | -803.0 | NaN | 33.0 | 1627.0 | 529.0 | NaN | 28.0 | 2482.0 | 325.0 | |||
4 | False | NaN | 1.63 | NaN | NaN | 0.72 | NaN | NaN | NaN | 1.48 | NaN | NaN | NaN | ||
True | -691.0 | NaN | -0.071 | -2860.0 | NaN | 67.0 | 2600.0 | 628.0 | NaN | 31.0 | 998.0 | 242.0 | |||
2013_03_25 | 1 | False | NaN | 0.22 | NaN | NaN | 1.00 | NaN | NaN | NaN | 3.42 | NaN | NaN | NaN | |
True | -92.0 | NaN | -0.008 | -678.0 | NaN | 0.0 | 10645.0 | 1453.0 | NaN | 72.0 | 1652.0 | 136.0 | |||
2 | False | NaN | 1.33 | NaN | NaN | 0.97 | NaN | NaN | NaN | 0.70 | NaN | NaN | NaN | ||
True | -566.0 | NaN | -0.084 | -4506.0 | NaN | 4.0 | 2367.0 | 297.0 | NaN | 42.0 | 936.0 | 126.0 |
You will notice that NaN values are added wholesale. What’s happened here?
Well, each column is either holds a measured or calculated variable. This when we try to stack
by moving the is_measured
from a column indexer to a row indexer, rows for is_measured
= True have NaN values in columns with calculated values; rows for is_measured
= False have *NaN values in columns with directly measured values.
This is an often encountered subtlety of unstacking
and stacking
data. In some ways this is also a feature, pandas
won’t add data where it does not exist!
Tidying data with pd.melt
¶
When thinking about the original data frame from the concatenated csv files, one might consider it tidy having one “tongue impact” observation in each row. Another point-of-view could consider each recorded value as an observation with the name of the variable as part of the metadata.
Let’s carry out this transformation. Then let’s also try to identify and separate those measured from calculated variables, as done above. This time you might find the process a bit more slick.
To turn this wide data into a long-tidy format, we should identify columns that identify observations, i.e. date
, trial number
, frog_id
. When pivoting the data, these columns will stay associated with values. All other column headings will become values in rows. Think of it as similar to the unstack
operation worked out above.
[11]:
df_long = df.melt(id_vars=['date', 'trial number', 'frog_id'])
# identify measured data by the presence of a parenthesis
df_long['is_measured'] = df_long['variable'].str.contains('\(')
df_long.head()
[11]:
date | trial number | frog_id | variable | value | is_measured | |
---|---|---|---|---|---|---|
0 | 2013_03_19 | 1 | II | impact force (mN) | 1612.0 | True |
1 | 2013_03_19 | 2 | II | impact force (mN) | 605.0 | True |
2 | 2013_03_19 | 3 | II | impact force (mN) | 327.0 | True |
3 | 2013_03_19 | 4 | II | impact force (mN) | 946.0 | True |
4 | 2013_03_21 | 1 | II | impact force (mN) | 541.0 | True |
If you’re wondering why we’ve written .str.contains('\(')
instead of .str.contains('(')
, take a peek at the documentation for pd.Series.str.contains. Hint: We could have also written .str.contains(')', regex=False)
Filtering for those values that are the direct result of an experimental measurement is done as so:
[12]:
df_long_direct = df_long[df_long['is_measured']].copy()
Note: .copy()
helps avoid bugs since filtering a pd.DataFrame
returns a view not a copy. More detail here.
To transform value
s back to a wide format, as in the original data frame (though with only the directly measured quantities), we’ll need to do another pivot. Setting the index
es is the first step: date
, trial number
, frog_id
, and variable
[13]:
df_direct = df_long_direct.set_index(['date', 'trial number', 'frog_id', 'variable'])
df_direct.head(10)
[13]:
value | is_measured | ||||
---|---|---|---|---|---|
date | trial number | frog_id | variable | ||
2013_03_19 | 1 | II | impact force (mN) | 1612.0 | True |
2 | II | impact force (mN) | 605.0 | True | |
3 | II | impact force (mN) | 327.0 | True | |
4 | II | impact force (mN) | 946.0 | True | |
2013_03_21 | 1 | II | impact force (mN) | 541.0 | True |
2 | II | impact force (mN) | 1539.0 | True | |
3 | II | impact force (mN) | 529.0 | True | |
4 | II | impact force (mN) | 628.0 | True | |
2013_03_25 | 1 | II | impact force (mN) | 1453.0 | True |
2 | II | impact force (mN) | 297.0 | True |
Now the index to pivot on should be clear, variable
, and only the column value
needs to be kept. is_measured
is all True. To move a row indexer to a column indexer (i.e. going long to wide), we need to use unstack
as so
[14]:
df_wide_direct = df_direct['value'].unstack('variable')
df_wide_direct.head(10)
[14]:
variable | adhesive force (mN) | adhesive impulse (N-s) | adhesive strength (Pa) | contact area without mucus (mm2) | contact pressure (Pa) | impact force (mN) | impact time (ms) | time frog pulls on target (ms) | total contact area (mm2) | ||
---|---|---|---|---|---|---|---|---|---|---|---|
date | trial number | frog_id | |||||||||
2013_02_26 | 3 | I | -785.0 | -0.290 | -2030.0 | 70.0 | 3117.0 | 1205.0 | 46.0 | 884.0 | 387.0 |
4 | I | -983.0 | -0.181 | -9695.0 | 94.0 | 24923.0 | 2527.0 | 44.0 | 248.0 | 101.0 | |
2013_03_01 | 1 | I | -850.0 | -0.157 | -10239.0 | 79.0 | 21020.0 | 1745.0 | 34.0 | 211.0 | 83.0 |
2 | I | -455.0 | -0.170 | -1381.0 | 158.0 | 4718.0 | 1556.0 | 41.0 | 1025.0 | 330.0 | |
3 | I | -974.0 | -0.423 | -3975.0 | 216.0 | 2012.0 | 493.0 | 36.0 | 499.0 | 245.0 | |
4 | I | -592.0 | -0.176 | -1737.0 | 106.0 | 6676.0 | 2276.0 | 31.0 | 969.0 | 341.0 | |
2013_03_05 | 1 | I | -512.0 | -0.285 | -1427.0 | 110.0 | 1550.0 | 556.0 | 43.0 | 835.0 | 359.0 |
2 | I | -804.0 | -0.285 | -3266.0 | 178.0 | 7832.0 | 1928.0 | 46.0 | 508.0 | 246.0 | |
3 | I | -690.0 | -0.239 | -2568.0 | 224.0 | 9824.0 | 2641.0 | 50.0 | 491.0 | 269.0 | |
4 | I | -462.0 | -0.328 | -1733.0 | 176.0 | 7122.0 | 1897.0 | 41.0 | 839.0 | 266.0 |
As with pd.melt
, you’ll also see the long-to-wide operation also done with pd.pivot
. Check this out:
[15]:
df_wide_direct = df_long_direct.set_index(['date', 'trial number', 'frog_id'])
# drop unneeded column
df_wide_direct = df_wide_direct.drop(columns=['is_measured'])
df_wide_direct.pivot(columns='variable').head()
[15]:
value | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
variable | adhesive force (mN) | adhesive impulse (N-s) | adhesive strength (Pa) | contact area without mucus (mm2) | contact pressure (Pa) | impact force (mN) | impact time (ms) | time frog pulls on target (ms) | total contact area (mm2) | ||
date | trial number | frog_id | |||||||||
2013_02_26 | 3 | I | -785.0 | -0.290 | -2030.0 | 70.0 | 3117.0 | 1205.0 | 46.0 | 884.0 | 387.0 |
4 | I | -983.0 | -0.181 | -9695.0 | 94.0 | 24923.0 | 2527.0 | 44.0 | 248.0 | 101.0 | |
2013_03_01 | 1 | I | -850.0 | -0.157 | -10239.0 | 79.0 | 21020.0 | 1745.0 | 34.0 | 211.0 | 83.0 |
2 | I | -455.0 | -0.170 | -1381.0 | 158.0 | 4718.0 | 1556.0 | 41.0 | 1025.0 | 330.0 | |
3 | I | -974.0 | -0.423 | -3975.0 | 216.0 | 2012.0 | 493.0 | 36.0 | 499.0 | 245.0 |
Reshaping advice¶
The step of setting index
es and then pivoting with stack
or unstack
is our preferred way of going from long-to-wide/wide-to-long. It helps being explicit and avoiding bugs in the process. pd.melt
is convenient, and pd.pivot
at times, too, but getting familiar with index
es, stack
, and unstack
will go a long way in reshaping and transforming your data.
You may also come across the function pd.pivot_table
. We don’t use this too often since values identified by the same index
positions are aggregated by default. In practice, it’s helpful to explicitly perform this aggregation when your data is tidied in a long format (think split-apply-combine).
Take a look at the docs for pd.pivot_table
to learn more here.
Tidying your tidy data¶
With the data molded into a tidy format (having covered the various manipulations along the way), we can process groups of observations and variables in uniform ways. The general strategy is to groupby
and then use one of apply
, agg
, or transform
on each group. In the context of the data here, let’s just group by frog_id
.
apply
is the most general taking a single function as its argument. This function takes the group as a data frame and applies arbitrary operations on/with it and then is required to return another data frame. Consider performing returning the slope \(m\) and intercept \(b\)) of a linear regression (think \(y = mx + b\), check out these numpy docs) for each group using, say, columnsimpact force (mN)
&impact time (ms)
.
Here we consider each tongue-hit along with its experimentally measured values as an observation, so we’ll want to use the concatenated data frame df
.
[16]:
def calculate_linear_reg(df_grp, x, y):
"""Calculates a linear regression for the columns specified"""
m, b, r_value = st.linregress( df_grp[x], df_grp[y])[:3]
return pd.Series({'m': m, 'b': b, 'rsq': r_value**2})
df.groupby('frog_id').apply(calculate_linear_reg, x='impact force (mN)', y='impact time (ms)')
[16]:
m | b | rsq | |
---|---|---|---|
frog_id | |||
I | 0.003814 | 34.013441 | 0.095088 |
II | 0.005618 | 33.675927 | 0.025363 |
III | -0.104138 | 100.636094 | 0.167798 |
IV | 0.020265 | 26.906864 | 0.083563 |
Not much of a relationship here…
Note: pd.groupby.apply
is a pretty “heavy” operation when passing large groups (say many columns). It can be useful to specify the specific groups necessary for calculation as so,
df.groupby('frog_id')[['impact force (mN)', 'impact time (ms)']].apply(
calculate_linear_reg, x='impact force (mN)', y='impact time (ms)')
Want to time how long an apply operation will take? Checkout tqdm. The following will be helpful to get going:
import tqdm
tqdm.tqdm.pandas()
pd.groupby([groups]).progress_apply(process_group)
agg
takes a single function (or list of functions) that accept apd.Series
,np.array
,list
or similar and return a single value summarizing the input. Thegroupby
agg
combination will do the hard work of aggregating each column of each group with the function/functions passed. Consider calculating the mean and standard deviation of the force measurements (sayimpact force (mN)
,adhesive force (mN)
, and both normalized to body weight).
Here a single observation corresponds to not only a tongue-hit but also type of experimentally measured value associated. We’ll want to use df_long
.
[17]:
df_forces = df_long[df_long['variable'].str.contains('force')].groupby(['frog_id', 'variable'])['value'].agg([np.mean, np.std])
df_forces
[17]:
mean | std | ||
---|---|---|---|
frog_id | variable | ||
I | adhesive force (mN) | -658.4000 | 167.143619 |
adhesive force / body weight | 1.0645 | 0.268729 | |
impact force (mN) | 1530.2000 | 630.207952 | |
impact force / body weight | 2.4715 | 1.018209 | |
II | adhesive force (mN) | -462.3000 | 203.811600 |
adhesive force / body weight | 1.0875 | 0.479702 | |
impact force (mN) | 707.3500 | 424.573256 | |
impact force / body weight | 1.6640 | 0.998854 | |
III | adhesive force (mN) | -206.7500 | 88.122448 |
adhesive force / body weight | 1.6635 | 0.708484 | |
impact force (mN) | 550.1000 | 124.273849 | |
impact force / body weight | 4.4260 | 0.999407 | |
IV | adhesive force (mN) | -263.6000 | 83.309442 |
adhesive force / body weight | 1.9640 | 0.620801 | |
impact force (mN) | 419.1000 | 234.864328 | |
impact force / body weight | 3.1200 | 1.749009 |
[18]:
# unstack for easy viewing
df_forces.unstack('variable')
[18]:
mean | std | |||||||
---|---|---|---|---|---|---|---|---|
variable | adhesive force (mN) | adhesive force / body weight | impact force (mN) | impact force / body weight | adhesive force (mN) | adhesive force / body weight | impact force (mN) | impact force / body weight |
frog_id | ||||||||
I | -658.40 | 1.0645 | 1530.20 | 2.4715 | 167.143619 | 0.268729 | 630.207952 | 1.018209 |
II | -462.30 | 1.0875 | 707.35 | 1.6640 | 203.811600 | 0.479702 | 424.573256 | 0.998854 |
III | -206.75 | 1.6635 | 550.10 | 4.4260 | 88.122448 | 0.708484 | 124.273849 | 0.999407 |
IV | -263.60 | 1.9640 | 419.10 | 3.1200 | 83.309442 | 0.620801 | 234.864328 | 1.749009 |
Looks like frog_id
I packs a punch although frog_id
III isn’t too shabby when normalized by body weight! A strip plot to show all of the force measurements is probably a good next step!
transform
is similiar toagg
in that it accepts apd.Series
,np.array
,list
and applies to each column of each group. However,transform
returns a list-like of the same length as the input. Consider calculating the z-score ofimpact forces (mN)
.
In this case too, a single observation corresponds to not only a tongue-hit but also type of experimentally measured value associated. We’ll want to use df_long
(and write back into that data frame but in a new column).
[19]:
def calculate_zscore(x):
return (x - np.mean(x)) / np.std(x)
df_long['zscore'] = df_long.groupby(['frog_id', 'variable'])['value'].transform(calculate_zscore)
df_long.head()
[19]:
date | trial number | frog_id | variable | value | is_measured | zscore | |
---|---|---|---|---|---|---|---|
0 | 2013_03_19 | 1 | II | impact force (mN) | 1612.0 | True | 2.186080 |
1 | 2013_03_19 | 2 | II | impact force (mN) | 605.0 | True | -0.247328 |
2 | 2013_03_19 | 3 | II | impact force (mN) | 327.0 | True | -0.919113 |
3 | 2013_03_19 | 4 | II | impact force (mN) | 946.0 | True | 0.576696 |
4 | 2013_03_21 | 1 | II | impact force (mN) | 541.0 | True | -0.401984 |
Now, we have z-score within frog_id
, variable
groups for each value… Might be a neat trick for calculating ECDFs!
A plot¶
Why play with frogs trying to eat insects if we don’t make it to a plot!? Since we have z-scores for each observation across frog_id
s and for each measurement, let’s values plot coloring by the z-score to draw our attention to any potentially extreme values. Consider this a start at exploring the data within this frog-tongue dataset.
[20]:
df_plot = df_long.copy()
df_plot['extreme'] = np.abs(df_plot['zscore']) > 1.75
df_plot = df_plot.sort_values(by=['frog_id', 'extreme'])
hv.Scatter(
data=df_plot,
kdims=[('frog_id', 'variable')],
vdims=['value', 'variable', 'extreme'],
).groupby(
'variable'
).opts(
color='extreme',
hooks=[bebi103.hv.no_xgrid_hook],
jitter=0.1,
show_legend=False,
width=200,
xlabel="",
ylabel="",
).layout(
).opts(
shared_axes=False,
)
[20]:
Computing environment¶
[21]:
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,bokeh,holoviews,bebi103,jupyterlab
CPython 3.7.4
IPython 7.1.1
numpy 1.17.2
scipy 1.3.1
pandas 0.24.2
bokeh 1.3.4
holoviews 1.12.6
bebi103 0.0.42
jupyterlab 1.1.4