4. Reshaping data

This recitation was written by Shyam Saladi with minor edits by Justin Bois.

Data set download


[1]:
import glob

import numpy as np
import pandas as pd

import scipy.stats as st

import holoviews as hv
hv.extension('bokeh')

import bebi103
bebi103.hv.set_defaults()

For the purposes of this material, it will be useful to have read through the first two sections of Lesson 4.

We’ll start by quickly reviewing some of the essentials of pd.Dataframes, including indexes and concatenation, and then move into split-apply-combine with groupby, apply, & transform and reshaping data with melt, stack, & unstack.

As with Lesson 4, we will be using the frog tongue strike data set (since it’s awesome) from a fun paper about the adhesive properties of frog tongues Kleinteich and Gorb, 2014. In this work, the authors investigated various properties of the adhesive characteristics of the tongues of horned frogs when they strike prey. The authors had a striking pad connected to a cantilever to measure forces. They also used high speed cameras to capture the strike and record relevant data.

The goal of this exercise will be to first get a feel for the dataset and by reshaping it using stack, unstack, melt, pivot and a smattering of set_index and reset_index along the way. Finally, we’ll do some split-apply-combine with the long-formatted tidy data making it to an exploratory plot.

Loading the data

After downloading and unzipping the archive with the frog strike data, we need to load the data into a single data frame concatenating the data from each csv file:

[2]:
fnames = glob.glob('../data/frog_strikes/frog_strikes_*.csv')

def read_strike_csv(fn):
    df = pd.read_csv(fn, comment='#')
    df['frog_id'] = fn[fn.rfind('_')+1:fn.rfind('.')]
    return df

df = pd.concat([read_strike_csv(fn) for fn in fnames])

# Take a look at data frame
df.head(10)
[2]:
date trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa) frog_id
0 2013_03_19 1 1612 18 3.79 -655 3087 1.54 -0.385 348 15 0.96 4633 -1881 II
1 2013_03_19 2 605 55 1.42 -292 1261 0.69 -0.294 248 24 0.90 2441 -1177 II
2 2013_03_19 3 327 51 0.77 -246 1508 0.58 -0.282 130 34 0.74 2517 -1894 II
3 2013_03_19 4 946 59 2.23 -245 1841 0.58 -0.340 106 26 0.76 8893 -2301 II
4 2013_03_21 1 541 33 1.27 -553 3126 1.30 -0.432 276 16 0.94 1959 -2004 II
5 2013_03_21 2 1539 43 3.62 -664 741 1.56 -0.046 85 24 0.72 18073 -7802 II
6 2013_03_21 3 529 28 1.24 -261 2482 0.61 -0.414 325 33 0.90 1627 -803 II
7 2013_03_21 4 628 31 1.48 -691 998 1.63 -0.071 242 67 0.72 2600 -2860 II
8 2013_03_25 1 1453 72 3.42 -92 1652 0.22 -0.008 136 0 1.00 10645 -678 II
9 2013_03_25 2 297 42 0.70 -566 936 1.33 -0.084 126 4 0.97 2367 -4506 II

It looks like frogs were tested on various days and multiple times on each day.

If you were running this experiment, would you have tested all frogs on the same dates? What did Kleinteich and Gorb do? Were all four frogs tested on the same days? Let’s count the number of trials by date and frog_id to take a look.

[3]:
df.groupby(['date', 'frog_id'])['trial number'].count()
[3]:
date        frog_id
2013_02_26  I          2
2013_03_01  I          4
2013_03_05  I          4
2013_03_12  I          5
2013_03_15  I          5
2013_03_19  II         4
2013_03_21  II         4
2013_03_25  II         4
2013_03_28  II         2
2013_04_03  II         2
2013_04_08  II         3
2013_04_12  II         1
2013_05_27  III        3
            IV         3
2013_05_30  IV         3
2013_06_03  IV         1
2013_06_11  III        3
            IV         4
2013_06_14  III        2
            IV         1
2013_06_18  III        4
            IV         4
2013_06_21  III        4
            IV         4
2013_06_26  III        4
Name: trial number, dtype: int64

It looks like the experiment took place over the course of 4 months with frog I tested initially and the others following.

Curiously, each date has a different number of trials/observations. A finer look at the observations across days and frogs would help better understand the experimental design here.

Let’s try displaying the data for all dates split up by frog_id and trial number.

To do this, we should move frog_id, trial number, and date from column values to row indices. We will also select a single value to focus on.

[4]:
df_force = df.set_index(['frog_id', 'trial number', 'date'])['impact force (mN)']

df_force.head()
[4]:
frog_id  trial number  date
II       1             2013_03_19    1612
         2             2013_03_19     605
         3             2013_03_19     327
         4             2013_03_19     946
         1             2013_03_21     541
Name: impact force (mN), dtype: int64

Unstacking/stacking indexes

Now, let’s pivot date to columns.

In general, the operation of moving row indexers to column indexers (or the other way around) is called “pivoting”. Pandas also has a pivot function, but in practice, stack and unstack are your friends. They can be easier to use since they need an index to pivot on.

To be clear, stacking transforms the most-inner column index into the inner-most row index. The cell values are then reshuffled accordingly. Unstacking performs the reverse operation–moving the inner row indexes to the columns indexes.

Here we want to move frog_id and trial number from row indexers to column indexers. This corresponds to the unstack operation.

[5]:
df_force = df_force.unstack(['frog_id', 'trial number'])

# sort index and tidy for easy viewing
df_force = df_force.sort_index(axis=1).astype(str).replace('nan', '')
df_force
[5]:
frog_id I II III IV
trial number 1 2 3 4 5 1 2 3 4 1 2 3 4 1 2 3 4
date
2013_02_26 1205.0 2527.0
2013_03_01 1745.0 1556.0 493.0 2276.0
2013_03_05 556.0 1928.0 2641.0 1897.0
2013_03_12 1891.0 1545.0 1307.0 1692.0 1543.0
2013_03_15 1282.0 775.0 2032.0 1240.0 473.0
2013_03_19 1612.0 605.0 327.0 946.0
2013_03_21 541.0 1539.0 529.0 628.0
2013_03_25 1453.0 297.0 703.0 269.0
2013_03_28 751.0 245.0
2013_04_03 1182.0 515.0
2013_04_08 435.0 383.0 457.0
2013_04_12 730.0
2013_05_27 614.0 414.0 324.0 172.0 142.0 37.0
2013_05_30 453.0 355.0 22.0
2013_06_03 502.0
2013_06_11 776.0 611.0 544.0 273.0 720.0 582.0 198.0
2013_06_14 538.0 579.0 198.0
2013_06_18 806.0 459.0 458.0 626.0 597.0 516.0 815.0 402.0
2013_06_21 621.0 544.0 535.0 385.0 605.0 711.0 614.0 468.0
2013_06_26 401.0 614.0 665.0 488.0

I find it easier to compare groups of numerical observations vertically, so let’s transpose the axes

[6]:
df_force.sort_index(axis=1).astype(str).replace('nan', '').transpose()
[6]:
date 2013_02_26 2013_03_01 2013_03_05 2013_03_12 2013_03_15 2013_03_19 2013_03_21 2013_03_25 2013_03_28 2013_04_03 2013_04_08 2013_04_12 2013_05_27 2013_05_30 2013_06_03 2013_06_11 2013_06_14 2013_06_18 2013_06_21 2013_06_26
frog_id trial number
I 1 1745.0 556.0 1891.0 1282.0
2 1556.0 1928.0 1545.0 775.0
3 1205.0 493.0 2641.0 1307.0 2032.0
4 2527.0 2276.0 1897.0 1692.0 1240.0
5 1543.0 473.0
II 1 1612.0 541.0 1453.0 751.0 1182.0 435.0 730.0
2 605.0 1539.0 297.0 245.0 515.0 383.0
3 327.0 529.0 703.0 457.0
4 946.0 628.0 269.0
III 1 614.0 776.0 538.0 806.0 621.0 401.0
2 414.0 611.0 579.0 459.0 544.0 614.0
3 324.0 544.0 458.0 535.0 665.0
4 626.0 385.0 488.0
IV 1 453.0 502.0 273.0 198.0 597.0 605.0
2 172.0 355.0 720.0 516.0 711.0
3 142.0 22.0 582.0 815.0 614.0
4 37.0 198.0 402.0 468.0

Ah, it looks like the measurements were taken successively frog I to IV across 4 months. On some days, the first few trials aren’t recorded and, from day-to-day, different numbers of trials were performed (perhaps the frog had enough science for the afternoon).

This data frame organizes frog_id and trial number into a MultiIndex on one axis and date onto the other. If we wanted to pivot trial number from a column indexer to a row indexer, stack is our friend.

[7]:
df_force_stack = df_force.stack('trial number')

# just show the first few rows
df_force_stack.head(10)
[7]:
frog_id I II III IV
date trial number
2013_02_26 1
2
3 1205.0
4 2527.0
5 NaN NaN NaN
2013_03_01 1 1745.0
2 1556.0
3 493.0
4 2276.0
5 NaN NaN NaN

We see that NaN values are introduced after stacking, for example in the cell given by date 2013_02_26, trial 5, and frog_id II. Looking closely at the previous dataframe df_force, only frog I has observations with a trial value of 5. When stacking, each of the column indexer values being pivoted is nested under each row indexer. Where there are missing values in the originating dataframe, NaN values are introduced.

Melting a data frame

Looking back at the original data frame again, it might be useful to separate the measured quantities from those that are calculated (from 2 or more measured values). Interestingly, those columns with measured data have units specified in parentheses and those without, do not. Keeping this in mind, it will be useful to add a label to each value indicating whether it’s measured or calculated.

One way to do this could be to convert the column indexer to a MultiIndex with two levels: is_measured indicating the type of variable and the variable name itself:

[8]:
col_names = [['(' in c, c] for c in df.columns]


# copy dataframe to leave initial dataframe unchanged
df2 = df.copy()

df2.columns = pd.MultiIndex.from_tuples(col_names)
df2.columns.names = ['is_measured', 'variable']
df2.sort_index(axis=1).head()
[8]:
is_measured False True
variable adhesive force / body weight contact area with mucus / contact area without mucus date frog_id impact force / body weight trial number adhesive force (mN) adhesive impulse (N-s) adhesive strength (Pa) contact area without mucus (mm2) contact pressure (Pa) impact force (mN) impact time (ms) time frog pulls on target (ms) total contact area (mm2)
0 1.54 0.96 2013_03_19 II 3.79 1 -655 -0.385 -1881 15 4633 1612 18 3087 348
1 0.69 0.90 2013_03_19 II 1.42 2 -292 -0.294 -1177 24 2441 605 55 1261 248
2 0.58 0.74 2013_03_19 II 0.77 3 -246 -0.282 -1894 34 2517 327 51 1508 130
3 0.58 0.76 2013_03_19 II 2.23 4 -245 -0.340 -2301 26 8893 946 59 1841 106
4 1.30 0.94 2013_03_21 II 1.27 1 -553 -0.432 -2004 16 1959 541 33 3126 276

It’s somewhat odd to have date, trial number, and frog_id under is_measured False. Let’s set them first to row indexers and then organize the columns:

[9]:
# copy dataframe to leave initial dataframe unchanged
df2 = df.set_index(['frog_id', 'date', 'trial number'])

col_names = [['(' in c, c] for c in df2.columns]

df2.columns = pd.MultiIndex.from_tuples(col_names)
df2.columns.names = ['is_measured', 'variable']
df2.sort_index(axis=1).head()
[9]:
is_measured False True
variable adhesive force / body weight contact area with mucus / contact area without mucus impact force / body weight adhesive force (mN) adhesive impulse (N-s) adhesive strength (Pa) contact area without mucus (mm2) contact pressure (Pa) impact force (mN) impact time (ms) time frog pulls on target (ms) total contact area (mm2)
frog_id date trial number
II 2013_03_19 1 1.54 0.96 3.79 -655 -0.385 -1881 15 4633 1612 18 3087 348
2 0.69 0.90 1.42 -292 -0.294 -1177 24 2441 605 55 1261 248
3 0.58 0.74 0.77 -246 -0.282 -1894 34 2517 327 51 1508 130
4 0.58 0.76 2.23 -245 -0.340 -2301 26 8893 946 59 1841 106
2013_03_21 1 1.30 0.94 1.27 -553 -0.432 -2004 16 1959 541 33 3126 276

As earlier, let’s try to stack one of the levels of the column indexer:

[10]:
df2.stack('is_measured').head(20)
[10]:
variable adhesive force (mN) adhesive force / body weight adhesive impulse (N-s) adhesive strength (Pa) contact area with mucus / contact area without mucus contact area without mucus (mm2) contact pressure (Pa) impact force (mN) impact force / body weight impact time (ms) time frog pulls on target (ms) total contact area (mm2)
frog_id date trial number is_measured
II 2013_03_19 1 False NaN 1.54 NaN NaN 0.96 NaN NaN NaN 3.79 NaN NaN NaN
True -655.0 NaN -0.385 -1881.0 NaN 15.0 4633.0 1612.0 NaN 18.0 3087.0 348.0
2 False NaN 0.69 NaN NaN 0.90 NaN NaN NaN 1.42 NaN NaN NaN
True -292.0 NaN -0.294 -1177.0 NaN 24.0 2441.0 605.0 NaN 55.0 1261.0 248.0
3 False NaN 0.58 NaN NaN 0.74 NaN NaN NaN 0.77 NaN NaN NaN
True -246.0 NaN -0.282 -1894.0 NaN 34.0 2517.0 327.0 NaN 51.0 1508.0 130.0
4 False NaN 0.58 NaN NaN 0.76 NaN NaN NaN 2.23 NaN NaN NaN
True -245.0 NaN -0.340 -2301.0 NaN 26.0 8893.0 946.0 NaN 59.0 1841.0 106.0
2013_03_21 1 False NaN 1.30 NaN NaN 0.94 NaN NaN NaN 1.27 NaN NaN NaN
True -553.0 NaN -0.432 -2004.0 NaN 16.0 1959.0 541.0 NaN 33.0 3126.0 276.0
2 False NaN 1.56 NaN NaN 0.72 NaN NaN NaN 3.62 NaN NaN NaN
True -664.0 NaN -0.046 -7802.0 NaN 24.0 18073.0 1539.0 NaN 43.0 741.0 85.0
3 False NaN 0.61 NaN NaN 0.90 NaN NaN NaN 1.24 NaN NaN NaN
True -261.0 NaN -0.414 -803.0 NaN 33.0 1627.0 529.0 NaN 28.0 2482.0 325.0
4 False NaN 1.63 NaN NaN 0.72 NaN NaN NaN 1.48 NaN NaN NaN
True -691.0 NaN -0.071 -2860.0 NaN 67.0 2600.0 628.0 NaN 31.0 998.0 242.0
2013_03_25 1 False NaN 0.22 NaN NaN 1.00 NaN NaN NaN 3.42 NaN NaN NaN
True -92.0 NaN -0.008 -678.0 NaN 0.0 10645.0 1453.0 NaN 72.0 1652.0 136.0
2 False NaN 1.33 NaN NaN 0.97 NaN NaN NaN 0.70 NaN NaN NaN
True -566.0 NaN -0.084 -4506.0 NaN 4.0 2367.0 297.0 NaN 42.0 936.0 126.0

You will notice that NaN values are added wholesale. What’s happened here?

Well, each column is either holds a measured or calculated variable. This when we try to stack by moving the is_measured from a column indexer to a row indexer, rows for is_measured = True have NaN values in columns with calculated values; rows for is_measured = False have *NaN values in columns with directly measured values.

This is an often encountered subtlety of unstacking and stacking data. In some ways this is also a feature, pandas won’t add data where it does not exist!

Tidying data with pd.melt

When thinking about the original data frame from the concatenated csv files, one might consider it tidy having one “tongue impact” observation in each row. Another point-of-view could consider each recorded value as an observation with the name of the variable as part of the metadata.

Let’s carry out this transformation. Then let’s also try to identify and separate those measured from calculated variables, as done above. This time you might find the process a bit more slick.

To turn this wide data into a long-tidy format, we should identify columns that identify observations, i.e. date, trial number, frog_id. When pivoting the data, these columns will stay associated with values. All other column headings will become values in rows. Think of it as similar to the unstack operation worked out above.

[11]:
df_long = df.melt(id_vars=['date', 'trial number', 'frog_id'])

# identify measured data by the presence of a parenthesis
df_long['is_measured'] = df_long['variable'].str.contains('\(')
df_long.head()
[11]:
date trial number frog_id variable value is_measured
0 2013_03_19 1 II impact force (mN) 1612.0 True
1 2013_03_19 2 II impact force (mN) 605.0 True
2 2013_03_19 3 II impact force (mN) 327.0 True
3 2013_03_19 4 II impact force (mN) 946.0 True
4 2013_03_21 1 II impact force (mN) 541.0 True

If you’re wondering why we’ve written .str.contains('\(') instead of .str.contains('('), take a peek at the documentation for pd.Series.str.contains. Hint: We could have also written .str.contains(')', regex=False)

Filtering for those values that are the direct result of an experimental measurement is done as so:

[12]:
df_long_direct = df_long[df_long['is_measured']].copy()

Note: .copy() helps avoid bugs since filtering a pd.DataFrame returns a view not a copy. More detail here.

To transform values back to a wide format, as in the original data frame (though with only the directly measured quantities), we’ll need to do another pivot. Setting the indexes is the first step: date, trial number, frog_id, and variable

[13]:
df_direct = df_long_direct.set_index(['date', 'trial number', 'frog_id', 'variable'])
df_direct.head(10)
[13]:
value is_measured
date trial number frog_id variable
2013_03_19 1 II impact force (mN) 1612.0 True
2 II impact force (mN) 605.0 True
3 II impact force (mN) 327.0 True
4 II impact force (mN) 946.0 True
2013_03_21 1 II impact force (mN) 541.0 True
2 II impact force (mN) 1539.0 True
3 II impact force (mN) 529.0 True
4 II impact force (mN) 628.0 True
2013_03_25 1 II impact force (mN) 1453.0 True
2 II impact force (mN) 297.0 True

Now the index to pivot on should be clear, variable, and only the column value needs to be kept. is_measured is all True. To move a row indexer to a column indexer (i.e. going long to wide), we need to use unstack as so

[14]:
df_wide_direct = df_direct['value'].unstack('variable')
df_wide_direct.head(10)
[14]:
variable adhesive force (mN) adhesive impulse (N-s) adhesive strength (Pa) contact area without mucus (mm2) contact pressure (Pa) impact force (mN) impact time (ms) time frog pulls on target (ms) total contact area (mm2)
date trial number frog_id
2013_02_26 3 I -785.0 -0.290 -2030.0 70.0 3117.0 1205.0 46.0 884.0 387.0
4 I -983.0 -0.181 -9695.0 94.0 24923.0 2527.0 44.0 248.0 101.0
2013_03_01 1 I -850.0 -0.157 -10239.0 79.0 21020.0 1745.0 34.0 211.0 83.0
2 I -455.0 -0.170 -1381.0 158.0 4718.0 1556.0 41.0 1025.0 330.0
3 I -974.0 -0.423 -3975.0 216.0 2012.0 493.0 36.0 499.0 245.0
4 I -592.0 -0.176 -1737.0 106.0 6676.0 2276.0 31.0 969.0 341.0
2013_03_05 1 I -512.0 -0.285 -1427.0 110.0 1550.0 556.0 43.0 835.0 359.0
2 I -804.0 -0.285 -3266.0 178.0 7832.0 1928.0 46.0 508.0 246.0
3 I -690.0 -0.239 -2568.0 224.0 9824.0 2641.0 50.0 491.0 269.0
4 I -462.0 -0.328 -1733.0 176.0 7122.0 1897.0 41.0 839.0 266.0

As with pd.melt, you’ll also see the long-to-wide operation also done with pd.pivot. Check this out:

[15]:
df_wide_direct = df_long_direct.set_index(['date', 'trial number', 'frog_id'])

# drop unneeded column
df_wide_direct = df_wide_direct.drop(columns=['is_measured'])

df_wide_direct.pivot(columns='variable').head()
[15]:
value
variable adhesive force (mN) adhesive impulse (N-s) adhesive strength (Pa) contact area without mucus (mm2) contact pressure (Pa) impact force (mN) impact time (ms) time frog pulls on target (ms) total contact area (mm2)
date trial number frog_id
2013_02_26 3 I -785.0 -0.290 -2030.0 70.0 3117.0 1205.0 46.0 884.0 387.0
4 I -983.0 -0.181 -9695.0 94.0 24923.0 2527.0 44.0 248.0 101.0
2013_03_01 1 I -850.0 -0.157 -10239.0 79.0 21020.0 1745.0 34.0 211.0 83.0
2 I -455.0 -0.170 -1381.0 158.0 4718.0 1556.0 41.0 1025.0 330.0
3 I -974.0 -0.423 -3975.0 216.0 2012.0 493.0 36.0 499.0 245.0

Reshaping advice

The step of setting indexes and then pivoting with stack or unstack is our preferred way of going from long-to-wide/wide-to-long. It helps being explicit and avoiding bugs in the process. pd.melt is convenient, and pd.pivot at times, too, but getting familiar with indexes, stack, and unstack will go a long way in reshaping and transforming your data.

You may also come across the function pd.pivot_table. We don’t use this too often since values identified by the same index positions are aggregated by default. In practice, it’s helpful to explicitly perform this aggregation when your data is tidied in a long format (think split-apply-combine).

Take a look at the docs for pd.pivot_table to learn more here.

Tidying your tidy data

With the data molded into a tidy format (having covered the various manipulations along the way), we can process groups of observations and variables in uniform ways. The general strategy is to groupby and then use one of apply, agg, or transform on each group. In the context of the data here, let’s just group by frog_id.

  • apply is the most general taking a single function as its argument. This function takes the group as a data frame and applies arbitrary operations on/with it and then is required to return another data frame. Consider performing returning the slope \(m\) and intercept \(b\)) of a linear regression (think \(y = mx + b\), check out these numpy docs) for each group using, say, columns impact force (mN) & impact time (ms).

Here we consider each tongue-hit along with its experimentally measured values as an observation, so we’ll want to use the concatenated data frame df.

[16]:
def calculate_linear_reg(df_grp, x, y):
    """Calculates a linear regression for the columns specified"""
    m, b, r_value = st.linregress( df_grp[x], df_grp[y])[:3]
    return pd.Series({'m': m, 'b': b, 'rsq': r_value**2})

df.groupby('frog_id').apply(calculate_linear_reg, x='impact force (mN)', y='impact time (ms)')
[16]:
m b rsq
frog_id
I 0.003814 34.013441 0.095088
II 0.005618 33.675927 0.025363
III -0.104138 100.636094 0.167798
IV 0.020265 26.906864 0.083563

Not much of a relationship here…

Note: pd.groupby.apply is a pretty “heavy” operation when passing large groups (say many columns). It can be useful to specify the specific groups necessary for calculation as so,

df.groupby('frog_id')[['impact force (mN)', 'impact time (ms)']].apply(
    calculate_linear_reg, x='impact force (mN)', y='impact time (ms)')

Want to time how long an apply operation will take? Checkout tqdm. The following will be helpful to get going:

import tqdm
tqdm.tqdm.pandas()
pd.groupby([groups]).progress_apply(process_group)
  • agg takes a single function (or list of functions) that accept a pd.Series, np.array, list or similar and return a single value summarizing the input. The groupby agg combination will do the hard work of aggregating each column of each group with the function/functions passed. Consider calculating the mean and standard deviation of the force measurements (say impact force (mN), adhesive force (mN), and both normalized to body weight).

Here a single observation corresponds to not only a tongue-hit but also type of experimentally measured value associated. We’ll want to use df_long.

[17]:
df_forces = df_long[df_long['variable'].str.contains('force')].groupby(['frog_id', 'variable'])['value'].agg([np.mean, np.std])
df_forces
[17]:
mean std
frog_id variable
I adhesive force (mN) -658.4000 167.143619
adhesive force / body weight 1.0645 0.268729
impact force (mN) 1530.2000 630.207952
impact force / body weight 2.4715 1.018209
II adhesive force (mN) -462.3000 203.811600
adhesive force / body weight 1.0875 0.479702
impact force (mN) 707.3500 424.573256
impact force / body weight 1.6640 0.998854
III adhesive force (mN) -206.7500 88.122448
adhesive force / body weight 1.6635 0.708484
impact force (mN) 550.1000 124.273849
impact force / body weight 4.4260 0.999407
IV adhesive force (mN) -263.6000 83.309442
adhesive force / body weight 1.9640 0.620801
impact force (mN) 419.1000 234.864328
impact force / body weight 3.1200 1.749009
[18]:
# unstack for easy viewing
df_forces.unstack('variable')
[18]:
mean std
variable adhesive force (mN) adhesive force / body weight impact force (mN) impact force / body weight adhesive force (mN) adhesive force / body weight impact force (mN) impact force / body weight
frog_id
I -658.40 1.0645 1530.20 2.4715 167.143619 0.268729 630.207952 1.018209
II -462.30 1.0875 707.35 1.6640 203.811600 0.479702 424.573256 0.998854
III -206.75 1.6635 550.10 4.4260 88.122448 0.708484 124.273849 0.999407
IV -263.60 1.9640 419.10 3.1200 83.309442 0.620801 234.864328 1.749009

Looks like frog_id I packs a punch although frog_id III isn’t too shabby when normalized by body weight! A strip plot to show all of the force measurements is probably a good next step!

  • transform is similiar to agg in that it accepts a pd.Series, np.array, list and applies to each column of each group. However, transform returns a list-like of the same length as the input. Consider calculating the z-score of impact forces (mN).

In this case too, a single observation corresponds to not only a tongue-hit but also type of experimentally measured value associated. We’ll want to use df_long (and write back into that data frame but in a new column).

[19]:
def calculate_zscore(x):
    return (x - np.mean(x)) / np.std(x)

df_long['zscore'] = df_long.groupby(['frog_id', 'variable'])['value'].transform(calculate_zscore)
df_long.head()
[19]:
date trial number frog_id variable value is_measured zscore
0 2013_03_19 1 II impact force (mN) 1612.0 True 2.186080
1 2013_03_19 2 II impact force (mN) 605.0 True -0.247328
2 2013_03_19 3 II impact force (mN) 327.0 True -0.919113
3 2013_03_19 4 II impact force (mN) 946.0 True 0.576696
4 2013_03_21 1 II impact force (mN) 541.0 True -0.401984

Now, we have z-score within frog_id, variable groups for each value… Might be a neat trick for calculating ECDFs!

A plot

Why play with frogs trying to eat insects if we don’t make it to a plot!? Since we have z-scores for each observation across frog_ids and for each measurement, let’s values plot coloring by the z-score to draw our attention to any potentially extreme values. Consider this a start at exploring the data within this frog-tongue dataset.

[20]:
df_plot = df_long.copy()
df_plot['extreme'] = np.abs(df_plot['zscore']) > 1.75
df_plot = df_plot.sort_values(by=['frog_id', 'extreme'])

hv.Scatter(
    data=df_plot,
    kdims=[('frog_id', 'variable')],
    vdims=['value', 'variable', 'extreme'],
).groupby(
    'variable'
).opts(
    color='extreme',
    hooks=[bebi103.hv.no_xgrid_hook],
    jitter=0.1,
    show_legend=False,
    width=200,
    xlabel="",
    ylabel="",
).layout(
).opts(
    shared_axes=False,
)
[20]:

Computing environment

[21]:
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,bokeh,holoviews,bebi103,jupyterlab
CPython 3.7.4
IPython 7.1.1

numpy 1.17.2
scipy 1.3.1
pandas 0.24.2
bokeh 1.3.4
holoviews 1.12.6
bebi103 0.0.42
jupyterlab 1.1.4