4. Reshaping data¶

This recitation was written by Shyam Saladi with minor edits by Justin Bois.

[1]:

import glob

import numpy as np
import pandas as pd

import scipy.stats as st

import holoviews as hv
hv.extension('bokeh')

import bebi103
bebi103.hv.set_defaults()

For the purposes of this material, it will be useful to have read through the first two sections of Lesson 4.

We’ll start by quickly reviewing some of the essentials of pd.Dataframes, including indexes and concatenation, and then move into split-apply-combine with groupby, apply, & transform and reshaping data with melt, stack, & unstack.

As with Lesson 4, we will be using the frog tongue strike data set (since it’s awesome) from a fun paper about the adhesive properties of frog tongues Kleinteich and Gorb, 2014. In this work, the authors investigated various properties of the adhesive characteristics of the tongues of horned frogs when they strike prey. The authors had a striking pad connected to a cantilever to measure forces. They also used high speed cameras to capture the strike and record relevant data.

The goal of this exercise will be to first get a feel for the dataset and by reshaping it using stack, unstack, melt, pivot and a smattering of set_index and reset_index along the way. Finally, we’ll do some split-apply-combine with the long-formatted tidy data making it to an exploratory plot.

Loading the data¶

After downloading and unzipping the archive with the frog strike data, we need to load the data into a single data frame concatenating the data from each csv file:

[2]:

fnames = glob.glob('../data/frog_strikes/frog_strikes_*.csv')

def read_strike_csv(fn):
    df = pd.read_csv(fn, comment='#')
    df['frog_id'] = fn[fn.rfind('_')+1:fn.rfind('.')]
    return df

df = pd.concat([read_strike_csv(fn) for fn in fnames])

# Take a look at data frame
df.head(10)

[2]:

	date	trial number	impact force (mN)	impact time (ms)	impact force / body weight	adhesive force (mN)	time frog pulls on target (ms)	adhesive force / body weight	adhesive impulse (N-s)	total contact area (mm2)	contact area without mucus (mm2)	contact area with mucus / contact area without mucus	contact pressure (Pa)	adhesive strength (Pa)	frog_id
0	2013_03_19	1	1612	18	3.79	-655	3087	1.54	-0.385	348	15	0.96	4633	-1881	II
1	2013_03_19	2	605	55	1.42	-292	1261	0.69	-0.294	248	24	0.90	2441	-1177	II
2	2013_03_19	3	327	51	0.77	-246	1508	0.58	-0.282	130	34	0.74	2517	-1894	II
3	2013_03_19	4	946	59	2.23	-245	1841	0.58	-0.340	106	26	0.76	8893	-2301	II
4	2013_03_21	1	541	33	1.27	-553	3126	1.30	-0.432	276	16	0.94	1959	-2004	II
5	2013_03_21	2	1539	43	3.62	-664	741	1.56	-0.046	85	24	0.72	18073	-7802	II
6	2013_03_21	3	529	28	1.24	-261	2482	0.61	-0.414	325	33	0.90	1627	-803	II
7	2013_03_21	4	628	31	1.48	-691	998	1.63	-0.071	242	67	0.72	2600	-2860	II
8	2013_03_25	1	1453	72	3.42	-92	1652	0.22	-0.008	136	0	1.00	10645	-678	II
9	2013_03_25	2	297	42	0.70	-566	936	1.33	-0.084	126	4	0.97	2367	-4506	II

It looks like frogs were tested on various days and multiple times on each day.

If you were running this experiment, would you have tested all frogs on the same dates? What did Kleinteich and Gorb do? Were all four frogs tested on the same days? Let’s count the number of trials by date and frog_id to take a look.

[3]:

df.groupby(['date', 'frog_id'])['trial number'].count()

[3]:

date        frog_id
2013_02_26  I          2
2013_03_01  I          4
2013_03_05  I          4
2013_03_12  I          5
2013_03_15  I          5
2013_03_19  II         4
2013_03_21  II         4
2013_03_25  II         4
2013_03_28  II         2
2013_04_03  II         2
2013_04_08  II         3
2013_04_12  II         1
2013_05_27  III        3
            IV         3
2013_05_30  IV         3
2013_06_03  IV         1
2013_06_11  III        3
            IV         4
2013_06_14  III        2
            IV         1
2013_06_18  III        4
            IV         4
2013_06_21  III        4
            IV         4
2013_06_26  III        4
Name: trial number, dtype: int64

It looks like the experiment took place over the course of 4 months with frog I tested initially and the others following.

Curiously, each date has a different number of trials/observations. A finer look at the observations across days and frogs would help better understand the experimental design here.

Let’s try displaying the data for all dates split up by frog_id and trial number.

To do this, we should move frog_id, trial number, and date from column values to row indices. We will also select a single value to focus on.

[4]:

df_force = df.set_index(['frog_id', 'trial number', 'date'])['impact force (mN)']

df_force.head()

[4]:

frog_id  trial number  date
II       1             2013_03_19    1612
         2             2013_03_19     605
         3             2013_03_19     327
         4             2013_03_19     946
         1             2013_03_21     541
Name: impact force (mN), dtype: int64

Unstacking/stacking indexes¶

Now, let’s pivot date to columns.

In general, the operation of moving row indexers to column indexers (or the other way around) is called “pivoting”. Pandas also has a pivot function, but in practice, stack and unstack are your friends. They can be easier to use since they need an index to pivot on.

To be clear, stacking transforms the most-inner column index into the inner-most row index. The cell values are then reshuffled accordingly. Unstacking performs the reverse operation–moving the inner row indexes to the columns indexes.

Here we want to move frog_id and trial number from row indexers to column indexers. This corresponds to the unstack operation.

[5]:

df_force = df_force.unstack(['frog_id', 'trial number'])

# sort index and tidy for easy viewing
df_force = df_force.sort_index(axis=1).astype(str).replace('nan', '')
df_force

[5]:

frog_id	I					II				III				IV
trial number	1	2	3	4	5	1	2	3	4	1	2	3	4	1	2	3	4
date
2013_02_26			1205.0	2527.0
2013_03_01	1745.0	1556.0	493.0	2276.0
2013_03_05	556.0	1928.0	2641.0	1897.0
2013_03_12	1891.0	1545.0	1307.0	1692.0	1543.0
2013_03_15	1282.0	775.0	2032.0	1240.0	473.0
2013_03_19						1612.0	605.0	327.0	946.0
2013_03_21						541.0	1539.0	529.0	628.0
2013_03_25						1453.0	297.0	703.0	269.0
2013_03_28						751.0	245.0
2013_04_03						1182.0	515.0
2013_04_08						435.0	383.0	457.0
2013_04_12						730.0
2013_05_27										614.0	414.0	324.0			172.0	142.0	37.0
2013_05_30														453.0	355.0	22.0
2013_06_03														502.0
2013_06_11										776.0	611.0	544.0		273.0	720.0	582.0	198.0
2013_06_14										538.0	579.0			198.0
2013_06_18										806.0	459.0	458.0	626.0	597.0	516.0	815.0	402.0
2013_06_21										621.0	544.0	535.0	385.0	605.0	711.0	614.0	468.0
2013_06_26										401.0	614.0	665.0	488.0

I find it easier to compare groups of numerical observations vertically, so let’s transpose the axes

[6]:

df_force.sort_index(axis=1).astype(str).replace('nan', '').transpose()

[6]:

	date	2013_02_26	2013_03_01	2013_03_05	2013_03_12	2013_03_15	2013_03_19	2013_03_21	2013_03_25	2013_03_28	2013_04_03	2013_04_08	2013_04_12	2013_05_27	2013_05_30	2013_06_03	2013_06_11	2013_06_14	2013_06_18	2013_06_21	2013_06_26
frog_id	trial number
I	1		1745.0	556.0	1891.0	1282.0
	2		1556.0	1928.0	1545.0	775.0
	3	1205.0	493.0	2641.0	1307.0	2032.0
	4	2527.0	2276.0	1897.0	1692.0	1240.0
	5				1543.0	473.0
II	1						1612.0	541.0	1453.0	751.0	1182.0	435.0	730.0
	2						605.0	1539.0	297.0	245.0	515.0	383.0
	3						327.0	529.0	703.0			457.0
	4						946.0	628.0	269.0
III	1													614.0			776.0	538.0	806.0	621.0	401.0
	2													414.0			611.0	579.0	459.0	544.0	614.0
	3													324.0			544.0		458.0	535.0	665.0
	4																		626.0	385.0	488.0
IV	1														453.0	502.0	273.0	198.0	597.0	605.0
	2													172.0	355.0		720.0		516.0	711.0
	3													142.0	22.0		582.0		815.0	614.0
	4													37.0			198.0		402.0	468.0

Ah, it looks like the measurements were taken successively frog I to IV across 4 months. On some days, the first few trials aren’t recorded and, from day-to-day, different numbers of trials were performed (perhaps the frog had enough science for the afternoon).

This data frame organizes frog_id and trial number into a MultiIndex on one axis and date onto the other. If we wanted to pivot trial number from a column indexer to a row indexer, stack is our friend.

[7]:

df_force_stack = df_force.stack('trial number')

# just show the first few rows
df_force_stack.head(10)

[7]:

	frog_id	I	II	III	IV
date	trial number
2013_02_26	1
	2
	3	1205.0
	4	2527.0
	5		NaN	NaN	NaN
2013_03_01	1	1745.0
	2	1556.0
	3	493.0
	4	2276.0
	5		NaN	NaN	NaN

We see that NaN values are introduced after stacking, for example in the cell given by date 2013_02_26, trial 5, and frog_id II. Looking closely at the previous dataframe df_force, only frog I has observations with a trial value of 5. When stacking, each of the column indexer values being pivoted is nested under each row indexer. Where there are missing values in the originating dataframe, NaN values are introduced.

Melting a data frame¶

Looking back at the original data frame again, it might be useful to separate the measured quantities from those that are calculated (from 2 or more measured values). Interestingly, those columns with measured data have units specified in parentheses and those without, do not. Keeping this in mind, it will be useful to add a label to each value indicating whether it’s measured or calculated.

One way to do this could be to convert the column indexer to a MultiIndex with two levels: is_measured indicating the type of variable and the variable name itself:

[8]:

col_names = [['(' in c, c] for c in df.columns]


# copy dataframe to leave initial dataframe unchanged
df2 = df.copy()

df2.columns = pd.MultiIndex.from_tuples(col_names)
df2.columns.names = ['is_measured', 'variable']
df2.sort_index(axis=1).head()

[8]:

is_measured	False						True
variable	adhesive force / body weight	contact area with mucus / contact area without mucus	date	frog_id	impact force / body weight	trial number	adhesive force (mN)	adhesive impulse (N-s)	adhesive strength (Pa)	contact area without mucus (mm2)	contact pressure (Pa)	impact force (mN)	impact time (ms)	time frog pulls on target (ms)	total contact area (mm2)
0	1.54	0.96	2013_03_19	II	3.79	1	-655	-0.385	-1881	15	4633	1612	18	3087	348
1	0.69	0.90	2013_03_19	II	1.42	2	-292	-0.294	-1177	24	2441	605	55	1261	248
2	0.58	0.74	2013_03_19	II	0.77	3	-246	-0.282	-1894	34	2517	327	51	1508	130
3	0.58	0.76	2013_03_19	II	2.23	4	-245	-0.340	-2301	26	8893	946	59	1841	106
4	1.30	0.94	2013_03_21	II	1.27	1	-553	-0.432	-2004	16	1959	541	33	3126	276

It’s somewhat odd to have date, trial number, and frog_id under is_measured False. Let’s set them first to row indexers and then organize the columns:

[9]:

# copy dataframe to leave initial dataframe unchanged
df2 = df.set_index(['frog_id', 'date', 'trial number'])

col_names = [['(' in c, c] for c in df2.columns]

df2.columns = pd.MultiIndex.from_tuples(col_names)
df2.columns.names = ['is_measured', 'variable']
df2.sort_index(axis=1).head()

[9]:

		is_measured	False			True
		variable	adhesive force / body weight	contact area with mucus / contact area without mucus	impact force / body weight	adhesive force (mN)	adhesive impulse (N-s)	adhesive strength (Pa)	contact area without mucus (mm2)	contact pressure (Pa)	impact force (mN)	impact time (ms)	time frog pulls on target (ms)	total contact area (mm2)
frog_id	date	trial number
II	2013_03_19	1	1.54	0.96	3.79	-655	-0.385	-1881	15	4633	1612	18	3087	348
		2	0.69	0.90	1.42	-292	-0.294	-1177	24	2441	605	55	1261	248
		3	0.58	0.74	0.77	-246	-0.282	-1894	34	2517	327	51	1508	130
		4	0.58	0.76	2.23	-245	-0.340	-2301	26	8893	946	59	1841	106
	2013_03_21	1	1.30	0.94	1.27	-553	-0.432	-2004	16	1959	541	33	3126	276

As earlier, let’s try to stack one of the levels of the column indexer:

[10]:

df2.stack('is_measured').head(20)

[10]:

			variable	adhesive force (mN)	adhesive force / body weight	adhesive impulse (N-s)	adhesive strength (Pa)	contact area with mucus / contact area without mucus	contact area without mucus (mm2)	contact pressure (Pa)	impact force (mN)	impact force / body weight	impact time (ms)	time frog pulls on target (ms)	total contact area (mm2)
frog_id	date	trial number	is_measured
II	2013_03_19	1	False	NaN	1.54	NaN	NaN	0.96	NaN	NaN	NaN	3.79	NaN	NaN	NaN
		1	True	-655.0	NaN	-0.385	-1881.0	NaN	15.0	4633.0	1612.0	NaN	18.0	3087.0	348.0
		2	False	NaN	0.69	NaN	NaN	0.90	NaN	NaN	NaN	1.42	NaN	NaN	NaN
		2	True	-292.0	NaN	-0.294	-1177.0	NaN	24.0	2441.0	605.0	NaN	55.0	1261.0	248.0
		3	False	NaN	0.58	NaN	NaN	0.74	NaN	NaN	NaN	0.77	NaN	NaN	NaN
		3	True	-246.0	NaN	-0.282	-1894.0	NaN	34.0	2517.0	327.0	NaN	51.0	1508.0	130.0
		4	False	NaN	0.58	NaN	NaN	0.76	NaN	NaN	NaN	2.23	NaN	NaN	NaN
		4	True	-245.0	NaN	-0.340	-2301.0	NaN	26.0	8893.0	946.0	NaN	59.0	1841.0	106.0
	2013_03_21	1	False	NaN	1.30	NaN	NaN	0.94	NaN	NaN	NaN	1.27	NaN	NaN	NaN
		1	True	-553.0	NaN	-0.432	-2004.0	NaN	16.0	1959.0	541.0	NaN	33.0	3126.0	276.0
		2	False	NaN	1.56	NaN	NaN	0.72	NaN	NaN	NaN	3.62	NaN	NaN	NaN
		2	True	-664.0	NaN	-0.046	-7802.0	NaN	24.0	18073.0	1539.0	NaN	43.0	741.0	85.0
		3	False	NaN	0.61	NaN	NaN	0.90	NaN	NaN	NaN	1.24	NaN	NaN	NaN
		3	True	-261.0	NaN	-0.414	-803.0	NaN	33.0	1627.0	529.0	NaN	28.0	2482.0	325.0
		4	False	NaN	1.63	NaN	NaN	0.72	NaN	NaN	NaN	1.48	NaN	NaN	NaN
		4	True	-691.0	NaN	-0.071	-2860.0	NaN	67.0	2600.0	628.0	NaN	31.0	998.0	242.0
	2013_03_25	1	False	NaN	0.22	NaN	NaN	1.00	NaN	NaN	NaN	3.42	NaN	NaN	NaN
		1	True	-92.0	NaN	-0.008	-678.0	NaN	0.0	10645.0	1453.0	NaN	72.0	1652.0	136.0
		2	False	NaN	1.33	NaN	NaN	0.97	NaN	NaN	NaN	0.70	NaN	NaN	NaN
		2	True	-566.0	NaN	-0.084	-4506.0	NaN	4.0	2367.0	297.0	NaN	42.0	936.0	126.0

You will notice that NaN values are added wholesale. What’s happened here?

Well, each column is either holds a measured or calculated variable. This when we try to stack by moving the is_measured from a column indexer to a row indexer, rows for is_measured = True have NaN values in columns with calculated values; rows for is_measured = False have *NaN values in columns with directly measured values.

This is an often encountered subtlety of unstacking and stacking data. In some ways this is also a feature, pandas won’t add data where it does not exist!

Tidying data with `pd.melt`¶

When thinking about the original data frame from the concatenated csv files, one might consider it tidy having one “tongue impact” observation in each row. Another point-of-view could consider each recorded value as an observation with the name of the variable as part of the metadata.

Let’s carry out this transformation. Then let’s also try to identify and separate those measured from calculated variables, as done above. This time you might find the process a bit more slick.

To turn this wide data into a long-tidy format, we should identify columns that identify observations, i.e. date, trial number, frog_id. When pivoting the data, these columns will stay associated with values. All other column headings will become values in rows. Think of it as similar to the unstack operation worked out above.

[11]:

df_long = df.melt(id_vars=['date', 'trial number', 'frog_id'])

# identify measured data by the presence of a parenthesis
df_long['is_measured'] = df_long['variable'].str.contains('\(')
df_long.head()

[11]:

	date	trial number	frog_id	variable	value	is_measured
0	2013_03_19	1	II	impact force (mN)	1612.0	True
1	2013_03_19	2	II	impact force (mN)	605.0	True
2	2013_03_19	3	II	impact force (mN)	327.0	True
3	2013_03_19	4	II	impact force (mN)	946.0	True
4	2013_03_21	1	II	impact force (mN)	541.0	True

If you’re wondering why we’ve written .str.contains('\(') instead of .str.contains('('), take a peek at the documentation for pd.Series.str.contains. Hint: We could have also written .str.contains(')', regex=False)

Filtering for those values that are the direct result of an experimental measurement is done as so:

[12]:

df_long_direct = df_long[df_long['is_measured']].copy()

Note: .copy() helps avoid bugs since filtering a pd.DataFrame returns a view not a copy. More detail here.

To transform values back to a wide format, as in the original data frame (though with only the directly measured quantities), we’ll need to do another pivot. Setting the indexes is the first step: date, trial number, frog_id, and variable

[13]:

df_direct = df_long_direct.set_index(['date', 'trial number', 'frog_id', 'variable'])
df_direct.head(10)

[13]:

				value	is_measured
date	trial number	frog_id	variable
2013_03_19	1	II	impact force (mN)	1612.0	True
	2	II	impact force (mN)	605.0	True
	3	II	impact force (mN)	327.0	True
	4	II	impact force (mN)	946.0	True
2013_03_21	1	II	impact force (mN)	541.0	True
	2	II	impact force (mN)	1539.0	True
	3	II	impact force (mN)	529.0	True
	4	II	impact force (mN)	628.0	True
2013_03_25	1	II	impact force (mN)	1453.0	True
2013_03_25	2	II	impact force (mN)	297.0	True

Now the index to pivot on should be clear, variable, and only the column value needs to be kept. is_measured is all True. To move a row indexer to a column indexer (i.e. going long to wide), we need to use unstack as so

[14]:

df_wide_direct = df_direct['value'].unstack('variable')
df_wide_direct.head(10)

[14]:

		variable	adhesive force (mN)	adhesive impulse (N-s)	adhesive strength (Pa)	contact area without mucus (mm2)	contact pressure (Pa)	impact force (mN)	impact time (ms)	time frog pulls on target (ms)	total contact area (mm2)
date	trial number	frog_id
2013_02_26	3	I	-785.0	-0.290	-2030.0	70.0	3117.0	1205.0	46.0	884.0	387.0
2013_02_26	4	I	-983.0	-0.181	-9695.0	94.0	24923.0	2527.0	44.0	248.0	101.0
2013_03_01	1	I	-850.0	-0.157	-10239.0	79.0	21020.0	1745.0	34.0	211.0	83.0
	2	I	-455.0	-0.170	-1381.0	158.0	4718.0	1556.0	41.0	1025.0	330.0
	3	I	-974.0	-0.423	-3975.0	216.0	2012.0	493.0	36.0	499.0	245.0
	4	I	-592.0	-0.176	-1737.0	106.0	6676.0	2276.0	31.0	969.0	341.0
2013_03_05	1	I	-512.0	-0.285	-1427.0	110.0	1550.0	556.0	43.0	835.0	359.0
	2	I	-804.0	-0.285	-3266.0	178.0	7832.0	1928.0	46.0	508.0	246.0
	3	I	-690.0	-0.239	-2568.0	224.0	9824.0	2641.0	50.0	491.0	269.0
	4	I	-462.0	-0.328	-1733.0	176.0	7122.0	1897.0	41.0	839.0	266.0

As with pd.melt, you’ll also see the long-to-wide operation also done with pd.pivot. Check this out:

[15]:

df_wide_direct = df_long_direct.set_index(['date', 'trial number', 'frog_id'])

# drop unneeded column
df_wide_direct = df_wide_direct.drop(columns=['is_measured'])

df_wide_direct.pivot(columns='variable').head()

[15]:

			value
		variable	adhesive force (mN)	adhesive impulse (N-s)	adhesive strength (Pa)	contact area without mucus (mm2)	contact pressure (Pa)	impact force (mN)	impact time (ms)	time frog pulls on target (ms)	total contact area (mm2)
date	trial number	frog_id
2013_02_26	3	I	-785.0	-0.290	-2030.0	70.0	3117.0	1205.0	46.0	884.0	387.0
2013_02_26	4	I	-983.0	-0.181	-9695.0	94.0	24923.0	2527.0	44.0	248.0	101.0
2013_03_01	1	I	-850.0	-0.157	-10239.0	79.0	21020.0	1745.0	34.0	211.0	83.0
	2	I	-455.0	-0.170	-1381.0	158.0	4718.0	1556.0	41.0	1025.0	330.0
	3	I	-974.0	-0.423	-3975.0	216.0	2012.0	493.0	36.0	499.0	245.0

Reshaping advice¶

The step of setting indexes and then pivoting with stack or unstack is our preferred way of going from long-to-wide/wide-to-long. It helps being explicit and avoiding bugs in the process. pd.melt is convenient, and pd.pivot at times, too, but getting familiar with indexes, stack, and unstack will go a long way in reshaping and transforming your data.

You may also come across the function pd.pivot_table. We don’t use this too often since values identified by the same index positions are aggregated by default. In practice, it’s helpful to explicitly perform this aggregation when your data is tidied in a long format (think split-apply-combine).

Take a look at the docs for pd.pivot_table to learn more here.

Tidying your tidy data¶

With the data molded into a tidy format (having covered the various manipulations along the way), we can process groups of observations and variables in uniform ways. The general strategy is to groupby and then use one of apply, agg, or transform on each group. In the context of the data here, let’s just group by frog_id.

apply is the most general taking a single function as its argument. This function takes the group as a data frame and applies arbitrary operations on/with it and then is required to return another data frame. Consider performing returning the slope \(m\) and intercept \(b\)) of a linear regression (think \(y = mx + b\), check out these numpy docs) for each group using, say, columns impact force (mN) & impact time (ms).

Here we consider each tongue-hit along with its experimentally measured values as an observation, so we’ll want to use the concatenated data frame df.

[16]:

def calculate_linear_reg(df_grp, x, y):
    """Calculates a linear regression for the columns specified"""
    m, b, r_value = st.linregress( df_grp[x], df_grp[y])[:3]
    return pd.Series({'m': m, 'b': b, 'rsq': r_value**2})

df.groupby('frog_id').apply(calculate_linear_reg, x='impact force (mN)', y='impact time (ms)')

[16]:

	m	b	rsq
frog_id
I	0.003814	34.013441	0.095088
II	0.005618	33.675927	0.025363
III	-0.104138	100.636094	0.167798
IV	0.020265	26.906864	0.083563

Not much of a relationship here…

Note: pd.groupby.apply is a pretty “heavy” operation when passing large groups (say many columns). It can be useful to specify the specific groups necessary for calculation as so,

df.groupby('frog_id')[['impact force (mN)', 'impact time (ms)']].apply(
    calculate_linear_reg, x='impact force (mN)', y='impact time (ms)')

Want to time how long an apply operation will take? Checkout tqdm. The following will be helpful to get going:

import tqdm
tqdm.tqdm.pandas()
pd.groupby([groups]).progress_apply(process_group)

agg takes a single function (or list of functions) that accept a pd.Series, np.array, list or similar and return a single value summarizing the input. The groupby agg combination will do the hard work of aggregating each column of each group with the function/functions passed. Consider calculating the mean and standard deviation of the force measurements (say impact force (mN), adhesive force (mN), and both normalized to body weight).

Here a single observation corresponds to not only a tongue-hit but also type of experimentally measured value associated. We’ll want to use df_long.

[17]:

df_forces = df_long[df_long['variable'].str.contains('force')].groupby(['frog_id', 'variable'])['value'].agg([np.mean, np.std])
df_forces

[17]:

		mean	std
frog_id	variable
I	adhesive force (mN)	-658.4000	167.143619
	adhesive force / body weight	1.0645	0.268729
	impact force (mN)	1530.2000	630.207952
	impact force / body weight	2.4715	1.018209
II	adhesive force (mN)	-462.3000	203.811600
	adhesive force / body weight	1.0875	0.479702
	impact force (mN)	707.3500	424.573256
	impact force / body weight	1.6640	0.998854
III	adhesive force (mN)	-206.7500	88.122448
	adhesive force / body weight	1.6635	0.708484
	impact force (mN)	550.1000	124.273849
	impact force / body weight	4.4260	0.999407
IV	adhesive force (mN)	-263.6000	83.309442
	adhesive force / body weight	1.9640	0.620801
	impact force (mN)	419.1000	234.864328
	impact force / body weight	3.1200	1.749009

[18]:

# unstack for easy viewing
df_forces.unstack('variable')

[18]:

	mean				std
variable	adhesive force (mN)	adhesive force / body weight	impact force (mN)	impact force / body weight	adhesive force (mN)	adhesive force / body weight	impact force (mN)	impact force / body weight
frog_id
I	-658.40	1.0645	1530.20	2.4715	167.143619	0.268729	630.207952	1.018209
II	-462.30	1.0875	707.35	1.6640	203.811600	0.479702	424.573256	0.998854
III	-206.75	1.6635	550.10	4.4260	88.122448	0.708484	124.273849	0.999407
IV	-263.60	1.9640	419.10	3.1200	83.309442	0.620801	234.864328	1.749009

Looks like frog_id I packs a punch although frog_id III isn’t too shabby when normalized by body weight! A strip plot to show all of the force measurements is probably a good next step!

transform is similiar to agg in that it accepts a pd.Series, np.array, list and applies to each column of each group. However, transform returns a list-like of the same length as the input. Consider calculating the z-score of impact forces (mN).

In this case too, a single observation corresponds to not only a tongue-hit but also type of experimentally measured value associated. We’ll want to use df_long (and write back into that data frame but in a new column).

[19]:

def calculate_zscore(x):
    return (x - np.mean(x)) / np.std(x)

df_long['zscore'] = df_long.groupby(['frog_id', 'variable'])['value'].transform(calculate_zscore)
df_long.head()

[19]:

	date	trial number	frog_id	variable	value	is_measured	zscore
0	2013_03_19	1	II	impact force (mN)	1612.0	True	2.186080
1	2013_03_19	2	II	impact force (mN)	605.0	True	-0.247328
2	2013_03_19	3	II	impact force (mN)	327.0	True	-0.919113
3	2013_03_19	4	II	impact force (mN)	946.0	True	0.576696
4	2013_03_21	1	II	impact force (mN)	541.0	True	-0.401984

Now, we have z-score within frog_id, variable groups for each value… Might be a neat trick for calculating ECDFs!

A plot¶

Why play with frogs trying to eat insects if we don’t make it to a plot!? Since we have z-scores for each observation across frog_ids and for each measurement, let’s values plot coloring by the z-score to draw our attention to any potentially extreme values. Consider this a start at exploring the data within this frog-tongue dataset.

[20]:

df_plot = df_long.copy()
df_plot['extreme'] = np.abs(df_plot['zscore']) > 1.75
df_plot = df_plot.sort_values(by=['frog_id', 'extreme'])

hv.Scatter(
    data=df_plot,
    kdims=[('frog_id', 'variable')],
    vdims=['value', 'variable', 'extreme'],
).groupby(
    'variable'
).opts(
    color='extreme',
    hooks=[bebi103.hv.no_xgrid_hook],
    jitter=0.1,
    show_legend=False,
    width=200,
    xlabel="",
    ylabel="",
).layout(
).opts(
    shared_axes=False,
)

[20]:

Computing environment¶

[21]:

%load_ext watermark
%watermark -v -p numpy,scipy,pandas,bokeh,holoviews,bebi103,jupyterlab

CPython 3.7.4
IPython 7.1.1

numpy 1.17.2
scipy 1.3.1
pandas 0.24.2
bokeh 1.3.4
holoviews 1.12.6
bebi103 0.0.42
jupyterlab 1.1.4