Tutorial 2b: Exploratory data analysis¶

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

import pandas as pd

import altair as alt
import bokeh.io

import bebi103
import altair_catplot as altcat

bokeh.io.output_notebook()

In 1977, John Tukey, one of the great statisticians and mathematicians of all time, published a book entitled Exploratory Data Analysis. In it, he laid out general principles on how researchers should handle their first encounters with their data, before formal statistical inference. Most of us spend a lot of time doing exploratory data analysis, or EDA, without really knowing it. Mostly, EDA involves a graphical exploration of a data set. In this tutorial, we will learn how to efficiently explore data.

We start off with a few wise words from John Tukey himself.

Useful EDA advice from John Tukey¶

"Exploratory data analysis can never be the whole story, but nothing else can serve as a foundation stone—as the first step."

"In exploratory data analysis there can be no substitute for flexibility; for adapting what is calculated—and what we hope plotted—both to the needs of the situation and the clues that the data have already provided."

"There is no excuse for failing to plot and look."

"There is often no substitute for the detective's microscope - - or for the enlarging graphs."

"Graphs force us to note the unexpected; nothing could be more important."

"'Exploratory data analysis' is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there."

Our data set¶

We will again be playing with the data set of frog strikes the tutorial on data frames. As a brief reminder, the data set comes from Kleinteich and Gorb, Sci. Rep., 4, 5225, 2014. The authors measured various physical properties tongues of horned frogs hitting a target connected to a force transducer. Let's load in that data set so we have it available.

df = pd.read_csv('../data/frog_tongue_adhesion.csv', comment='#')

data_dict = {'ID': ['I', 'II', 'III', 'IV'],
             'age': ['adult', 'adult', 'juvenile', 'juvenile'],
             'SVL (mm)': [63, 70, 28, 31],
             'weight (g)': [63.1, 72.7, 12.7, 12.7],
             'species': ['cross', 'cross', 'cranwelli', 'cranwelli']}

df = df.merge(pd.DataFrame(data=data_dict))

# Check it out!
df.head()

Exploring with scatter plots¶

Ok, now that that is loaded, let's try to make a plot of the impact force versus adhesive force like we did in a previous tutorial. As we're exploring the data, it is also useful to color the points for each frog to aid in our visualization.

alt.Chart(df
    ).mark_point(
    ).encode(
        x='impact force (mN):Q',
        y='adhesive force (mN):Q',
        color=alt.Color('ID:N', title='frog')
    )

This is of course a useful plot, but we might like to make a similar plot for many variates so we can see how many pairs of variables are related. We can do this conveniently using Altair but making a repeat chart. As is often the case, Altair's grammar is so intuitive, you will be understand how it works simply by looking at an example.

variates = ['impact force (mN)', 
            'adhesive force (mN)', 
            'contact pressure (Pa)',
            'impact time (ms)']

alt.Chart(df,
        width=100,
        height=100
    ).mark_point(
    ).encode(
        x=alt.X(alt.repeat('column'), type='quantitative'),
        y=alt.Y(alt.repeat('row'), type='quantitative'),
        color=alt.Color('ID:N', title='frog')
    ).repeat(
        row=variates,
        column=variates
    ).interactive()

Display of repeated measurments¶

Thus far, we have focused on scatter plots to highlight how you can use Altair to rapidly explore your data sets. For the rest of this tutorial, we will focus on good practices for displaying a data type commonly encountered in biology. We repeat a measurement many times for given test conditions and wish to compare the results. The different conditions are called categories, and the axis along which the conditions are represented is called a categorical axis. The concrete example we will use will be the impact force of the frog strikes. The categories are the IDs of the frogs. This would be completely analogous to, say, comparing the length of dorsal appendages of eggs from fruit flys of four different genotypes.

Bar graphs (don't do this)¶

I will start with pretty much the worst, and probably most ubiquitous, mode of display: the bar graph.

alt.Chart(df
    ).mark_bar(
    ).encode(
        x='impact force (mN):Q',
        y=alt.Y('ID:N', title='frog')
    )

This way of displaying data is just plain awful. Do not do it. You are only graphically showing the means and using a lot of real estate to do it. Why would you decide to only display four points when you actually measured 80 tongue strikes?

Box plots¶

If you are going to summarize the data, a box-and-whisker plot, also just called a box plot is a better option. Indeed, it was invented by John Tukey himself. Instead of condensing your measurements into one value (or two, if you include an error bar) like in a bar graph, you condense them into at least five. It is easier to describe a box plot if you have one to look at. Unfortunately, Altair currently (as of September 25, 2018) does not have the capability to make box plots because Vega-Lite 2.x does not have that capability. Vega-Lite 3.0 does, however, so Altair will soon as well.

For now, we will use the Altair-Catplot module to make box plots and other plots where one of the axes is categorical. The Altair-Catplot module uses similar syntax as Altair, but uses dictionaries to specify encodings (this is intentional so as not to confuse what is present in Vega-Lite/Altair and what is not). Every plot is generated using the altcat.catplot() function, and the transform kwarg determines how the data are transformed for display.

altcat.catplot(data=df,
               encoding=dict(x=alt.X('ID:N', title=None),
                             y='impact force (mN):Q',
                             color=alt.Color('ID:N', legend=None)),
               transform='box'
              ).configure_bar(
                    fillOpacity=0.75
              ).configure_axisX(
                  labelAngle=0,
              )

The top of a box is the 75th percentile of the measured data. That means that 75 percent of the measurements were less than the top of the box. The bottom of the box is the 25th percentile. The line in the middle of the box is the 50th percentile, also called the median. Half of the measured quantities were less than the median, and half were above. The total height of the box encompasses the measurements between the 25th and 75th percentile, and is called the interquartile region, or IQR. The top whisker extends to the minimum of these two quantities: the largest measured data point and the 75th percentile plus 1.5 times the IQR. Similarly, the bottom whisker extends to the maximum of the smallest measured data point and the 25th percentile minus 1.5 times the IQR. Any data points not falling between the whiskers are then plotting individually, and are typically termed outliers, though there are many ways we could define what an outlier is (more on that later in the course).

So, box-and-whisker plots give much more information than a bar plot. They give a reasonable summary of how data are distributed.

Plot all of your data¶

In a scatter plot, you plot all of your data points. Shouldn't the same be true for categorical plots? You went through all the work to get the data; you should show them all!

Strip plot¶

One convenient way to plot all of your data is a strip plot. This is implemented in Altair by simply specifying one of the axes to be a categorical (nominal or ordinal in Altair-speak) variable.

alt.Chart(df
    ).mark_tick(
    ).encode(
        x='impact force (mN):Q',
        y=alt.Y('ID:N', title='frog'),
        color=alt.Color('ID:N', legend=None)
    )

This plot is more informative even than the box-and-whisker plot. We see that frog I strikes with a pretty broad distribution of impact forces. Frog II likes to strike at around 500 mN, but is capable of much bigger strikes around 1500 mN.

Jitter plots¶

Strip plots are useful, but they have the issue that data points can overlap and it is therefore hard to tell where all of the measurements are. To remedy that, we can make a jitter plot. Here, instead of lining all of the data points up exactly in line with the category, we randomly "jitter" the points about the centerline. There are many approaches to jittering, and some are not even random, like beeswarm plots. See, for example, this package for demonstrations of different jittering algorithms (it's in R, but that's ok). Jitter transforms are not yet implemented in Vega-Lite, and therefore also not in Altair either. You can instead use Altair-catplot to make jitter plots.

altcat.catplot(data=df,
               mark='point',
               encoding=dict(y=alt.Y('ID:N', title=None, fontWeight='bold'),
                             x='impact force (mN):Q', 
                             color=alt.Color('ID:N', legend=None)),
               transform='jitter'
              ).configure_text(
                  fontWeight='bold'
              )

Sometimes we want to overlay jitter and box plots. This is a way to plot all of your data and also to show important summary statistics. The 'jitterbox' transform in Altair-catplot allows this. Here I will also add a tooltip where I can hover over to see the date of the experiment to see if the outliers all happened on the same day. This is part of what exploratory data analysis is all about. Setting up graphical displays of data whereby you can quickly see patterns or problems in your data set.

altcat.catplot(df,
               height=250,
               width=450,
               mark='point',
               encoding=dict(y=alt.Y('ID:N', title=None),
                             x='impact force (mN):Q',
                             color=alt.Color('ID:N', legend=None),
                             tooltip=alt.Tooltip(['date:N'], title='date')),
               transform='jitterbox'
              ).configure_text(
                  fontWeight='bold'
              )

Histograms¶

We often want a way to visualize the probability distribution from which the data are drawn. A common way to do this is to plot a histogram. The data are placed in bins, and the height of the bars corresponding to each bin are proportional to the number of data that lie between the bounds of the respective bin. This is accomplished in Altair using the bin kwarg in setting up the x-axis and specifying specifying the name of the y-axis as 'count()', which instructs Altair to use the count of data points in the bin to set the height. We will demonstrate a histogram for all of the impact forces taken together as a single data set.

alt.Chart(df
    ).mark_bar(
    ).encode(
        alt.X('impact force (mN):Q', bin=alt.Bin(maxbins=15)),
        y='count()'
    )

At first glance, there seems to be some bimodality. We could start playing around with the number of bins to look more closely, but I will not bother to do that here. In general, I contend that a histogram is rarely the best way to plot the distribution of repeated measurements. Instead, an ECDF is better....

The ECDF¶

While informative in this case, I generally frown on using histograms to display distributions of measurements. People for whom I have great respect (e.g., this guy disagree with me on this, but my convictions on it are strong, as are other people I respect (e.g., this guy). The primary reason for this is binning bias. To create a histogram, we necessarily have to consider not the exact values of measurements, but we must place these measurements in bins. We do not plot all of the data, which goes against the "plot all your data mantra," and the choice of bins can change what you infer from the plot. Instead, we should look at the empirical cumulative distribution function, or ECDF. The ECDF evaluated at x for a set of measurements is defined as

ECDF(x) = fraction of measurements ≤ x.

While the histogram is an attempt to visualize a probability density function (PDF) of a distribution, the ECDF visualizes the cumulative density function (CDF). Remember that the CDF, $F(x)$ and PDF, ($f(x)$ both completely define a univariate distribution and are related by

\begin{align} f(x) = \frac{\mathrm{d}F}{\mathrm{d}x}. \end{align}

The 'ecdf' transform of Altair-catplot allows for quick plotting of the ECDF.

altcat.catplot(data=df,
               mark='point',
               encoding=dict(x='impact force (mN):Q'),
               transform='ecdf')

Now, given the above definition of the ECDF, it is defined for all real x. So, formally, the ECDF is a continuous function (with discontinuous derivatives at each data point). So, it should be plotted like a staircase according to the formal definition. This can be accomplished by specifying our mark to be a line.

altcat.catplot(data=df,
               mark='line',
               encoding=dict(x='impact force (mN):Q'),
               transform='ecdf')

While this is certainly a correct depiction of the ECDF, I prefer the dots version for data sets containing more than about 20 data points. I generally prefer this because it is easier to see the values of the measurements, which occur at the concave corners of the staircase representation. That said, when there are too few points, it is harder to make out the shape of the ECDF, so in those cases, I prefer the staircase. This frog data set, split up by frog, is right in the region where my preference could go either way. Either representation contains all of the information of the ECDF, so they are both legitimate, though the staircase technically matches the definition of an ECDF.

To plot the ECDFs for each frog so we can make comparisons, we add a color encoding.

altcat.catplot(data=df,
               mark='line',
               encoding=dict(x='impact force (mN):Q',
                             color=alt.Color('ID:N', title='frog')),
               transform='ecdf')

Here we can make a clear comparison. The broad distribution of frog I's impact force is clear, as it the generally greater magnitude of the impacts.

I typically prefer this way of displaying this sort of data. One problem is that the plot can get crowded and hard to read if there are too many categories. Here, we have four, which is manageable. Generally, I always go with ECDFs up to about six categories. Depending on how spread out the ECDFs are, I can go higher, but if they are crowded, and I absolutely need to have the different categories on the same plot, I will resort to a jitter or strip plot.

A short visit to the zebrafish sleep data¶

To illustrate the importance and efficacy of doing exploratory data analysis with tidy data, we will again visit the zebrafish activity data of the previous tutorial. We will load in the resampled data we generated and saved.

df_fish = pd.read_csv('../data/130315_1A_aanat2_resampled.csv')

# Take a look as a reminder
df_fish.head()

We might be interested in averaging the traces for each genotype. That is, for each time point (which we have marked with the zeit_ind column), we average the activity over all locations of a given genotype, and then plot the resulting trace.

First, we will do a split-apply-combine on the DataFrame to give us a new DataFrame with the averaged traces. The idea is to group by 'genotype' and 'zeit_ind', perform the averaging, and then reset the index to get a tidy DataFrame back.

averaged_df = df_fish.groupby(['genotype', 'zeit_ind']).mean().reset_index()

# Location columns is irrelovant
del averaged_df['location']

# Take a look
averaged_df.head()

Now that we have the data we wish to plot, we will use Altair to build it.

alt.Chart(averaged_df
    ).mark_line(
        strokeJoin='bevel'
    ).encode(
        x=alt.X('zeit:Q', title='time (hours)'),
        y='activity:Q',
        color=alt.Color('genotype:N', sort=['wt', 'het', 'mut']),
        order='zeit:Q'
    )

We can quickly see that at night (the nights are obvious; the periods of low activity, but we could do some extra work and annotate them on the plot if we like), the mutant fish are more active than the wild type or heterozygotic fish.

This is useful, but remember my advice, if you can, plot all of your data. We should do that to look at all traces and make sure we don't have anything anomalous. To make the plots of all trajectories, we can use the row encoding. Since we are using the whole data frame, we will get a max rows error from Altair, so we need to enable a JSON data transformer.

alt.data_transformers.enable('json')

alt.Chart(df_fish
    ).mark_line(
        strokeJoin='bevel'
    ).encode(
        x='zeit:Q',
        y='activity:Q',
        color=alt.Color('genotype:N', sort=['wt', 'het', 'mut']),
        order='zeit:Q',
        row=alt.Row('location:N', sort=['genotype', 'location']),
    ).properties(
        height=50
    ).interactive(
        bind_y=False
    )

We see something interesting in the mutant traces. Specifically, there appear to be two fish, in locations 21 and 67, that are up all night on a couple of the nights. You will also see that fish 21 has very low activity early in the experiment, and then its activity ramps up. It then stays consistently active for a couple of days, and then basically dies. For fish 67, it has almost no activity until the middle of the experiment, and then its activity ramps up. I have spoken with experts on zebrafish behavior, and then tell me that this may likely be the sick fish, showing effects that might not be due to the mutation. We would have had a really hard time identifying these fish if it were not for efficient EDA.

In light of the fact that these fish may be true outliers, we may want to consider a summary statistic that is robust to outliers, like the median, instead of the mean when plotting the characteristic genotype trajectories. So, let's remake the plot with the summary trace for each genotype.

median_df = df_fish.groupby(['genotype', 'zeit_ind']).median().reset_index()

alt.Chart(median_df
    ).mark_line(
        strokeJoin='bevel'
    ).encode(
        x=alt.X('zeit:Q', title='time (hours)'),
        y='activity:Q',
        color=alt.Color('genotype:N', sort=['wt', 'het', 'mut']),
        order='zeit:Q'
    )

The effect of greater activity among mutant fish is still apparent, but less pronouced.

Making categorical plots with Bokeh¶

Altair has very useful syntax for quickly doing exploratory data analysis. The Altair-catplot module extends some of its capabilities for plots with a categorical variable by enabling box, jitter, and ECDF plots. While Altair and Vega-Lite are excellent tools for exploratory data analysis, they can sometimes have performance issues with larger data sets. If this is a concern for you, HoloViews is an excellent package for high-level plotting. We may have an auxiliary lesson on it at some point. Alternatively, we can build categorical plots using Bokeh. This is of course much more verbose than with Altair (or HoloViews). I therefore made some useful plotting functions in the bebi103.viz submodule. In what follows, I will re-do many of the plots above using this module so you can see how the plots are made. For plots with catagorical data, for all of the functions in the submodule, you need to input a tidy data frame, and argument cats, which gives the columns of the data frame that are categorical in your plot, and argument val, which is the column containing the quantitative data to plot.

Box plot¶

p = bebi103.viz.box(data=df, 
                    cats='ID',
                    val='impact force (mN)',
                    x_axis_label=None)
bokeh.io.show(p)

Jitter plot¶

p = bebi103.viz.jitter(data=df, 
                       cats='ID',
                       val='impact force (mN)',
                       y_axis_label=None,
                       horizontal=True)
bokeh.io.show(p)

Box-jitter plot¶

# Box plot
p = bebi103.viz.box(data=df, 
                    cats='ID',
                    val='impact force (mN)',
                    y_axis_label=None,
                    box_kwargs={'fill_alpha': 0.2},
                    horizontal=True)

# Add jitter; note kwarg p=p to add to existing figure
p = bebi103.viz.jitter(data=df, 
                       cats='ID',
                       val='impact force (mN)',
                       horizontal=True,
                       p=p)
bokeh.io.show(p)

Histogram¶

p = bebi103.viz.histogram(data=df['impact force (mN)'], kind='step_filled', fill_alpha=0.5, bins=15, density=False)
bokeh.io.show(p)

ECDF collection¶

p = bebi103.viz.ecdf_collection(data=df, 
                                cats='ID',
                                val='impact force (mN)',
                                formal=True,
                                line_width=2)
bokeh.io.show(p)

Computing environment¶

%load_ext watermark

%watermark -v -p pandas,altair,altair_catplot,bokeh,bebi103,jupyterlab

CPython 3.7.0
IPython 6.5.0

pandas 0.23.4
altair 2.2.2
altair_catplot 0.0.2
bokeh 0.13.0
bebi103 0.0.23
jupyterlab 0.34.9

	date	ID	trial number	impact force (mN)	impact time (ms)	impact force / body weight	adhesive force (mN)	time frog pulls on target (ms)	adhesive force / body weight	adhesive impulse (N-s)	total contact area (mm2)	contact area without mucus (mm2)	contact area with mucus / contact area without mucus	contact pressure (Pa)	adhesive strength (Pa)	age	SVL (mm)	weight (g)	species
0	2013_02_26	I	3	1205	46	1.95	-785	884	1.27	-0.290	387	70	0.82	3117	-2030	adult	63	63.1	cross
1	2013_02_26	I	4	2527	44	4.08	-983	248	1.59	-0.181	101	94	0.07	24923	-9695	adult	63	63.1	cross
2	2013_03_01	I	1	1745	34	2.82	-850	211	1.37	-0.157	83	79	0.05	21020	-10239	adult	63	63.1	cross
3	2013_03_01	I	2	1556	41	2.51	-455	1025	0.74	-0.170	330	158	0.52	4718	-1381	adult	63	63.1	cross
4	2013_03_01	I	3	493	36	0.80	-974	499	1.57	-0.423	245	216	0.12	2012	-3975	adult	63	63.1	cross

	location	time	activity	zeit	zeit_ind	day	genotype	light
0	1	2013-03-15 18:30:00	85.888889	-14.500000	-869	4	het	True
1	1	2013-03-15 18:40:00	4.500000	-14.333333	-860	4	het	True
2	1	2013-03-15 18:50:00	0.000000	-14.166667	-850	4	het	True
3	1	2013-03-15 19:00:00	0.000000	-14.000000	-840	4	het	True
4	1	2013-03-15 19:10:00	0.000000	-13.833333	-830	4	het	True

	genotype	zeit_ind	activity	zeit	day	light
0	het	-869	17.921569	-14.500000	4.0	True
1	het	-860	7.900000	-14.333333	4.0	True
2	het	-850	4.891176	-14.166667	4.0	True
3	het	-840	4.041176	-14.000000	4.0	True
4	het	-830	1.411765	-13.833333	4.0	True