Exploratory data analysis


In 1977, John Tukey, one of the great statisticians and mathematicians of all time, published a book entitled Exploratory Data Analysis. In it, he laid out general principles on how researchers should handle their first encounters with their data, before formal statistical inference. Most of us spend a lot of time doing exploratory data analysis, or EDA, without really knowing it. Mostly, EDA involves a graphical exploration of a data set.

We start off with a few wise words from John Tukey himself.

Useful EDA advice from John Tukey

  • “Exploratory data analysis can never be the whole story, but nothing else can serve as a foundation stone—as the first step.”

  • “In exploratory data analysis there can be no substitute for flexibility; for adapting what is calculated—and what we hope plotted—both to the needs of the situation and the clues that the data have already provided.”

  • “There is no excuse for failing to plot and look.”

  • “There is often no substitute for the detective’s microscope - - or for the enlarging graphs.”

  • “Graphs force us to note the unexpected; nothing could be more important.”

  • “‘Exploratory data analysis’ is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there.”

The tools of EDA

Being able to load in a data set and quickly start exploring it graphically enables you to think about your data set instead being mired in the mechanics of producing a plot. In the notebooks that follow in this lesson, we will learn how to use the Python-based tools for EDA. In particular, we will learn how to use Pandas to keep the data set organized and accessible, and Bokeh and HoloViews to make interactive graphics.

Along the way, we will learn key concepts of data organization and display. Importantly, we will learn about tidy data, split-apply-combine, and how to plot all of your data.

Before we march on this trajectory, though, we need to learn a bit about Numpy and Scipy, which form the foundation upon which much of these tools are built.