Categorical axes and HoloViews

Data set download


[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade colorcet datashader bebi103 watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np
import pandas as pd

import holoviews as hv

import bebi103

hv.extension('bokeh')
bebi103.hv.set_defaults()

We will be usingiqplotto make all of the kinds of plots in this notebook. This notebook is therefore for reference, and you may skip reading it.


We have seen how to handle categorical axes with Bokeh, including nested axes. We will now do the same with HoloViews, which automatically infers categorical types for the axes. The caveat is that the entries in the data frame for a categorical column must be string. For example, if we had a column that contained Trues and Falses, like we have had for 'insomnia', and we wanted to use it as a categorical variable, we would have to convert the data type to str.

With that in mind, let’s load in the data set and make the usual adjustments as we prepare for plotting.

[2]:
fname = os.path.join(data_path, "gfmt_sleep.csv")
df = pd.read_csv(fname, na_values="*")
df["insomnia"] = df["sci"] <= 16
df["sleeper"] = df["insomnia"].apply(lambda x: "insomniac" if x else "normal")
df["gender"] = df["gender"].apply(lambda x: "female" if x == "f" else "male")

df.head()
[2]:
participant number gender age correct hit percentage correct reject percentage percent correct confidence when correct hit confidence incorrect hit confidence correct reject confidence incorrect reject confidence when correct confidence when incorrect sci psqi ess insomnia sleeper
0 8 female 39 65 80 72.5 91.0 90.0 93.0 83.5 93.0 90.0 9 13 2 True insomniac
1 16 male 42 90 90 90.0 75.5 55.5 70.5 50.0 75.0 50.0 4 11 7 True insomniac
2 18 female 31 90 95 92.5 89.5 90.0 86.0 81.0 89.0 88.0 10 9 3 True insomniac
3 22 female 35 100 75 87.5 89.5 NaN 71.0 80.0 88.0 80.0 13 8 20 True insomniac
4 27 female 74 60 65 62.5 68.5 49.0 61.0 49.0 65.0 49.0 13 9 12 True insomniac

Bar graphs (don’t do this)

I will start with pretty much the worst, and probably most ubiquitous, mode of display: the bar graph. To make the bar graph, we have to make a new data frame to specify the bars, which we will take to be the mean percent correct for each (gender, sleeper) pair.

[3]:
df_mean = df.groupby(['gender', 'sleeper'])['percent correct'].mean().reset_index()

Now we can make the bar graph. HoloViews knows that both 'gender' and 'sleeper' are categorical because they have str data types. It also automatically makes a nested categorical axis if we specify two or more kdims.

[4]:
hv.extension("bokeh")

hv.Bars(
    data=df_mean,
    kdims=['gender', 'sleeper'],
    vdims=['percent correct']
).opts(
    xlabel='',
    ylim=(0, 100),
)
[4]:

This way of displaying data is just plain awful. Do not do it. You are only graphically showing the means and using a lot of real estate to do it. Why would you decide to only display four points when you actually measured a whole lot more?

Box plots

If you are going to summarize the data, a box-and-whisker plot, also just called a box plot is a better option than a bar graph. Indeed, it was invented by John Tukey himself. Instead of condensing your measurements into one value (or two, if you include an error bar) like in a bar graph, you condense them into at least five. It is easier to describe a box plot if you have one to look at.

[5]:
hv.extension("bokeh")

hv.BoxWhisker(
    data=df,
    kdims=['gender', 'sleeper'],
    vdims=['percent correct'],
).opts(
    box_color='sleeper'
)
BokehUserWarning: ColumnDataSource's columns must be of the same length. Current lengths: ('index', 2), ('percent correct', 2), ('percent_correct', 0)
[5]:

The top of a box is the 75th percentile of the measured data. That means that 75 percent of the measurements were less than the top of the box. The bottom of the box is the 25th percentile. The line in the middle of the box is the 50th percentile, also called the median. Half of the measured quantities were less than the median, and half were above. The total height of the box encompasses the measurements between the 25th and 75th percentile, and is called the interquartile region, or IQR. The top whisker extends to the minimum of these two quantities: the largest measured data point and the 75th percentile plus 1.5 times the IQR. Similarly, the bottom whisker extends to the maximum of the smallest measured data point and the 25th percentile minus 1.5 times the IQR. Any data points not falling between the whiskers are then plotting individually, and are typically termed outliers.

So, box-and-whisker plots give much more information than a bar plot. They give a reasonable summary of how data are distributed.

Plot all of your data

In a scatter plot, you plot all of your data points. Shouldn’t the same be true for categorical plots? You went through all the work to get the data; you should show them all!

Strip plot

One convenient way to plot all of your data is a strip plot. In a strip plot, every point is plotted. We use hv.Scatter() to generate strip plots.

Unfortunately, nested categorical axes are currently (as of September 20, 2020) only supported for box, violin, and bar plots, as per the docs but will eventually be supported for many more plot types, including Scatter, which are used to generate strip plots. So, for now, we will only consider insomniacs and normal sleepers as our categorical axes, and will use gender for color.

[6]:
hv.extension("bokeh")

hv.Scatter(
    data=df,
    kdims=['sleeper'],
    vdims=['percent correct', 'gender'],
).opts(
    color='gender',
    xlabel='',
)
[6]:

An obvious problem with this plot is that the data points overlap. We can get around this issue by adding a jitter to the plot. Instead of lining all of the data points up exactly in line with the category, we randomly “jitter” the points about the centerline. There are many approaches to jittering, and some are not even random, like beeswarm plots. See, for example, this package for demonstrations of different jittering algorithms (it’s in R, but that’s ok). This is specified with the jitter kwarg of the opts.

[7]:
hv.extension("bokeh")

hv.Scatter(
    data=df,
    kdims=['sleeper'],
    vdims=['percent correct', 'gender'],
).opts(
    color='gender',
    jitter=0.3,
    xlabel='',
)
[7]:

We can now better resolve the respective points.

We do sometimes wish to overlay the points on top of the graphical display of summary statistics available from box plots. To do that, we can use HoloView’s overlay functionality. We will set the outlier in the box-and-whisker plot to be transparent, since all points are already plotted.

[8]:
hv.extension("bokeh")

strip = hv.Scatter(
    data=df,
    kdims=['sleeper'],
    vdims=['percent correct', 'gender'],
).opts(
    color='gender',
    jitter=0.3,
    xlabel='',
)

box = hv.BoxWhisker(
    data=df,
    kdims=['sleeper'],
    vdims=['percent correct'],
).opts(
    box_fill_color='lightgray',
    outlier_alpha=0,
)

box * strip
BokehUserWarning: ColumnDataSource's columns must be of the same length. Current lengths: ('index', 1), ('percent correct', 1), ('percent_correct', 0)
[8]:

Again, though, the key feature here is to plot all of your data!

Computing environment

[12]:
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,bokeh,holoviews,jupyterlab
CPython 3.8.5
IPython 7.18.1

numpy 1.19.1
scipy 1.5.0
pandas 1.1.1
bokeh 2.2.1
holoviews 1.13.4
jupyterlab 2.2.6