Categorical axes and HoloViews¶

Data set download

[1]:

import numpy as np
import pandas as pd

import holoviews as hv

import bebi103

hv.extension('bokeh')
bebi103.hv.set_defaults()

We have seen how to handle categorical axes with Bokeh, including nested axes. We will now do the same with HoloViews, which automatically infers categorical types for the axes. The caveat is that the entries in the data frame for a categorical column must be string. For example, if we had a column that contained Trues and Falses, like we have had for 'insomnia', and we wanted to use it as a categorical variable, we would have to convert the data type to str.

With that in mind, let’s load in the data set and make the usual adjustments as we prepare for plotting.

[2]:

df = pd.read_csv('../data/gfmt_sleep.csv', na_values='*')
df['insomnia'] = df['sci'] <= 16
df['sleeper'] = df['insomnia'].apply(lambda x: 'insomniac' if x else 'normal')
df['gender'] = df['gender'].apply(lambda x: 'female' if x == 'f' else 'male')

df.head()

[2]:

	participant number	gender	age	correct hit percentage	correct reject percentage	percent correct	confidence when correct hit	confidence incorrect hit	confidence correct reject	confidence incorrect reject	confidence when correct	confidence when incorrect	sci	psqi	ess	insomnia	sleeper
0	8	female	39	65	80	72.5	91.0	90.0	93.0	83.5	93.0	90.0	9	13	2	True	insomniac
1	16	male	42	90	90	90.0	75.5	55.5	70.5	50.0	75.0	50.0	4	11	7	True	insomniac
2	18	female	31	90	95	92.5	89.5	90.0	86.0	81.0	89.0	88.0	10	9	3	True	insomniac
3	22	female	35	100	75	87.5	89.5	NaN	71.0	80.0	88.0	80.0	13	8	20	True	insomniac
4	27	female	74	60	65	62.5	68.5	49.0	61.0	49.0	65.0	49.0	13	9	12	True	insomniac

Bar graphs (don’t do this)¶

I will start with pretty much the worst, and probably most ubiquitous, mode of display: the bar graph. To make the bar graph, we have to make a new data frame to specify the bars, which we will take to be the mean percent correct for each (gender, sleeper) pair.

[3]:

df_mean = df.groupby(['gender', 'sleeper'])['percent correct'].mean().reset_index()

Now we can make the bar graph. HoloViews knows that both 'gender' and 'sleeper' are categorical because they have str data types. It also automatically makes a nested categorical axis if we specify two or more kdims.

[4]:

hv.Bars(
    data=df_mean,
    kdims=['gender', 'sleeper'],
    vdims=['percent correct']
).opts(
    xlabel='',
    ylim=(0, 100),
)

[4]:

This way of displaying data is just plain awful. Do not do it. You are only graphically showing the means and using a lot of real estate to do it. Why would you decide to only display four points when you actually measured a whole lot more?

Box plots¶

If you are going to summarize the data, a box-and-whisker plot, also just called a box plot is a better option than a bar graph. Indeed, it was invented by John Tukey himself. Instead of condensing your measurements into one value (or two, if you include an error bar) like in a bar graph, you condense them into at least five. It is easier to describe a box plot if you have one to look at.

[5]:

hv.BoxWhisker(
    data=df,
    kdims=['gender', 'sleeper'],
    vdims=['percent correct'],
).opts(
    box_color='sleeper'
)

[5]:

The top of a box is the 75th percentile of the measured data. That means that 75 percent of the measurements were less than the top of the box. The bottom of the box is the 25th percentile. The line in the middle of the box is the 50th percentile, also called the median. Half of the measured quantities were less than the median, and half were above. The total height of the box encompasses the measurements between the 25th and 75th percentile, and is called the interquartile region, or IQR. The top whisker extends to the minimum of these two quantities: the largest measured data point and the 75th percentile plus 1.5 times the IQR. Similarly, the bottom whisker extends to the maximum of the smallest measured data point and the 25th percentile minus 1.5 times the IQR. Any data points not falling between the whiskers are then plotting individually, and are typically termed outliers.

So, box-and-whisker plots give much more information than a bar plot. They give a reasonable summary of how data are distributed.

Plot all of your data¶

In a scatter plot, you plot all of your data points. Shouldn’t the same be true for categorical plots? You went through all the work to get the data; you should show them all!

Strip plot¶

One convenient way to plot all of your data is a strip plot. In a strip plot, every point is plotted. We use hv.Scatter() to generate strip plots.

Unfortunately, nested categorical axes are currently (as of October 13, 2019) only supported for box, violin, and bar plots, as per the docs but will eventually be supported for many more plot types, including Scatter, which are used to generate strip plots. So, for now, we will only consider insomniacs and normal sleepers as our categorical axes, and will use gender for color.

[6]:

hv.Scatter(
    data=df,
    kdims=['sleeper'],
    vdims=['percent correct', 'gender'],
).opts(
    color='gender',
    xlabel='',
)

[6]:

An obvious problem with this plot is that the data points overlap. We can get around this issue by adding a jitter to the plot. Instead of lining all of the data points up exactly in line with the category, we randomly “jitter” the points about the centerline. There are many approaches to jittering, and some are not even random, like beeswarm plots. See, for example, this package for demonstrations of different jittering algorithms (it’s in R, but that’s ok). This is specified with the jitter kwarg of the opts.

[7]:

hv.Scatter(
    data=df,
    kdims=['sleeper'],
    vdims=['percent correct', 'gender'],
).opts(
    color='gender',
    jitter=0.3,
    xlabel='',
)

[7]:

We can now better resolve the respective points.

We do sometimes wish to overlay the points on top of the graphical display of summary statistics available from box plots. To do that, we can use HoloView’s overlay functionality. We will set the outlier in the box-and-whisker plot to be transparent, since all points are already plotted.

[9]:

strip = hv.Scatter(
    data=df,
    kdims=['sleeper'],
    vdims=['percent correct', 'gender'],
).opts(
    color='gender',
    jitter=0.3,
    xlabel='',
)

box = hv.BoxWhisker(
    data=df,
    kdims=['sleeper'],
    vdims=['percent correct'],
).opts(
    box_fill_color='lightgray',
    outlier_alpha=0,
)

box * strip

[9]:

Again, though, the key feature here is to plot all of your data!

Histograms¶

When making a histogram, the values of the bin edges and counts must be computed beforehand using np.histogram().

[99]:

edges, counts = np.histogram(df['percent correct'], bins=int(np.sqrt(len(df))))

We then can pass the bin edges and counts into hv.Histogram().

[111]:

ds = hv.Dataset(df)
hv.operation.histogram(ds, dimension='percent correct', groupby='sleeper')

# hv.Histogram(
#     data=(edges, counts),
#     kdims='percent correct'
# ).opts(
#     ylim=(0, None),
# )

[111]: