Plots with categorical variables

Data set download


[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import pandas as pd

import bokeh.models
import bokeh.plotting
import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

Types of data for plots

Let us first consider the different kinds of data we may encounter as we think about constructing a plot.

  • Quantitative data may have continuously varying (and therefore ordered) values.

  • Categorical data has discrete, unordered values that a variable can take.

  • Ordinal data has discrete, ordered values. Integers are a classic example.

  • Temporal data refers to time, which can be represented as dates.

In practice, ordinal data can be cast as quantitative or treated as categorical with an ordering enforced on the categories (e.g., categorical data [1, 2, 3] becomes ['1', '2', '3'].). Temporal data can also be cast as quantitative, (e.g., seconds from the start time). We will therefore focus out attention on quantitative and categorical data.

When we made scatter plots (note lowercase “scatter;” we actually used hv.Points because we had two independent variables) in the previous lesson, both types of data were quantitative. We did actually incorporate categorical information in the form of colors of the glyph (insomniacs and normal sleepers being colored differently) and in tooltips.

But what if we wanted a single type of measurement, as percent correct in the facial identification, but were interested in how well insomniacs versus normal sleepers performed. Here, we have the quantitative percent correct data and the categorical sleeper type. One of our axes is now categorical.

Note that this kind of plot is commonly encountered in the biological sciences. We repeat a measurement many times for given test conditions and wish to compare the results. The different conditions are the categories, and the axis along which the conditions are represented is called a categorical axis. The quantitative axis contains the result of the measurements from each condition.


The rest of this lesson is mostly for reference so you can see how to handle categorical axes with Bokeh. In practice, we will be usingiqplotandHoloViewsto do this and it is done for your automatically. You may therefore skip the rest of this notebook if you like.

Making a bar graph with Bokeh

To demonstrate how to set up a categorical axis with Bokeh, I will make a bar graph of the mean percent correct for insomniacs and normal sleepers. But before I even begin this, I will give you the following piece of advice: Don’t make bar graphs. More on that in a moment.

Setting up a data frame for plotting

Before we do that, we need to set up a data frame to make the plot. We start by reading in the data set and computing the 'insomnia' column, which gives Trues and Falses, as we’ve done in the preceding parts of this lesson.

[2]:
fname = os.path.join(data_path, "gfmt_sleep.csv")
df = pd.read_csv(fname, na_values="*")
df["insomnia"] = df["sci"] <= 16

For convenience in plotting the categorical axis, we would rather not have the values on the axis be True or False, but something more descriptive, like insomniac and normal. So, let’s make a column in the data frame, 'sleeper' that has that for us. We use the apply() method of the data frame to apply a function that returns the string 'insomniac' if the entry is in the 'insomnia' column is True and 'normal' otherwise.

[3]:
df["sleeper"] = df["insomnia"].apply(lambda x: "insomniac" if x else "normal")

Next, we need to make a data frame that has the mean percent correct for each of the two categories of sleeper. We have decided that it is the mean of the respective measurements that will set the height of the bars.

[4]:
df_mean = df.groupby("sleeper")["percent correct"].mean().reset_index()

# Take a look
df_mean
[4]:
sleeper percent correct
0 insomniac 76.100000
1 normal 81.461039

Now we’re ready to make the bar graph. Note that we now have only two data points that we are showing on the plot. We have decided to throw out a lot of information from the data we collected to display only two values. Does this strike you as a terrible idea? It should. Don’t do this. We’re just doing it to show how categorical axes are set up using Bokeh.

Setting up categorical axes

To set up a categorical axis, you need to specify the x_range (or y_range if you want the y-axis to be categorical) as a list with the categories you want on the axis when you instantiate the figure. I will make a horizontal bar graph, so I will specify y_range. I also want my quantitative axis (x in this case) to go from zero to 100, since it signifies a percent. Also, when I instantiate this figure, because it is not very tall and I do not want the reset tool cut off, I will also explicitly set the tools I want in the toolbar.

[5]:
p = bokeh.plotting.figure(
    height=200,
    width=400,
    x_axis_label="percent correct",
    x_range=[0, 100],
    y_range=df_mean["sleeper"].unique(),
    tools="save",
)

Now that we have the figure, we can put the bars on. The p.hbar() method populates the figure with horizontal bar glyphs. The right kwarg says what column of the data source dictates how far to the right to show the bar, while the height kwarg says how think the bars are.

I will also ensure the quantitative axis starts at zero and turn off the grid lines on the categorical axis, which is commonly done.

[6]:
p.hbar(
    source=df_mean,
    y="sleeper",
    right="percent correct",
    height=0.6,
)

# Turn off gridlines on categorical axis
p.ygrid.grid_line_color = None

bokeh.io.show(p)

We similarly make vertical bar graphs specifying x_range and using p.vbar().

[7]:
p = bokeh.plotting.figure(
    height=250,
    width=250,
    x_range=df_mean["sleeper"].unique()[::-1],
    y_range=[0, 100],
    y_axis_label="percent correct",
)

p.vbar(
    source=df_mean,
    x="sleeper",
    top="percent correct",
    width=0.6,
)

p.xgrid.grid_line_color = None

bokeh.io.show(p)

Nested categorical axes

We may wish to make a bar graph where we have four bars, normal and insomniac for males and also normal and insomniac for females. To start, we will have to re-make the df_mean data frame, now grouping by gender and sleeper. Furthermore, it will be nicer to label the categories as “female” and “male” instead of “f” and “m”.

[8]:
df["gender"] = df["gender"].apply(lambda x: "female" if x == "f" else "male")

df_mean = df.groupby(["gender", "sleeper"])["percent correct"].mean().reset_index()

# Take a look
df_mean
[8]:
gender sleeper percent correct
0 female insomniac 73.947368
1 female normal 82.045455
2 male insomniac 82.916667
3 male normal 80.000000

Because of the way Bokeh handles nested categories, we need to create a new column that has a tuple corresponding to the nested category. To make the tuple, we can again apply a function, this time to each entire row of the data frame (which requires the axis=1 kwarg of df_mean.apply()).

[9]:
df_mean["cats"] = df_mean.apply(lambda x: (x["gender"], x["sleeper"]), axis=1)

# Take a look
df_mean
[9]:
gender sleeper percent correct cats
0 female insomniac 73.947368 (female, insomniac)
1 female normal 82.045455 (female, normal)
2 male insomniac 82.916667 (male, insomniac)
3 male normal 80.000000 (male, normal)

Next, we need to set up factors, which give the nested categories. We could extract them from the 'cats' column of the data frame as

factors = list(df_mean.cats)

Instead, we will specify them by hand to ensure they are ordered as we would like.

[10]:
factors = [
    ("female", "normal"),
    ("female", "insomniac"),
    ("male", "normal"),
    ("male", "insomniac"),
]

Finally, to use these factors in a y_range (or x_range), we need to convert them to a factor range using bokeh.models.FactorRange().

[11]:
p = bokeh.plotting.figure(
    height=200,
    width=400,
    x_axis_label="percent correct",
    x_range=[0, 100],
    y_range=bokeh.models.FactorRange(*factors),
    tools="save",
)

Now we are ready to add the bars, taking care to specify the 'cats' column for our y-values.

[12]:
p.hbar(
    source=df_mean,
    y="cats",
    right="percent correct",
    height=0.6,
)

p.ygrid.grid_line_color = None

bokeh.io.show(p)

Computing environment

[13]:
%load_ext watermark
%watermark -v -p pandas,bokeh,jupyterlab
CPython 3.8.5
IPython 7.18.1

pandas 1.1.1
bokeh 2.2.1
jupyterlab 2.2.6