Plots with categorical variables¶
[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
cmd = "pip install --upgrade watermark"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
data_path = "../data/"
# ------------------------------
import pandas as pd
import bokeh.models
import bokeh.plotting
import bokeh.io
bokeh.io.output_notebook()
Types of data for plots¶
Let us first consider the different kinds of data we may encounter as we think about constructing a plot.
Quantitative data may have continuously varying (and therefore ordered) values.
Categorical data has discrete, unordered values that a variable can take.
Ordinal data has discrete, ordered values. Integers are a classic example.
Temporal data refers to time, which can be represented as dates.
In practice, ordinal data can be cast as quantitative or treated as categorical with an ordering enforced on the categories (e.g., categorical data [1, 2, 3]
becomes ['1', '2', '3']
.). Temporal data can also be cast as quantitative, (e.g., seconds from the start time). We will therefore focus out attention on quantitative and categorical data.
When we made scatter plots (note lowercase “scatter;” we actually used hv.Points
because we had two independent variables) in the previous lesson, both types of data were quantitative. We did actually incorporate categorical information in the form of colors of the glyph (insomniacs and normal sleepers being colored differently) and in tooltips.
But what if we wanted a single type of measurement, as percent correct in the facial identification, but were interested in how well insomniacs versus normal sleepers performed. Here, we have the quantitative percent correct data and the categorical sleeper type. One of our axes is now categorical.
Note that this kind of plot is commonly encountered in the biological sciences. We repeat a measurement many times for given test conditions and wish to compare the results. The different conditions are the categories, and the axis along which the conditions are represented is called a categorical axis. The quantitative axis contains the result of the measurements from each condition.
The rest of this lesson is mostly for reference so you can see how to handle categorical axes with Bokeh. In practice, we will be usingiqplotandHoloViewsto do this and it is done for your automatically. You may therefore skip the rest of this notebook if you like.
Making a bar graph with Bokeh¶
To demonstrate how to set up a categorical axis with Bokeh, I will make a bar graph of the mean percent correct for insomniacs and normal sleepers. But before I even begin this, I will give you the following piece of advice: Don’t make bar graphs. More on that in a moment.
Setting up a data frame for plotting¶
Before we do that, we need to set up a data frame to make the plot. We start by reading in the data set and computing the 'insomnia'
column, which gives True
s and False
s, as we’ve done in the preceding parts of this lesson.
[2]:
fname = os.path.join(data_path, "gfmt_sleep.csv")
df = pd.read_csv(fname, na_values="*")
df["insomnia"] = df["sci"] <= 16
For convenience in plotting the categorical axis, we would rather not have the values on the axis be True
or False
, but something more descriptive, like insomniac and normal. So, let’s make a column in the data frame, 'sleeper'
that has that for us. We use the apply()
method of the data frame to apply a function that returns the string 'insomniac'
if the entry is in the 'insomnia'
column is True
and 'normal'
otherwise.
[3]:
df["sleeper"] = df["insomnia"].apply(lambda x: "insomniac" if x else "normal")
Next, we need to make a data frame that has the mean percent correct for each of the two categories of sleeper. We have decided that it is the mean of the respective measurements that will set the height of the bars.
[4]:
df_mean = df.groupby("sleeper")["percent correct"].mean().reset_index()
# Take a look
df_mean
[4]:
sleeper | percent correct | |
---|---|---|
0 | insomniac | 76.100000 |
1 | normal | 81.461039 |
Now we’re ready to make the bar graph. Note that we now have only two data points that we are showing on the plot. We have decided to throw out a lot of information from the data we collected to display only two values. Does this strike you as a terrible idea? It should. Don’t do this. We’re just doing it to show how categorical axes are set up using Bokeh.
Setting up categorical axes¶
To set up a categorical axis, you need to specify the x_range
(or y_range
if you want the y-axis to be categorical) as a list with the categories you want on the axis when you instantiate the figure. I will make a horizontal bar graph, so I will specify y_range
. I also want my quantitative axis (x in this case) to go from zero to 100, since it signifies a percent. Also, when I instantiate this figure, because it is not very tall and I do not want the reset tool cut off, I will also
explicitly set the tools I want in the toolbar.
[5]:
p = bokeh.plotting.figure(
height=200,
width=400,
x_axis_label="percent correct",
x_range=[0, 100],
y_range=df_mean["sleeper"].unique(),
tools="save",
)
Now that we have the figure, we can put the bars on. The p.hbar()
method populates the figure with horizontal bar glyphs. The right
kwarg says what column of the data source dictates how far to the right to show the bar, while the height
kwarg says how think the bars are.
I will also ensure the quantitative axis starts at zero and turn off the grid lines on the categorical axis, which is commonly done.
[6]:
p.hbar(
source=df_mean,
y="sleeper",
right="percent correct",
height=0.6,
)
# Turn off gridlines on categorical axis
p.ygrid.grid_line_color = None
bokeh.io.show(p)
We similarly make vertical bar graphs specifying x_range
and using p.vbar()
.
[7]:
p = bokeh.plotting.figure(
height=250,
width=250,
x_range=df_mean["sleeper"].unique()[::-1],
y_range=[0, 100],
y_axis_label="percent correct",
)
p.vbar(
source=df_mean,
x="sleeper",
top="percent correct",
width=0.6,
)
p.xgrid.grid_line_color = None
bokeh.io.show(p)
Nested categorical axes¶
We may wish to make a bar graph where we have four bars, normal and insomniac for males and also normal and insomniac for females. To start, we will have to re-make the df_mean
data frame, now grouping by gender and sleeper. Furthermore, it will be nicer to label the categories as “female” and “male” instead of “f” and “m”.
[8]:
df["gender"] = df["gender"].apply(lambda x: "female" if x == "f" else "male")
df_mean = df.groupby(["gender", "sleeper"])["percent correct"].mean().reset_index()
# Take a look
df_mean
[8]:
gender | sleeper | percent correct | |
---|---|---|---|
0 | female | insomniac | 73.947368 |
1 | female | normal | 82.045455 |
2 | male | insomniac | 82.916667 |
3 | male | normal | 80.000000 |
Because of the way Bokeh handles nested categories, we need to create a new column that has a tuple corresponding to the nested category. To make the tuple, we can again apply a function, this time to each entire row of the data frame (which requires the axis=1
kwarg of df_mean.apply()
).
[9]:
df_mean["cats"] = df_mean.apply(lambda x: (x["gender"], x["sleeper"]), axis=1)
# Take a look
df_mean
[9]:
gender | sleeper | percent correct | cats | |
---|---|---|---|---|
0 | female | insomniac | 73.947368 | (female, insomniac) |
1 | female | normal | 82.045455 | (female, normal) |
2 | male | insomniac | 82.916667 | (male, insomniac) |
3 | male | normal | 80.000000 | (male, normal) |
Next, we need to set up factors, which give the nested categories. We could extract them from the 'cats'
column of the data frame as
factors = list(df_mean.cats)
Instead, we will specify them by hand to ensure they are ordered as we would like.
[10]:
factors = [
("female", "normal"),
("female", "insomniac"),
("male", "normal"),
("male", "insomniac"),
]
Finally, to use these factors in a y_range
(or x_range
), we need to convert them to a factor range using bokeh.models.FactorRange()
.
[11]:
p = bokeh.plotting.figure(
height=200,
width=400,
x_axis_label="percent correct",
x_range=[0, 100],
y_range=bokeh.models.FactorRange(*factors),
tools="save",
)
Now we are ready to add the bars, taking care to specify the 'cats'
column for our y-values.
[12]:
p.hbar(
source=df_mean,
y="cats",
right="percent correct",
height=0.6,
)
p.ygrid.grid_line_color = None
bokeh.io.show(p)
Computing environment¶
[13]:
%load_ext watermark
%watermark -v -p pandas,bokeh,jupyterlab
CPython 3.8.5
IPython 7.18.1
pandas 1.1.1
bokeh 2.2.1
jupyterlab 2.2.6