Bokeh-catplot¶
[1]:
import pandas as pd
import bokeh_catplot
import bokeh.io
bokeh.io.output_notebook()
This notebook uses an updated version of Bokeh-catplot. Be sure to update it before running this notebook by doing the following on the commmand line.
pip install --upgrade bokeh-catplot
Holoviews is and excellent for this purpose high-level package, but as we have mentioned before, it lacks two key functionalities.
It does not natively conveniently make ECDFs, but it will soon.
It does not allow for nested categorical axes for plots other than box plots, bar graphs, and violin plots, but it will soon.
To address these needs, I developed Bokeh-catplot, which generates Bokeh plots from tidy data frames where one or more columns contains categorical data and the column of interest in the plot is quantitative. Eventually, this package will become obsolete when HoloViews incorporates the functionality. But for now, Bokeh-catplot is a convenient package for making plots involving categorical variables.
There are four types of plots that Bokeh-catplot can generate. As you will see, all four of these modes of plotting are meant to give a picture about how the quantitative measurements are distributed for each category.
Plots with a categorical axis
Box plots:
bokeh_catplot.box()
Strip plots:
bokeh_catplot.strip()
Plots without a categorical axis
Histograms:
bokeh_catplot.histogram()
ECDFs:
bokeh_catplot.ecdf()
The first three arguments of each of these functions are necessary to build the plot. They are:
data
: A tidy data framecats
: A list of columns in the data frame that are to be considered as categorical variables in the plot. IfNone
, a single box, strip, histogram, or ECDF is plotted.val
: The column of the data frame to be treated as the quantitative variable.
With this in mind, we will put Bokeh-catplot to use on facial identification data set.
[2]:
df = pd.read_csv('../data/gfmt_sleep.csv', na_values='*')
df['insomnia'] = df['sci'] <= 16
df['sleeper'] = df['insomnia'].apply(lambda x: 'insomniac' if x else 'normal')
df['gender'] = df['gender'].apply(lambda x: 'female' if x == 'f' else 'male')
All four plots¶
We now make plots of the percent correct for male and female insomniacs and normal sleepers so you can see how the syntax works.
Box plot¶
[3]:
p = bokeh_catplot.box(
data=df,
cats=['gender', 'sleeper'],
val='percent correct',
)
bokeh.io.show(p)
Strip plot¶
For this plot, I will add jitter, which is passes as a Boolean. Note that HoloViews cannot make a plot like this because it cannot have nested categorical axes for Scatter
elements.
[4]:
p = bokeh_catplot.strip(
data=df,
cats=['gender', 'sleeper'],
val='percent correct',
jitter=True,
)
bokeh.io.show(p)
Histogram¶
For histograms, the number of bins are automatically chosen using the Freedman-Diaconis rule.
[5]:
p = bokeh_catplot.histogram(
data=df,
cats=['gender', 'sleeper'],
val='percent correct',
)
p.legend.location = 'top_left'
bokeh.io.show(p)
ECDF¶
HoloViews does not have native ECDF support.
[6]:
p = bokeh_catplot.ecdf(
data=df,
cats=['gender', 'sleeper'],
val='percent correct',
style='staircase'
)
p.legend.location = 'top_left'
bokeh.io.show(p)
Note that the ECDFs show a clear difference. Female insomniacs have a distribution that is shifted rightward from all other categories. This is most revealing in the ECDF.
Customization with Bokeh-catplot¶
You may have noticed in the discussion of ECDFs that I introduced some new keyword arguments, formal
and p
. In fact, each of the four plotting functions also has the following additional optional keyword arguments.
palette
: A list of hex colors to use for coloring the markers for each category. By default, it uses the default color scheme of Vega-Lite.order
: If specified, the ordering of the categories to use on the categorical axis and legend (if applicable). Otherwise, the order of the inputted data frame is used.p
: If specified, thebokeh.plotting.Figure
object to use for the plot. If not specified, a new figure is created.
The respective plotting functions also have kwargs that are specific to each (such as formal
for bokeh_catplot.ecdf()
. Examples highlighting some, but not all, customizations follow. You can find out what kwargs are available for each function by reading their doc strings, e.g., with
bokeh_catplot.box?
Any kwargs not in the function call signature are passed to bokeh.plotting.figure()
when the figure is instantiated.
Customizing box plots¶
We can also have horizontal box plots. The 'horizontal
kwarg also works for strip plots.
[7]:
p = bokeh_catplot.box(
data=df,
cats=['gender', 'sleeper'],
val='percent correct',
horizontal=True
)
bokeh.io.show(p)
We can independently specify properties of the marks using box_kwargs
, whisker_kwargs
, median_kwargs
, and outlier_kwargs
. For example, say we wanted our colors to be Betancourt red, and that we wanted the outliers to also be that color and use diamond glyphs.
[8]:
p = bokeh_catplot.box(
data=df,
cats=['gender', 'sleeper'],
val='percent correct',
whisker_caps=True,
outlier_marker='diamond',
box_kwargs=dict(fill_color='#7C0000'),
whisker_kwargs=dict(line_color='#7C0000', line_width=2),
)
bokeh.io.show(p)
Custominzing strip plots¶
To help alleviate the overlap problem, we can make a strip plot with dash markers and add some transparency.
[9]:
p = bokeh_catplot.strip(
data=df,
cats=['gender', 'sleeper'],
val='percent correct',
marker='dash',
marker_kwargs=dict(alpha=0.5)
)
bokeh.io.show(p)
I prefer jittering to this, but a strip plot wish dashes is an option (also in HoloViews). Below, I add hover tools that give more information about the respective data points in a jittered strip plot.
[10]:
p = bokeh_catplot.strip(
data=df,
cats=['gender', 'sleeper'],
val='percent correct',
jitter=True,
tooltips=[
('age', '@{age}'),
('participant number', '@{participant number}')
],
)
bokeh.io.show(p)
Strip-box plots¶
Even while plotting all of the data, we sometimes want to graphically display summary statistics, in which case overlaying a box plot and a jitter plot is useful. To populate an existing Bokeh figure with new glyphs from another catplot, pass in the p
kwarg. You should be careful, though, because you need to make sure the cats
, val
, and horizontal
arguments exactly match.
[11]:
p = bokeh_catplot.strip(
data=df,
cats=['gender', 'sleeper'],
val='percent correct',
horizontal=True,
jitter=True,
height=250
)
p = bokeh_catplot.box(
data=df,
cats=['gender', 'sleeper'],
val='percent correct',
horizontal=True,
whisker_caps=True,
display_points=False,
box_kwargs=dict(fill_color=None, line_color='gray'),
median_kwargs=dict(line_color='gray'),
whisker_kwargs=dict(line_color='gray'),
p=p,
)
bokeh.io.show(p)
Customizing histograms¶
We could plot normalized histograms using the density kwarg, and we’ll make the plot a little wider to support the legend.
[12]:
# Plot the histogram
p = bokeh_catplot.histogram(
data=df,
cats=['gender', 'sleeper'],
val='percent correct',
density=True,
width=550,
)
p.legend.location = 'top_left'
bokeh.io.show(p)
Customizing ECDFs¶
Instead of plotting a separate ECDF for each category, we can put all of the categories together on one ECDF and color the points by the categorical variable by using the kind='colored'
kwarg. Note that if we do this, we can only have the “dot” style ECDF, not the formal staircase.
[13]:
p = bokeh_catplot.ecdf(
data=df,
cats=['gender', 'sleeper'],
val='percent correct',
kind='colored',
)
p.legend.location = 'top_left'
bokeh.io.show(p)
In general, for cumstomization, the doc strings of the respective plotting functions provide a good sense of what is available.
Computing environment¶
[14]:
%load_ext watermark
%watermark -v -p pandas,bokeh,bokeh_catplot,jupyterlab
CPython 3.7.4
IPython 7.8.0
pandas 0.24.2
bokeh 1.3.4
bokeh_catplot 0.1.4
jupyterlab 1.1.4