iqplot

Data set download


[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade iqplot colorcet watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import pandas as pd

import iqplot

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

Holoviews is and excellent for this purpose high-level package, but as we have mentioned before, it lacks two key functionalities.

  1. It does not natively conveniently make ECDFs, but it will soon.

  2. It does not allow for nested categorical axes for plots other than box plots, bar graphs, and violin plots, but it will soon.

To address these and other needs, I developed iq, which generates Bokeh plots for data sets where one variable is quantitative and all other variables of interest, if any, are categorical. This is where the name comes from; the first two letters of the package name are meant to indicate one (Roman number I) quantitative (Q) variable. The subclass of data sets that contain a single quantitative variable (and possibly several categorical variables) abound in the biological sciences.

There are five types of plots that iqplot can generate. As you will see, all four of these modes of plotting are meant to give a picture about how the quantitative measurements are distributed.

  • Plots with a categorical axis

    • Box plots: iqplot.box()

    • Strip plots: iqplot.strip()

    • Strip-box plots (strip and box plots overlaid): iqplot.stripbox()

  • Plots without a categorical axis

    • Histograms: iqplot.histogram()

    • ECDFs: iqplot.ecdf()

This first seven arguments are the same for all plots. They are:

  • data: A tidy data frame

  • q: The column of the data frame to be treated as the quantitative variable.

  • cats: A list of columns in the data frame that are to be considered as categorical variables in the plot. If None, a single box, strip, histogram, or ECDF is plotted.

  • q_axis: Along which axis, x or y that the quantitative variable varies. The default is 'x'.

  • palette: A list of hex colors to use for coloring the markers for each category. By default, it uses the Glasbey Category 10 color palette from colorcet.

  • order: If specified, the ordering of the categories to use on the categorical axis and legend (if applicable). Otherwise, the order of the inputted data frame is used.

  • p: If specified, the bokeh.plotting.Figure object to use for the plot. If not specified, a new figure is created.

If data is given as a Numpy array, it is the only required argument. If data is given as a Pandas DataFrame, q must also be supplied. All other arguments are optional and have reasonably set defaults. Any extra kwargs not in the function call signature are passed to bokeh.plotting.figure() when the figure is instantiated.

With this in mind, we will put iqplot to use on facial identification data set to demonstrate how we can make each of the five kinds of plots.

[2]:
fname = os.path.join(data_path, "gfmt_sleep.csv")
df = pd.read_csv(fname, na_values="*")
df["insomnia"] = df["sci"] <= 16
df["sleeper"] = df["insomnia"].apply(lambda x: "insomniac" if x else "normal")
df["gender"] = df["gender"].apply(lambda x: "female" if x == "f" else "male")

All four plots

We now make plots of the percent correct for male and female insomniacs and normal sleepers so you can see how the syntax works.

Box plot

[3]:
p = iqplot.box(
    data=df,
    q='percent correct',
    cats=['gender', 'sleeper'],
)

bokeh.io.show(p)

Strip plot

For this plot, I will add jitter, which is passes as a Boolean. Note that HoloViews cannot make a plot like this because it cannot have nested categorical axes for Scatter elements.

[4]:
p = iqplot.strip(
    data=df,
    q='percent correct',
    cats=['gender', 'sleeper'],
    jitter=True,
)

bokeh.io.show(p)

Strip-box plot

For a strip-box plot, a strip plot and box plot are overlaid with reasonable defaults for the box plot to enable visualization.

[5]:
p = iqplot.stripbox(
    data=df,
    q='percent correct',
    cats=['gender', 'sleeper'],
    jitter=True,
)

bokeh.io.show(p)

Histogram

For histograms, the number of bins are automatically chosen using the Freedman-Diaconis rule.

[6]:
p = iqplot.histogram(
    data=df,
    q='percent correct',
    cats=['gender', 'sleeper'],
)

p.legend.location = 'top_left'

bokeh.io.show(p)

ECDF

[7]:
p = iqplot.ecdf(
    data=df,
    q='percent correct',
    cats=['gender', 'sleeper'],
    style='staircase'
)

p.legend.location = 'top_left'

bokeh.io.show(p)

Note that the ECDFs show a clear difference. Female insomniacs have a distribution that is shifted leftward from all other categories. This is most revealing in the ECDF.

Customization with iqplot

You may have noticed in the discussion of ECDFs that I introduced some a new keyword argument, style='staircase'. There are plot-type-specific kwargs which enable customization beyond the customization kwargs common to the plot types, such as palette and q_axis.

You can find out what kwargs are available for each function by reading their doc strings, e.g., with

iqplot.box?

of by reading the documentation. Any kwargs not in the function call signature are passed to bokeh.plotting.figure() when the figure is instantiated.

Customizing box plots

We can also have vertical box plots using the q_axis kwarg.

[8]:
p = iqplot.box(
    data=df,
    q='percent correct',
    cats=['gender', 'sleeper'],
    q_axis='y',
)

bokeh.io.show(p)

We can independently specify properties of the marks using box_kwargs, whisker_kwargs, median_kwargs, and outlier_kwargs. For example, say we wanted our colors to be Betancourt red, and that we wanted the outliers to also be that color and use diamond glyphs.

[9]:
p = iqplot.box(
    data=df,
    q='percent correct',
    cats=['gender', 'sleeper'],
    q_axis='y',
    whisker_caps=True,
    outlier_marker='diamond',
    box_kwargs=dict(fill_color='#7C0000'),
    whisker_kwargs=dict(line_color='#7C0000', line_width=2),
)

bokeh.io.show(p)

Customzing strip plots

To help alleviate the overlap problem, we can make a strip plot with dash markers and add some transparency.

[10]:
p = iqplot.strip(
    data=df,
    q='percent correct',
    cats=['gender', 'sleeper'],
    marker='dash',
    marker_kwargs=dict(alpha=0.5)
)

bokeh.io.show(p)

I prefer jittering to this, but a strip plot with dashes is an option (also in HoloViews). Below, I add hover tools that give more information about the respective data points in a jittered strip plot.

[11]:
p = iqplot.strip(
    data=df,
    q='percent correct',
    cats=['gender', 'sleeper'],
    jitter=True,
    tooltips=[
        ('age', '@{age}'),
        ('participant number', '@{participant number}')
    ],
)

bokeh.io.show(p)

Customizing histograms

We could plot normalized histograms using the density kwarg, and we’ll make the plot a little wider to support the legend.

[12]:
# Plot the histogram
p = iqplot.histogram(
    data=df,
    q='percent correct',
    cats=['gender', 'sleeper'],
    density=True,
    frame_width=525,
)

p.legend.location = 'top_left'

bokeh.io.show(p)

Customizing ECDFs

Instead of plotting a separate ECDF for each category, we can put all of the categories together on one ECDF and color the points by the categorical variable by using the kind='colored' kwarg. Note that if we do this, we can only have the “dot” style ECDF, not the formal staircase.

[13]:
p = iqplot.ecdf(
    data=df,
    q='percent correct',
    cats=['gender', 'sleeper'],
    kind='colored',
)

p.legend.location = 'top_left'

bokeh.io.show(p)

In general, for customization, you should check the documentation to see what is available.

Computing environment

[14]:
%load_ext watermark
%watermark -v -p pandas,bokeh,iqplot,jupyterlab
CPython 3.8.5
IPython 7.18.1

pandas 1.1.1
bokeh 2.2.1
iqplot 0.1.6
jupyterlab 2.2.6