Code is meant for sharing

This recitation was written by Patrick Almhjell.


As we progress through this course, we will be sharing a lot of code. Indeed, you already have! You’ve shared your code with your group as you work to complete assignments, and you’ve shared your code with us so we can make sure you’re on the right track. And, most important, down the road you will be sharing this code with Future You.

Critically, to complete each assignment you’ve used functions written by other (very talented) people to efficiently accomplish your goals—functions which have been packaged and distributed and made accessible to the world.

I expect you to believe me when I say that your life would be very difficult without packages.

In this recitation, we’ll learn a little bit about making, distributing, and managing packages of our own.

An overview of what this recitation will cover:

a) Sharing code: From functions to modules to packages.

  • Why bother to write a package?

b) Package architecture and good practices.

  • Each module does a consistent task but are complementary (data processing or analysis or visualization).

  • Namespaces: a ‘honking great idea’. (remember The Zen of Python?)

c) Making a new package.

  • When to start. (Hint: It’s probably sooner than you think.)

  • But really. Now is better than never!

d) Setting up and testing the package.

  • Make your code accessible!

  • Use pytest to check your work.

e) Improvement and collaboration.

  • TDD is great, but the best way to make something that works is to have other people break* it.

  • (*They don’t always have to break it… suggesting enhancements is good too.)

  • Have your people use it and raise issues on GitHub!

Making useful code

As you progress through this course, you’ll find yourself gaining the ability to quickly write functions that do a useful task with real data. For example, you’ll write parsing functions, functions that wrangle instrument data from Excel-loving manufacturers, and plenty of plotting functions.

You may find yourself reusing these functions (or a handful or cooperating functions) between different projects. This behavior is facilitated by Jupyter notebooks, as you can quickly drag and drop functions into code cells at the beginning of your notebook. This is a good thing. However, it can eventually result in unruly notebooks.

For example, let’s look at a function I wrote to quickly compare timecourse-like data across different experiments.

(warning: This is a lot of code. That’s the point. Just keep scrolling.)

[1]:
import numpy as np
import pandas as pd

import bokeh.palettes

import holoviews as hv
hv.extension('bokeh')
[2]:
def check_df_col(df, column, name=None):
    """
    Checks for the presence of a column (or columns) in a tidy
    DataFrame with an informative error message. Passes silently,
    otherwise raises error.
    """
    if column is not None:

        if type(column) != list:

            column = [column]

        for col in column:
            if name is None:
                error_message = f"The value '{col}' is not present in any of the columns of your DataFrame."
            else:
                error_message = f"Your {name} value '{col}' is not present in any of the columns of your DataFrame."
            error_message += "\nYou may be looking for:\n  " + str(list(df.columns))

            assert col in df.columns, error_message


def check_replicates(df, variable, value, grouping):
    """Checks for the presence of replicates in the values of a dataset,
    given some experimental conditions. Returns True if the standard
    deviation of the values of each group (if more than one exists) is
    greater than, indicating that replicates were performed under the
    given criteria.

    Parameters
    ----------
    df : Pandas DataFrame in tidy format
        The data set to be checked for replicates
    variable : immutable object
        Name of column of data frame for the independent variable,
        indicating a specific experimental condition.
    value : immutable object
        Name of column of data frame for the dependent variable,
        indicating an experimental observation.
    group : immutable object of list of immutable objects
        Column name or list of column names that indicates how the
        data set should be split.

    Returns
    -------
    replicates : boolean
        True if replicates are present.
    df_out : the DataFrame containing averaged 'variable' values, if
        replicates is True. Otherwise returns the original DataFrame.
    """

    # Unpack the experimental conditions into a single list of arguments
    if type(grouping) != list:
        grouping = [grouping]
    args = [elem for elem in [variable, *grouping] if elem != None]

    # Get stdev of argument groups
    grouped = df.groupby(args)[value]
    group_stdevs = grouped.std().reset_index()
    group_stdev = group_stdevs[value].mean()

    # Determine if there are replicates (mean > 0)
    replicates = bool(group_stdev > 0)

    # Average the values and return
    if replicates:
        df_mean = grouped.mean().reset_index()
        df_mean.columns = list(df_mean.columns[:-1]) + ['Mean of ' + str(value)]
        df_return = df.merge(df_mean)

    return replicates, df_return


def plot_timecourse(
    df,
    variable,
    value,
    condition=None,
    split=None,
    sort=None,
    cmap=None,
    show_all=False,
    show_points="default",
    legend=False,
    height=350,
    width=500,
    additional_opts={},
):

    """
    Converts a tidy DataFrame containing timecourse-like data into a
    plot, taking care to show all the data. A line is computed as the
    average of each set of points (grouped by the condition and split,
    if present), and the actual data points are overlaid on top.

    Parameters
    ----------
    df : Pandas DataFrame in tidy format
        The data set to be used for plotting.
    variable : immutable object
        Column in data frame representing the timecourse-like variable,
        plotted on the x-axis
    value : immutable object
        Column in data frame representing the quantitative value,
        plotted on the y-axis
    condition : The way the data is grouped for a single chart.
        Defaults to None.
    split :  The way the data is grouped between different charts.
        Defaults to None.
    sort : Which column is used to determine the sorting of the data.
        Defaults to None, and will sort by the condition column
        (alphabetical) if present, otherwise variable.
    cmap : The colormap to use. Any Holoviews/Bokeh colormap is fine.
        Uses Holoviews default if None.
    show_all : If split is not None, whether or not to use a drop-down
        or to show all the plots (layout). Note that this can be pretty
        buggy from Holoview's layout system. There is usually a way to
        how all the info you want, in a nice way. Just play around.
    show_points : Shows all the data points. I don't even know why this
        is an argument. Default will show points if there are multiple
        replicates. Unless you have a real good reason, don't change
        this.
    legend : First controls whether or not the legend is shown, then
        its position. Defaults to False, though 'top' would be a good
        option, or 'top_left' if using split.
    height : int; the height of the chart.
    width : int; the width of the chart.
    additional_opts : A dictionary to pass additional Holoviews options
        to the chart. Flexible; will try all options and only use the
        ones that did not raise an exception. Not verbose.

    Returns
    -------
    chart : the final Holoviews chart
    """

    # Check columns
    check_df_col(df, variable, name="variable")
    check_df_col(df, value, name="value")
    check_df_col(df, condition, name="condition")
    check_df_col(df, split, name="split")
    check_df_col(df, sort, name="sort")

    # Check for replicates; aggregate df
    groups = [grouping for grouping in (condition, split) if grouping is not None]
    if groups == []:
        groups = None
    replicates, df = check_replicates(df, variable, value, groups)

    # Pull out available encodings (column names)
    encodings = [*list(df.columns)]

    # Set options
    base_opts = dict(height=height, width=width, padding=0.1)

    if legend is not False:
        base_opts.update(dict(show_legend=True))
        if legend is not True:
            additional_opts.update(dict(legend_position=legend))

    line_opts = base_opts
    scat_opts = dict(size=6, fill_alpha=0.75, tools=["hover"])
    scat_opts.update(base_opts)

    # Now, start to actually make the chart
    points = hv.Scatter(df, variable, [value, *encodings]).opts(**scat_opts)

    lines = hv.Curve(df, variable, [("Mean of " + str(value), value), *encodings]).opts(
        **line_opts
    )

    if groups is not None:
        points = points.groupby(groups).opts(**scat_opts)
        lines = lines.groupby(groups).opts(**line_opts)

    # Output chart as desired
    if show_points == "default":
        if replicates is True:
            chart = lines * points
        else:
            chart = lines
    elif show_points is True:
        chart = lines * points
    else:
        chart = lines

    # Overlay each line plot
    if condition is not None:
        chart = chart.overlay(condition)

    # Split among different charts
    if split is not None:

        # If split, show as side-by-side, or dropdown
        # Note, this is pretty buggy; on Holoviews' end.
        if show_all is True:
            chart = chart.layout(split)

    # Assign the additional options, as allowed
    if additional_opts != {}:
        try:
            chart = chart.options(**additional_opts)
        except ValueError:
            good_opts = {}
            bad_opts = {}

            for opt in additional_opts.keys():
                try:
                    test = chart.options(additional_opts[opt])
                    good_opts[opt] = additional_opts[opt]
                except ValueError:
                    bad_opts[opt] = additional_opts[opt]

                chart = chart.options(**good_opts)

    # Assign color
    if cmap is not None:
        chart = chart.opts({"Scatter": {"color": cmap}, "Curve": {"color": cmap}})

    return chart

Okay. Done.

Now that all this is out of the way, let’s make some data to plot.

[3]:
# Make a fake Dataset
np.random.seed(8675309)

# Michaelis-Menten equation, because I'm a biochemist
def noisy_mm(concs, kcat, KM=10):
    noise = np.random.normal(1, 0.1, len(concs))
    return kcat*(concs / (concs+KM))*noise

# Make some concentrations in logspace
concs = np.logspace(1e-10, 2, 8)

# Set up the fake MM data
df_indole = pd.concat([pd.DataFrame({
    'Sample' : 'Enzyme '+str(i+1),
    'Substrate Concentration' : np.concatenate([concs for _ in range(8)]),
    'Rate' : np.concatenate([noisy_mm(concs, kcat=1+i**2) for _ in range(8)]),
    'Substrate' : 'Indole',
}) for i in range(3)])

df_azulene = pd.concat([pd.DataFrame({
    'Sample' : 'Enzyme '+str(i+1),
    'Substrate Concentration' : np.concatenate([concs for _ in range(8)]),
    'Rate' : np.concatenate([noisy_mm(concs, kcat=5+i**2.2, KM=100) for _ in range(8)]),
    'Substrate' : 'Azulene',
}) for i in range(3)])

df = pd.concat([df_indole, df_azulene])

# Check
df.head()
[3]:
Sample Substrate Concentration Rate Substrate
0 Enzyme 1 1.000000 0.096264 Indole
1 Enzyme 1 1.930698 0.173690 Indole
2 Enzyme 1 3.727594 0.239982 Indole
3 Enzyme 1 7.196857 0.395196 Indole
4 Enzyme 1 13.894955 0.536581 Indole

Now let’s use our function to plot the data.

[4]:
cmap = hv.Cycle(bokeh.palettes.inferno(5)[1:-1])

plot_timecourse(df,
                'Substrate Concentration',
                'Rate',
                condition='Sample',
                split='Substrate',
                legend='top_left',
                cmap=cmap)
[4]:

This is a mess!

Well, the plot’s pretty nice. (pats self on back.) But the huge block of code is pretty distracting. We wouldn’t want to find and copy and paste this entire thing into our notebook each time we wanted to use it.

We could instead transfer this into a single module in .py file in the same directory at the notebook that we import in a single line at the beginning of our notebook. But, as discussed in class, this also isn’t a great idea. The .py file eventually goes rogue, floating around and accruing untracked changes between different users. This results in many different versions of a base module that all do different things.

Furthermore, we may want to add additional functionalities that are complementary but based on conceptually different code. So, we may want to keep them in separate modules but still use them together.

Packages hosted on GitHub (or something similar) very conveniently address these issues. If this seems at all relevant to your day-to-day life, you should make a package. (Like, now.) We’ll learn how to do this in the rest of this recitation. I promise, it’s not so scary, and you’ll thank yourself later.

Computing environment

[5]:
%load_ext watermark

%watermark -v -p jupyterlab,numpy,pandas,holoviews,bokeh
CPython 3.7.4
IPython 7.1.1

jupyterlab 1.1.4
numpy 1.17.2
pandas 0.24.2
holoviews 1.12.6
bokeh 1.3.4