Package basics

This recitation was written by Patrick Almhjell.


Packages were explained to you in Lesson 2, so you should have a general idea of how they work. But we’ll go over the important things again.

You should think of a package as a collection of modules with some instructions for how they interact:

  • Some of these instructions are set in a file called __init__.py.

  • Each module should contain python objects (functions and classes*, mainly) that are related, and make sense being together.

    • Modules should interact with one another intuitively and productively. So, you can and should import modules within other modules.

    • Modules should not mix actual code that performs fundamentally different tasks within the same module.

(*We won’t be discussing classes in this course, but feel free to reach out if you want to learn about them!)

Package architecture

Personally, I’ve developed a preference for having instrument-centric data wrangling and processing modules. So, my package that I use and share with my lab looks something like this:

Package Component         Description
---------------------     ----------------------------------
/arnoldLab_utils          <------ this is the root directory
  /arnoldLab_utils        <------ this is where the guts of the package are kept
    __init__.py           <------ this specifies what/how things are imported into the 'namespace'; see below
    tecan.py              <------ submodule for Tecan plate reader data
    LCMS.py               <------ submodule Agilent LCMS data
    screening_utils.py    <------ this helps work up screening data from either Tecan or LCMS
    viz.py                <------ my favorite module; helps me visualize my data in many ways
/tests                    <------ tests, to be used with pytest
/templates                <------ useful template files; e.g., to map conditions to a 96-well plate
setup.py                  <------ helps install the package
README.md                 <------ gives details on the package, how to install/contribute, etc.

Here, tecan.py contains code for wrangling data from our Tecan plate reader (you’ll be doing this in an upcoming problem set!), and LCMS.py file contains code for wrangling data from our Agilent LCMS.

Most times I’m screening enzyme variants for activity. So, I have a module called screening_utils.py that interacts with these other modules when I’m doing that, allowing me to specify which wells of a 96-well plate are controls, doing background subtractions/normalizations (after validating that I should be doing subtraction or normalization), etc.

However, I’m not always using the Tecan or LCMS for screening, so they don’t have to interact with screening_utils.py. Other times I might be doing a BCA assay (for protein quantification) on the Tecan or looking at single analytical reactions on the LCMS. So they also provide that functionality. But it all starts with working up the data and getting it into a usable format.

To drive this home, I’ll present an excellent quote from Griffin Chure when we were discussing this:

Code you write should be separated by what it does.

In that vein, my module viz.py thus contains functions for making informative plots from data that I collect on a daily basis, usually (but not limited to) data from the Tecan or LCMS. Vizualization is not mixed with the processing or analysis.

Finally, I have the usual setup.py, README.md, and __init__.py files as well as a tests/ directory (more on these soon). You’ll also see a templates/ directory, which is where I keep basic templates that can help streamline some functions. This is another nice thing about a package: you can keep anything in the root directory (or almost anywhere, really) that might be essential or helpful for the user, such as templates, documents, example data, etc.

The __init__.py file and namespaces

As described in Lesson 2, the __init__.py file provides instructions about how the modules are imported.

Mine looks something like this:

from .tecan import *
from .LCMS import *
from .viz import *

__author__ = 'Patrick Almhjell'
__email__ = 'palmhjell@caltech.edu'
__version__ = '0.0.1'

The other modules are handled within those three import statements, so that’s really all I need. This is a pretty common import style.

What this means is that when I run import arnoldLab_utils as ut in my python session, any given function within tecan.py, LCMS.py, viz.py is accessible to me with ut.function().

In other words, these functions are available within the namespace of ut. Namespaces help keep python objects separate, which is a very good thing.

A quick aside on namespaces:

Say I decide to make a function called slice(), which could be used to slice a dataset at a given value and only give me the entries above it (my “hits”, as they’re called when we’re doing screening). This seems innocuous enough. However, you may notice that slice() is a built-in function in python:

slice()

So, if we just had a function we imported called slice(), we’d overwrite the built-in function. This is not something you want to do.

Namespaces solve this issue, because we import our function as ut.slice(), rather than into the global namespace. (Though, generally you should try not to conflict with built-ins.)

So I’ll issue a warning here:

Don’t import a module into the global namespace (``from module import *``) unless you are really sure you will not get a name clash. (And even then, be careful.)

An alternative __init__.py import statement might look something like this:

from . import tecan
from . import LCMS
from . import viz

where you then access a function in tecan.py with ut.tecan.function().

You can find more on __init__.py and package architecture here.

Computing environment

[1]:
%load_ext watermark

%watermark -v -p jupyterlab
CPython 3.7.4
IPython 7.8.0

jupyterlab 1.1.4