The innards of a package

This lesson was developed by Rosita Fu based off of work by Patrick Almhjell.


[1]:
import numpy as np
import bokeh
import chromatose as ct
Loading BokehJS ...

but first…

What’s the point?

When we write modules, they are typically stored in directories and live in very fixed, stable homes. But if you are going to be re-using code from notebook to notebook or application to application, a proper package saves you the trouble of having to remember the location of said code, and finding a way to either get that into your directory, or brute force changing your sys.path to load it in.

Making a package is a great way to organize your life. Let’s reframe some of Marie Kondo’s words:

“The act of folding packaging is far more than making clothes code compact for storage. It is an act of caring, an expression of love and appreciation for the way these clothes your code support your lifestyle. Therefore, when we fold package, we should put our heart into it, thanking our clothes code for protecting our bodies work.”

Packages revive code that would otherwise fade out of your memory and become obsolete. This is, in all seriousness, very tragic. All your hard work is lost not only to you, but to the rest of the world as well. Packaging that code is a very simple solution. When you have a series of .py files you want to share with your colleagues, modifications and edits can be formally made, and everyone is on the same page. Because you are under version control, you can add and modify as much as you want without the fear of irreversibly breaking anything.

It’s also just fun. The life of a coder oscillates between profound frustration and fleeting triumph. To actually put a project together, no matter how small or simple, is very satisfying and brings lasting joy. Watching methods you’ve written turn blue is delightful. The possibility of someone across the world using your package is a nice reminder that even in very trying times, we are not alone. It also provides enough motivation to sit down and write proper docstrings.

The actual process of bundling healthy, functioning code into a package is really the computational equivalent to filling out a form. The details are a bit technical and the rules seem a bit arbitrary and confining, and most of it consists of populating seemingly bland text files, but I think the result is well worth the busy-work. It also takes significantly less time and brain-power than the actual getting-your-code-to-work part.

Your first package does not need to be perfect, but as long as you are excited about it, future updates and modifications will slowly shape it into something that looks and behaves like it should :-)

General Structure

To create packages for use locally on your own machine for your own personal use, the structure looks something like this:

/pkg_name
  /pkg_name
    __init__.py
    module1.py
    module2.py
    module3.py
    ...
  setup.py
  README.md

But to unleash your code to the world, there are a few more files you have to add. The idea is that other people have to be able to know what environment their machine needs to be in (requirements.txt), and how lift-able your code is legally (LICENSE.md):

/pkg_name
  /pkg_name
    __init__.py
    module1.py
    module2.py
    module3.py
    ...
  setup.py
  requirements.txt
  README.md
  LICENSE.md

It is essential that the name of the root directory be the name of your package, and that there be a subdirectory with the same name. The subdirectory must contain a file __init__.py as well as all the modules that users can access. Additionally, your setup, requirements, README and license all need to exist in the first layer. These file names are protected; Python automatically knows what they are and expects them to look a certain way. So like don’t get cute and rename setup.py to framework.py or requirements.txt to prerequisites.txt. Python will not understand, and will still be looking for those files.

You can have miscellaneous directories, but when other people try to call modules outside /pkg_name/pkg_name/, the classes and methods and variables within will not be accessible. For example, I have a directory called _imgs in my root directory for my README, but this will not actually show up when you use dot syntax to call it. Similarly, inside Numpy in its first layer, there are directories called branding/logo (contains the images it uses for its branding) and doc (resources for its documentation). Neither of these things should actually be imported to my computer—it’d be a waste of space for the user. But they’re still important for the dressings and utility of the package, they just live in a slightly different place.

Also, Numpy is so large its github username isalsoNumpy, so don’t be fooled, its package follows the exact same format!

[2]:
np.random         # random/ directory is inside /numpy/numpy
[2]:
<module 'numpy.random' from '/Users/bois/opt/anaconda3/lib/python3.8/site-packages/numpy/random/__init__.py'>
[3]:
np.doc            # doc/ directory is NOT inside /numpy/numpy
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-419bf7ff5ed2> in <module>
----> 1 np.doc            # doc/ directory is NOT inside /numpy/numpy

~/opt/anaconda3/lib/python3.8/site-packages/numpy/__init__.py in __getattr__(attr)
    212                 return Tester
    213             else:
--> 214                 raise AttributeError("module {!r} has no attribute "
    215                                      "{!r}".format(__name__, attr))
    216

AttributeError: module 'numpy' has no attribute 'doc'

Opening Our Package ✂️ 📦 …

Now let’s look at the actual code that goes inside our files.

__init__.py 📂

The existence of this file is critical, but its contents will be relatively simple. You can import your modules here with the following syntax:

"""Docstring goes here!"""

from .module1 import *
from .module2 import *
from .module3 import *

__author__ = 'Pippi Longstocking'
__email__ = 'pippi@package.edu'
__version__ = '0.0.1'
__license__ = 'MIT'
  • Note the dot, and how the actual file name is module1.py, but I’m dropping the .py

  • In the case that you do not want your user to access variables/functions directly via pkg.variable or pkg.function, but instead want them to call pkg.method1.variable or pkg.method2.function (an example of this is numpy’s random module), you can change the import statements to look like:

from . import module1
from . import module2
from . import module3
  • ^^^A way to remember the difference between the two is that the * unfurls the contents of the modules, and when you replace the * with the name of the module, it forces the user to tack on the module’s name.

  • With the first method, your functions are available within the namespace of your package. As your package increases in complexity, namespaces help keep python objects separate.

  • The __xxx__ variables, like authorship, email, version, and license information should be located in this file, and are typically assigned strings. These variables are technically optional, in the sense that Python is not specifically looking for them, but they’re nice to have anyways, so people can conveniently check things like versions and licenses while they’re working. There are creative ways to load in your license, some packages will read in the actual license file line by line and output it.

  • Packages (and not just their methods) have docstrings too! They go at the very very top of the __init__.py file. These will show up for the user when they type pkg?. This is also technically optional, but cool to have. Chromatose sadly does not have a docstring (yet), but bokeh does!

[4]:
ct?
Type:        module
String form: <module 'chromatose' from '/Users/bois/opt/anaconda3/lib/python3.8/site-packages/chromatose/__init__.py'>
File:        ~/opt/anaconda3/lib/python3.8/site-packages/chromatose/__init__.py
Docstring:   <no docstring>

[5]:
bokeh?
Type:        module
String form: <module 'bokeh' from '/Users/bois/opt/anaconda3/lib/python3.8/site-packages/bokeh/__init__.py'>
File:        ~/opt/anaconda3/lib/python3.8/site-packages/bokeh/__init__.py
Docstring:
Bokeh is a Python interactive visualization library that targets modern
web browsers for presentation.

Its goal is to provide elegant, concise construction of versatile graphics,
and also deliver this capability with high-performance interactivity over large
or streaming datasets. Bokeh can help anyone who would like to quickly and
easily create interactive plots, dashboards, and data applications.

For full documentation, please visit: https://docs.bokeh.org

moduleX.py 📂

  • Your modules should be python files and not notebooks. The transition from working in a notebook to a .py is pretty seamless; after testing your code out, just copy and paste your imports and functions. Save your markdown for text files or READMEs, and/or turn them into code comments!

  • Each module contains functions and classes that are related and utilities that complement each other. Modules should interact with each other so it’s very sensical to import modules into other modules.

  • Organizing your functions into separate modules takes a bit of thinking. It de-clutters huge chunks of code. For my own package, there are some conversion tools that really belong in their own separate module. Since I was not in the room when this was said, I’ll quote Patrick Almhjell quoting Griffin Chure,

“Code you write should be separated by what it does.”

  • To clean things up for the user, you can transform some intermediate functions by adding an underscore before the function’s name, effectively blocking the user from accessing them. They won’t be able to see them when using help() or ? or call them. This is a very simple way to separate what is meant for the user, and what is meant for you.

    • For example, I wanted to define a function to clean my plots, getting rid of certain tick marks, etc., but this is so oddly specific to my tastes. So in viz.py:

    def _clean_plot(...):
        ...
    
    def palplot(...):
        ...
    
[6]:
ct.palplot
[6]:
<function chromatose.viz.palplot(palette, plot='all', bg_color='white', alpha=1.0, shuffle=False, scatter_kwargs=None)>
[7]:
ct._clean_plot
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-abc87b1bc755> in <module>
----> 1 ct._clean_plot

AttributeError: module 'chromatose' has no attribute '_clean_plot'

So even though we can see with our physical eyes that _clean_plot() exists in our module (and is in fact being used by palplot()), we as the user can’t actually access it. Pretty nifty!

setup.py 📂

This file must be in the main directory, and it must exist with that name. It contains instructions for setuptools, a built-in library, to actually install the package. We use the setuptools.setup() function to do the installation.

import setuptools

with open("README.md", "r") as f:
    long_description = f.read()

setuptools.setup(
    name='pkg_name',
    version='0.0.1',
    author='Pippi Longstocking',
    author_email='pippi@packages.edu',
    description='Pippi is learning how to make packages.',
    long_description=long_description,
    long_description_content_type='ext/markdown',
    packages=setuptools.find_packages(),
    install_requires=["numpy","pandas", "bokeh>=1.4.0"],
    classifiers=(
        "Programming Language :: Python :: 3",
        "Operating System :: OS Independent",
    ),
)

Lots of kwargs! Note the important lines are the import statement, and the call to setuptools.setup(...). Everything else (file-reading, string manipulation) is just window dressing. You only have to think about most of the kywargs here once you’re ready to share and publish your package. I include the details anyway; hopefully it’ll be a solid starting reference for you later on.

  • name: distribution name of your package, can be any name with letters, numbers, _ and -

    • if you want to publish to PyPI, the name must be original!

  • version: There are many different versioning schemes, e.g., semantic versioning, date-based versioning, local version identifiers, etc.

    • see some guidelines on different schemes here.

    • Local version identifiers might be the easiest to wrap your head around, but with larger collaborative projects, versioning will be a topic of discussion with a group of people, and likely not a decision you will have to make yourself.

  • url: URL for the homepage of the project, the github repo will do, but if you prepare some fancy documentation, that’s even better!

  • packages: list of all Python import packages that should be included in the distribution package. Instead of listing all import packages manually, we can use the find_packages() function in setuptools to automatically find `em all.

    • distribution package vs. import package: an import package is a package inside a package, e.g. np.random has an __init__ file, so np.random is technically an import package, while numpy is the distribution package. kind of a meta concept.

  • install_requires: should specify what dependencies a project minimally needs to run, only include what’s needed

    • when users install via pip, this is the specification that is used to install dependencies

    • note that this is a list of strings of dependencies, equalities and/or inequalities can be used

    • the less constraints & dependencies you have here the better

    • put the full list of requirements in the requirements.txt file

  • description: a short, one-sentence summary of the package

  • long_description: a detailed description, typically loaded from README.md

  • long_description_content_type: tells the index what type of markup is used for the long description. markdown is a common choice.

  • classifiers: provides pip additional metadata

    • here we’re saying our package is only compatible with Python 3, is OS-independent (Mojave, Catalina, etc.). Some additional things you could add are the Topics related to your work

    • find a full list of classifiers here

    • we can see the descriptions and classifiers on PyPI.

See more setuptools arguments here.

This part looks deceptively long, but with some copy-pasting from pre-existing packages, the setup.py should not take you more than a minute or two!

requirements.txt 📂

This file is only necessary when you’re publishing your code for others to use. It should be located in the main directory as well. The name of this should have the word requirements in there somewhere, but I don’t believe the naming is rigorous.
This will be a really simple text file. Each line contains the name of a dependency, and equalities/inequalities for their supported versions.
numpy
scipy
pandas
bokeh>=1.4.0
...

Version Specification: - without version specifiers means your package supports all versions. - bokeh==1.4.0: must be on version 1.4.0 - bokeh>=1.4.0 or bokeh~=1.4.0: the minimum version is 1.4.0 - bokeh!=1.4.0: version exclusion, anything but 1.4.0

Although pip will check install_requires in setup.py, this is a really convenient file to have. The reason is that once you’re starting to think about the user, you’ll realize that setting up a virtual environment will mimic a fresh environment on a different machine.

Setting up a virtual environment

It only takes two command lines! Replace name_of_virtual_env with anything you like. Then activate it. Your command line prompt will now have (name_of_my_virtual_env).

python3.x -m venv name_of_virtual_env

source name_of_virtual_env/bin/activate

When activated, you can install packages in requirements.txt with the following command:

pip install -r requirements.txt

This will install your package dependencies in a fresh environment, and ignore (for the most part) the configurations of your local machine. When you’re done testing your code, deactivate it simply with

deactivate

README.md 📂

README’s are lovely. The name commands so much attention, and is what I typically open when a foreign package arrives on my screen. For relatively small packages, your documentation goes here. Another common file extensions is .rst (reStructuredText).

LICENSE.txt 📂

Every package uploaded to PyPI should have a license. Github has a website to help you choose. Licenses are typically text files, but I believe markdown is supported as well.