(c) 2018 John Ciemniecki and Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
This lesson was generated from a Jupyter notebook. You can download the notebook here.
import numpy as np
import pandas as pd
import altair as alt
alt.data_transformers.enable('json')
This lesson is all about style. Style in the general sense of the word is very important. It can have a big effect on how people interact with a program or software and therefore your data analysis workflow. As an example, we can look at the style of data presentation.
The Keeling curve is a measure of the carbon dioxide concentration on top of Muana Loa over time. Let's look at a plot of the Keeling curve.
I contend that this plot is horrible looking. The green color is hard to see. The dashed curve is difficult to interpret. We do not know when the measurements were made. The grid lines are obtrusive. Awful.
Lest you think this plot is a ridiculous way of showing the data, I can tell you I have seen plots just like this in the literature. Now, let's look at a nicer plot.
Here, it is clear when the measurements were made. The data are clearly visible. The grid lines are not obtrusive. It is generally pleasing to the eye. As a result, the data are easier to interpret. Style matters!
The same arguments about style are true for code. Style matters! Having a well-defined style also keeps your code clean, easy to read, and therefore easier to debug and share.
Before we get any deeper into the philosophy of coding style, let's watch a video that has nothing, and yet everything, to do with it.
%%HTML
<iframe width="420" height="315" src="https://www.youtube.com/embed/_PTvPwnwf-A"></iframe>
We never want to have that regret about interpreting our hard-earned scientific work. How can we best avoid being like Past Ted?
The book, The Art of Readable Code by Boswell and Foucher is a treasure trove of tips about writing well-styled code. At the beginning of their book, they state the Fundamental Theorem of Readability.
Code should be written to minimize the time it would take for someone else to understand it.
This is in general good advice, and this is the essential motivation for using the suggestions in PEP8. Before we dive into PEP8, I want to introduce you to the most important person in the world, Future You. When you are writing code, the person at the front of your mind should be Future You. You really want to make that person happy. Because as far as coding goes, Future You is really someone else, and you want to minimize the time it takes for Future You to understand what Present You (a.k.a. you) did.
Guido van Rossum is the inventor of Python and was the benevolent dictator for life (BDFL) until he stepped down in July of 2018. As new features were added to Python, Guido and a team of others either write or (usually) consider a Python Enhancement Proposal, or a PEP. Each PEP is carefully reviewed, and often there are many iterations with the PEP's author(s).
Perhaps the best-known PEPs are PEP 8 and PEP 20. This lesson is about PEP 8, but we'll pause for a moment to look at PEP 20 to understand why PEP 8 is important. PEP 20 is "The Zen of Python." You can see its text by running import this
.
import this
These are good ideas for coding practice in general. Importantly, beautiful, simple, readable code is a goal of a programmer. That's where PEP 8 comes in. PEP8 is the Python Style Guide, written by Guido, Barry Warsaw, and Nick Coghlan. You can read its full text in the Python PEP index, and I recommend you do that. I also recommend you follow everything it says! It helps you a lot. Trust me; my life got much better after I started following PEP 8's rules.
Note, though, that your code will work just fine if you break PEP 8's rules. In fact, some companies have their own style guides. For example, Google's own style was deprecated and replaced by a much more PEP8 adherent style.
People will sometimes say that one should aim to make their code appear poetic. What is that supposed to mean?
Let's use a poem by William Carlos Williams as an example.
Here's some nonsense:
So much depends upon a red wheel barrow
glazed with rain water beside the white chickens.
And here's a hauntingly beautiful poem:
so much depends
upon
a red wheel
barrow
glazed with rain
water
beside the white
chickens.
Same words. But one of those structures turns the words into a boring sentence conveying no emotion, the other a world-famous poem that evokes a forlorn nostalgia for childhood. In poetic theory, the reason for why this occurs is because the line breaks affect the way poems are read in our minds. Breaks make us slow down our reading pace, process what we've read more thoughtfully, and can help emphasize patterns within the word order. To accomplish this, the poet needs to be simultaneously thoughtful about (1) what he/she wants to convey and (2) how best to convey it for an audience.
Here's some psuedocode to bring us back to programming. Which of these conveys the imagery of the poem better?
so_much.depends(rain_water.glaze(red_wheel_barrow) | white_chickens)
versus
wheel_barrow = red
chickens = white
water = rain
water.glaze(wheel_barrow)
so_much.depends(wheel_barrow | chickens)
While there is a certain satisfaction to writing complex one-liners that work, it's hopefully clear why we want our code to consist of short, well-spaced lines like poetry: a consistent, spacious style reinforces comprehension.
So when we write code, we have not one but TWO goals:
(1) Having the computer understand what you want it to do
(2) Having other humans understand what you're asking the computer to do
How do we accomplish both with our Python code? Here's where the suggestions of PEP 8 come in.
PEP 8 is extensive, but here are some key points for you to keep in mind as you are being style-conscious.
l
, O
, or I
because they are hard to distinguish from ones and zeros.x**2 + y**2
. Low precedence operators should have space.f(x, y=4)
..py
file.Here's an example, the dictionary mapping single-letter residue symbols to the three-letter equivalents.
aa = { 'A' : 'Ala' , 'R' : 'Arg' , 'N' : 'Asn' , 'D' : 'Asp' , 'C' : 'Cys' , 'Q' : 'Gln' , 'E' : 'Glu' , 'G' : 'Gly' , 'H' : 'His' , 'I' : 'Ile' , 'L' : 'Leu' , 'K' : 'Lys' , 'M' : 'Met' , 'F' : 'Phe' , 'P' : 'Pro' , 'S' : 'Ser' , 'T' : 'Thr' , 'W' : 'Trp' , 'Y' : 'Tyr' , 'V' : 'Val' }
My god, that is awful. The PEP 8 version, where we break lines to make things clear, is so much more readable.
aa = {'A': 'Ala',
'R': 'Arg',
'N': 'Asn',
'D': 'Asp',
'C': 'Cys',
'Q': 'Gln',
'E': 'Glu',
'G': 'Gly',
'H': 'His',
'I': 'Ile',
'L': 'Leu',
'K': 'Lys',
'M': 'Met',
'F': 'Phe',
'P': 'Pro',
'S': 'Ser',
'T': 'Thr',
'W': 'Trp',
'Y': 'Tyr',
'V': 'Val'}
How about custom functions? Consider the quadratic formula.
def qf(a, b, c):
return -(b-np.sqrt(b**2-4*a*c))/2/a, (-b-np.sqrt(b**2-4*a*c))/2/a
It works just fine.
qf(2, -3, -9)
But it is illegible. Let's do a PEP 8-ified version.
def quadratic_roots(a, b, c):
"""Real roots of a second order
polynomial of the form ax^2+bx+c."""
# Compute square root of the discriminant
sqrt_disc = np.sqrt(b**2 - 4*a*c)
# Compute two roots
root_1 = (-b + sqrt_disc) / (2*a)
root_2 = (-b - sqrt_disc) / (2*a)
return root_1, root_2
And this also works!
quadratic_roots(2, -3, -9)
PEP8 does not comment extensively on line breaks. I have found that choosing how to do line breaks is often one of the more challenging aspects of making readable code. The Boswell and Foucher book spends lots of space discussing it. There are lots of considerations for choosing line breaks. One of my favorite discussions on this is this blog post from Trey Hunner. It's definitely worth a read, and is about as concise as anything I could put here in this lesson.
PEP 8 gets the aesthetics and readability down, but that style is just the first step in the communication that is possible with Jupyter notebooks.
Let's start with some examples of code from your first homework assignment NOT adhering to PEP 8:
#Plot ant tracks ASAP!
df=pd.read_csv('../data/ant_joint_locations.csv',comment='#')
df['x (mm)']=df['x_coord']*0.08 # mm / pixel
df['y (mm)']=df['y_coord']*0.08
df['ID mod 6']=df['ID']% 6
xax = alt.Scale(domain=[0,20])
chart=alt.Chart(df.loc[df['bodypart']=='head', :],height=150,width=150).mark_circle(size=5,opacity=0.1).encode(x=alt.X('x (mm):Q',scale=xax),
y=alt.Y('y (mm):Q',scale=xax),
color=alt.Color('beetle_treatment:N',title='beetle genus'),
column=alt.Column('beetle_treatment:N', title='beetle genus'),
row='ID mod 6:O')
Ew, this code just looks gross. But as data scientists, there's an even bigger issue here.
Because of the ambiguity of the author's intent with these lines of code, understanding it requires either remembering or actively interpreting exactly what every line accomplishes. If you or someone else comes back to this code, there's a high potential for miscommunication or confusion.
Some lines will remain understandable because of the transparency of python syntax; e.g., it's easy to remember what pd.read_csv() does. You do not need comments for such lines of code. In fact, excessive commenting in this case can clutter your code.
Comments are especially useful when describing why you made a decision in your code; that is when describing intent. In the above example, why use the mod 6 operator? What was the purpose was of instantiating an alt.Scale object? Will you have any idea why you did these things after three months of wet-lab research on another project? Will your PI, labmates, or collaborators?
Compare that to the PEP 8-ified version:
# Plot tracks of individual ants in the presence of
# Dalotia or Sceptobius beetles for comparison
df = pd.read_csv('../data/ant_joint_locations.csv', comment='#')
# Create new columns with x-y coordinates in mm units
interpixel_distance = 0.08
df['x (mm)'] = df['x_coord'] * interpixel_distance
df['y (mm)'] = df['y_coord'] * interpixel_distance
# IDs are unique, but we want ID-beetle treatment pairs for
# ease of plotting. Conveniently, there are six of each treatment.
df['ID mod 6'] = df['ID'] % 6
# Thorax tracking is most useful; keep df small
thorax_df = df.loc[df['bodypart']=='head', :]
# Need to specify domain to get square aspect ratio
axis_scale = alt.Scale(domain=[0, 20])
# Plot the thorax data of both beetle treatments,
# keep height and width same for square aspect
alt.Chart(thorax_df,
height=150,
width=150
).mark_circle(
size=5,
opacity=0.1
).encode(
x=alt.X('x (mm):Q', scale=axis_scale),
y=alt.Y('y (mm):Q', scale=axis_scale),
color=alt.Color('beetle_treatment:N', title='beetle genus'),
column=alt.Column('beetle_treatment:N', title='beetle genus'),
row='ID mod 6:O'
)
The descriptive variable names, the spacing, the appropriate comments all make it much more readable and aesthetically pleasing. And as scientists we're happy because there is little if any ambiguity in the author's intent with each line of code. This makes it understandable, accessible to fruitful debate, and easier to catch mistakes in.
Speaking of which, do you see the mistake in this code which was veiled by ambiguity before?
While this style of coding is great, we can do even better. Let's compare further to the modern Jupyter notebook-ified version (AKA, TA/PI-friendly version 👍):
This experiment aimed to distinguish differences in ant behavior in the presence of a beetle from the genus Sceptobius, which secretes a chemical that modifies the behavior of the ant, versus in the presence of a beetle from the species Dalotia, which does not.
In the experiment, an ant and a beetle were placed in a circular arena and recorded with video at a frame rate of 28 frames per second. The positions of the body parts the ant are tracked throughout the video recording.
We first read in the data frame.
df = pd.read_csv('../data/ant_joint_locations.csv', comment='#')
To convert the units from pixels to millimeters, we need to multiply the x_coord
and y_coord
columns by the interpixel distance.
interpixel_distance = 0.08 # mm
df['x (mm)'] = df['x_coord'] * interpixel_distance
df['y (mm)'] = df['y_coord'] * interpixel_distance
Now, we will find out which ant IDs correspond to which beetle treatment. We can do this by performing a groupby()
operation on the data frame and then noting the groups.
df.groupby(['ID', 'beetle_treatment']).groups.keys()
So, we have found that ants zero through five are paired with Dalotia rove beetles and ants 6 through 11 with Sceptobius. For plotting convenience, we will make a new column in the data frame which is a mod 6 ID. That is, the beetles are now numbered zero through five and are distinguished by the corresponding beetle_treament
value.
df['ID mod 6'] = df['ID'] % 6
Now we can conveniently make a plot with the two genuses of rove beetle side by side. It is important to set the domain of the axes and the height and width of the plots to keep the aspect ratios appropriate so as not to distort spatial dimensions.
I will also choose to plot the trajectories using semitransparent circle marks are small (5 pixels). This way, I can see when the ant walks and when it remains in one place. Normally, I would make the plot interactive, but I am not going to do that here for performance reasons.
# Thorax tracking is most useful; keep df small
thorax_df = df.loc[df['bodypart']=='thorax', :]
# Need to specify domain to get square aspect ratio
axis_scale = alt.Scale(domain=[0, 20])
# Plot the thorax data of both beetle treatments,
# keep height and width same for square aspect
alt.Chart(thorax_df,
height=150,
width=150
).mark_circle(
size=5,
opacity=0.1
).encode(
x=alt.X('x (mm):Q', scale=axis_scale),
y=alt.Y('y (mm):Q', scale=axis_scale),
color=alt.Color('beetle_treatment:N', title='beetle genus'),
column=alt.Column('beetle_treatment:N', title='beetle genus'),
row='ID mod 6:O'
)
This code is not only highly explicit but also explanatory. The Jupyter Notebook allows for a lab notebook style of coding, which is as powerful for communicating and recording data analysis as programming is for reproducing it.
But this is of course dependent on the author putting the effort in to make it so. It will be tedious at first, but as with anything else, over time it will become second-hand. The benefits are tangible: it is extremely satisfying to come back to complex code you wrote six months ago and understanding it all again within minutes.
Perhaps even more satisfying is when your PI walks by, asks to see what you're doing after a few hectic months with disappointingly little progress, and near-immediately understands your poetic code while marveling at your hypnotizingly beautiful plots. It always pays to make your code look like it was made by a data wizard.