(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This tutorial was generated from an Jupyter notebook. You can download the notebook here.
import glob
import numpy as np
import pandas as pd
Before you can apply your Pandas ninja skills on your data sets, you need to make sure they contain what you think they do. Data validation is an important part of the data analysis pipeline. When data comes to you from your data source (e.g., off of an instrument or from a collaborator), you should first verify that it has the structure you expect and that it is complete. Furthermore, you should check to make sure the data make sense. For example, you should never see a negative absorbance measurement.
You probably already do this just by looking at your data. While this is important, you really should automate the process so that you know certain guarantees about your data before you start working with it.
In this tutorial, I will present some basic principles of data validation and then work through a couple examples. If you want to learn more about data validation, Eric Ma is developing some tutorials on the topic. You can find his materials in this repository. He goes into significantly more depth that I do here, including introducing the wonderful pytest module, so it is worth your time to work through his materials.
For a nice introduction to the philosophy, you might want to watch Eric's 3-minute 30 second talk on the subject (don't worry, his talk is only the first 3:30 of the video; it's not 48 minutes long).
As you work with a type of data, your validation suite will grow. You will have more and more confidence that your incoming data sets are what you think they are. You are systematically eliminating sources of error in your analyses.
For our first demonstration of how to build a validation suite, we will validate a set of flow cytometry data. The data files may be downloaded here. These are a subset of this data set from Razo, et al., Tuning transcriptional regulation through signaling: A predictive theory of allosteric induction.
Each file contains flow cytometry data for a bacterial strain with a ribosomal binding site (RBS) modification used to tune repressor copy number. Each CSV file consists of four columns, FSC-A
, SSC-A
, FITC-A
, and gate
. Each row represents a measurement for a putative single cell. (The data set is tidy.) The FSC-A
column contains the front scattering value, the SSC-A
column contains the side scattering value, and the FITC-A
column contains a fluorescent intensity. Finally, the gate
column contains a 1
or 0
dictating whether or not that given measurement is to be included in the analysis.
In writing the tests for this particular data set, it is convenient to assume we have already loaded the data set in an a Pandas DataFrame
. Testing will obviously fail if we cannot read in the data (which is kind of a zeroth-order test).
For our first test, we will make sure all of the columns are present and correct.
def test_column_names(df, fname):
"""Ensure DataFrame has proper columns."""
column_names = ['FSC-A', 'SSC-A', 'FITC-A', 'gate']
assert list(df.columns) == column_names, fname + ' has wrong column names.'
Notice the assert
statement. In Python, an assert
statement checks to see if a Boolean expression evaluates True
(in this case that the columns of the DataFrame
match the correct column names). If the Boolean expression does evaluate True
, nothing happens. If it does now, an AssertionError
is raised, with a string followed by the commas. Let's try testing one of the data sets.
fname = '../data/validation/flow_data/20160804_0_RBS1027_0.0.csv'
# Load in the data
df = pd.read_csv(fname)
# Check the column names
test_column_names(df, fname)
Nothing happened. That means that the test passed. Let's mess with the DataFrame
to see what a failure looks like.
# Change one of the column names
df = df.rename(columns={'FSC-A': 'FSC'})
# Run test again
test_column_names(df, fname)
Now we get an AssertionError
. We now have a failure.
We can write other tests as well. For example, we might want to enforce that we have no missing data, so we have no NaNs in the DataFrame
. We also know that we cannot have negative scattering or fluorescence, at least not in something that is gated. Furthermore, we know that the gate
column can contain only ones and zeros.
def test_missing_data(df, fname):
"""Look for missing entries."""
assert np.all(df.notnull()), fname + ' contains missing data'
def test_gate(df, fname):
"""Make sure all gating entries are 0 or 1"""
assert ((df['gate'] == 0) | (df['gate'] == 1)).all()
def test_negative(df, fname):
"""Look for negative scattering values in gated cells."""
assert np.all(df.loc[df.gate==1, ['FSC-A', 'SSC-A', 'FITC-A']] >= 0), \
fname + ' contains negative scattering data'
Let's try running all three tests.
# Load in the data
df = pd.read_csv(fname)
# Perform tests
test_column_names(df, fname)
test_missing_data(df, fname)
test_gate(df, fname)
test_negative(df, fname)
That data set passed them all! It will be convenient at this point to write a function that loads in a data set, and then performs these tests.
def test_flow(fname):
"""Run a gamut of tests on a flow data set."""
df = pd.read_csv(fname)
test_column_names(df, fname)
test_missing_data(df, fname)
test_gate(df, fname)
test_negative(df, fname)
test_flow('../data/validation/flow_data/20160804_0_RBS1027_0.0.csv')
We would of course like to automate testing multiple files in one go. The glob
module is useful for this. This is best seen first by example.
# Pattern to match
pattern = '../data/validation/flow_data/*RBS1027*.csv'
# Glob it!
glob.glob(pattern)
Running glob.glob
on a pattern string returns a list of all file names that match the pattern. Here, we used the wildcard character (*
) to find all files that had the string RBS1027
preceded and followed by anything so long that it had a .csv
extension. So, we could loop over all of these files to test them. Let's modify our test_flow()
function to do that.
def test_flow(pattern):
"""Validate all files matching pattern for flow specification."""
filenames = glob.glob(pattern)
for fname in filenames:
df = pd.read_csv(fname)
test_column_names(df, fname)
test_missing_data(df, fname)
test_negative(df, fname)
# If we get here, all tests pass
print(len(filenames), 'files passed.')
Let's give this a whirl!
test_flow('../data/validation/flow_data/*RBS1027*.csv')
assert
statements are convenient and having many separate tests with assert
statements are useful if you are going to use pytest. But they can be annoying if your data set might have errors you want to be aware of, but not necessarily abort analysis if those errors exists. As an example, let's rewrite our test for negative values to also include cells that were not gated.
def test_negative(df, fname):
"""Look for negative scattering values in all cells."""
assert np.all(df >= 0), fname + ' contains negative scattering data'
Now, if we rerun our tests, we will get an AssertionError
.
test_flow('../data/validation/flow_data/*RBS1027*.csv')
The very first file failed because it had some negative values. With this real data set, the researchers discovered this, and then asked the manufacturer why some fluorescence values were negative. The manufacturer said it had something to do with the calibration of the instrument. (This is exactly the type of thing data validation is supposed to catch!) So, we may expect some negative values, but may still want to use the data set. So, we may instead with to do a little more customization and provide our own messages for the errors and not abort the entire test. This is more work, and we can put it together.
Before we do, I pause to note that this may not be a great strategy. We may want strong error messages whenever something is wrong with our data, and it might better to hand-built the tolerance (such as being only slightly negative) into the tests themselves. Having errors such as AssertionError
s also allows easier use of tools like pytest. It is up to you to think carefully about what you best testing practices are.
For this example, I will only have more detailed messaging with the test_negative()
function and keep the tests as throwing errors.
def n_negative_col(df, col):
"""Return number of negative entries for gated and non-gated cells"""
n_gated = (df.loc[df['gate']==1, col] < 0).sum()
n_nongated = (df.loc[df['gate']==0, col] < 0).sum()
return n_gated, n_nongated
def test_negative(df, fname):
"""Look for negative scattering values in all cells."""
passed = True
# Check each column for negative values and print result
for col in ['FSC-A', 'SSC-A', 'FITC-A']:
n_gated, n_nongated = n_negative_col(df, col)
if n_gated > 0 or n_nongated > 0:
msg = ( ('{0:s} had {1:d} ungated and {2:d} gated'
+ ' negative entries in the {3:s} column')
.format(fname, n_nongated, n_gated, col))
print(msg)
passed = False
return passed
Now that we have this function, we should also adjust the test_flow()
function.
def test_flow(pattern):
"""Validate all files matching pattern for flow specification."""
filenames = glob.glob(pattern)
n_passed = 0
n_failed = 0
for fname in filenames:
df = pd.read_csv(fname)
test_column_names(df, fname)
test_missing_data(df, fname)
if test_negative(df, fname):
n_passed += 1
else:
n_failed += 1
# Report results
print('\n*************************************')
print(n_passed, 'files passed.')
print(n_failed, 'files failed')
Now let's try running the tests.
test_flow('../data/validation/flow_data/*RBS1027*.csv')
Now we see that every file had failures because of negative entries, though no failures for gated cells.
As a second example, we will test a directory of images. You can download the images here. This is a fabricated set of images meant to simulate a time series of TIF images.
We would like to verify the following.
im_000013.tif
. This is important because if we are going to automate loading in these images, they need to have the correct file names.For this example, I will illustrate how to use pytest to test the directory. You simply place a .py
file somewhere in the directory you want tested. When you run pytest, it will sniff out any function that begins with test_
and run the function. It will then give you a report of the errors you encountered. To see how this works, on the command line, go into the directory containing the images and run
pytest -v
and pytest will do the rest! You can look at the file test_image_collection.py
to see how the tests were constructed. The principles are exactly the same as for the flow data.
You have a computer. It can automate checking and validating your data. Do it. I will save you headaches when you are doing an analysis and wondering why things might be looking weird. More importantly, when things don't look weird in your analysis, but have a problem nonetheless, formal automated data validation can catch errors and thereby make your work more reproducible.