Tutorial 3a: Data validation

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [1]:
import glob

import numpy as np
import pandas as pd

Before you can apply your Pandas ninja skills on your data sets, you need to make sure they contain what you think they do. Data validation is an important part of the data analysis pipeline. When data come to you from your data source (e.g., off of an instrument or from a collaborator), you should first verify that it has the structure you expect and that it is complete. Furthermore, you should check to make sure the data make sense. For example, you should never see a negative absorbance measurement.

You probably already do this just by looking at your data. While just looking is important, you really should automate the process so that you know certain guarantees about your data before you start working with it.

In this tutorial, I will present some basic principles of data validation and then work through a couple examples. If you want to learn more about data validation, Eric Ma is developing some tutorials on the topic. You can find his materials in this repository. He goes into significantly more depth that I do here, including introducing the wonderful pytest module, so it is worth your time to work through his materials.

For a nice introduction to the philosophy, you might want to watch Eric's 3-minute 30 second talk on the subject (don't worry, his talk is only the first 3:30 of the video; it's not 48 minutes long).

Building a validation suite

  1. Think about all of the expectations you have of your data source.
    • What is its structure?
    • What type is it? (Numeric, sequencing, etc.)
    • Is it complete?
  2. For every expectation, write a validation test to make sure a data set follows these assumptions. Usually, each test is a separate function (in our case in Python).
  3. As you develop your wrangling pipeline for tidying, etc., every time you make an assumption about your data, write a validation test for that assumption if you have not already.
  4. Whenever you get hit with a bug or see a new error in a data set, write a test for it.

As you work with a type of data, your validation suite will grow. You will have more and more confidence that your incoming data sets are what you think they are. You are systematically eliminating sources of error in your analyses.

Example validation 1: flow data

For our first demonstration of how to build a validation suite, we will validate a set of flow cytometry data. The data files may be downloaded here. These are a subset of this data set from Razo, et al., Tuning transcriptional regulation through signaling: A predictive theory of allosteric induction.

Each file contains flow cytometry data for a bacterial strain with a ribosomal binding site (RBS) modification used to tune repressor copy number. Each CSV file consists of four columns, FSC-A, SSC-A, FITC-A, and gate. Each row represents a measurement for a putative single cell. (The data set is tidy.) The FSC-A column contains the front scattering value, the SSC-A column contains the side scattering value, and the FITC-A column contains a fluorescent intensity. Finally, the gate column contains a 1 or 0 dictating whether or not that given measurement is to be included in the analysis.

In writing the tests for this particular data set, it is convenient to assume we have already loaded the data set in an a Pandas DataFrame. Testing will obviously fail if we cannot read in the data (which is kind of a zeroth-order test).

For our first test, we will make sure all of the columns are present and correct.

In [2]:
def test_column_names(df, fname):
    """Ensure DataFrame has proper columns."""
    column_names = ['FSC-A', 'SSC-A', 'FITC-A', 'gate']

    assert list(df.columns) == column_names, fname + ' has wrong column names.'

Notice the assert statement. In Python, an assert statement checks to see if a Boolean expression evaluates True (in this case that the columns of the DataFrame match the correct column names). If the Boolean expression does evaluate True, nothing happens. If it does not, an AssertionError is raised, with a string followed by the comma. Let's try testing one of the data sets.

In [3]:
fname = '../data/validation/flow_data/20160804_0_RBS1027_0.0.csv'

# Load in the data
df = pd.read_csv(fname)

# Check the column names
test_column_names(df, fname)

Nothing happened. That means that the test passed. Let's mess with the DataFrame to see what a failure looks like.

In [4]:
# Change one of the column names
df = df.rename(columns={'FSC-A': 'FSC'})

# Run test again
test_column_names(df, fname)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-4-e0d50a6f71f2> in <module>
      3 
      4 # Run test again
----> 5 test_column_names(df, fname)

<ipython-input-2-800cc64e43be> in test_column_names(df, fname)
      3     column_names = ['FSC-A', 'SSC-A', 'FITC-A', 'gate']
      4 
----> 5     assert list(df.columns) == column_names, fname + ' has wrong column names.'

AssertionError: ../data/validation/flow_data/20160804_0_RBS1027_0.0.csv has wrong column names.

Now we get an AssertionError. We now have a failure.

We can write other tests as well. For example, we might want to enforce that we have no missing data, so we have no NaNs in the DataFrame. We also know that we cannot have negative scattering or fluorescence, at least not in something that is gated. Furthermore, we know that the gate column can contain only ones and zeros.

In [5]:
def test_missing_data(df, fname):
    """Look for missing entries."""
    assert np.all(df.notnull()), fname + ' contains missing data'

def test_gate(df, fname):
    """Make sure all gating entries are 0 or 1"""
    assert ((df['gate'] == 0) | (df['gate'] == 1)).all(), \
            fname + ' has bad gate values.'

def test_negative(df, fname):
    """Look for negative scattering values in gated cells."""
    assert np.all(df.loc[df['gate']==1, ['FSC-A', 'SSC-A', 'FITC-A']] >= 0), \
            fname + ' contains negative scattering data'

Let's try running all three tests.

In [6]:
# Load in the data
df = pd.read_csv(fname)

# Perform tests
test_column_names(df, fname)
test_missing_data(df, fname)
test_gate(df, fname)
test_negative(df, fname)

That data set passed them all! It will be convenient at this point to write a function that loads in a data set, and then performs these tests.

In [7]:
def test_flow(fname):
    """Run a gamut of tests on a flow data set."""
    df = pd.read_csv(fname)
    
    test_column_names(df, fname)
    test_missing_data(df, fname)
    test_gate(df, fname)
    test_negative(df, fname)
    
test_flow('../data/validation/flow_data/20160804_0_RBS1027_0.0.csv')

Testing multiple files

We would of course like to automate testing multiple files in one go. The glob module is useful for this. This is best seen first by example.

In [8]:
# Pattern to match
pattern = '../data/validation/flow_data/*RBS1027*.csv'

# Glob it! (and sort them alphabetically)
list(sorted(glob.glob(pattern)))
Out[8]:
['../data/validation/flow_data/20160804_0_RBS1027_0.0.csv',
 '../data/validation/flow_data/20160804_0_RBS1027_0.1.csv',
 '../data/validation/flow_data/20160804_0_RBS1027_1.0.csv',
 '../data/validation/flow_data/20160804_0_RBS1027_10.0.csv',
 '../data/validation/flow_data/20160804_0_RBS1027_100.0.csv',
 '../data/validation/flow_data/20160804_0_RBS1027_1000.0.csv',
 '../data/validation/flow_data/20160804_0_RBS1027_25.0.csv',
 '../data/validation/flow_data/20160804_0_RBS1027_250.0.csv',
 '../data/validation/flow_data/20160804_0_RBS1027_5.0.csv',
 '../data/validation/flow_data/20160804_0_RBS1027_50.0.csv',
 '../data/validation/flow_data/20160804_0_RBS1027_500.0.csv',
 '../data/validation/flow_data/20160804_0_RBS1027_5000.0.csv',
 '../data/validation/flow_data/20160804_0_RBS1027_75.0.csv']

Running glob.glob on a pattern string returns a list of all file names that match the pattern. Here, we used the wildcard character (*) to find all files that had the string RBS1027 preceded and followed by anything so long that it had a .csv extension. So, we could loop over all of these files to test them. Let's modify our test_flow() function to do that.

In [9]:
def test_flow(pattern):
    """Validate all files matching pattern for flow specification."""
    filenames = list(sorted(glob.glob(pattern)))
    
    for fname in filenames:
        df = pd.read_csv(fname)
        test_gate(df, fname)
        test_column_names(df, fname)
        test_missing_data(df, fname)
        test_negative(df, fname)
    
    # If we get here, all tests pass
    print(len(filenames), 'files passed.')

Let's give this a whirl!

In [10]:
test_flow('../data/validation/flow_data/*RBS1027*.csv')
13 files passed.

Assertions are useful, but...

assert statements are convenient and having many separate tests with assert statements are useful if you are going to use pytest. But they can be annoying if your data set might have errors you want to be aware of, but not necessarily abort analysis if those errors exists. As an example, let's rewrite our test for negative values to also include cells that were not gated.

In [11]:
def test_negative(df, fname):
    """Look for negative scattering values in all cells."""
    assert np.all(df >= 0), fname + ' contains negative scattering data'

Now, if we rerun our tests, we will get an AssertionError.

In [12]:
test_flow('../data/validation/flow_data/*RBS1027*.csv')
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-12-016e69abc68c> in <module>
----> 1 test_flow('../data/validation/flow_data/*RBS1027*.csv')

<ipython-input-9-417bbc1b71b2> in test_flow(pattern)
      8         test_column_names(df, fname)
      9         test_missing_data(df, fname)
---> 10         test_negative(df, fname)
     11 
     12     # If we get here, all tests pass

<ipython-input-11-bbc41d9c1a19> in test_negative(df, fname)
      1 def test_negative(df, fname):
      2     """Look for negative scattering values in all cells."""
----> 3     assert np.all(df >= 0), fname + ' contains negative scattering data'

AssertionError: ../data/validation/flow_data/20160804_0_RBS1027_0.0.csv contains negative scattering data

The very first file failed because it had some negative values. With this real data set, the researchers discovered this, and then asked the manufacturer why some fluorescence values were negative. The manufacturer said it had something to do with the calibration of the instrument. (This is exactly the type of thing data validation is supposed to catch!) So, we may expect some negative values, but may still want to use the data set. So, we may instead do a little more customization and provide our own messages for the errors and not abort the entire test. This is more work, and we can put it together.

Before we do, I pause to note that this may not be a great strategy. We may want strong error messages whenever something is wrong with our data, and it might better to hand-build the tolerance (such as being only slightly negative) into the tests themselves. Having errors such as AssertionErrors also allows easier use of tools like pytest. It is up to you to think carefully about what you best testing practices are.

For this example, I will only have more detailed messaging with the test_negative() function and keep the tests as throwing errors.

In [13]:
def n_negative_col(df, col):
    """Return number of negative entries for gated and non-gated cells"""
    n_gated = (df.loc[df['gate']==1, col] < 0).sum()
    n_nongated = (df.loc[df['gate']==0, col] < 0).sum()
    return n_gated, n_nongated
    
def test_negative(df, fname):
    """Look for negative scattering values in all cells."""
    passed = True
    
    # Check each column for negative values and print result
    for col in ['FSC-A', 'SSC-A', 'FITC-A']:
        n_gated, n_nongated = n_negative_col(df, col)
        if n_gated > 0 or n_nongated > 0:
            msg = ( ('{0:s} had {1:d} ungated and {2:d} gated'
                      + ' negative entries in the {3:s} column')
                   .format(fname, n_nongated, n_gated, col))
            print(msg)
            passed = False
            
    return passed

Now that we have this function, we should also adjust the test_flow() function.

In [14]:
def test_flow(pattern):
    """Validate all files matching pattern for flow specification."""
    filenames = glob.glob(pattern)
    
    n_passed = 0
    n_failed = 0
    
    for fname in filenames:
        df = pd.read_csv(fname)
        test_gate(df, fname)
        test_column_names(df, fname)
        test_missing_data(df, fname)
        if test_negative(df, fname):
            n_passed += 1
        else:
            n_failed += 1
    
    # Report results
    print('\n*************************************')
    print(n_passed, 'files passed.')
    print(n_failed, 'files failed')

Now let's try running the tests.

In [15]:
test_flow('../data/validation/flow_data/*RBS1027*.csv')
../data/validation/flow_data/20160804_0_RBS1027_25.0.csv had 1607 ungated and 0 gated negative entries in the FITC-A column
../data/validation/flow_data/20160804_0_RBS1027_500.0.csv had 104 ungated and 0 gated negative entries in the FITC-A column
../data/validation/flow_data/20160804_0_RBS1027_1000.0.csv had 1 ungated and 0 gated negative entries in the FSC-A column
../data/validation/flow_data/20160804_0_RBS1027_1000.0.csv had 1 ungated and 0 gated negative entries in the SSC-A column
../data/validation/flow_data/20160804_0_RBS1027_1000.0.csv had 158 ungated and 0 gated negative entries in the FITC-A column
../data/validation/flow_data/20160804_0_RBS1027_0.0.csv had 1 ungated and 0 gated negative entries in the SSC-A column
../data/validation/flow_data/20160804_0_RBS1027_0.0.csv had 15063 ungated and 0 gated negative entries in the FITC-A column
../data/validation/flow_data/20160804_0_RBS1027_0.1.csv had 14302 ungated and 0 gated negative entries in the FITC-A column
../data/validation/flow_data/20160804_0_RBS1027_5.0.csv had 11059 ungated and 0 gated negative entries in the FITC-A column
../data/validation/flow_data/20160804_0_RBS1027_1.0.csv had 14302 ungated and 0 gated negative entries in the FITC-A column
../data/validation/flow_data/20160804_0_RBS1027_250.0.csv had 1 ungated and 0 gated negative entries in the SSC-A column
../data/validation/flow_data/20160804_0_RBS1027_250.0.csv had 3481 ungated and 0 gated negative entries in the FITC-A column
../data/validation/flow_data/20160804_0_RBS1027_50.0.csv had 404 ungated and 0 gated negative entries in the FITC-A column
../data/validation/flow_data/20160804_0_RBS1027_10.0.csv had 1 ungated and 0 gated negative entries in the SSC-A column
../data/validation/flow_data/20160804_0_RBS1027_10.0.csv had 6770 ungated and 0 gated negative entries in the FITC-A column
../data/validation/flow_data/20160804_0_RBS1027_100.0.csv had 159 ungated and 0 gated negative entries in the FITC-A column
../data/validation/flow_data/20160804_0_RBS1027_5000.0.csv had 8 ungated and 0 gated negative entries in the FSC-A column
../data/validation/flow_data/20160804_0_RBS1027_5000.0.csv had 3 ungated and 0 gated negative entries in the SSC-A column
../data/validation/flow_data/20160804_0_RBS1027_5000.0.csv had 260 ungated and 0 gated negative entries in the FITC-A column
../data/validation/flow_data/20160804_0_RBS1027_75.0.csv had 1 ungated and 0 gated negative entries in the SSC-A column
../data/validation/flow_data/20160804_0_RBS1027_75.0.csv had 243 ungated and 0 gated negative entries in the FITC-A column

*************************************
0 files passed.
13 files failed

Now we see that every file had failures because of negative entries, though no failures for gated cells.

Example validation 2: image data

As a second example, we will test a directory of images. You can download the images and .py files used for testing here. This is a fabricated set of images meant to simulate a time series of TIF images.

We would like to verify the following.

  • All file names match the specific pattern, such as im_000013.tif. This is important because if we are going to automate loading in these images, they need to have the correct file names.
  • Verify that there are no dropped frames, i.e., that the numbers of the images go up sequentially.
  • Make sure no frames are overexposed (having at least one pixel that is the maximum for the bit depth of the image) or unexposed (being completely black).
  • Make sure each image has the correct dimensions.

For this example, I will illustrate how to use pytest to test the directory. You simply place a .py file somewhere in the directory you want tested. When you run pytest, it will sniff out any function that begins with test_ and run the function. It will then give you a report of the errors you encountered. To see how this works, on the command line, go into the directory containing the images and run

pytest -v

and pytest will do the rest! You can look at the file test_image_collection.py (contents of which are shown below) to see how the tests were constructed. The principles are exactly the same as for the flow data.

In [16]:
# %load ../data/validation/image_validation/test_image_collection.py
import glob
import re

import numpy as np
import skimage.io

# Get all tiff files in this directory
fnames = glob.glob('*.tif')

# Proper pattern for image file names
pattern = re.compile('im_[0-9]{6}.tif')

# Proper dimensions of images
idim, jdim = 128, 128

# Image bit depth
bitdepth = 12

def get_proper_frames(fnames, pattern):
    """Get all properly named frames in directory."""
    frames = []
    for fname in fnames:
        if re.fullmatch(pattern, fname) is not None:
            frames.append(int(fname[3:-4]))
    return frames


def test_file_names():
    """
    Ensure all TIFF files follow naming convention.
    """
    for fname in fnames:
        assert re.fullmatch(pattern, fname) is not None, \
                            fname + ' does not match pattern.'


def test_dropped_frames():
    """
    Check for dropped frames
    """
    # Get all proper file names (Should be all of them)
    frames = get_proper_frames(fnames, pattern)

    # Look for skipped frames
    actual_frames = set(frames)
    max_frame = max(actual_frames)
    desired_frames = set(np.arange(max_frame+1))
    skipped_frames = desired_frames - actual_frames

    assert not skipped_frames, 'Missing frames: ' + str(skipped_frames)


def test_exposure():
    """
    Check for frames with overexposure and unexposure.
    """
    # Get all proper file names (Should be all of them)
    frames = get_proper_frames(fnames, pattern)

    # Look for unexposed and overexposed frames
    unexposed = []
    overexposed = []
    for frame in frames:
        fname = 'im_{0:06d}.tif'.format(frame)
        im = skimage.io.imread(fname)
        if im.max() == 0:
            unexposed.append(frame)
        if im.max() == 2**bitdepth-1:
            overexposed.append(frame)

    assert not unexposed and not overexposed, \
            'unexposed: ' + str(unexposed) \
                + '  overexposed: ' + str(overexposed)


def test_dimensions():
    """Make sure all images have same dimensions."""
    # Get all proper file names (Should be all of them)
    frames = get_proper_frames(fnames, pattern)

    for frame in frames:
        fname = 'im_{0:06d}.tif'.format(frame)
        im = skimage.io.imread(fname)
        assert im.shape == (idim, jdim)

Conclusions

You have a computer. It can automate checking and validating your data. Do it. I will save you headaches when you are doing an analysis and wondering why things might be looking weird. More importantly, when things don't look weird in your analysis, but have a problem nonetheless, formal automated data validation can catch errors and thereby make your work more reproducible.

Computing environment

In [17]:
%load_ext watermark
In [18]:
%watermark -v -p numpy,pandas
CPython 3.7.0
IPython 7.0.1

numpy 1.15.2
pandas 0.23.4