Hacker’s approach to NHST

Dataset download

import numpy as np
import pandas as pd

import numba

import bokeh_catplot

import colorcet

import bokeh.io
import bokeh.plotting
Loading BokehJS ...

In this part of the lesson, we will perform null hypothesis significance tests (NHST) using the drone sperm quality data set. You should be sure to carefully read the lecture material on NHST so that you are very clear on what NHST is and what is entails.

Of course, we need to load in the data set first.

df = pd.read_csv('../data/bee_sperm.csv', comment='#')
Specimen Treatment Environment TreatmentNCSS Sample ID Colony Cage Sample Sperm Volume per 500 ul Quantity ViabilityRaw (%) Quality Age (d) Infertil AliveSperm Quantity Millions Alive Sperm Millions Dead Sperm Millions
0 227 Control Cage 1 C2-1-1 2 1 1 2150000 2150000 96.7263814616756 96.726381 14 0 2079617 2.1500 2.079617 0.070383
1 228 Control Cage 1 C2-1-2 2 1 2 2287500 2287500 96.3498079760595 96.349808 14 0 2204001 2.2875 2.204001 0.083499
2 229 Control Cage 1 C2-1-3 2 1 3 87500 87500 98.75 98.750000 14 0 86406 0.0875 0.086406 0.001094
3 230 Control Cage 1 C2-1-4 2 1 4 1875000 1875000 93.2874208336941 93.287421 14 0 1749139 1.8750 1.749139 0.125861
4 231 Control Cage 1 C2-1-5 2 1 5 1587500 1587500 97.7925061050061 97.792506 14 0 1552456 1.5875 1.552456 0.035044

Because we will be using Numba’d functions to rapidly do our sampling, we will go ahead and slice out the subsets of the data frame we will use for inference as Numpy arrays. We will use the specimens that have nonzero sperm count of alive and dead sperm.

# Extract NumPy arrays (only use values greater than zero for logs)
inds = (df["Alive Sperm Millions"] > 0) & (df["Dead Sperm Millions"] > 0)
df = df.loc[inds, :]

alive_ctrl = df.loc[df["Treatment"] == "Control", "Alive Sperm Millions"].values

alive_pest = df.loc[df["Treatment"] == "Pesticide", "Alive Sperm Millions"].values

dead_ctrl = df.loc[df["Treatment"] == "Control", "Dead Sperm Millions"].values

dead_pest = df.loc[df["Treatment"] == "Pesticide", "Dead Sperm Millions"].values

Finally, so we have them available, we will define some of the utility functions we defined earlier in the lesson.

def plot_ecdf(data, color="#4e79a7", alpha=1, p=None, **kwargs):
    """Plot an ECDF"""
    # Set up kwargs for the marker
    marker_kwargs = {"color": color, "alpha": alpha}

    # Get the legend label if needed
    if "legend_label" in kwargs:
        marker_kwargs["legend_label"] = kwargs.pop("legend_label")

    # Instantiate new plot if none provided
    if p is None:
        p = bokeh.plotting.figure(**kwargs)

    # Populate ECDF glyphs
    p.circle(np.sort(data), np.arange(1, len(data) + 1) / len(data), **marker_kwargs)

    return p

def draw_bs_sample(data):
    """Draw a bootstrap sample from a 1D data set."""
    return np.random.choice(data, size=len(data))

def pearson_r(x, y):
    Compute plug-in estimate for the Pearson correlation coefficient.
    return (
        np.sum((x - np.mean(x)) * (y - np.mean(y)))
        / np.std(x)
        / np.std(y)
        / np.sqrt(len(x))
        / np.sqrt(len(y))

Permutation hypothesis tests

Recall the steps for performing a NHST.

  1. Clearly state the null hypothesis.

  2. Define a test statistic, a scalar value that you can compute from data. Compute it directly from your measured data.

  3. Simulate data acquisition for the scenario where the null hypothesis is true. Do this many times, computing and storing the value of the test statistic each time.

  4. The fraction of simulations for which the test statistic is at least as extreme as the test statistic computed from the measured data is called the p-value, which is what you report.

For one special type of hypothesis, there is a very straight-forward way of simulating it. Here is our hypothesis: the control and pesticide-treated samples have exactly the same distribution. To simulate this, we take the following steps for two data sets, a control with \(n\) measurements and a test the other with \(m\).

  1. Concatenate the two data sets into one.

  2. Randomly scramble the order of the combined data set.

  3. Designate the first \(n\) entries in this scrambled array to be “control” and the remaining to be “test.”

This simulation is exact; it is as if the label of the data set has no meaning; hence the distributions of the two data sets are entirely equal. We test such a null hypothesis with a permutation test. A permutation sample is akin to a bootstrap sample; it is a new pair of data sets generated after scrambling the concatenated data set. A permutation replicate is a value of the test statistic computed from a permutation sample. For this test, we will use the difference of means as the test statistic.

Let’s code this up. We can write a function to draw a permutation sample, and then a function to draw permutation replicates for an arbitrary user-supplied summary statistic.

def draw_perm_sample(x, y):
    """Generate a permutation sample."""
    concat_data = np.concatenate((x, y))

    return concat_data[:len(x)], concat_data[len(x):]

def draw_perm_reps(x, y, stat_fun, size=1):
    """Generate array of permuation replicates."""
    return np.array([stat_fun(*draw_perm_sample(x, y)) for _ in range(size)])

Since we will look at the difference of means specifically, we will also code up a custom function to draw replicates of the difference of means.

def draw_perm_reps_diff_mean(x, y, size=1):
    """Generate array of permuation replicates."""
    out = np.empty(size)
    for i in range(size):
        x_perm, y_perm = draw_perm_sample(x, y)
        out[i] = np.mean(x_perm) - np.mean(y_perm)

    return out

To perform the hypothesis test, then, with the difference of means as our test statistic, we have only to draw many replicates and then tally up how many of them are more extreme than the observed difference of mean.

# Compute test statistic for original data set
diff_mean = np.mean(alive_ctrl) - np.mean(alive_pest)

# Draw replicates
perm_reps = draw_perm_reps_diff_mean(alive_ctrl, alive_pest, size=1)

# Compute p-value
p_val = np.sum(perm_reps >= diff_mean) / len(perm_reps)

print('p-value =', p_val)
p-value = 0.0

Whoa! Wait a minute. The p-value is zero? This just means that in all of the 10,000 replicates we took, not one had a test statistic as extreme as was observed. We cannot resolve p-values much less than 0.001 with 10,000 permutation replicates. Let’s try taking ten million replicates as see how we do. This will take a few seconds to run (which again highlights the utility of using Numpy arrays with Numba).

# Compute test statistic for original data set
diff_mean = np.mean(alive_ctrl) - np.mean(alive_pest)

# Draw replicates
perm_reps = draw_perm_reps_diff_mean(alive_ctrl, alive_pest, size=10000000)

# Compute p-value
p_val = np.sum(perm_reps >= diff_mean) / len(perm_reps)

print('p-value =', p_val)
p-value = 3.04e-05

So, our p-value is quite small, less than \(10^{-4}\). This means that the probability of getting a difference of means as extreme as was observed under the null hypothesis that the control and test samples were drawn from identical distribution is quite small.

Permutation hypothesis test for correlation

As we continue to explore the data set, we might want to consider a hypothesis: dead and alive sperm counts are uncorrelated. It is possible that under this hypothesis, we could see correlation at the same level observed in the real data. We can test this hypothesis by permutation. In this case, we scramble the labels “dead” and “alive” and see what correlation we get. We will do 10 million permutation replicates. This will take a while to compute.

# Compute Pearson r for actual data
r_ctrl = pearson_r(np.log(alive_ctrl), np.log(dead_ctrl))
r_pest = pearson_r(np.log(alive_pest), np.log(dead_pest))

# Get permutation replicates
perm_reps_ctrl = draw_perm_reps(
    np.log(alive_ctrl), np.log(dead_ctrl), pearson_r, size=10000000
perm_reps_pest = draw_perm_reps(
    np.log(alive_pest), np.log(dead_pest), pearson_r, size=10000000

# Compute p-value
p_ctrl = np.sum(np.abs(perm_reps_ctrl) > np.abs(r_ctrl)) / len(perm_reps_ctrl)
p_pest = np.sum(np.abs(perm_reps_pest) > np.abs(r_pest)) / len(perm_reps_pest)

p-value control:   {0:.3e}
p-value pesticide: {1:.3e}
        p_ctrl, p_pest

p-value control:   0.000e+00
p-value pesticide: 1.000e-06

Note that here we used the absolute value of the permutation replicates. We did this because we are interested to see if we get correlation at all, including anticorrelation. This is another essential piece of a hypothesis test: you need to define what it means to be more extreme.

The p-value for both is tiny; not one in 10 million permutations for the control sample gave a correlation as high as was observed, and only 10 or so of the pesticide treated group. So, we would report something like \(p < 10^{-6}\) for control and \(p < 10^{-5}\) for pesticide.

Bootstrap hypothesis tests

Permutation tests are great: they exactly simulate the null hypothesis that the two samples are identically distributed. But, they are limited to this specific null hypothesis. What if we had a different hypothesis, say only that the means of the two distributions we are comparing are equal, but other properties of the distributions need not be identical? In this case, we cannot use the permutations to simulate the null hypothesis.

Instead, we can simulate the null hypothesis by shifting the means of the control and test distributions so that they are equal. We then take a bootstrap sample out of each of the shifted data sets. We compute our test statistic from these two bootstrap samples to get a bootstrap replicate. Then, the number of bootstrap replicates that are at least as extreme as the test statistic from the original data is used to compute the p-value. Let’s see this in action. First, we’ll see how to shift the data sets and what the resulting ECDFs look like.

# Shift data sets
total_mean = np.mean(np.concatenate((alive_ctrl, alive_pest)))
alive_ctrl_shift = alive_ctrl - np.mean(alive_ctrl) + total_mean
alive_pest_shift = alive_pest - np.mean(alive_pest) + total_mean

# Plot the ECDFs
p = plot_ecdf(
    x_axis_label="shifted alive sperm count (millions)",
p = plot_ecdf(

p.legend.location = "bottom_right"


The distributions now have the same mean, but nothing else about them has changed. They still have the same shape as before. Now, let’s draw bootstrap samples out of these shifted distributions and see how they compare to the original ECDFs.

p = bokeh_catplot.ecdf(df, cats="Treatment", val="Alive Sperm Millions")

for _ in range(100):
    # Get x-y values for ECDF
    p = plot_ecdf(
    p = plot_ecdf(


The blue-orange haze in between the ECDFs of the original data are the bootstrap samples under the null hypothesis. Only on rare occurrence does that haze reach the original ECDFs, so we might suspect that the observed data might not be all that probable under the null hypothesis. Let’s perform the hypothesis test.

def draw_bs_reps_diff_mean(x, y, size=1):
    Generate bootstrap replicates with difference of means
    as the test statistic.
    out = np.empty(size)
    for i in range(size):
        out[i] = np.mean(draw_bs_sample(x)) - np.mean(draw_bs_sample(y))
    return out

# Generate samples (10 million again)
bs_reps = draw_bs_reps_diff_mean(alive_ctrl_shift, alive_pest_shift, size=10000000)

# Compute p-value
p_val = np.sum(bs_reps >= diff_mean) / len(bs_reps)

print("p-value =", p_val)
p-value = 7.5e-06

This p-value is of similar magnitude as what we got from the permutation test.

Simulating a hypothesis: challenging and rewarding

As you can see from the above examples with permutation and bootstrap hypothesis tests, the part of the procedure that requires the most thought and care is simulating the null hypothesis. For a test where we are comparing means, we needed to shift the distributions so that all other aspects of the distribution remain the same, and then resample from the shifted distribution.

The strength of doing these kinds of hypothesis tests is that they are entirely nonparametric. There is not need to specify a probability distribution for the null hypothesis; you can just make statements about how the two distributions you are comparing are related.

Computing environment

%load_ext watermark

%watermark -v -p numpy,pandas,numba,altair,bokeh,colorcet,bokeh_catplot,jupyterlab
CPython 3.7.5
IPython 7.1.1

numpy 1.17.3
pandas 0.24.2
numba 0.46.0
altair 3.2.0
bokeh 1.4.0
colorcet 1.0.0
bokeh_catplot 0.1.6
jupyterlab 1.1.4