Introduction to the bootstrap

Dataset download


[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade iqplot colorcet watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np
import pandas as pd

import numba

import iqplot

import colorcet

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

We have discussed the plug-in principle, in which we approximate the distribution function of the generative model, \(F\), with the empirical distribution function, \(\hat{F}\). We then approximate all statistical functionals of \(F\) using \(\hat{F}\); \(T(F)\approx T(\hat{F})\). Using the frequentist perspective, we talked about using the plug-in principle and bootstrapping to compute confidence intervals and do hypothesis testing. In this lesson, we will go over how we implement this in practice. We will work with some real data on the reproductive health of male bees that have been treated with pesticide.

It is important as you go through this lesson to remember that you do not know what the generative distribution is. We are only approximating it using the plug-in principle. Since we do not know what \(F\) is, we do not know what its parameters are, so we therefore cannot estimate them. Rather, we can study summary statistics, which are plug-in estimates for statistical functionals of the generative distribution, such as means, variances, or even the ECDF itself, and how they may vary from experiment to experiment. In this sense, we are doing nonparametric inference. We do not have a parametrized model in mind, but are investigating how experimental results might vary upon repetition of the experiment.

The data set

Neonicotinoid pesticides are thought to have inadvertent effects on service-providing insects such as bees. A study of this was featured in the New York Times. The original paper by Straub and coworkers may be found here. The authors put their data in the Dryad repository, which means that it is free to all to work with!

(Do you see a trend here? If you want people to think deeply about your results, explore them, learn from them, in general further science with them, make your data publicly available. Strongly encourage the members of your lab to do the same.)

We will look at the sperm quality of drone bees using this data set. First, let’s load in the data set and check it out.

[2]:
df = pd.read_csv(os.path.join(data_path, "bee_sperm.csv"), comment="#")
df.head()
[2]:
Specimen Treatment Environment TreatmentNCSS Sample ID Colony Cage Sample Sperm Volume per 500 ul Quantity ViabilityRaw (%) Quality Age (d) Infertil AliveSperm Quantity Millions Alive Sperm Millions Dead Sperm Millions
0 227 Control Cage 1 C2-1-1 2 1 1 2150000 2150000 96.7263814616756 96.726381 14 0 2079617 2.1500 2.079617 0.070383
1 228 Control Cage 1 C2-1-2 2 1 2 2287500 2287500 96.3498079760595 96.349808 14 0 2204001 2.2875 2.204001 0.083499
2 229 Control Cage 1 C2-1-3 2 1 3 87500 87500 98.75 98.750000 14 0 86406 0.0875 0.086406 0.001094
3 230 Control Cage 1 C2-1-4 2 1 4 1875000 1875000 93.2874208336941 93.287421 14 0 1749139 1.8750 1.749139 0.125861
4 231 Control Cage 1 C2-1-5 2 1 5 1587500 1587500 97.7925061050061 97.792506 14 0 1552456 1.5875 1.552456 0.035044

We are interested in the number of alive sperm in the samples. Let’s first explore the data by making ECDFs for the two groups we will compare, those treated with pesticide (“Pesticide”) and those that are not (“Control”).

[3]:
p = iqplot.ecdf(df, q="Alive Sperm Millions", cats="Treatment")

bokeh.io.show(p)

The visual inspection of the ECDFs suggests that indeed the control drones have more alive sperm than those treated with pesticide. But how variable would these ECDFs be if we repeated the experiment?

Bootstrap samples and ECDFs

To address this question, we can generate bootstrap samples from the experimental data and make lots of ECDFs. We can then plot them all to see how the ECDF might vary. Recall that a bootstrap sample from a data set of \(n\) repeated measurements is generated by drawing \(n\) data points out of the original data set with replacement. The rg.choice() function allows for randomly drawing elements out of an array. Let’s generate 100 bootstrap samples and plot their ECDFs to visualize how our data set might change as we repeat the experiment.

[4]:
rg = np.random.default_rng()

# Set up Numpy arrays for convenience (also much better performance)
alive_ctrl = df.loc[df["Treatment"] == "Control", "Alive Sperm Millions"].values
alive_pest = df.loc[df["Treatment"] == "Pesticide", "Alive Sperm Millions"].values

# ECDF values for plotting
ctrl_ecdf = np.arange(1, len(alive_ctrl)+1) / len(alive_ctrl)
pest_ecdf = np.arange(1, len(alive_pest)+1) / len(alive_pest)

# Make 100 bootstrap samples and plot them
for _ in range(100):
    bs_ctrl = rg.choice(alive_ctrl, size=len(alive_ctrl))
    bs_pest = rg.choice(alive_pest, size=len(alive_pest))

    # Add semitransparent ECDFs to the plot
    p.circle(np.sort(bs_ctrl), ctrl_ecdf, color=colorcet.b_glasbey_category10[0], alpha=0.02)
    p.circle(np.sort(bs_pest), pest_ecdf, color=colorcet.b_glasbey_category10[1], alpha=0.02)

bokeh.io.show(p)

From this graphical display, we can already see that the ECDFs do not strongly overlap in 100 bootstrap samples, so there is likely a real difference between the two treatments.

Speeding up sampling with Numba

We can make sampling faster (though note that the slowness in generating the above plot is not in resampling, but in adding glyphs to the plot) by employing just-in-time compilation, wherein the Python code is compiled at runtime into machine code. This results in a substantial speed boost. Numba is a powerful tool for this purpose, and we will use it. Bear in mind, though, that not all Python code is Numba-able. Specifically, only the Mersenne Twister bit generator is supported, so we need to use np.random.choice() in Numba’s code to do our resampling.

In order to jit-in-time compile a function, we need to decorate its definition with the @numba.njit decorator, which tells Python to use Numba to just-in-time compile the function. Note that Numba does not currently work with Pandas data frames; it typically requires Numpy arrays to do any operations on arrays.

[5]:
@numba.njit
def draw_bs_sample(data):
    """Draw a bootstrap sample from a 1D data set."""
    return np.random.choice(data, size=len(data))

Let’s quickly quantify the speed boost. Before we do the timing, we will run the Numba’d version once to give it a chance to do the JIT compilation.

[6]:
# Run once to JIT
draw_bs_sample(alive_ctrl)

print('Using the Numpy default RNG (PCG64):')
%timeit rg.choice(alive_ctrl, size=len(alive_ctrl))

print('\nUsing the Numpy Mersenne Twister:')
%timeit np.random.choice(alive_ctrl, size=len(alive_ctrl))

print("\nUsing a Numba'd Mersenne Twister:")
%timeit draw_bs_sample(alive_ctrl)
Using the Numpy default RNG (PCG64):
18.3 µs ± 380 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Using the Numpy Mersenne Twister:
18.5 µs ± 146 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Using a Numba'd Mersenne Twister:
2.86 µs ± 447 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

That’s a factor of 7 or so increase. That can make a big difference when generating lots of bootstrap samples.

Bootstrap replicates and confidence intervals

We have plotted the ECDF of the data, which is instructive, but we would also like to get plug-in estimates for the generative distribution. Remember, when doing nonparametric plug-in estimates, we plug in the ECDF for the CDF. We do not need to specify what the distribution (described mathematically by the CDF, or equivalently by the PDF) is, just that we approximate it by the empirical distribution.

In lecture, we laid out the procedure to compute a confidence interval.

  1. Generate \(B\) independent bootstrap samples.

  2. Compute the plug-in estimate of the statistical functional of interest for each bootstrap sample to get the bootstrap replicates.

  3. The \(100(1-\alpha)\) percent confidence interval consists of the percentiles \(100\alpha/2\) and \(100(1 - \alpha/2)\) of the bootstrap replicates.

A key step here is computing the bootstrap replicate. We will write a couple functions for this. The first is generic; it takes as an argument the function to be used to compute the statistic of interest (e.g., np.mean or np.median). We will also write a few functions for commonly computed statistics, which enables us to use Numba to greatly speed up the process of generating bootstrap replicates.

[7]:
def draw_bs_reps(data, stat_fun, size=1):
    """Draw boostrap replicates computed with stat_fun from 1D data set."""
    return np.array([stat_fun(draw_bs_sample(data)) for _ in range(size)])


@numba.njit
def draw_bs_reps_mean(data, size=1):
    """Draw boostrap replicates of the mean from 1D data set."""
    out = np.empty(size)
    for i in range(size):
        out[i] = np.mean(draw_bs_sample(data))
    return out


@numba.njit
def draw_bs_reps_median(data, size=1):
    """Draw boostrap replicates of the median from 1D data set."""
    out = np.empty(size)
    for i in range(size):
        out[i] = np.median(draw_bs_sample(data))
    return out


@numba.njit
def draw_bs_reps_std(data, size=1):
    """Draw boostrap replicates of the standard deviation from 1D data set."""
    out = np.empty(size)
    for i in range(size):
        out[i] = np.std(draw_bs_sample(data))
    return out

Now, let’s get bootstrap replicates for the mean of each of the two treatments.

[8]:
bs_reps_mean_ctrl = draw_bs_reps_mean(alive_ctrl, size=10000)
bs_reps_mean_pest = draw_bs_reps_mean(alive_pest, size=10000)

We can now compute the confidence intervals by computing the percentiles using the np.percentile() function.

[9]:
# 95% confidence intervals
mean_ctrl_conf_int = np.percentile(bs_reps_mean_ctrl, [2.5, 97.5])
mean_pest_conf_int = np.percentile(bs_reps_mean_pest, [2.5, 97.5])

print("""
Mean alive sperm count 95% conf int control (millions):   [{0:.2f}, {1:.2f}]
Mean alive sperm count 95% conf int treatment (millions): [{2:.2f}, {3:.2f}]
""".format(*(tuple(mean_ctrl_conf_int) + tuple(mean_pest_conf_int))))

Mean alive sperm count 95% conf int control (millions):   [1.68, 2.07]
Mean alive sperm count 95% conf int treatment (millions): [1.11, 1.49]

We can also use the bootstrap replicates to plot the probability distribution of mean alive sperm count. Remember: This is not a confidence interval on a parameter value. That does not make sense in frequentist statistics. It is the confidence interval describing what we would get as a plug-in estimate for the mean if we did the experiment over and over again.

When we plot the ECDF of the bootstrap replicates, we thin them so as not to choke the browser with too many points. Since we will do this again, we write a quick function to do it.

[10]:
def plot_bs_reps_ecdf(ctrl, pest, q, thin=1, **kwargs):
    """Make plot of bootstrap ECDFs for control and pesticide treatment."""
    x_ctrl = np.sort(ctrl)[::thin]
    x_pest = np.sort(pest)[::thin]

    df = pd.DataFrame(
        data={
            "treatment": ["control"] * len(x_ctrl) + ["pesticide"] * len(x_pest),
            q: np.concatenate((x_ctrl, x_pest)),
        }
    )

    p = iqplot.ecdf(
        df, q=q, cats="treatment", frame_height=200, frame_width=350, **kwargs
    )

    return p


p = plot_bs_reps_ecdf(
    bs_reps_mean_ctrl, bs_reps_mean_pest, "mean alive sperm (millions)", thin=50
)
p.legend.location = "top_left"

bokeh.io.show(p)

These are both nice, Normal distributions with almost no overlap in the tails. We also saw in the computation of the 95% confidence intervals above that there is no overlap. This is expected, the mean alive sperm count should be Normally distributed as per the central limit theorem.

We can do the same procedure for other statistical quantities that do not follow the central limit theorem. The procedure is exactly the same. We will do it for the median.

[11]:
# Get the bootstrap replicates
bs_reps_median_ctrl = draw_bs_reps_median(alive_ctrl, size=10000)
bs_reps_median_pest = draw_bs_reps_median(alive_pest, size=10000)

# 95% confidence intervals
median_ctrl_conf_int = np.percentile(bs_reps_median_ctrl, [2.5, 97.5])
median_pest_conf_int = np.percentile(bs_reps_median_pest, [2.5, 97.5])

print(
    """
Median alive sperm count 95% conf int control (millions):   [{0:.2f}, {1:.2f}]
Median alive sperm count 95% conf int treatment (millions): [{2:.2f}, {3:.2f}]
""".format(
        *(tuple(median_ctrl_conf_int) + tuple(median_pest_conf_int))
    )
)

# Plot ECDFs of bootstrap replicates
p = plot_bs_reps_ecdf(
    bs_reps_median_ctrl, bs_reps_median_pest, "median alive sperm (millions)", thin=50
)
p.legend.location = "top_left"

bokeh.io.show(p)

Median alive sperm count 95% conf int control (millions):   [1.63, 2.11]
Median alive sperm count 95% conf int treatment (millions): [0.95, 1.47]

The results are similar, and we clearly see clear non-normality in the ECDFs, since the central limit theorem does not hold here.

Computing environment

[12]:
%load_ext watermark
%watermark -v -p numpy,pandas,numba,bokeh,colorcet,iqplot,jupyterlab
CPython 3.8.5
IPython 7.18.1

numpy 1.19.1
pandas 1.1.3
numba 0.51.2
bokeh 2.2.3
colorcet 2.0.2
iqplot 0.1.6
jupyterlab 2.2.6