8. Bootstrapping

This recitation was prepared by Cecelia Andrews with help from Justin Bois.

Law school dataset download

C. elegans egg cross-sectional area dataset download

Bee sperm dataset download

import pandas as pd
import numpy as np
import numba
import scipy.stats as st

import bokeh_catplot

import colorcet

import bokeh.io
import bokeh.plotting
import bokeh.layouts
Loading BokehJS ...

This this recitation, we will cover

  1. Review of bootstrapping concepts from Lesson 7.

  2. BCa Confidence Intervals

  3. Studentized Bootstrap

An excellent reference, if dated, for bootstrapping calculations is Efron and Tibshirani’s An Introduction to the Bootstrap, 1994 (Amazon).

1. Review

The plug-in principle

This was covered in Lecture 7.

Simplified definition: Say we have a sample of repeated measurements, and we know there is a generative distribution for these measurements. But, we don’t know what that generative distribution is. We can estimate a parameter, \(\theta = T(F)\), of the generative distribution by computing the statistical functional using the empirical CDF rather than the actual CDF.

Thorough definition, from Lecture 7:
We’ll use the same example covered in Lecture 7. Say we have a data set of repeated measurements of the lengths of C. elegans eggs. We know there is a generative distribution for these lengths, but we don’t know any parameters of the distribution. How can we estimate a parameter of the distribution?

Call the CDF of our generative distribution \(F(x)\). We want to compute a statistical functional, \(T(F)\), of the CDF. \(T(F)\) could be the mean, variance, median, etc. The parameter \(\theta\) is defined by the functional, so \(\theta = T(F)\).

However, we only have a sample from the generative distribution (our egg length measurements). We can compute the ECDF of our measurements, called \(\hat{F}\). The plug-in principle simply states we can estimate our parameter of interest, \(\theta\), by calculating:

\begin{align} \hat{\theta} &= T(\hat{F}) \end{align}

Essentially, the plug-in principle says we can estimate the parameter \(\theta\) by computing the statistical functional on the empirical CDF rather than the actual CDF.

Confidence Intervals

These were covered in Lecture 7.

“If an experiment is repeated over and over again, the estimate I compute for a parameter, \(\hat{\theta}\) , will lie between the bounds of the 95% confidence interval for 95% of the experiments.”


Also covered in Lesson 7.

The basic idea of bootstrapping is to simulate many repetitions of the experiment. This allows us to get a better plug-in estimate for the statistical functional as if we did the experiment many times. To see the power of bootstrapping in action, let’s consider the example given in Efron and Tibshirani’s An Introduction to the Bootstrap.

The CSV file law_school_data.csv, available here, contains the population of measurements (LSAT, GPA) for N=82 American law schools for the entering classes of 1973. LSAT is the average score of the class on the LSAT (national law exam), and GPA is the average undergraduate GPA. In this example, we have a census, so we know the actual values of the summary statistics we may want to compute. Here we load in the data and view the first 10 observations.

df_law = pd.read_csv('../data/law_school_data.csv')
0 1 622 3.23
1 2 542 2.83
2 3 579 3.24
3 4 653 3.12
4 5 606 3.09
5 6 576 3.39
6 7 620 3.10
7 8 615 3.40
8 9 553 2.97
9 10 607 2.91

Now, let’s pretend we are a researcher interested in estimating the mean LSAT and GPA for American law schools. Say we collect a random sample of 15 law schools and want to estimate the population mean from our sample. We will simulate this by drawing a random sample of size n=15:

rg = np.random.default_rng(seed=3252)

# Draw 15 random samples and make a new dataframe
indices = rg.choice(np.arange(82), replace=False, size=15)
df_sample = df_law.loc[indices]
21 22 614 3.19
51 52 580 3.07
11 12 596 3.24
17 18 646 3.47
2 3 579 3.24
10 11 558 3.11
33 34 591 3.02
3 4 653 3.12
13 14 581 3.22
36 37 615 3.37
34 35 578 3.03
52 53 594 2.96
16 17 599 3.23
70 71 570 3.01
49 50 555 3.00

How does this sample compare to the population? Let’s compare the population mean LSAT and GPA to the sample means:

# Convert data to numpy arrays
sample_lsat = df_sample['LSAT'].values
sample_gpa = df_sample['GPA'].values
pop_lsat = df_law['LSAT'].values
pop_gpa = df_law['GPA'].values

# Print means
print(f'Sample mean LSAT: {np.mean(sample_lsat)}')
print(f'Sample mean GPA: {np.mean(sample_gpa)}')
print(f'Population mean LSAT: {np.mean(pop_lsat)}')
print(f'Population mean GPA: {np.mean(pop_gpa)}')
Sample mean LSAT: 593.9333333333333
Sample mean GPA: 3.1519999999999997
Population mean LSAT: 597.5487804878048
Population mean GPA: 3.1347560975609756

The accuracy of the mean LSAT and GPA of the small sample will vary based on the sample you draw, but they can be fairly far away from the mean of the population such that the standard error does not capture the actual mean.

Let’s compare the ECDFs as well.

def ecdf(data):
    """Returns ECDF values of data"""
    return np.arange(1, len(data) + 1) / len(data)

# Set up 2 plots
p = bokeh.plotting.figure(
    width=400, height=300, x_axis_label="LSAT", y_axis_label="ECDF",

a = bokeh.plotting.figure(
    width=400, height=300, x_axis_label="GPA", y_axis_label="ECDF",

# Get ECDF values
sample_lsat_ecdf = ecdf(sample_lsat)
pop_lsat_ecdf = ecdf(pop_lsat)
sample_gpa_ecdf = ecdf(sample_gpa)
pop_gpa_ecdf = ecdf(pop_gpa)

# Plot ECDFs
    legend="Sample LSAT",
    legend="Large sample LSAT",
    legend="Sample GPA",
    legend="Large sample GPA",

a.legend.location = "bottom_right"
p.legend.location = "bottom_right"

bokeh.io.show(bokeh.layouts.row(p, a))

The ECDFs match up fairly well, but if we zoom in on the 50th percentile we can see the means may not overlap. Let’s see if we can compute a confidence interval through bootstrapping. We’ll use the function draw_bs_sample() to draw bootstrap samples of the same size as our sample (n=15), and compute the means of each to create 1000 replicates. Remember, our bootstrap samples are always the same size as the original sample because we want to simulate repeating the same experiment.

def draw_bs_sample(data):
    """Draw a bootstrap sample from a 1D data set."""
    return np.random.choice(data, size=len(data))

sample_lsat_mean_reps = np.empty(1000)
sample_gpa_mean_reps = np.empty(1000)
for i in range(1000):
    sample_lsat_mean_reps[i] = np.mean(draw_bs_sample(sample_lsat))
    sample_gpa_mean_reps[i] = np.mean(draw_bs_sample(sample_gpa))

print(f'Sample mean LSAT: {np.mean(sample_lsat)}')
print(f'Sample mean GPA: {np.mean(sample_gpa)}')
print(f'Mean LSAT 95% C.I.: {np.percentile(sample_lsat_mean_reps, [2.5, 97.5])}')
print(f'Mean GPA 95% C.I.: {np.percentile(sample_gpa_mean_reps, [2.5, 97.5])}')
print(f'Population mean LSAT: {np.mean(pop_lsat)}')
print(f'Population mean GPA: {np.mean(pop_gpa)}')
Sample mean LSAT: 593.9333333333333
Sample mean GPA: 3.1519999999999997

Mean LSAT 95% C.I.: [580.66666667 609.87166667]
Mean GPA 95% C.I.: [3.08465    3.21666667]

Population mean LSAT: 597.5487804878048
Population mean GPA: 3.1347560975609756

The 95% confidence interval captures the true mean for both the LSAT and GPA data. Let’s compare the ECDFs again. We will create 100 bootstrap sample ECDFs and overlay them over the population ECDF.

p = bokeh.plotting.figure(
    width=400, height=300, x_axis_label="LSAT", y_axis_label="ECDF"

a = bokeh.plotting.figure(
    width=400, height=300, x_axis_label="GPA", y_axis_label="ECDF",

# ECDF values for plotting
sample_lsat_ecdf = ecdf(sample_lsat)
pop_lsat_ecdf = ecdf(pop_lsat)
sample_gpa_ecdf = ecdf(sample_gpa)
pop_gpa_ecdf = ecdf(pop_gpa)

# Make 100 bootstrap samples and plot them
for _ in range(100):
    lsat = draw_bs_sample(sample_lsat)
    gpa = draw_bs_sample(sample_gpa)
    # Add semitransparent ECDFs to the plot

    np.sort(pop_lsat), pop_lsat_ecdf, color=colorcet.b_glasbey_category10[1], alpha=2
    np.sort(pop_gpa), pop_gpa_ecdf, color=colorcet.b_glasbey_category10[1], alpha=2

bokeh.io.show(bokeh.layouts.row(p, a))

What do you notice about the bootstrapped ECDFs? Consider the approximate shape and mean.

Now that we’ve shown the power of bootstrapping, let’s explore some more bootstrapping methods! First we’ll review pairs bootstrapping, which we learned about in class, then go into some new methods.

Pairs Bootstrap

Also covered in Lesson 7.

We use pairs bootstrapping when we have paired data, and we will often want to compute the correlation. The important thing to remember with pairs bootstrapping is that we draw a random sample of pairs, not 2 random samples of each array. For example, if we have paired data in 2 arrays: \(\mathbf{x} = [x_1, x_2,\ldots, x_n]\) and \(\mathbf{y} = [y_1, y_2,\ldots, y_n]\), we must draw the same indices from both arrays so we end up with paired samples:

\begin{align} &\mathbf{\hat{x}} = [x_i, x_j, x_k,\ldots, x_n], \text{ and}\\[1em] &\mathbf{\hat{y}} = [y_i, y_j, y_k,\ldots, y_n]. \end{align}

Take a look at Lesson 7 for a coding example.

Question: Why do we have to keep pairs together to estimate correlation? What result will you likely get for the correlation if your pairs get separated?

2. BCa Confidence Intervals

The BCa confidence interval is the bias-corrected accelerated bootstrap confidence interval. This concept was introduced by Bradley Efron. The BCa interval is designed to automatically create confidence intervals that adjust for underlying higher order effects. In simple terms, it corrects for bias and skew in the distribution of bootstrap estimates.

In this recitation, we will focus on the nonparametric case and show a coding example. If you want to dive into this concept even more, which I highly recommend, take a look at Efron, “Better Bootstrap Confidence Intervals.” Journal of the American Statistical Association 82, no. 397 (1987): 171-85..

Let’s consider the C. elegans egg data example again. We have a dataset of repeated measurements of the areas C. elegans eggs. We know there is a generative distribution for this data, but we don’t know what it is. We want to compute a confidence interval of the mean. For the general case, we’ll call the statistic we want to estimate \(\theta\). As we know, the standard confidence interval is given by:

\begin{align} \theta \in [\hat{\theta} + \hat{\sigma}z^{(\alpha)}, \hat{\theta} + \hat{\sigma}z^{(1 - \alpha)}] \tag{2.1} \end{align}

where \(\hat{\theta}\) is our sample estimate of \(\theta\), \(\hat{\sigma}\) is our estimate of the standard deviation of the distribution of \(\hat{\theta}\), and \(z^{(\alpha)}\) is the \(100 \times \alpha\) percentile point of a standard normal variate.

This standard confidence interval (Equation 2.1) is based on taking the asymptotic normal approximation

\begin{align} (\hat{\theta} - \theta)/\hat{\sigma} \sim N(0, 1) \tag{2.2} \end{align}

In our case, and in many cases, we assume there is a generative distribution, \(g\), for our data. Thus, we can achieve normality and constant standard error by considering the monotone transformation \(\hat{\phi} = g(\hat{\theta})\) and \(\phi = g(\theta)\) instead of \(\hat{\theta}\) and \(\theta\). We assume this transformation is both normalizing and variance stabilizing (defined momentarily). Through this method, we can now compute the confidence interval for \(\phi\) based on:

\begin{align} (\hat{\phi} - \phi)/\tau \sim N(-z_0, 1) \tag{2.3} \end{align}

where \(\tau\) is the constant standard error of \(\hat{\phi}\) and \(z_0\) is the bias constant. To get the confidence interval for \(\theta\), we take the inverse transformation \(\theta = g^{-1}(\phi)\). The great part: all of this is achieved automatically simply through bootstrapping! We do not need to know \(g\) or \(z_0\).

How can we improve on this? That’s where BCa comes in. Like we described previously, through bootstrapping alone we are assuming the transformation \(g\) is both normalizing and variance stabilizing. A variance stabilizing transformation is a transformation such that the variance is no longer related to the mean. For some distributions (ex. Binomial), the variance is dependent on the mean. A variance stabilizing transformation removes this dependence. But, since we do not know the transformation, we do not know for sure that it is variance stabilizing. In the BCa confidence interval, we only need to know that \(g\) is normalizing. This makes it more accurate in many cases!

The BCa incorporates some bias constant, \(z_0\), and some acceleration constant, \(a\), such that the BCa confidence interval is based on:

\begin{align} &(\hat{\phi} - \phi)/\tau \sim N(-z_0\sigma_\phi, \sigma_\phi^2),\\[1em] &\sigma_\phi = 1 + a\phi. \tag{2.4} \end{align}

We can determine \(a\) and \(z_0\) by resampling our data. Here, we are considering the nonparametric case. For the nonparametric case, the acceleration constant for the mean is given by

\begin{align} a = \frac{\sum_i (x_i - \bar{x})^3}{6\left(\sum_i (x_i - \bar{x})^2\right)^{3/2}}. \tag{2.5} \end{align}

We can compute \(z_0\) from our bootstrap replicates. The bias-correction constant, \(z_0\), is:

\begin{align} z_0 = \phi^{-1}(\hat{G}(\hat{\theta})) \tag{2.6} \end{align}

From the plug-in principle, we know that for the nonparametric case we can approximate the CDF \(\hat{G}(\hat{\theta})\) as the ECDF of our bootstrap replicates. So, \(\phi^{-1}(\hat{G}(\hat{\theta}))\) is given by the inverse ECDF of our bootstrap replicates.

This may all be a little confusing now, so let’s practice with the worm egg example! I will review how to compute each parameter as we go.

Let’s load in our worm egg area dataset, digitized from Harvey and Orbidans’s 2011 PLoS One paper.

df_worms = pd.read_csv('../data/c_elegans_egg_xa.csv', comment='#')
food area (sq um)
0 high 1683
1 high 2061
2 high 1792
3 high 1852
4 high 2091

We’ll convert the data we’re interested into arrays:

fed_worms = df_worms.loc[df_worms['food'] == 'high']['area (sq um)'].values
hungry_worms = df_worms.loc[df_worms['food'] == 'low']['area (sq um)'].values

Here we write some helper functions to calculate the parameters for the BCa interval. We will combine them in our main function, compute_bca_ci().

def compute_jackknife_reps(data, statfunction=np.mean):
    '''Returns jackknife resampled replicates for the given data and statistical function'''
    # Set up empty array to store jackknife replicates
    jack_reps = np.empty(len(data))

    # For each observation in the dataset, compute the statistical function on the sample
    # with that observation removed
    for i in range(len(data)):
        jack_sample = np.delete(data, i)
        jack_reps[i] = statfunction(jack_sample)
    return jack_reps

def compute_a(jack_reps):
    '''Returns the acceleration constant a'''
    mean = np.mean(jack_reps)
    return (1/6) * np.divide(np.sum(mean - jack_reps)**3, (np.sum(mean - jack_reps)**2)**(3/2))

def bootstrap_replicates(data, n_reps=1000, statfunction=np.mean):
    '''Computes n_reps number of bootstrap replicates for given data and statistical function'''
    boot_reps = np.empty(n_reps)
    for i in range(n_reps):
        boot_reps[i] = statfunction(draw_bs_sample(data))
    return boot_reps

def compute_z0(data, boot_reps, statfunction=np.mean):
    '''Computes z0 for given data and statistical function'''
    s = statfunction(data)
    return st.norm.ppf(np.sum(boot_reps < s) / len(boot_reps))

def compute_bca_ci(data, alpha_level, n_reps=1000, statfunction=np.mean):
    '''Returns BCa confidence interval for given data at given alpha level'''
    # Compute bootstrap and jackknife replicates
    boot_reps = bootstrap_replicates(data, n_reps, statfunction)
    jack_reps = compute_jackknife_reps(data, statfunction)

    # Compute a and z0
    a = compute_a(jack_reps)
    z0 = compute_z0(data, boot_reps, statfunction)

    # Compute confidence interval indices
    alphas = np.array([alpha_level/2., 1-alpha_level/2.])
    zs = z0 + st.norm.ppf(alphas).reshape(alphas.shape+(1,)*z0.ndim)
    avals = st.norm.cdf(z0 + zs/(1-a*zs))
    ints = np.round((len(boot_reps)-1)*avals)
    ints = np.nan_to_num(ints).astype('int')

    # Compute confidence interval
    boot_reps = np.sort(boot_reps)
    ci_low = boot_reps[ints[0]]
    ci_high = boot_reps[ints[1]]
    return (ci_low, ci_high)

We can now compute the BCa confidence intervals.

print(f'Fed worms mean egg area 95% C.I.:{compute_bca_ci(fed_worms, 0.05)}')
print(f'Hungry worms mean egg area 95% C.I.:{compute_bca_ci(hungry_worms, 0.05)}')
Fed worms mean egg area 95% C.I.:(1791.4318181818182, 1886.8181818181818)
Hungry worms mean egg area 95% C.I.:(2068.9298245614036, 2147.1052631578946)

3. Studentized Bootstrap

The Studentized bootstrap error aims to reduce coverage error of our resampling. It is a good idea to use Studentized bootstrapping to approximate your statistic of interest when you suspect the distribution of your sample may not be close to the actual population distribution. The Studentized bootstrapping method works because even when this is the case, the distribution of the sample statistic divided by estimated standard error could be close to the actual population statistic divided by the population standard error. This the same principle employed in the Student-t test.

The Studentized bootstrap involves taking inner and outer bootstrap samples. The algorithm is as follows:

  1. Compute \(\hat{\theta}\), the sample test statistic

  2. A large number of times (number of reps ≈1000), draw a bootstrap sample. Call this the outer sample:

    1. Compute the statistical functional on the sample

    2. Resample your bootstrap sample ≈50 to 100 times. Call this the inner sample:

      1. Compute the statistical function on the resample

    3. For your inner sample replicates, compute the standard deviation

  3. Compute your statistic of interest from your replicates: t = (outer sample replicates) / (standard deviation of the inner sample replicates)

  4. Compute s = standard deviation of your outer sample replicates

  5. Multiply t by s

  6. Compute the confidence interval from the \(\alpha/2\) and \((1 - \alpha)/2\) percentiles of t*s

Let’s try this out on a dataset comparing bee sperm counts in the presence of a pesticide to control bees. The data come from Straub, et al., Proc. Royal Soc. B 283(1835): 20160506.

df_bees = pd.read_csv('../data/bee_sperm.csv', comment='#')
Specimen Treatment Environment TreatmentNCSS Sample ID Colony Cage Sample Sperm Volume per 500 ul Quantity ViabilityRaw (%) Quality Age (d) Infertil AliveSperm Quantity Millions Alive Sperm Millions Dead Sperm Millions
0 227 Control Cage 1 C2-1-1 2 1 1 2150000 2150000 96.7263814616756 96.726381 14 0 2079617 2.1500 2.079617 0.070383
1 228 Control Cage 1 C2-1-2 2 1 2 2287500 2287500 96.3498079760595 96.349808 14 0 2204001 2.2875 2.204001 0.083499
2 229 Control Cage 1 C2-1-3 2 1 3 87500 87500 98.75 98.750000 14 0 86406 0.0875 0.086406 0.001094
3 230 Control Cage 1 C2-1-4 2 1 4 1875000 1875000 93.2874208336941 93.287421 14 0 1749139 1.8750 1.749139 0.125861
4 231 Control Cage 1 C2-1-5 2 1 5 1587500 1587500 97.7925061050061 97.792506 14 0 1552456 1.5875 1.552456 0.035044

Let’s take a quick look at the ECDFs.

    val='Alive Sperm Millions'

For computing the confidence intervals of the means, we’ll want to convert the data of interest into arrays, and take a look at the sample mean:

control = df_bees.loc[df_bees['Treatment'] == 'Control']['Alive Sperm Millions'].values
pesticide = df_bees.loc[df_bees['Treatment'] == 'Pesticide']['Alive Sperm Millions'].values

print(f'Control mean alive sperm (millions): {np.mean(control)}')
print(f'Pesticide mean alive sperm(millions): {np.mean(pesticide)}')
Control mean alive sperm (millions): 1.8727770758620692
Pesticide mean alive sperm(millions): 1.2997477111111113

Now, we’ll employ the Studentized bootstrap to compute the confidence interval.

def studentized_bootstrap_ci(
    ptiles=(2.5, 97.5),
    thetas = np.empty(n_bs_samples)
    ts = np.zeros(n_bs_samples)

    # Outer sample
    for i in range(n_bs_samples):
        sample = draw_bs_sample(data)
        theta = statfunction(sample)
        thetas[i] = theta

        # Inner sample
        resample_stats = np.empty(n_inner_samples)
        for j in range(n_inner_samples):
            resample = draw_bs_sample(sample)
            resample_stats[j] = statfunction(resample)

        # Compute statistic of interest
        ts[i] = np.sum(resample_stats - theta) / np.std(resample_stats)

    s = np.std(thetas)
    ts = np.sort(ts)

    p1, p2 = np.percentile(s*ts, ptiles)
    ci_high = np.mean(data)-(s*p1)
    ci_low = np.mean(data)-(s*p2)
    return (ci_low, ci_high)

studentized_ctrl_ci = studentized_bootstrap_ci(control)
studentized_pest_ci = studentized_bootstrap_ci(pesticide)

print(f"Control Alive Sperm Millions 95% C.I.: {studentized_ctrl_ci}")
print(f"Pesticide Alive Sperm Millions 95% C.I.: {studentized_pest_ci}")
Control Alive Sperm Millions 95% C.I.: (1.6422482149046422, 2.0943950244253298)
Pesticide Alive Sperm Millions 95% C.I.: (1.1189521249528647, 1.4847851245037909)

Let’s get a plot of these confidence intervals to compare.

def plot_with_error_bars(means, confs, names, **kwargs):
    """Make a horizontal plot of means/conf ints with error bars."""
    frame_height = kwargs.pop("frame_height", 150)
    frame_width = kwargs.pop("frame_width", 450)

    p = bokeh.plotting.figure(
        y_range=names, frame_height=frame_height, frame_width=frame_width, **kwargs

    p.circle(x=means, y=names)
    for conf, name in zip(confs, names):
        p.line(x=conf, y=[name, name], line_width=2)

    return p

means = [np.mean(control), np.mean(pesticide)]
confs = [studentized_ctrl_ci, studentized_pest_ci]
names = ['control', 'pesticide']

bokeh.io.show(plot_with_error_bars(means, confs, names, x_axis_label='alive sperm (millions)'))

Success!! We can see the mean alive sperm count in millions for the pesticide-exposed bees is likely lower than the control bees, but the confidence intervals do overlap so we should temper that statement.

For comparison, we can compute the non-Studentized confidence intervals.

ctrl_reps = [np.mean(draw_bs_sample(control)) for _ in range(10000)]
pest_reps = [np.mean(draw_bs_sample(pesticide)) for _ in range(10000)]

ctrl_ci = np.percentile(ctrl_reps, [2.5, 97.5])
pest_ci = np.percentile(pest_reps, [2.5, 97.5])

print(f"Control Alive Sperm Millions 95% C.I.: {ctrl_ci}")
print(f"Pesticide Alive Sperm Millions 95% C.I.: {pest_ci}")
Control Alive Sperm Millions 95% C.I.: [1.67537854 2.07238755]
Pesticide Alive Sperm Millions 95% C.I.: [1.11075824 1.49738931]

Let’s plot.

means += means
confs += [ctrl_ci, pest_ci]
names = ['studentized ctrl', 'studentized pest', 'ctrl', 'pest']

bokeh.io.show(plot_with_error_bars(means, confs, names, x_axis_label='alive sperm (millions)'))

The Studentized confidence intervals are fairly close, but slightly larger. In this case, our sample distribution is likely close to the population distribution, so the rigor of the studentized bootstrap may not dramatically improve our results. However, when the distributions are different, the inner sample will attempt to correct for this.

Computing environment

%load_ext watermark
%watermark -v -p numpy,scipy,numba,bokeh,colorcet,jupyterlab
CPython 3.7.4
IPython 5.8.0

numpy 1.17.4
scipy 1.3.1
numba 0.45.1
bokeh 1.3.4
colorcet 1.0.0
jupyterlab 1.2.0