Introduction to the bootstrap¶
[1]:
import numpy as np
import pandas as pd
import numba
import bokeh_catplot
import colorcet
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
We have discussed the plug-in principle, in which we approximate the distribution function of the generative model, \(F\), with the empirical distribution function, \(\hat{F}\). We then approximate all statistical functionals of \(F\) using \(\hat{F}\); \(T(F)\approx T(\hat{F})\). Using the frequentist perspective we talked about using the plug-in principle and bootstrapping to compute confidence intervals and do hypothesis testing. In this lesson, we will go over how we implement this in practice. We will work with some real data on the reproductive health of male bees that have been treated with pesticide.
It is important as you go through this lesson to remember that you do not know what the generative distribution is. We are only approximating it using the plug-in principle. Since we do not know what \(F\) is, we do not know what its parameters are, so we therefore cannot estimate them. Rather, we can study summary statistics, which are plug-in estimates for statistical functionals of the generative distribution, such as means, variances, or even the ECDF itself, and how they may vary from experiment to experiment. In this sense, we are doing nonparametric inference. We do not have a parametrized model in mind, but are investigating how experimental results might vary upon repetition of the experiment.
The data set¶
Neonicotinoid pesticides are thought to have inadvertent effects on service-providing insects such as bees. A study of this was featured in the New York Times. The original paper by Straub and coworkers may be found here. The authors put their data in the Dryad repository, which means that it is free to all to work with!
(Do you see a trend here? If you want people to think deeply about your results, explore them, learn from them, in general further science with them, make your data publicly available. Strongly encourage the members of your lab to do the same.)
We will look at the sperm quality of drone bees using this data set. First, let’s load in the data set and check it out.
[2]:
df = pd.read_csv('../data/bee_sperm.csv', comment='#')
df.head()
[2]:
Specimen | Treatment | Environment | TreatmentNCSS | Sample ID | Colony | Cage | Sample | Sperm Volume per 500 ul | Quantity | ViabilityRaw (%) | Quality | Age (d) | Infertil | AliveSperm | Quantity Millions | Alive Sperm Millions | Dead Sperm Millions | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 227 | Control | Cage | 1 | C2-1-1 | 2 | 1 | 1 | 2150000 | 2150000 | 96.7263814616756 | 96.726381 | 14 | 0 | 2079617 | 2.1500 | 2.079617 | 0.070383 |
1 | 228 | Control | Cage | 1 | C2-1-2 | 2 | 1 | 2 | 2287500 | 2287500 | 96.3498079760595 | 96.349808 | 14 | 0 | 2204001 | 2.2875 | 2.204001 | 0.083499 |
2 | 229 | Control | Cage | 1 | C2-1-3 | 2 | 1 | 3 | 87500 | 87500 | 98.75 | 98.750000 | 14 | 0 | 86406 | 0.0875 | 0.086406 | 0.001094 |
3 | 230 | Control | Cage | 1 | C2-1-4 | 2 | 1 | 4 | 1875000 | 1875000 | 93.2874208336941 | 93.287421 | 14 | 0 | 1749139 | 1.8750 | 1.749139 | 0.125861 |
4 | 231 | Control | Cage | 1 | C2-1-5 | 2 | 1 | 5 | 1587500 | 1587500 | 97.7925061050061 | 97.792506 | 14 | 0 | 1552456 | 1.5875 | 1.552456 | 0.035044 |
We are interested in the number of alive sperm in the samples. Let’s first explore the data by making ECDFs for the two groups we will compare, those treated with pesticide (“Pesticide”) and those that are not (“Control”).
[3]:
p = bokeh_catplot.ecdf(df, cats='Treatment', val='Alive Sperm Millions')
bokeh.io.show(p)
The visual inspection of the ECDFs suggests that indeed the control drones have more alive sperm than those treated with pesticide. But how variable would these ECDFs be if we repeated the experiment?
Utility plotting functions¶
Since we will use Numpy arrays for much of the analysis going forward, and for our purposes here, we wish to avoid placing them in data frames to use bokeh_catplot
to plot the ECDFs, we will write a utility function to quickly plot an ECDF from a Numpy array, possibly populating an existing plot.
[4]:
def plot_ecdf(data, color="#4e79a7", alpha=1, p=None, **kwargs):
"""Plot an ECDF"""
# Set up kwargs for the marker
marker_kwargs = {"color": color, "alpha": alpha}
# Get the legend label if needed
if "legend_label" in kwargs:
marker_kwargs["legend_label"] = kwargs.pop("legend_label")
# Instantiate new plot if none provided
if p is None:
p = bokeh.plotting.figure(**kwargs)
# Populate ECDF glyphs
p.circle(np.sort(data), np.arange(1, len(data) + 1) / len(data), **marker_kwargs)
return p
Bootstrap samples and ECDFs¶
To address this question, we can generate bootstrap samples from the experimental data and make lots of ECDFs. We can then plot them all to see how the ECDF might vary. Recall that a bootstrap sample from a data set of \(n\) repeated measurements is generated by drawing \(n\) data points out of the original data set with replacement. The rg.choice()
function allows for randomly drawing elements out of an array. Let’s generate 100 bootstrap samples and plot their ECDFs to visualize
how our data set might change as we repeat the experiment.
[5]:
rg = np.random.default_rng()
# Set up Numpy arrays for convenience (also much better performance)
alive_ctrl = df.loc[df["Treatment"] == "Control", "Alive Sperm Millions"].values
alive_pest = df.loc[df["Treatment"] == "Pesticide", "Alive Sperm Millions"].values
# ECDF values for plotting
ctrl_ecdf = np.arange(1, len(alive_ctrl)+1) / len(alive_ctrl)
pest_ecdf = np.arange(1, len(alive_pest)+1) / len(alive_pest)
# Make 100 bootstrap samples and plot them
for _ in range(100):
bs_ctrl = rg.choice(alive_ctrl, size=len(alive_ctrl))
bs_pest = rg.choice(alive_pest, size=len(alive_pest))
# Add semitransparent ECDFs to the plot
p.circle(np.sort(bs_ctrl), ctrl_ecdf, color=colorcet.b_glasbey_category10[0], alpha=0.02)
p.circle(np.sort(bs_pest), pest_ecdf, color=colorcet.b_glasbey_category10[1], alpha=0.02)
bokeh.io.show(p)
From this graphical display, we can already see that the ECDFs do not strongly overlap in 100 bootstrap samples, so there is likely a real difference between the two treatments.
Speeding up sampling with Numba¶
We can make sampling faster (though note that the slowness in generating the above plot is not in resampling, but in adding glyphs to the plot) by employing just-in-time compilation, wherein the Python code is compiled at runtime into machine code. This results in a substantial speed boost. Numba is a powerful tool for this purpose, and we will use it. Bear in mind, though, that not all Python code is Numba-able. Specifically, only the Mersenne Twister bit
generator is supported, so we need to use np.random.choice()
in Numba’s code to do our resampling.
In order to jit-in-time compile a function, we need to decorate its definition with the @numba.njit
decorator, which tells Python to use Numba to just-in-time compile the function. Note that Numba does not currently work with Pandas data frames; it typically requires Numpy arrays to do any operations on arrays.
[6]:
@numba.njit
def draw_bs_sample(data):
"""Draw a bootstrap sample from a 1D data set."""
return np.random.choice(data, size=len(data))
Let’s quickly quantify the speed boost. Before we do the timing, we will run the Numba’d version once to give it a chance to do the JIT compilation.
[7]:
# Run once to JIT
draw_bs_sample(alive_ctrl)
print('Using the Numpy default RNG (PCG64):')
%timeit rg.choice(alive_ctrl, size=len(alive_ctrl))
print('\nUsing the Numpy Mersenne Twister:')
%timeit np.random.choice(alive_ctrl, size=len(alive_ctrl))
print("\nUsing a Numba'd Mersenne Twister:")
%timeit draw_bs_sample(alive_ctrl)
Using the Numpy default RNG (PCG64):
26.8 µs ± 650 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Using the Numpy Mersenne Twister:
29.7 µs ± 3.57 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Using a Numba'd Mersenne Twister:
2.9 µs ± 80.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
That’s a factor of 10 or so increase. That can make a big difference when generating lots of bootstrap samples.
Bootstrap replicates and confidence intervals¶
We have plotted the ECDF of the data, which is instructive, but we would also like to get plug-in estimates for the generative distribution. Remember, when doing nonparametric plug-in estimates, we plug in the ECDF for the CDF. We do not need to specify what the distribution (described mathematically by the CDF, or equivalently by the PDF) is, just that we approximate it by the empirical distribution.
In lecture, we laid out the procedure to compute a confidence interval.
Generate \(B\) independent bootstrap samples.
Compute the plug-in estimate of the statistical functional of interest for each bootstrap sample to get the bootstrap replicates.
The \(100(1-\alpha)\) percent confidence interval consists of the percentiles \(100\alpha/2\) and \(100(1 - \alpha/2)\) of the bootstrap replicates.
A key step here is computing the bootstrap replicate. We will write a couple functions for this. The first is generic; it takes as an argument the function to be used to compute the statistic of interest (e.g., np.mean
or np.median
). We will also write a few functions for commonly computed statistics, which enables us to use Numba to greatly speed up the process of generating bootstrap replicates.
[8]:
def draw_bs_reps(data, stat_fun, size=1):
"""Draw boostrap replicates computed with stat_fun from 1D data set."""
return np.array([stat_fun(draw_bs_sample(data)) for _ in range(size)])
@numba.njit
def draw_bs_reps_mean(data, size=1):
"""Draw boostrap replicates of the mean from 1D data set."""
out = np.empty(size)
for i in range(size):
out[i] = np.mean(draw_bs_sample(data))
return out
@numba.njit
def draw_bs_reps_median(data, size=1):
"""Draw boostrap replicates of the median from 1D data set."""
out = np.empty(size)
for i in range(size):
out[i] = np.median(draw_bs_sample(data))
return out
@numba.njit
def draw_bs_reps_std(data, size=1):
"""Draw boostrap replicates of the standard deviation from 1D data set."""
out = np.empty(size)
for i in range(size):
out[i] = np.std(draw_bs_sample(data))
return out
Now, let’s get bootstrap replicates for the mean of each of the two treatments.
[9]:
bs_reps_mean_ctrl = draw_bs_reps_mean(alive_ctrl, size=10000)
bs_reps_mean_pest = draw_bs_reps_mean(alive_pest, size=10000)
We can now compute the confidence intervals by computing the percentiles using the np.percentile()
function.
[10]:
# 95% confidence intervals
mean_ctrl_conf_int = np.percentile(bs_reps_mean_ctrl, [2.5, 97.5])
mean_pest_conf_int = np.percentile(bs_reps_mean_pest, [2.5, 97.5])
print("""
Mean alive sperm count 95% conf int control (millions): [{0:.2f}, {1:.2f}]
Mean alive sperm count 95% conf int treatment (millions): [{2:.2f}, {3:.2f}]
""".format(*(tuple(mean_ctrl_conf_int) + tuple(mean_pest_conf_int))))
Mean alive sperm count 95% conf int control (millions): [1.68, 2.07]
Mean alive sperm count 95% conf int treatment (millions): [1.12, 1.49]
We can also use the bootstrap replicates to plot the probability distribution of mean alive sperm count. Remember: This is not a confidence interval on a parameter value. That does not make sense on frequentist statistics. It is the confidence interval describing what we would get as a plug-in estimate for the mean if we did the experiment over and over again.
When we plot the ECDF of the bootstrap replicates, we thin them so as not to choke the browser with too many points. Since we will do this again, we write a quick function to do it.
[11]:
def plot_bs_reps_ecdf(ctrl, pest, thin):
"""Make plot of bootstrap ECDFs for control and pesticide treatment."""
x_ctrl = np.sort(ctrl)[::thin]
x_pest = np.sort(pest)[::thin]
p = plot_ecdf(
x_ctrl,
color=colorcet.b_glasbey_category10[0],
legend_label="control",
frame_height=200,
frame_width=350,
x_axis_label="mean alive sperm (millions)",
y_axis_label="ECDF",
)
p = plot_ecdf(
x_pest, p=p, color=colorcet.b_glasbey_category10[1], legend_label="pesticide"
)
return p
p = plot_bs_reps_ecdf(bs_reps_mean_ctrl, bs_reps_mean_pest, 50)
p.legend.location = "top_left"
bokeh.io.show(p)
These are both nice, Gaussian distributions with almost no overlap in the tails. We also saw in the computation of the 95% confidence intervals above that there is no overlap. This is expected, the mean alive sperm count should be Gaussian distributed as per the central limit theorem.
We can do the same procedure for other statistical quantities that do not follow the central limit theorem. The procedure is exactly the same. We will do it for the median.
[12]:
# Get the bootstrap replicates
bs_reps_median_ctrl = draw_bs_reps_median(alive_ctrl, size=10000)
bs_reps_median_pest = draw_bs_reps_median(alive_pest, size=10000)
# 95% confidence intervals
median_ctrl_conf_int = np.percentile(bs_reps_median_ctrl, [2.5, 97.5])
median_pest_conf_int = np.percentile(bs_reps_median_pest, [2.5, 97.5])
print("""
Median alive sperm count 95% conf int control (millions): [{0:.2f}, {1:.2f}]
Median alive sperm count 95% conf int treatment (millions): [{2:.2f}, {3:.2f}]
""".format(*(tuple(median_ctrl_conf_int) + tuple(median_pest_conf_int))))
# Plot ECDFs of bootstrap replicates
p = plot_bs_reps_ecdf(bs_reps_median_ctrl, bs_reps_median_pest, 50)
p.legend.location = 'top_left'
bokeh.io.show(p)
Median alive sperm count 95% conf int control (millions): [1.63, 2.08]
Median alive sperm count 95% conf int treatment (millions): [0.96, 1.47]
The results are similar, and we clearly see clear non-normality in the ECDFs, since the central limit theorem does not hold here.
Computing environment¶
[13]:
%load_ext watermark
%watermark -v -p numpy,pandas,numba,bokeh,colorcet,bokeh_catplot,jupyterlab
CPython 3.7.5
IPython 7.1.1
numpy 1.17.3
pandas 0.24.2
numba 0.46.0
bokeh 1.4.0
colorcet 1.0.0
bokeh_catplot 0.1.6
jupyterlab 1.1.4