Bootstrapping MLEs with dependent variables

This tutorial was prepared by Suzy Beeler and edited and enhanced by Rosita Fu with assistance from Justin Bois.

Dataset download


[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade bebi103 tqdm watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st
import numba
import tqdm

import bebi103

import bokeh.io
bokeh.io.output_notebook()

import holoviews as hv
hv.extension('bokeh')

import panel as pn
pn.extension()

bebi103.hv.set_defaults()
/Users/bois/opt/anaconda3/lib/python3.8/site-packages/arviz/__init__.py:317: UserWarning: Trying to register the cmap 'cet_gray' which already exists.
  register_cmap("cet_" + name, cmap=cmap)
/Users/bois/opt/anaconda3/lib/python3.8/site-packages/arviz/__init__.py:317: UserWarning: Trying to register the cmap 'cet_gray_r' which already exists.
  register_cmap("cet_" + name, cmap=cmap)
Loading BokehJS ...

Thus far, we have primarily discussed how to do bootstrapping on univariate data. We have also touched a bit on how we would do bootstrapping for data which have multiple independent variables that we are interested in (i.e. beak width and beak depth). In this recitation, we will discuss ways to conduct bootstrapping when one variable is dependent and the other is independent. As a first pass, we will start with an approach we have already seen, the pairs bootstrap. We will discuss why this approach is not preferred when working with dependent variables, and discuss some alternatives. So let’s dive in….

The data set

For this recitation, we will revisit the data set we discussed in Lesson 8, with the data from Matt Good (Science, 2013). As a reminder, they were exploring to what extent (if any) spindle length depends on droplet size in the artificial cells they constructed. Let’s plot the data below to get refamiliarize ourselves.

[2]:
df = pd.read_csv(os.path.join(data_path, "good_invitro_droplet_data.csv"), comment="#")

scatter = hv.Scatter(
    data=df,
    kdims=["Droplet Diameter (um)"],
    vdims=["Spindle Length (um)"]
)

scatter
[2]:

It will be faster to work with NumPy arrays throughout, so let’s save those here.

[3]:
d = df['Droplet Diameter (um)'].values
ell = df['Spindle Length (um)'].values
data = df[['Droplet Diameter (um)', 'Spindle Length (um)']].values

Pairs bootstrap

Given a frequentist interpretation of probability, if we would like to understand how much our final statistical functionals or parameter estimates may vary if we were to repeat the experiment, we need to conduct a bootstrap on this data set. As a first pass, we will revisit the use of the pairs bootstrap, with the code from Lesson 7 reproduced here.

[4]:
@numba.njit
def draw_bs_sample(data):
    """Draw a bootstrap sample from a 1D data set."""
    return np.random.choice(data, size=len(data))

@numba.njit
def draw_bs_pairs(x, y):
    """Draw a pairs bootstrap sample."""
    inds = np.arange(len(x))
    bs_inds = draw_bs_sample(inds)

    return x[bs_inds], y[bs_inds]
[5]:
# draw bootstrapped pairs
bs_d, bs_ell = draw_bs_pairs(d, ell)

Let’s take it for a spin and see how the bootstrap sample looks.

[6]:
# make a clickable button and checkbox for comparison
button_pairs = pn.widgets.Button(name="Pairs Bootstrap!", button_type="primary")
empirical = pn.widgets.Checkbox(name="original data", value=False)

# define a bootstrap and plot function
@pn.depends(button_pairs.param.clicks, empirical.param.value)
def bs_and_plot(_, empirical):

    bs_d, bs_ell = draw_bs_pairs(d, ell)

    p = hv.Scatter(data=(bs_d, bs_ell),
                      kdims=["Droplet Diameter (um)"],
                      vdims=["Spindle Length (um)"]
                     ).opts(
                        xlim=(20, 225),
                        ylim=(18, 50),
                        color="#483d8b")
    if empirical==True:
        return hv.Scatter(data=(d, ell)) * p
    return p

plot = pn.Column(button_pairs, bs_and_plot)
pn.Row(plot, pn.Column(pn.Spacer(height=80), empirical))
[6]:

Watch the 200+ um point: it either gets chosen or it doesn’t… does this model the variablity well? For now, let’s see how our maximum likelihood parameter estimates might change with different bootstrapped samples. As a reminder, this is the model for spindle length that we have in mind, with parameters \(\gamma\), \(\phi\), and \(\sigma\):

\begin{align} &\mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &l_i \sim \text{Norm}(\mu_i, \sigma) \;\forall i. \end{align}

[7]:
def theor_spindle_length(gamma, phi, d):
    """Compute spindle length using mathematical model"""
    return gamma * d / np.cbrt(1 + (gamma * d / phi)**3)


def log_likelihood(params, d, ell):
    """Log likelihood of spindle length model."""
    gamma, phi, sigma = params

    if gamma <= 0 or gamma > 1 or phi <= 0:
        return -np.inf

    mu = theor_spindle_length(gamma, phi, d)
    return np.sum(st.norm.logpdf(ell, mu, sigma))


def resid(params, d, ell):
    """Residual for spindle length model."""
    return ell - theor_spindle_length(*params, d)


def spindle_mle(data):
    """Compute MLE for parameters in spindle length model."""
    # Unpack data
    d = data[:, 0]
    ell = data[:, 1]

    res = scipy.optimize.least_squares(
        resid, np.array([0.5, 35]), args=(d, ell), bounds=([0, 0], [1, np.inf])
    )

    # Compute residual sum of squares from optimal params
    rss_mle = np.sum(resid(res.x, d, ell)**2)

    # Compute MLE for sigma
    sigma_mle = np.sqrt(rss_mle / len(d))

    return tuple([x for x in res.x] + [sigma_mle])

With these in hand, we can now get our MLEs.

[8]:
mle_params = spindle_mle(data)

# save these values for later
mle_gamma = mle_params[0]
mle_phi = mle_params[1]
mle_sigma = mle_params[2]

print(f'Value of γ: {np.round(mle_gamma, decimals=2)}')
print(f'Value of φ: {np.round(mle_phi, decimals=2)}')
print(f'Value of σ: {np.round(mle_sigma, decimals=2)}')
Value of γ: 0.86
Value of φ: 38.25
Value of σ: 3.75

Now let’s add in our parameter estimates to our interactive plot. We will add markdown panes to display the estimates.

[9]:
button_pairs_MLE = pn.widgets.Button(name="Pairs Bootstrap + MLE", button_type="primary")
gamma = pn.pane.Markdown("γ*:")
phi = pn.pane.Markdown("φ*:")

@pn.depends(button_pairs_MLE.param.clicks)
def bs_MLE_and_plot(_):
    # construct a pairs bootstrap sample and make the scatter plot
    bs_d, bs_ell = draw_bs_pairs(d, ell)

    scatter = hv.Scatter(
        data=(bs_d, bs_ell),
        kdims=["Droplet Diameter (um)"],
        vdims=["Spindle Length (um)"]
    ).opts(xlim=(20, 225), ylim=(18, 50), color="#483d8b")

    # do MLE and update parameter values to display
    mle_params = spindle_mle(np.vstack((bs_d, bs_ell)).T)
    gamma.object = f"γ*: {np.round(mle_params[0], decimals=2)}"
    phi.object = f"φ*: {np.round(mle_params[1], decimals=2)} µm"

    # compute theoretical curve using MLE parameters
    d_theor = np.linspace(0, 250, 200)
    ell_theor = theor_spindle_length(*mle_params[:-1], d_theor)

    # overlay the theroritcal curve and the boostrap replicate data
    return scatter * hv.Curve(
        data=(d_theor, ell_theor)
        ).opts(
        color="orange",
        line_width=2
    )

plot = pn.Column(button_pairs_MLE, bs_MLE_and_plot)
parameters = pn.Column(pn.Spacer(height=140), gamma, phi)
pn.Row(plot, pn.Spacer(width=20), parameters)
[9]:

This allows us to readily see that the parameter values don’t change much from each bootstrap sample to the next. This gives us a nice intuitive feel for how we expect our estimates to change from experiment to experiment. Of course, we would like to put some numbers to this, so let’s compute a 95% confidence interval on these estimates.

[10]:
num_trials = 1000
bs_MLEs = np.zeros([3, num_trials])

for i in tqdm.tqdm(range(num_trials)):
    bs_d, bs_ell = draw_bs_pairs(d, ell)
    mle_params = spindle_mle(np.vstack((bs_d, bs_ell)).T)
    bs_MLEs[:,i] = mle_params

CIs_pairs = np.percentile(bs_MLEs, [2.5, 97.5], axis=1)
lower_CI = CIs_pairs[0]
upper_CI = CIs_pairs[1]

CIs_pairs
100%|████████████████████████████████████████████████████████████████| 1000/1000 [00:05<00:00, 173.61it/s]
[10]:
array([[ 0.82868667, 37.47646903,  3.54828632],
       [ 0.89512829, 39.15720805,  3.95445391]])

Great! This validates what we saw in the interactive figure, where the value of \(\gamma\) is somewhere around 0.85 and the value of \(\phi\) is just shy of 40. Lastly, let’s visualize how the curves parametrized by our MLEs vary for each bootstrap sample.

[11]:
scatter = hv.Scatter(
    data=df,
    kdims=["Droplet Diameter (um)"],
    vdims=["Spindle Length (um)"]
)

# Pop out Bokeh plot
p = hv.render(scatter)

# Plot theoretical curves (do every tenth)
d_theor = np.linspace(0, 250, 200)
for params in bs_MLEs[:,::10].T:
    p.line(d_theor, theor_spindle_length(*params[:-1], d_theor), line_width=1, alpha=0.1, color='orange')
p.title.text="empirical data + bootstrapped MLE"   # will re-use scatter

bokeh.io.show(p)

But let’s now take a step back: remember how I prefaced this by saying the pairs bootstrap is not the way to go with dependent variables? Let’s take a minute to think about why.

In this case, the spindle length is dependent on the droplet diameter. This means we aren’t really interested in how the droplet diameter itself varies, but rather how spindle length depends on the droplet diameter. Accordingly, it’s not really appropriate to bootstrap both our x and y values, when it’s really only the y values that capture our data of interest. This stands in contrast to the bird beak data where both the length and the depth were random variables whose variation we wished to explore. This leads us the next approach….

Resampling residuals

First we’ll write a function to plot any set of residuals. This will be useful later.

[12]:
def residuals_plot(x, residuals, color="#483d8b", zero_color="red"):
    r = hv.Scatter(data=(d, residuals),
                        kdims=["Droplet Diameter (um)"],
                        vdims=["Residual length (um)"],
                      ).opts(
                        ylim=(epsilon_hat.min()-2,epsilon_hat.max()+2),
                        color=color
                    )
    zero = hv.Path(data=(x, 0*np.empty(len(x))),).opts(color=zero_color)
    return r * zero

Rather than resample which data points we will include in each bootstrap replicate, the approach of resampling the residuals allows us formulate a bootstramp sample that reflects the variability we expect to see in the response variable (e.g. spindle length in our case).

1. Perform MLE once on the original dataset.

[13]:
hv.extension("bokeh")

# fit the model
mle_params = spindle_mle(data)

d_theor = np.linspace(0,225,1000)
y_theor = theor_spindle_length(*mle_params[:-1], d_theor)
scatter * hv.Curve((d_theor, y_theor)).opts(color='red', title="MLE empirical data")
[13]:

2. Retrieve residuals!

We will use the following variables.
Our new array \(\hat{\epsilon}\) is what we’ll be effectively bootstrapping out of. The bootstrapped residuals will be stored in \({\epsilon^*}\) and added back to \(\hat{y_i}\), obtained from our initial MLE estimates, evaluated at empirical diameters \(d_i\).

\begin{align} \hat{y}_i &= \mathrm{theor \hspace{1mm} length}(\phi_{\mathrm{estimate}},\sigma_{\mathrm{estimate}},d_i)\\[0.5em] \hat{\epsilon}_i &= l_i - \hat{y}_i \\[0.5em] &\mathrm{bootstrapping...}\\[0.5em] y^* &= \epsilon^{*} + \hat{y}_i \end{align}

[14]:
# get our spindle length predictions,
y_hat = theor_spindle_length(*mle_params[:-1], d)

# get our residual values
epsilon_hat = ell - y_hat

3. Bootstrap residuals.

[15]:
# get resampled residuals
epsilon_star = np.random.choice(epsilon_hat, replace=True, size=len(epsilon_hat))

# add 'em back in to get resampled y values
y_star = y_hat + epsilon_star

Let’s see if we did that correctly…

[16]:
hv.extension("bokeh")

# Check out our resampled data set
scatter.opts(color="#3a86ba") * hv.Scatter(
        data=(d, y_star),
        kdims=["Droplet Diameter (um)"],
        vdims=["Spindle Length (um)"],
        label="bootstrapped"
).opts(color="#483d8b", title="bootstrapped residuals overlay")
[16]:

Time for a button!

[17]:
button_resid = pn.widgets.Button(
    name="Residual Bootstrap", button_type="primary", width=700
)
empirical = pn.widgets.Checkbox(name="original data")


@pn.depends(button_resid.param.clicks, empirical.param.value)
def bs_resid_and_plot(_, empirical=False):
    epsilon_star = np.random.choice(
        epsilon_hat, replace=True, size=len(epsilon_hat)  # resampled residuals
    )
    y_star = y_hat + epsilon_star  # resampled y-values

    r = residuals_plot(d, epsilon_star, color="#483d8b")  # bootstrapped resids

    p = hv.Scatter(
        data=(d, y_star),  # bootstrapped sample
        kdims=["Droplet Diameter (um)"],
        vdims=["Spindle Length (um)"],
    ).opts(xlim=(20, 225), ylim=(18, 50), color="#483d8b")
    d_theor = np.linspace(0, 250, 200)
    ell_theor = theor_spindle_length(*mle_params[:-1], d_theor)
    fit = hv.Curve(data=(d_theor, ell_theor)).opts(color="red", line_width=2)

    if empirical == True:
        emp_data = hv.Scatter(data=(d, ell)).opts(color="#3a86ba")
        emp_resid = residuals_plot(d, epsilon_hat, color="#3a86ba")
        return emp_data * p * fit + emp_resid * r

    return p * fit + r


pn.Column(pn.Row(button_resid, empirical), bs_resid_and_plot)
[17]:

We can see that one spurious droplet of over 200 µm diameter is taking on more exciting values since it’s unlikely to have been paired with its original residual. This approach to bootstrapping captures the variability we are truly interested in!

4. Perform MLE on bootstrapped samples.

Let’s make a widget to do this once and see how our γ and φ change. Here are the important additional lines of code:

[18]:
mle_params_star = spindle_mle(np.vstack((d, y_star)).T)
ell_theor_star = theor_spindle_length(*mle_params_star[:-1], d_theor)
  • y_star is our bootstrapped lengths

  • ell_theor_star will store the predicted lengths from the MLE parameters.

[19]:
button_resid_mle = pn.widgets.Button(name="Residual Bootstrap + MLE", button_type="primary")
estimate = pn.widgets.Checkbox(name="initial estimate", value=False)
gamma, phi = pn.pane.Markdown("γ*:"), pn.pane.Markdown("φ*:")

@pn.depends(button_resid_mle.param.clicks, estimate.param.value)
def bs_resid_mle_and_plot(_, estimate):
    # get resampled residuals
    epsilon_star = np.random.choice(epsilon_hat, replace=True, size=len(epsilon_hat))

    # get resampled y values
    y_star = y_hat + epsilon_star

    # perform MLE
    d_theor = np.linspace(0, 250, 200)
    mle_params_star = spindle_mle(np.vstack((d, y_star)).T)
    ell_theor_star = theor_spindle_length(*mle_params_star[:-1], d_theor)

    gamma.object = f"γ*: {np.round(mle_params_star[0], decimals=2)}"
    phi.object = f"φ*: {np.round(mle_params_star[1], decimals=2)} µm"

    # plot bootstrapped sample
    p = hv.Scatter(data=(d, y_star),
                      kdims=["Droplet Diameter (um)"],
                      vdims=["Spindle Length (um)"]
                     ).opts(
                        xlim=(20, 225),
                        ylim=(18, 50),
                        color="#483d8b"
                    )
    fit = hv.Curve(data=(d_theor, ell_theor_star)).opts(color="orange", line_width=2)

    if estimate == True:
        d_theor = np.linspace(0, 250, 200)
        ell_theor = theor_spindle_length(mle_gamma, mle_phi, d_theor)
        initial = hv.Curve(data=(d_theor, ell_theor)).opts(color="red", line_width=2)
        return initial * p * fit
    return p * fit

plot = pn.Column(button_resid_mle, bs_resid_mle_and_plot)
parameters = pn.Column(pn.Spacer(height=20), gamma, phi)
pn.Row(plot, pn.Column(pn.Spacer(height=50), estimate, parameters))
[19]:

Great. We can now move on to obtaining our confidence intervals!

4. Pete and Repeat went out on a boat…

For this, let’s make a numba’ed function that we can call repeatedly.

[20]:
@numba.njit
def resampled_residuals(y_fit, residuals):
    """Produces resampled fitted response variable with the provided residuals."""
    bs_residuals =  np.random.choice(residuals, size=len(residuals))
    return y_fit + bs_residuals

num_trials = 1000
bs_MLEs_rr = np.zeros([3, num_trials])

for i in tqdm.tqdm(range(num_trials)):
    y_star = resampled_residuals(y_hat, epsilon_hat)
    mle_params = spindle_mle(np.vstack((d, y_star)).T)
    bs_MLEs_rr[:,i] = mle_params

CIs_resamp_resids = np.percentile(bs_MLEs_rr, [2.5, 97.5], axis=1)
CIs_resamp_resids
100%|████████████████████████████████████████████████████████████████| 1000/1000 [00:05<00:00, 177.65it/s]
[20]:
array([[ 0.82908853, 37.58987543,  3.52610349],
       [ 0.89311242, 38.99318451,  3.95870325]])

Wow. Things didn’t seem to change much, did they? Let’s do a little wrangling so we can plot these parameter estimates and their confidence intervals, making the comparison more easily done.

[21]:
# the original MLE estimates
gamma_estimate = mle_gamma
phi_estimate = mle_phi

# CIs on the estimates from the pairs bootstrap
gamma_pairs_CI = [CIs_pairs[0,0], CIs_pairs[1,0]]
phi_pairs_CI = [CIs_pairs[0,1], CIs_pairs[1,1]]

# CIs on the estimates form the resampled residuals
gamma_rr_CI = [CIs_resamp_resids[0,0], CIs_resamp_resids[1,0]]
phi_rr_CI = [CIs_resamp_resids[0,1], CIs_resamp_resids[1,1]]

Now let’s take a look at our confidence intervals!

[22]:
pairs_CI  = [gamma_pairs_CI, phi_pairs_CI]
rr_CI     = [gamma_rr_CI, phi_rr_CI]
estimates = [gamma_estimate, phi_estimate]
labels    = ['γ','φ']

for i in range(2):
    summaries = [{'estimate':estimates[i],'conf_int':pairs_CI[i],'label':'pairs bootstrap'},
                 {'estimate':estimates[i],'conf_int':rr_CI[i],   'label':'resampled residuals'},]
    p=bebi103.viz.confints(summaries,)
    p.xaxis.axis_label=labels[i]

    bokeh.io.show(p)

Okay. So the result didn’t change much, although the pairs bootstrap might yield a slightly larger CI. This raises the question of why we went through all this effort in the first place, right? Rather than fixate on the end result, it’s more valuable to consider the interpretation of the two approaches and which is more appropriate for our data. When one variable is dependent upon the other, we are generally interested in how the value of y varies at each specific value of x rather than the variability in x itself. This reasoning will become even clearer in the next dataset, where the experimentalists explicitly specified and constructed bacterial strains for the desired values of the independent variable.

Computing environment

[23]:
%load_ext watermark
%watermark -v -p numpy,scipy,numba,pandas,bebi103,bokeh,holoviews,panel,jupyterlab
Python implementation: CPython
Python version       : 3.8.12
IPython version      : 7.29.0

numpy     : 1.21.2
scipy     : 1.7.1
numba     : 0.53.0
pandas    : 1.3.4
bebi103   : 0.1.9
bokeh     : 2.3.3
holoviews : 1.14.6
panel     : 0.12.1
jupyterlab: 3.2.1