Bootstrapping MLEs with dependent variables¶

This tutorial was prepared by Suzy Beeler with assistance from Justin Bois.

[1]:

import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st
import numba
import tqdm

import bebi103

import bokeh.io
bokeh.io.output_notebook()

import holoviews as hv
hv.extension('bokeh')

import panel as pn
pn.extension()

bebi103.hv.set_defaults()

Loading BokehJS ...

Thus far, we have primarily discussed how to do bootstrapping on univariate data. We have also touched a bit on how we would do bootstrapping for data which have multiple independent variables that we are interested in (i.e. beak width and beak depth). In this recitation, we will discuss ways to conduct bootstrapping when one variable is dependent and the other is independent. As a first pass, we will start with an approach we have already seen, the pairs bootstrap. We will discuss why this approach is not preferred when working with dependent variables, and discuss some alternatives. So let’s dive in….

The data set¶

For this recitation, we will revisit the data set we discussed in Lesson 8, with the data from Matt Good (Science, 2013). As a reminder, they were exploring to what extent (if any) spindle length depends on droplet size in the artificial cells they constructed. Let’s plot the data below to get refamiliarize ourselves.

[2]:

df = pd.read_csv("../data/good_invitro_droplet_data.csv", comment="#")

scatter = hv.Scatter(
    data=df,
    kdims=["Droplet Diameter (um)"],
    vdims=["Spindle Length (um)"]
)

scatter

[2]:

It will be faster to work with NumPy arrays throughout, so let’s save those here.

[3]:

d = df['Droplet Diameter (um)'].values
ell = df['Spindle Length (um)'].values
data = df[['Droplet Diameter (um)', 'Spindle Length (um)']].values

Pairs bootstrap¶

Given a frequentist interpretation of probability, if we would like to understand how much our final statistical functionals or parameter estimates may vary if we were to repeat the experiment, we need to conduct a bootstrap on this data set. As a first pass, we will revisit the use of the pairs bootstrap, with the code from Lesson 7 reproduced here.

[4]:

@numba.njit
def draw_bs_sample(data):
    """Draw a bootstrap sample from a 1D data set."""
    return np.random.choice(data, size=len(data))


@numba.njit
def draw_bs_pairs(x, y):
    """Draw a pairs bootstrap sample."""
    inds = np.arange(len(x))
    bs_inds = draw_bs_sample(inds)

    return x[bs_inds], y[bs_inds]

Let’s take it for a spin and see how the bootstrap sample looks.

[5]:

bs_d, bs_ell = draw_bs_pairs(d, ell)

hv.Scatter(data=(bs_d, bs_ell),
           kdims=["Droplet Diameter (um)"],
           vdims=["Spindle Length (um)"]
)

[5]:

If we look closely, we can see that this bootstrap sample is in fact a bit different from the original data set. Just for kicks and to practice our Panel skills, let’s make an interactive version that will repeatedly update with a new bootstrap sample each time we click a button.

[6]:

# make a clickable button
button = pn.widgets.Button(name="Bootstrap!", button_type="primary")

# define a bootstrap and plot function
@pn.depends(button.param.clicks)
def bs_and_plot(_):

    bs_d, bs_ell = draw_bs_pairs(d, ell)

    return hv.Scatter(data=(bs_d, bs_ell),
                      kdims=["Droplet Diameter (um)"],
                      vdims=["Spindle Length (um)"]
                     ).opts(
                        xlim=(20, 225),
                        ylim=(18, 50)
    )

# profit!
pn.Column(button, bs_and_plot)

[6]:

Great! This can provide hours of endless entertainment. Now that we have picked a way to do the bootstrap, we might be interested in how our maximum likelihood parameter estimates might change with these different bootstrap samples. Below, I’ve reproduced the code from Lesson 8 that was used to get the MLEs for the parameters. As a reminder, this is the model for spindle length that we have in mind, with parameters $\gamma$, $\phi$, and $\sigma$:

\begin{align} &\mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &l_i \sim \text{Norm}(\mu_i, \sigma) \;\forall i. \end{align}

[7]:

def theor_spindle_length(gamma, phi, d):
    """Compute spindle length using mathematical model"""
    return gamma * d / np.cbrt(1 + (gamma * d / phi)**3)


def log_likelihood(params, d, ell):
    """Log likelihood of spindle length model."""
    gamma, phi, sigma = params

    if gamma <= 0 or gamma > 1 or phi <= 0:
        return -np.inf

    mu = theor_spindle_length(gamma, phi, d)
    return np.sum(st.norm.logpdf(ell, mu, sigma))


def resid(params, d, ell):
    """Residual for spindle length model."""
    return ell - theor_spindle_length(*params, d)


def spindle_mle(data):
    """Compute MLE for parameters in spindle length model."""
    # Unpack data
    d = data[:, 0]
    ell = data[:, 1]

    res = scipy.optimize.least_squares(
        resid, np.array([0.5, 35]), args=(d, ell), bounds=([0, 0], [1, np.inf])
    )

    # Compute residual sum of squares from optimal params
    rss_mle = np.sum(resid(res.x, d, ell)**2)

    # Compute MLE for sigma
    sigma_mle = np.sqrt(rss_mle / len(d))

    return tuple([x for x in res.x] + [sigma_mle])

With these in hand, we can now get our MLEs.

[8]:

mle_params = spindle_mle(data)

# save these values for later
mle_gamma = mle_params[0]
mle_phi = mle_params[1]
mle_sigma = mle_params[2]

print(f'Value of gamma: {np.round(mle_gamma, decimals=2)}')
print(f'Value of phi: {np.round(mle_phi, decimals=2)}')
print(f'Value of sigma: {np.round(mle_sigma, decimals=2)}')

Value of gamma: 0.86
Value of phi: 38.25
Value of sigma: 3.75

Now let’s take our interactive figure from before to the next level, getting new parameter estimates each time we obtain a new bootstrap sample. Let’s first update our plotting function, which will now incorporate new widgets for displaying the MLE values for each new bootstrap sample.

[9]:

button = pn.widgets.Button(name="Bootstrap!", button_type="primary")
gamma = pn.pane.Markdown("γ*:")
phi = pn.pane.Markdown("φ*:")


@pn.depends(button.param.clicks)
def bs_MLE_and_plot(_):

    # construct a pairs bootstrap sample and make the scatter plot
    bs_d, bs_ell = draw_bs_pairs(d, ell)

    scatter = hv.Scatter(
        data=(bs_d, bs_ell),
        kdims=["Droplet Diameter (um)"],
        vdims=["Spindle Length (um)"]
    ).opts(
        xlim=(20, 225),
        ylim=(18, 50)
    )

    # do MLE and update parameter values to display
    mle_params = spindle_mle(np.vstack((bs_d, bs_ell)).T)
    gamma.object = f"γ*: {np.round(mle_params[0], decimals=2)}"
    phi.object = f"φ*: {np.round(mle_params[1], decimals=2)} µm"

    # compute theoretical curve using MLE parameters
    d_theor = np.linspace(0, 250, 200)
    ell_theor = theor_spindle_length(*mle_params[:-1], d_theor)

    # overlay the theroritcal curve and the boostrap replicate data
    return scatter * hv.Curve(
        data=(d_theor, ell_theor)
    ).opts(
        color="orange",
        line_width=2
    )

Now let’s see it in action!

[10]:

plot = pn.Column(button, bs_MLE_and_plot)
parameters = pn.Column(pn.Spacer(height=140), gamma, phi)
pn.Row(plot, pn.Spacer(width=20), parameters)

[10]:

This allows us to readily see that the parameter values don’t change much from each bootstrap sample to the next. This gives us a nice intuitive feel for how we expect our estimates to change from experiment to experiment. Of course, we would like to put some numbers to this, so let’s compute a 95% confidence interval on these estimates.

[11]:

num_trials = 1000
bs_MLEs = np.zeros([3, num_trials])

for i in tqdm.tqdm(range(num_trials)):
    bs_d, bs_ell = draw_bs_pairs(d, ell)
    mle_params = spindle_mle(np.vstack((bs_d, bs_ell)).T)
    bs_MLEs[:,i] = mle_params

CIs_pairs = np.percentile(bs_MLEs, [2.5, 97.5], axis=1)
lower_CI = CIs_pairs[0]
upper_CI = CIs_pairs[1]

CIs_pairs

100%|██████████| 1000/1000 [00:07<00:00, 128.22it/s]

[11]:

array([[ 0.82748526, 37.37804023,  3.53610314],
       [ 0.89503766, 39.13487636,  3.96642846]])

Great! This validates what we saw in the interactive figure, where the value of $\gamma$ is somewhere around 0.85 and the value of $\phi$ is just shy of 40. Lastly, let’s visualize how the curves parametrized by our MLEs vary for each bootstrap sample.

[12]:

scatter = hv.Scatter(
    data=df,
    kdims=["Droplet Diameter (um)"],
    vdims=["Spindle Length (um)"]
)

# Pop out Bokeh plot
p = hv.render(scatter)

# Plot theoretical curves (do every tenth)
d_theor = np.linspace(0, 250, 200)
for params in bs_MLEs[:,::10].T:
    p.line(d_theor, theor_spindle_length(*params[:-1], d_theor), line_width=1, alpha=0.1, color='orange')

bokeh.io.show(p)

But let’s now take a step back: remember how I prefaced this by saying the pairs bootstrap is not the way to go with dependent variables? Let’s take a minute to think about why.

In this case, the spindle length is dependent on the droplet diameter. This means we aren’t really interested in how the droplet diameter itself varies, but rather how spindle length depends on the droplet diameter. Accordingly, it’s not really appropriate to bootstrap both our x and y values, when it’s really only the y values that capture our data of interest. This stands in contrast to the bird beak data where both the length and the depth were random variables we wished to explore the variation in. This leads us the next approach …

Resampling residuals¶

Rather than resample which data points we will include in each bootstrap replicate, the approach of resampling the residuals allows us formulate a bootstramp sample that reflects the variability we expect to see in the response variable (e.g. spindle length in our case). The steps as listed below are taken directly from Wikipedia:

Fit the model and retain the fitted values $\widehat y_i$ and the residuals $:nbsphinx-math:widehat `:nbsphinx-math:epsilon`_i = y_i

:nbsphinx-math:`widehat `y_i$.

[13]:

# fit the model
mle_params = spindle_mle(data)

# get our spindle length predictions
y_hat = theor_spindle_length(*mle_params[:-1], d)

# get our residual values
epsilon_hat = ell - y_hat

For each pair, ($x_i$, $y_i$), in which $x_i$ is the explanatory variable, add a randomly resampled residual, $\widehat \epsilon_j$, to the fitted value $\widehat y_i$. In other words, create synthetic response variables $y^*_i = \widehat y_i + \widehat \epsilon_j$ where $j$ is selected randomly from the list (1, …, $n$) for every $i$.

[14]:

# get resampled residuals
epsilon_star = np.random.choice(epsilon_hat, replace=True, size=len(epsilon_hat))

# get resampled y values
y_star = y_hat + epsilon_star

# Check out our resampled data set
hv.Scatter(
        data=(d, y_star),
        kdims=["Droplet Diameter (um)"],
        vdims=["Spindle Length (um)"]
)

[14]:

Things haven’t changed too drastically, but we can see that one spurious droplet with over 200 µm diameter has shifted up since it’s unlikely to have been paired with it’s original residual.

Refit the model using the fictitious response variables $y^*_i$, and retain the quantities of interest, the parameter estimates in this case.

[15]:

mle_params_star = spindle_mle(np.vstack((d, y_star)).T)
mle_params_star

[15]:

(0.8320461581157351, 38.86744324073538, 3.5871776138552915)

We can look at a plot of the curve parametrized by the MLE.

[16]:

# plot the resampled data
scatter = hv.Scatter(
        data=(d, y_star),
        kdims=["Droplet Diameter (um)"],
        vdims=["Spindle Length (um)"]
)

# overlay the theoritcal curve
d_theor = np.linspace(0, 250, 200)
ell_theor_star = theor_spindle_length(*mle_params_star[:-1], d_theor)

MLE_star = hv.Curve(data=(d_theor, ell_theor_star)).opts(color="orange", line_width=2)

scatter * MLE_star

[16]:

Nice. This seemed to have worked well and the parameter estimates didn’t change too drastically, which is promising. We can now move to the last step, which we should be familiar with by now:

Repeat steps 2 and 3 a large number of times.

For this, let’s make a numba’ed function that we can call repeatedly.

[17]:

@numba.njit
def resampled_residuals(y_fit, residuals):
    """Produces resampled fitted response variable with the provided residuals."""
    bs_residuals =  np.random.choice(residuals, size=len(residuals))
    return y_fit + bs_residuals


num_trials = 1000
bs_MLEs_rr = np.zeros([3, num_trials])

for i in tqdm.tqdm(range(num_trials)):
    y_star = resampled_residuals(y_hat, epsilon_hat)
    mle_params = spindle_mle(np.vstack((d, y_star)).T)
    bs_MLEs_rr[:,i] = mle_params

CIs_resamp_resids = np.percentile(bs_MLEs_rr, [2.5, 97.5], axis=1)
CIs_resamp_resids

100%|██████████| 1000/1000 [00:05<00:00, 190.53it/s]

[17]:

array([[ 0.82964235, 37.55197845,  3.53569939],
       [ 0.89582985, 39.01863112,  3.96114377]])

Wow. Things didn’t seem to change much, did they? Let’s do a little wrangling so we can plot these parameter estimates and their confidence intervals, making the comparison more easily done.

[18]:

# the original MLE estimates
gamma_estimate = mle_gamma
phi_estimate = mle_phi

# CIs on the estimates from the pairs bootstrap
gamma_pairs_CI = [CIs_pairs[0,0], CIs_pairs[1,0]]
phi_pairs_CI = [CIs_pairs[0,1], CIs_pairs[1,1]]

# CIs on the estimates form the resampled residuals
gamma_rr_CI = [CIs_resamp_resids[0,0], CIs_resamp_resids[1,0]]
phi_rr_CI = [CIs_resamp_resids[0,1], CIs_resamp_resids[1,1]]

Now let’s take a look at our confidence intervals!

[19]:

estimate = [gamma_estimate, gamma_estimate]
confs = np.array([gamma_pairs_CI, gamma_rr_CI])
names = ['pairs bootstrap', 'resampled residuals']

bokeh.io.show(bebi103.viz.plot_with_error_bars(estimate, confs, names, x_axis_label='γ'))

[20]:

estimate = [phi_estimate, phi_estimate]
confs = np.array([phi_pairs_CI, phi_rr_CI])
names = ['pairs bootstrap', 'resampled residuals']

bokeh.io.show(bebi103.viz.plot_with_error_bars(estimate, confs, names, x_axis_label='φ (µm)'))

Okay. So the result didn’t change much, although the pairs bootstrap might yield a slightly larger CI. This raises the question of why we went through all this effort in the first place, right? Rather than fixate on the end result, it’s more valuable to consider the interpretation of the two approaches and which is more appropriate for our data. When one variable is dependent upon the other, we are generally interested in how the value of y varies at each specific value of x rather than the variability in x itself. This reasoning will become even clearer in the next dataset, where the experimentalists explicitly specified and constructed bacterial strains for the desired values of the independent variable.

Computing environment¶

[21]:

%load_ext watermark
%watermark -v -p numpy,scipy,numba,pandas,bebi103,bokeh,holoviews,panel,jupyterlab

CPython 3.7.5
IPython 7.10.1

numpy 1.17.4
scipy 1.3.1
numba 0.46.0
pandas 0.24.2
bebi103 0.0.46
bokeh 1.4.0
holoviews 1.12.7
panel 0.7.0
jupyterlab 1.2.3