Pitfalls of resampling residuals

This tutorial was prepared by Suzy Beeler and edited and enhanced by Rosita Fu with assistance from Justin Bois.

Dataset download


[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade bebi103 tqdm watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st
import numba
import tqdm

import bebi103

import bokeh.io
bokeh.io.output_notebook()

import holoviews as hv
hv.extension('bokeh')

import panel as pn
pn.extension()

bebi103.hv.set_defaults()
/Users/bois/opt/anaconda3/lib/python3.8/site-packages/arviz/__init__.py:317: UserWarning: Trying to register the cmap 'cet_gray' which already exists.
  register_cmap("cet_" + name, cmap=cmap)
/Users/bois/opt/anaconda3/lib/python3.8/site-packages/arviz/__init__.py:317: UserWarning: Trying to register the cmap 'cet_gray_r' which already exists.
  register_cmap("cet_" + name, cmap=cmap)
Loading BokehJS ...

The data set

For this data set, we will be returning to the flow cytometry data from Rob Phillips group here at Caltech, with one of the co-first authors being a familiar name… Griffin Chure! When we last saw this data in the overplotting lesson, we were just looking at data for one given strain, but the authors were interested in investigating how the level of gene expression of a fluorescent protein changes as a function of a number of tunable parameters: the number of repressors in the cell, the binding energy of the repressor to the DNA, and the concentration of the inducer that is known to bind to the repressor. The data set we load in here is the result of quite a bit of upstream analysis, where the fold-change (i.e a normalized value) of gene expression was computed for each set of these possible cellular conditions.

[2]:
df = pd.read_csv(os.path.join(data_path, "flow_master.csv"), comment="#")
df.tail()
[2]:
date username operator binding_energy rbs repressors IPTG_uM mean_YFP_A mean_YFP_bgcorr_A fold_change_A
2853 20160907 mrazomej O3 -9.7 RBS1 610 1000.0 33656.967752 30578.511268 1.010222
2854 20160907 mrazomej O3 -9.7 RBS1027 130 1000.0 32557.806469 29479.349986 0.973909
2855 20160907 mrazomej O3 -9.7 RBS446 62 1000.0 31654.811988 28576.355504 0.944077
2856 20160907 mrazomej O3 -9.7 RBS1147 30 1000.0 34471.563131 31393.106647 1.037134
2857 20160907 mrazomej O3 -9.7 HG104 11 1000.0 34267.171217 31188.714733 1.030381

Wow there’s a lot here. For the sake of this exercise, we are actually going to disregard the inducer (IPTG) concentration and just get a sense of how gene expression changes as a function of the number of repressors for one given binding site. We are examining such a small part of this data, so check out the paper website to see the full analyses the authors unleashed.

[3]:
df_simple = df[
    (df["operator"] == "O3")
    & (df["IPTG_uM"] == 0.0)
    & (df["rbs"] != "auto")
    & (df["rbs"] != "delta")
]

df_simple = df_simple.rename(
    columns={"repressors": "repressor copy number", "fold_change_A": "fold change"}
)

Let’s see how fold-change (a measure of gene expression) changes with the number of repressors.

[4]:
scatter = hv.Scatter(
    data=df_simple,
    kdims=["repressor copy number"],
    vdims=["fold change"]
).opts(
    logx=True,
)

scatter
[4]:

We see that the level of gene expression (as metricized by fold change) decreases with the number of repressors, which jibes with what we know about gene regulations. Again, it will be faster to work with NumPy arrays throughout, so let’s save those here.

[5]:
R = df_simple["repressor copy number"].values
fc = df_simple['fold change'].values

Pairs bootstrap? No.

With this dataset, it becomes abundantly clear that we are not interested in the variability of the number of repressors. Quite the opposite, the experimentalists worked hard to manipulate various cell strains to have these desired copy numbers and explicitly took 10 replicates for each strain. Doing a pairs bootstrap in this case could wreck havoc on the carefully designed experiment, since the bootstrap samples would likely incorporate data from some strains more than others. As we saw in the previous data set, let’s take the approach of resampling the residuals.

Resampling residuals? Maybe.

Let’s first set up the frame work of our MLE estimate. While I won’t go into the details here, the mathematical model of fold change is

fold change=11+2RNNSeΔϵ / kBT,

where NNS is the number of non-specific binding sites, R is the number of repressors, and Δϵ in the binding energy of the repressor to the operator in units of kBT. In this case, I actually already know the binding energy, but for sake argument, let’s pretend we don’t and get a MLE for this parameter.

[6]:
N_NS = 5e6

def theor_fold_change(N_NS, R, delta_e):
    """Compute fold change using mathematical model"""
    return 1 / (1 + 2*R/N_NS * np.exp(-delta_e))


def log_likelihood(params, R, fc_data):
    """Log likelihood of fold change model."""
    delta_e, sigma = params

    if sigma <= 0:
        return -np.inf

    fc_theory = theor_fold_change(N_NS, R, delta_e)
    return np.sum(st.norm.logpdf(np.log(fc_data), np.log(fc_theory), sigma))


def resid(params, R, fc_data):
    delta_e = params[0]
    return fc_data - theor_fold_change(N_NS, R, delta_e)


def fc_mle(R, fc_data):
    """Compute MLE for parameters in fold-change model."""
    res = scipy.optimize.least_squares(resid, np.array([-5]), args=(R, fc_data))

    # Compute residual sum of squares from optimal params
    rss_mle = np.sum(resid(res.x, R, fc_data)**2)

    # Compute MLE for sigma
    sigma_mle = np.sqrt(rss_mle / len(R))

    return tuple([x for x in res.x] + [sigma_mle])

With these in hand, we can now get our MLEs.

[7]:
mle_params = fc_mle(R, fc)

# save these for later
mle_binding = mle_params[0]
mle_sigma = mle_params[1]

print(f'Value of binding energy: {np.round(mle_binding, decimals=2)} kBT')
print(f'Value of sigma: {np.round(mle_sigma, decimals=2)}')
Value of binding energy: -9.42 kBT
Value of sigma: 0.06

Given my insider knowledge, that binding energy seems reasonable, but let’s make a predictive regression plot to check. First, we need to generate samples out of the generative model parametrized by the MLE.

[8]:
rg = np.random.default_rng()

def sample_fold_change(delta_e, sigma, R, size=1):
    """Generate samples of spindle length vs droplet diameter."""
    samples = np.empty((size, len(R)))

    for i in range(size):
        mu = theor_fold_change(N_NS, R, delta_e)
        samples[i] = np.maximum(0, rg.normal(mu, sigma))

    return samples

R_theor = np.logspace(1, 3, 200)
samples = sample_fold_change(*mle_params, R_theor, size=5000)

Now we can make the plot.

[9]:
p = bebi103.viz.predictive_regression(
    samples=samples,
    samples_x=R_theor,
    data=df_simple[["repressor copy number", 'fold change']].values,
    x_axis_label='repressor copy number',
    y_axis_label='fold change',
    x_axis_type='log',
)

bokeh.io.show(p)

Interestingly, for high repressor copy numbers, the data are much closer to the median data predicted by the model. This is a hallmark of heteroskedasticity, which means that σ can differ based on the measured values and/or the explanatory variables. We will discuss the wild bootstrap as a way of dealing with this momentarily, but first let’s get confidence intervals on the MLEs of the parameters by resampling the residuals as we did before.

[10]:
@numba.njit
def resampled_residuals(y_fit, residuals):
    """Produces resampled fitted response variable with the provided residuals."""
    bs_residuals =  np.random.choice(residuals, size=len(residuals))
    return y_fit + bs_residuals

To get set up, let’s first get the fold change values our MLE would predict and the residuals when we compare our data to these predicited values.

[11]:
# get our fc predictions
MLE_fc_pred = theor_fold_change(N_NS, R, mle_params[0])

# get our residual values
resids = fc - MLE_fc_pred

With these in hand, we are ready to repeatedly resample the residuals.

[12]:
num_trials = 1000
bs_MLEs_resampled_resids = np.zeros([2, num_trials])

for i in tqdm.tqdm(range(num_trials)):
    fc_resamp = resampled_residuals(MLE_fc_pred, resids)
    mle_params = fc_mle(R, fc_resamp)
    bs_MLEs_resampled_resids[:,i] = mle_params

resampled_resids_CI = np.percentile(bs_MLEs_resampled_resids, [2.5, 97.5], axis=1)
resampled_resids_CI
100%|████████████████████████████████████████████████████████████████| 1000/1000 [00:02<00:00, 447.14it/s]
[12]:
array([[-9.5489066 ,  0.04630524],
       [-9.33554537,  0.07323599]])

In the plot of the data, we saw that there is much more variability in the measurement of fold change when the value of fold change is higher, known as heteroskedasticity. Heteroskedasticity can be a source of problems when doing residual resampling, as we will often assign a large residual to an y-value that would realistically not yield that much variability. This issue is most clearly illustrated by showing plots of what our resampled data look like.

[13]:
# make a clickable button
button_resid = pn.widgets.Button(name="Residual Boostrap", button_type="primary")
MLE_estimate = pn.widgets.Checkbox(name="initial MLE estimate", value=False)

# define a bootstrap and plot function
@pn.depends(button_resid.param.clicks, MLE_estimate.param.value)
def bs_and_plot(_, estimate):
    fc_resamp = resampled_residuals(MLE_fc_pred, resids)
    new_resids = fc_resamp - MLE_fc_pred

    r = hv.Scatter(data=(R, new_resids),
                        kdims=["repressor copy number"],
                        vdims=["residual fc"],
                      ).opts(
                        ylim=(resids.min()-.1,resids.max()+.1)
                    )
    zero = hv.Path(data=(R, 0*np.empty(len(R))),).opts(color='orange')

    p = hv.Scatter(data=(R, fc_resamp),
                      kdims=["repressor copy number"],
                      vdims=["fold change"]
                     ).opts(ylim=(0,1.1))

    R_theor = np.linspace(0,900,1000)
    fc_theor = theor_fold_change(N_NS, R_theor, mle_binding)
    fit = hv.Path(data=(R_theor, fc_theor)).opts(color="orange", line_width=2)
    if estimate == True:
        return p * fit + r * zero
    return p + r

plot = pn.Column(button_resid, bs_and_plot)
pn.Row(plot, pn.Column(pn.Spacer(height=80), MLE_estimate))
[13]:

We see that oftentimes, the high repressor copy number strains show much wider variability in fold change than they do in the original data, likely bringing up our estimate on σ. A solution?

Wild bootstrap

An approach you might see in this situation is a wild bootstrap, where rather than scrambling the residuals across all the y values, you retain the association of each residual to its original y value. Then your bootstrap samples allow the residuals to vary around their original value in a normally distributed manner, scaled by the original residual. In other words, if the residual for data point i is ϵi, then the wild bootstrapped residual is ϵi=ϵivi where vi is drawn from a Normal distribution with location parameter zero and scale parameter one.

This approach allows us to make bootstrap samples that retain the non-uniform variability that we see in the original data. We show this approach below for completeness and since it’s a common technique you might see, but we suggest using it with caution for reasons that will be come clear.

[14]:
def wild_bootstrap(y_fit, residuals):
    """Produces resampled fitted response variable with the provided residuals."""
    resampled_residuals = residuals * np.random.normal(size=len(residuals))
    return y_fit + resampled_residuals

Let’s visually check out what happens with the wild bootstrap.

[15]:
def residuals_plot(x, residuals):
    r = hv.Scatter(data=(R, residuals),
                        kdims=["repressor copy number"],
                        vdims=["residual fc"],
                      ).opts(
                        ylim=(resids.min()-.1,resids.max()+.1)
                    )
    return r
[16]:
# make a clickable button
button = pn.widgets.Button(name="Wild Bootstrap", button_type="primary")
MLE_estimate = pn.widgets.Checkbox(name="initial MLE estimate", value=False)

# define a bootstrap and plot function
@pn.depends(button.param.clicks, MLE_estimate.param.value)
def bs_and_plot(_, estimate):

    # now doing wild boostrap
    fc_resamp = wild_bootstrap(MLE_fc_pred, resids)
    r = residuals_plot(R, fc_resamp - MLE_fc_pred)
    zero = hv.Path(data=(R, 0*np.empty(len(R))),).opts(color='orange')
    p = hv.Scatter(data=(R, fc_resamp),
                      kdims=["repressor copy number"],
                      vdims=["fold change"]
                     ).opts(ylim=(0,1))


    R_theor = np.linspace(0,900,1000)
    fc_theor = theor_fold_change(N_NS, R_theor, mle_binding)
    fit = hv.Path(data=(R_theor, fc_theor)).opts(color="orange", line_width=2)

    if estimate == True:
        return p * fit + r * zero
    return p + r

plot = pn.Column(button, bs_and_plot)
pn.Row(plot, pn.Column(pn.Spacer(height=80), MLE_estimate))
[16]:

We can now see that the low repressor copy number strains retain high variability in fold change and vice versa, which better reflects what we saw in the original data.

Now we would like to use this bootstrapping technique to get confidence intervals on our parameter estimates. But wait. What about σ now? By choosing to use a wild bootstrap, we have baked into our log likelihood that the residual for each data point is normally distributed about the empirical residual with a scale parameter that scales with the size of the empirical residual.

This is a way of dealing with heteroskedasticity without seriously considering the variation in measurement in a generative way. This is why we warn away from using the wild bootstrap. In this case, it would be more appropriate to explicitly construct a generative model that reflects how we suspect σ to scale rather than taking a non-parametric approach that simulates the heteroskedasticity.

To implement this, we do not need to estimate a σ, since it is no longer a parameter in our generative model. Since we already did the work of finding the parameters by least squares, we can use the same MLE solver we used before, and just ignore the estimate of σ. So, let’s use wild bootstrap to get our confidence intervals on the binding energy.

[17]:
num_trials = 1000
bs_MLEs_wild = np.zeros(num_trials)

for i in tqdm.tqdm(range(num_trials)):
    fc_resamp = wild_bootstrap(MLE_fc_pred, resids)
    mle_params = fc_mle(R, fc_resamp)
    bs_MLEs_wild[i] = mle_params[0]

wild_CI = np.percentile(bs_MLEs_wild, [2.5, 97.5])
wild_CI
100%|████████████████████████████████████████████████████████████████| 1000/1000 [00:01<00:00, 619.40it/s]
[17]:
array([-9.53015523, -9.32254786])

The confidence interval on the binding energy seems shifted but let’s visualize it!

[18]:
summaries = [{'estimate':mle_binding,'conf_int':resampled_resids_CI.T[0],'label':'residuals bootstrap'},
             {'estimate':mle_binding,'conf_int':wild_CI,'label':'wild bootstrap'},]
p=bebi103.viz.confints(summaries)
p.xaxis.axis_label='∈'
bokeh.io.show(p)

Should we use wild bootstrap?

If we are not explicitly modeling variability off of the theoretical curve, wild bootstrap can be an effective way to handle heteroskedasticity. Recall that when we do explicitly model the residuals in a homoskedastic manner, the MLE we get for σ is inconsequential with respect to the parameters of the theoretical model (we derived this in the lesson on regression). So, in a sense, we can be safe in not modeling variability if we assume homoskedasticity (though we could not do generative regression plots as model checks). So, if we are feeling lazy, or ignorant, and could continue to not model variability generatively and employ wild bootstrap to handle any possible lurking heteroskedasticity.

But, we really should think generatively. If we are going to model, we should model everything. A full generative model, including how we might expect the variability to change as a function of the measured and/or explanatory variables, is preferable. We will spend a lot of time next term building good generative models.

Computing environment

[19]:
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,numba,tqdm,bebi103,bokeh,holoviews,panel,jupyter
Python implementation: CPython
Python version       : 3.8.12
IPython version      : 7.29.0

numpy    : 1.21.2
scipy    : 1.7.1
pandas   : 1.3.4
numba    : 0.53.0
tqdm     : 4.62.3
bebi103  : 0.1.9
bokeh    : 2.3.3
holoviews: 1.14.6
panel    : 0.12.1
jupyter  : 1.0.0