Pitfalls of resampling residuals
This tutorial was prepared by Suzy Beeler and edited and enhanced by Rosita Fu with assistance from Justin Bois.
[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
cmd = "pip install --upgrade bebi103 tqdm watermark"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
data_path = "../data/"
# ------------------------------
import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st
import numba
import tqdm
import bebi103
import bokeh.io
bokeh.io.output_notebook()
import holoviews as hv
hv.extension('bokeh')
import panel as pn
pn.extension()
bebi103.hv.set_defaults()
/Users/bois/opt/anaconda3/lib/python3.8/site-packages/arviz/__init__.py:317: UserWarning: Trying to register the cmap 'cet_gray' which already exists.
register_cmap("cet_" + name, cmap=cmap)
/Users/bois/opt/anaconda3/lib/python3.8/site-packages/arviz/__init__.py:317: UserWarning: Trying to register the cmap 'cet_gray_r' which already exists.
register_cmap("cet_" + name, cmap=cmap)
The data set
For this data set, we will be returning to the flow cytometry data from Rob Phillips group here at Caltech, with one of the co-first authors being a familiar name… Griffin Chure! When we last saw this data in the overplotting lesson, we were just looking at data for one given strain, but the authors were interested in investigating how the level of gene expression of a fluorescent protein changes as a function of a number of tunable parameters: the number of repressors in the cell, the binding energy of the repressor to the DNA, and the concentration of the inducer that is known to bind to the repressor. The data set we load in here is the result of quite a bit of upstream analysis, where the fold-change (i.e a normalized value) of gene expression was computed for each set of these possible cellular conditions.
[2]:
df = pd.read_csv(os.path.join(data_path, "flow_master.csv"), comment="#")
df.tail()
[2]:
date | username | operator | binding_energy | rbs | repressors | IPTG_uM | mean_YFP_A | mean_YFP_bgcorr_A | fold_change_A | |
---|---|---|---|---|---|---|---|---|---|---|
2853 | 20160907 | mrazomej | O3 | -9.7 | RBS1 | 610 | 1000.0 | 33656.967752 | 30578.511268 | 1.010222 |
2854 | 20160907 | mrazomej | O3 | -9.7 | RBS1027 | 130 | 1000.0 | 32557.806469 | 29479.349986 | 0.973909 |
2855 | 20160907 | mrazomej | O3 | -9.7 | RBS446 | 62 | 1000.0 | 31654.811988 | 28576.355504 | 0.944077 |
2856 | 20160907 | mrazomej | O3 | -9.7 | RBS1147 | 30 | 1000.0 | 34471.563131 | 31393.106647 | 1.037134 |
2857 | 20160907 | mrazomej | O3 | -9.7 | HG104 | 11 | 1000.0 | 34267.171217 | 31188.714733 | 1.030381 |
Wow there’s a lot here. For the sake of this exercise, we are actually going to disregard the inducer (IPTG) concentration and just get a sense of how gene expression changes as a function of the number of repressors for one given binding site. We are examining such a small part of this data, so check out the paper website to see the full analyses the authors unleashed.
[3]:
df_simple = df[
(df["operator"] == "O3")
& (df["IPTG_uM"] == 0.0)
& (df["rbs"] != "auto")
& (df["rbs"] != "delta")
]
df_simple = df_simple.rename(
columns={"repressors": "repressor copy number", "fold_change_A": "fold change"}
)
Let’s see how fold-change (a measure of gene expression) changes with the number of repressors.
[4]:
scatter = hv.Scatter(
data=df_simple,
kdims=["repressor copy number"],
vdims=["fold change"]
).opts(
logx=True,
)
scatter
[4]:
We see that the level of gene expression (as metricized by fold change) decreases with the number of repressors, which jibes with what we know about gene regulations. Again, it will be faster to work with NumPy arrays throughout, so let’s save those here.
[5]:
R = df_simple["repressor copy number"].values
fc = df_simple['fold change'].values
Pairs bootstrap? No.
With this dataset, it becomes abundantly clear that we are not interested in the variability of the number of repressors. Quite the opposite, the experimentalists worked hard to manipulate various cell strains to have these desired copy numbers and explicitly took 10 replicates for each strain. Doing a pairs bootstrap in this case could wreck havoc on the carefully designed experiment, since the bootstrap samples would likely incorporate data from some strains more than others. As we saw in the previous data set, let’s take the approach of resampling the residuals.
Resampling residuals? Maybe.
Let’s first set up the frame work of our MLE estimate. While I won’t go into the details here, the mathematical model of fold change is
where
[6]:
N_NS = 5e6
def theor_fold_change(N_NS, R, delta_e):
"""Compute fold change using mathematical model"""
return 1 / (1 + 2*R/N_NS * np.exp(-delta_e))
def log_likelihood(params, R, fc_data):
"""Log likelihood of fold change model."""
delta_e, sigma = params
if sigma <= 0:
return -np.inf
fc_theory = theor_fold_change(N_NS, R, delta_e)
return np.sum(st.norm.logpdf(np.log(fc_data), np.log(fc_theory), sigma))
def resid(params, R, fc_data):
delta_e = params[0]
return fc_data - theor_fold_change(N_NS, R, delta_e)
def fc_mle(R, fc_data):
"""Compute MLE for parameters in fold-change model."""
res = scipy.optimize.least_squares(resid, np.array([-5]), args=(R, fc_data))
# Compute residual sum of squares from optimal params
rss_mle = np.sum(resid(res.x, R, fc_data)**2)
# Compute MLE for sigma
sigma_mle = np.sqrt(rss_mle / len(R))
return tuple([x for x in res.x] + [sigma_mle])
With these in hand, we can now get our MLEs.
[7]:
mle_params = fc_mle(R, fc)
# save these for later
mle_binding = mle_params[0]
mle_sigma = mle_params[1]
print(f'Value of binding energy: {np.round(mle_binding, decimals=2)} kBT')
print(f'Value of sigma: {np.round(mle_sigma, decimals=2)}')
Value of binding energy: -9.42 kBT
Value of sigma: 0.06
Given my insider knowledge, that binding energy seems reasonable, but let’s make a predictive regression plot to check. First, we need to generate samples out of the generative model parametrized by the MLE.
[8]:
rg = np.random.default_rng()
def sample_fold_change(delta_e, sigma, R, size=1):
"""Generate samples of spindle length vs droplet diameter."""
samples = np.empty((size, len(R)))
for i in range(size):
mu = theor_fold_change(N_NS, R, delta_e)
samples[i] = np.maximum(0, rg.normal(mu, sigma))
return samples
R_theor = np.logspace(1, 3, 200)
samples = sample_fold_change(*mle_params, R_theor, size=5000)
Now we can make the plot.
[9]:
p = bebi103.viz.predictive_regression(
samples=samples,
samples_x=R_theor,
data=df_simple[["repressor copy number", 'fold change']].values,
x_axis_label='repressor copy number',
y_axis_label='fold change',
x_axis_type='log',
)
bokeh.io.show(p)
Interestingly, for high repressor copy numbers, the data are much closer to the median data predicted by the model. This is a hallmark of heteroskedasticity, which means that
[10]:
@numba.njit
def resampled_residuals(y_fit, residuals):
"""Produces resampled fitted response variable with the provided residuals."""
bs_residuals = np.random.choice(residuals, size=len(residuals))
return y_fit + bs_residuals
To get set up, let’s first get the fold change values our MLE would predict and the residuals when we compare our data to these predicited values.
[11]:
# get our fc predictions
MLE_fc_pred = theor_fold_change(N_NS, R, mle_params[0])
# get our residual values
resids = fc - MLE_fc_pred
With these in hand, we are ready to repeatedly resample the residuals.
[12]:
num_trials = 1000
bs_MLEs_resampled_resids = np.zeros([2, num_trials])
for i in tqdm.tqdm(range(num_trials)):
fc_resamp = resampled_residuals(MLE_fc_pred, resids)
mle_params = fc_mle(R, fc_resamp)
bs_MLEs_resampled_resids[:,i] = mle_params
resampled_resids_CI = np.percentile(bs_MLEs_resampled_resids, [2.5, 97.5], axis=1)
resampled_resids_CI
100%|████████████████████████████████████████████████████████████████| 1000/1000 [00:02<00:00, 447.14it/s]
[12]:
array([[-9.5489066 , 0.04630524],
[-9.33554537, 0.07323599]])
In the plot of the data, we saw that there is much more variability in the measurement of fold change when the value of fold change is higher, known as heteroskedasticity. Heteroskedasticity can be a source of problems when doing residual resampling, as we will often assign a large residual to an y-value that would realistically not yield that much variability. This issue is most clearly illustrated by showing plots of what our resampled data look like.
[13]:
# make a clickable button
button_resid = pn.widgets.Button(name="Residual Boostrap", button_type="primary")
MLE_estimate = pn.widgets.Checkbox(name="initial MLE estimate", value=False)
# define a bootstrap and plot function
@pn.depends(button_resid.param.clicks, MLE_estimate.param.value)
def bs_and_plot(_, estimate):
fc_resamp = resampled_residuals(MLE_fc_pred, resids)
new_resids = fc_resamp - MLE_fc_pred
r = hv.Scatter(data=(R, new_resids),
kdims=["repressor copy number"],
vdims=["residual fc"],
).opts(
ylim=(resids.min()-.1,resids.max()+.1)
)
zero = hv.Path(data=(R, 0*np.empty(len(R))),).opts(color='orange')
p = hv.Scatter(data=(R, fc_resamp),
kdims=["repressor copy number"],
vdims=["fold change"]
).opts(ylim=(0,1.1))
R_theor = np.linspace(0,900,1000)
fc_theor = theor_fold_change(N_NS, R_theor, mle_binding)
fit = hv.Path(data=(R_theor, fc_theor)).opts(color="orange", line_width=2)
if estimate == True:
return p * fit + r * zero
return p + r
plot = pn.Column(button_resid, bs_and_plot)
pn.Row(plot, pn.Column(pn.Spacer(height=80), MLE_estimate))
[13]:
We see that oftentimes, the high repressor copy number strains show much wider variability in fold change than they do in the original data, likely bringing up our estimate on
Wild bootstrap
An approach you might see in this situation is a wild bootstrap, where rather than scrambling the residuals across all the y values, you retain the association of each residual to its original y value. Then your bootstrap samples allow the residuals to vary around their original value in a normally distributed manner, scaled by the original residual. In other words, if the residual for data point
This approach allows us to make bootstrap samples that retain the non-uniform variability that we see in the original data. We show this approach below for completeness and since it’s a common technique you might see, but we suggest using it with caution for reasons that will be come clear.
[14]:
def wild_bootstrap(y_fit, residuals):
"""Produces resampled fitted response variable with the provided residuals."""
resampled_residuals = residuals * np.random.normal(size=len(residuals))
return y_fit + resampled_residuals
Let’s visually check out what happens with the wild bootstrap.
[15]:
def residuals_plot(x, residuals):
r = hv.Scatter(data=(R, residuals),
kdims=["repressor copy number"],
vdims=["residual fc"],
).opts(
ylim=(resids.min()-.1,resids.max()+.1)
)
return r
[16]:
# make a clickable button
button = pn.widgets.Button(name="Wild Bootstrap", button_type="primary")
MLE_estimate = pn.widgets.Checkbox(name="initial MLE estimate", value=False)
# define a bootstrap and plot function
@pn.depends(button.param.clicks, MLE_estimate.param.value)
def bs_and_plot(_, estimate):
# now doing wild boostrap
fc_resamp = wild_bootstrap(MLE_fc_pred, resids)
r = residuals_plot(R, fc_resamp - MLE_fc_pred)
zero = hv.Path(data=(R, 0*np.empty(len(R))),).opts(color='orange')
p = hv.Scatter(data=(R, fc_resamp),
kdims=["repressor copy number"],
vdims=["fold change"]
).opts(ylim=(0,1))
R_theor = np.linspace(0,900,1000)
fc_theor = theor_fold_change(N_NS, R_theor, mle_binding)
fit = hv.Path(data=(R_theor, fc_theor)).opts(color="orange", line_width=2)
if estimate == True:
return p * fit + r * zero
return p + r
plot = pn.Column(button, bs_and_plot)
pn.Row(plot, pn.Column(pn.Spacer(height=80), MLE_estimate))
[16]:
We can now see that the low repressor copy number strains retain high variability in fold change and vice versa, which better reflects what we saw in the original data.
Now we would like to use this bootstrapping technique to get confidence intervals on our parameter estimates. But wait. What about
This is a way of dealing with heteroskedasticity without seriously considering the variation in measurement in a generative way. This is why we warn away from using the wild bootstrap. In this case, it would be more appropriate to explicitly construct a generative model that reflects how we suspect
To implement this, we do not need to estimate a
[17]:
num_trials = 1000
bs_MLEs_wild = np.zeros(num_trials)
for i in tqdm.tqdm(range(num_trials)):
fc_resamp = wild_bootstrap(MLE_fc_pred, resids)
mle_params = fc_mle(R, fc_resamp)
bs_MLEs_wild[i] = mle_params[0]
wild_CI = np.percentile(bs_MLEs_wild, [2.5, 97.5])
wild_CI
100%|████████████████████████████████████████████████████████████████| 1000/1000 [00:01<00:00, 619.40it/s]
[17]:
array([-9.53015523, -9.32254786])
The confidence interval on the binding energy seems shifted but let’s visualize it!
[18]:
summaries = [{'estimate':mle_binding,'conf_int':resampled_resids_CI.T[0],'label':'residuals bootstrap'},
{'estimate':mle_binding,'conf_int':wild_CI,'label':'wild bootstrap'},]
p=bebi103.viz.confints(summaries)
p.xaxis.axis_label='∈'
bokeh.io.show(p)
Should we use wild bootstrap?
If we are not explicitly modeling variability off of the theoretical curve, wild bootstrap can be an effective way to handle heteroskedasticity. Recall that when we do explicitly model the residuals in a homoskedastic manner, the MLE we get for
But, we really should think generatively. If we are going to model, we should model everything. A full generative model, including how we might expect the variability to change as a function of the measured and/or explanatory variables, is preferable. We will spend a lot of time next term building good generative models.
Computing environment
[19]:
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,numba,tqdm,bebi103,bokeh,holoviews,panel,jupyter
Python implementation: CPython
Python version : 3.8.12
IPython version : 7.29.0
numpy : 1.21.2
scipy : 1.7.1
pandas : 1.3.4
numba : 0.53.0
tqdm : 4.62.3
bebi103 : 0.1.9
bokeh : 2.3.3
holoviews: 1.14.6
panel : 0.12.1
jupyter : 1.0.0