Pitfalls of resampling residualsΒΆ
This tutorial was prepared by Suzy Beeler and edited and enhanced by Rosita Fu with assistance from Justin Bois.
[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
cmd = "pip install --upgrade bebi103 tqdm watermark"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
data_path = "../data/"
# ------------------------------
import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st
import numba
import tqdm
import bebi103
import bokeh.io
bokeh.io.output_notebook()
import holoviews as hv
hv.extension('bokeh')
import panel as pn
pn.extension()
bebi103.hv.set_defaults()
The data setΒΆ
For this data set, we will be returning to the flow cytometry data from Rob Phillips group here at Caltech, with one of the co-first authors being a familiar name⦠Griffin Chure! When we last saw this data in the overplotting lesson, we were just looking at data for one given strain, but the authors were interested in investigating how the level of gene expression of a fluorescent protein changes as a function of a number of tunable parameters: the number of repressors in the cell, the binding energy of the repressor to the DNA, and the concentration of the inducer that is known to bind to the repressor. The data set we load in here is the result of quite a bit of upstream analysis, where the fold-change (i.e a normalized value) of gene expression was computed for each set of these possible cellular conditions.
[2]:
df = pd.read_csv(os.path.join(data_path, "flow_master.csv"), comment="#")
df.tail()
[2]:
date | username | operator | binding_energy | rbs | repressors | IPTG_uM | mean_YFP_A | mean_YFP_bgcorr_A | fold_change_A | |
---|---|---|---|---|---|---|---|---|---|---|
2853 | 20160907 | mrazomej | O3 | -9.7 | RBS1 | 610 | 1000.0 | 33656.967752 | 30578.511268 | 1.010222 |
2854 | 20160907 | mrazomej | O3 | -9.7 | RBS1027 | 130 | 1000.0 | 32557.806469 | 29479.349986 | 0.973909 |
2855 | 20160907 | mrazomej | O3 | -9.7 | RBS446 | 62 | 1000.0 | 31654.811988 | 28576.355504 | 0.944077 |
2856 | 20160907 | mrazomej | O3 | -9.7 | RBS1147 | 30 | 1000.0 | 34471.563131 | 31393.106647 | 1.037134 |
2857 | 20160907 | mrazomej | O3 | -9.7 | HG104 | 11 | 1000.0 | 34267.171217 | 31188.714733 | 1.030381 |
Wow thereβs a lot here. For the sake of this exercise, we are actually going to disregard the inducer (IPTG) concentration and just get a sense of how gene expression changes as a function of the number of repressors for one given binding site. We are examining such a small part of this data, so check out the paper website to see the full analyses the authors unleashed.
[3]:
df_simple = df[
(df["operator"] == "O3")
& (df["IPTG_uM"] == 0.0)
& (df["rbs"] != "auto")
& (df["rbs"] != "delta")
]
df_simple = df_simple.rename(
columns={"repressors": "repressor copy number", "fold_change_A": "fold change"}
)
Letβs see how fold-change (a measure of gene expression) changes with the number of repressors.
[4]:
scatter = hv.Scatter(
data=df_simple,
kdims=["repressor copy number"],
vdims=["fold change"]
).opts(
logx=True,
)
scatter
[4]:
We see that the level of gene expression (as metricized by fold change) decreases with the number of repressors, which jibes with what we know about gene regulations. Again, it will be faster to work with NumPy arrays throughout, so letβs save those here.
[5]:
R = df_simple["repressor copy number"].values
fc = df_simple['fold change'].values
Pairs bootstrap? No.ΒΆ
With this dataset, it becomes abundantly clear that we are not interested in the variability of the number of repressors. Quite the opposite, the experimentalists worked hard to manipulate various cell strains to have these desired copy numbers and explicitly took 10 replicates for each strain. Doing a pairs bootstrap in this case could wreck havoc on the carefully designed experiment, since the bootstrap samples would likely incorporate data from some strains more than others. As we saw in the previous data set, letβs take the approach of resampling the residuals.
Resampling residuals? Maybe.ΒΆ
Letβs first set up the frame work of our MLE estimate. While I wonβt go into the details here, the mathematical model of fold change is
\begin{align} \text{fold change} = \frac{1}{1 + \frac{2R}{N_{NS}} e^{- \Delta \epsilon \ / \ k_BT }}, \end{align}
where \(N_{NS}\) is the number of non-specific binding sites, \(R\) is the number of repressors, and \(\Delta \epsilon\) in the binding energy of the repressor to the operator in units of \(k_BT\). In this case, I actually already know the binding energy, but for sake argument, letβs pretend we donβt and get a MLE for this parameter.
[6]:
N_NS = 5e6
def theor_fold_change(N_NS, R, delta_e):
"""Compute fold change using mathematical model"""
return 1 / (1 + 2*R/N_NS * np.exp(-delta_e))
def log_likelihood(params, R, fc_data):
"""Log likelihood of fold change model."""
delta_e, sigma = params
if sigma <= 0:
return -np.inf
fc_theory = theor_fold_change(N_NS, R, delta_e)
return np.sum(st.norm.logpdf(np.log(fc_data), np.log(fc_theory), sigma))
def resid(params, R, fc_data):
delta_e = params[0]
return fc_data - theor_fold_change(N_NS, R, delta_e)
def fc_mle(R, fc_data):
"""Compute MLE for parameters in fold-change model."""
res = scipy.optimize.least_squares(resid, np.array([-5]), args=(R, fc_data))
# Compute residual sum of squares from optimal params
rss_mle = np.sum(resid(res.x, R, fc_data)**2)
# Compute MLE for sigma
sigma_mle = np.sqrt(rss_mle / len(R))
return tuple([x for x in res.x] + [sigma_mle])
With these in hand, we can now get our MLEs.
[7]:
mle_params = fc_mle(R, fc)
# save these for later
mle_binding = mle_params[0]
mle_sigma = mle_params[1]
print(f'Value of binding energy: {np.round(mle_binding, decimals=2)} kBT')
print(f'Value of sigma: {np.round(mle_sigma, decimals=2)}')
Value of binding energy: -9.42 kBT
Value of sigma: 0.06
Given my insider knowledge, that binding energy seems reasonable, but letβs make a predictive regression plot to check. First, we need to generate samples out of the generative model parametrized by the MLE.
[8]:
rg = np.random.default_rng()
def sample_fold_change(delta_e, sigma, R, size=1):
"""Generate samples of spindle length vs droplet diameter."""
samples = np.empty((size, len(R)))
for i in range(size):
mu = theor_fold_change(N_NS, R, delta_e)
samples[i] = np.maximum(0, rg.normal(mu, sigma))
return samples
R_theor = np.logspace(1, 3, 200)
samples = sample_fold_change(*mle_params, R_theor, size=5000)
Now we can make the plot.
[9]:
p = bebi103.viz.predictive_regression(
samples=samples,
samples_x=R_theor,
data=df_simple[["repressor copy number", 'fold change']].values,
x_axis_label='repressor copy number',
y_axis_label='fold change',
x_axis_type='log',
)
bokeh.io.show(p)
Interestingly, for high repressor copy numbers, the data are much closer to the median data predicted by the model. This is a hallmark of heteroskedasticity, which means that \(\sigma\) can differ based on the measured values and/or the explanatory variables. We will discuss the wild bootstrap as a way of dealing with this momentarily, but first letβs get confidence intervals on the MLEs of the parameters by resampling the residuals as we did before.
[10]:
@numba.njit
def resampled_residuals(y_fit, residuals):
"""Produces resampled fitted response variable with the provided residuals."""
bs_residuals = np.random.choice(residuals, size=len(residuals))
return y_fit + bs_residuals
To get set up, letβs first get the fold change values our MLE would predict and the residuals when we compare our data to these predicited values.
[11]:
# get our fc predictions
MLE_fc_pred = theor_fold_change(N_NS, R, mle_params[0])
# get our residual values
resids = fc - MLE_fc_pred
With these in hand, we are ready to repeatedly resample the residuals.
[12]:
num_trials = 1000
bs_MLEs_resampled_resids = np.zeros([2, num_trials])
for i in tqdm.tqdm(range(num_trials)):
fc_resamp = resampled_residuals(MLE_fc_pred, resids)
mle_params = fc_mle(R, fc_resamp)
bs_MLEs_resampled_resids[:,i] = mle_params
resampled_resids_CI = np.percentile(bs_MLEs_resampled_resids, [2.5, 97.5], axis=1)
resampled_resids_CI
100%|ββββββββββ| 1000/1000 [00:02<00:00, 488.96it/s]
[12]:
array([[-9.54963275, 0.04681566],
[-9.33394636, 0.07305054]])
In the plot of the data, we saw that there is much more variability in the measurement of fold change when the value of fold change is higher, known as heteroskedasticity. Heteroskedasticity can be a source of problems when doing residual resampling, as we will often assign a large residual to an y-value that would realistically not yield that much variability. This issue is most clearly illustrated by showing plots of what our resampled data look like.
[13]:
# make a clickable button
button_resid = pn.widgets.Button(name="Residual Boostrap", button_type="primary")
MLE_estimate = pn.widgets.Checkbox(name="initial MLE estimate", value=False)
# define a bootstrap and plot function
@pn.depends(button_resid.param.clicks, MLE_estimate.param.value)
def bs_and_plot(_, estimate):
fc_resamp = resampled_residuals(MLE_fc_pred, resids)
new_resids = fc_resamp - MLE_fc_pred
r = hv.Scatter(data=(R, new_resids),
kdims=["repressor copy number"],
vdims=["residual fc"],
).opts(
ylim=(resids.min()-.1,resids.max()+.1)
)
zero = hv.Path(data=(R, 0*np.empty(len(R))),).opts(color='orange')
p = hv.Scatter(data=(R, fc_resamp),
kdims=["repressor copy number"],
vdims=["fold change"]
).opts(ylim=(0,1.1))
R_theor = np.linspace(0,900,1000)
fc_theor = theor_fold_change(N_NS, R_theor, mle_binding)
fit = hv.Path(data=(R_theor, fc_theor)).opts(color="orange", line_width=2)
if estimate == True:
return p * fit + r * zero
return p + r
plot = pn.Column(button_resid, bs_and_plot)
pn.Row(plot, pn.Column(pn.Spacer(height=80), MLE_estimate))
[13]:
We see that oftentimes, the high repressor copy number strains show much wider variability in fold change than they do in the original data, likely bringing up our estimate on \(\sigma\). A solution?
Wild bootstrapΒΆ
An approach you might see in this situation is a wild bootstrap, where rather than scrambling the residuals across all the y values, you retain the association of each residual to its original y value. Then your bootstrap samples allow the residuals to vary around their original value in a normally distributed manner, scaled by the original residual. In other words, if the residual for data point \(i\) is \(\epsilon_i\), then the wild bootstrapped residual is \(\epsilon^*_i = \epsilon_i \cdot v_i\) where \(v_i\) is drawn from a Normal distribution with location parameter zero and scale parameter one.
This approach allows us to make bootstrap samples that retain the non-uniform variability that we see in the original data. We show this approach below for completeness and since itβs a common technique you might see, but we suggest using it with caution for reasons that will be come clear.
[14]:
def wild_bootstrap(y_fit, residuals):
"""Produces resampled fitted response variable with the provided residuals."""
resampled_residuals = residuals * np.random.normal(size=len(residuals))
return y_fit + resampled_residuals
Letβs visually check out what happens with the wild bootstrap.
[15]:
def residuals_plot(x, residuals):
r = hv.Scatter(data=(R, residuals),
kdims=["repressor copy number"],
vdims=["residual fc"],
).opts(
ylim=(resids.min()-.1,resids.max()+.1)
)
return r
[16]:
# make a clickable button
button = pn.widgets.Button(name="Wild Bootstrap", button_type="primary")
MLE_estimate = pn.widgets.Checkbox(name="initial MLE estimate", value=False)
# define a bootstrap and plot function
@pn.depends(button.param.clicks, MLE_estimate.param.value)
def bs_and_plot(_, estimate):
# now doing wild boostrap
fc_resamp = wild_bootstrap(MLE_fc_pred, resids)
r = residuals_plot(R, fc_resamp - MLE_fc_pred)
zero = hv.Path(data=(R, 0*np.empty(len(R))),).opts(color='orange')
p = hv.Scatter(data=(R, fc_resamp),
kdims=["repressor copy number"],
vdims=["fold change"]
).opts(ylim=(0,1))
R_theor = np.linspace(0,900,1000)
fc_theor = theor_fold_change(N_NS, R_theor, mle_binding)
fit = hv.Path(data=(R_theor, fc_theor)).opts(color="orange", line_width=2)
if estimate == True:
return p * fit + r * zero
return p + r
plot = pn.Column(button, bs_and_plot)
pn.Row(plot, pn.Column(pn.Spacer(height=80), MLE_estimate))
[16]:
We can now see that the low repressor copy number strains retain high variability in fold change and vice versa, which better reflects what we saw in the original data.
Now we would like to use this bootstrapping technique to get confidence intervals on our parameter estimates. But wait. What about \(\sigma\) now? By choosing to use a wild bootstrap, we have baked into our log likelihood that the residual for each data point is normally distributed about the empirical residual with a scale parameter that scales with the size of the empirical residual.
This is a way of dealing with heteroskedasticity without seriously considering the variation in measurement in a generative way. This is why we warn away from using the wild bootstrap. In this case, it would be more appropriate to explicitly construct a generative model that reflects how we suspect \(\sigma\) to scale rather than taking a non-parametric approach that simulates the heteroskedasticity.
To implement this, we do not need to estimate a \(\sigma\), since it is no longer a parameter in our generative model. Since we already did the work of finding the parameters by least squares, we can use the same MLE solver we used before, and just ignore the estimate of \(\sigma\). So, letβs use wild bootstrap to get our confidence intervals on the binding energy.
[17]:
num_trials = 1000
bs_MLEs_wild = np.zeros(num_trials)
for i in tqdm.tqdm(range(num_trials)):
fc_resamp = wild_bootstrap(MLE_fc_pred, resids)
mle_params = fc_mle(R, fc_resamp)
bs_MLEs_wild[i] = mle_params[0]
wild_CI = np.percentile(bs_MLEs_wild, [2.5, 97.5])
wild_CI
100%|ββββββββββ| 1000/1000 [00:01<00:00, 668.40it/s]
[17]:
array([-9.51671551, -9.3200403 ])
The confidence interval on the binding energy seems shifted but letβs visualize it!
[18]:
summaries = [{'estimate':mle_binding,'conf_int':resampled_resids_CI.T[0],'label':'residuals bootstrap'},
{'estimate':mle_binding,'conf_int':wild_CI,'label':'wild bootstrap'},]
p=bebi103.viz.confints(summaries)
p.xaxis.axis_label='β'
bokeh.io.show(p)
Should we use wild bootstrap?ΒΆ
If we are not explicitly modeling variability off of the theoretical curve, wild bootstrap can be an effective way to handle heteroskedasticity. Recall that when we do explicitly model the residuals in a homoskedastic manner, the MLE we get for \(\sigma\) is inconsequential with respect to the parameters of the theoretical model (we derived this in the lesson on regression). So, in a sense, we can be safe in not modeling variability if we assume homoskedasticity (though we could not do generative regression plots as model checks). So, if we are feeling lazy, or ignorant, and could continue to not model variability generatively and employ wild bootstrap to handle any possible lurking heteroskedasticity.
But, we really should think generatively. If we are going to model, we should model everything. A full generative model, including how we might expect the variability to change as a function of the measured and/or explanatory variables, is preferable. We will spend a lot of time next term building good generative models.
Computing environmentΒΆ
[19]:
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,numba,tqdm,bebi103,bokeh,holoviews,panel,jupyter
CPython 3.8.5
IPython 7.19.0
numpy 1.19.2
scipy 1.5.2
pandas 1.1.3
numba 0.51.2
tqdm 4.51.0
bebi103 0.1.1
bokeh 2.2.3
holoviews 1.13.5
panel 0.9.7
jupyter 1.0.0