Parallel bootstrap calculationsΒΆ

Dataset download


[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade iqplot bebi103 multiprocess watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

try:
    import multiprocess
except:
    import multiprocessing as multiprocess

import warnings

import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st

import tqdm

import bebi103

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

In the last part of this lesson, we saw that computing the MLE confidence intervals can take some time (several minutes). To speed up calculations, we can compute the bootstrap replicates in parallel. That is, we split the sampling job up into parts, and different processing cores work on each part, and then combine the results together at the end.

To begin, we load in the data.

[2]:
df = pd.read_csv(os.path.join(data_path, "singer_transcript_counts.csv"), comment="#")

And we also need the functions we used for performing the MLE calculation from earlier in the lesson.

[3]:
def log_like_iid_nbinom(params, n):
    """Log likelihood for i.i.d. NBinom measurements, parametrized
    by alpha, b=1/beta."""
    alpha, b = params

    if alpha <= 0 or b <= 0:
        return -np.inf

    return np.sum(st.nbinom.logpmf(n, alpha, 1/(1+b)))


def mle_iid_nbinom(n):
    """Perform maximum likelihood estimates for parameters for i.i.d.
    NBinom measurements, parametrized by alpha, b=1/beta"""

    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        res = scipy.optimize.minimize(
            fun=lambda params, n: -log_like_iid_nbinom(params, n),
            x0=np.array([3, 3]),
            args=(n,),
            method='Powell'
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)

Explicit RNGsΒΆ

When we parallelize the calculation, we have to be very careful about the random number generators we use. If we define a random number generator globally with rg = np.random.default_rng(), then each parallel process will use the same sequence of pseudorandom numbers. We therefore need to make a new RNG for each of the parallel processes. As a result, we need to explicitly supply a random number generator to the function we use to generate samples. This changes the call signature to include the RNG, but requires no other changes.

[4]:
def gen_nbinom(alpha, b, size, rg):
    return rg.negative_binomial(alpha, 1 / (1 + b), size=size)

Next, we need to adjust the draw_bs_reps_mle() function to also have explicit specification of a RNG. If we do not supply one, a new one should be created within the function.

Finally, since the parallel version will supersede the one draw_bs_reps_mle() function we have already written, we will prepend its name with an underscore, designating it as a private function (essentially a function an end user will not call), since we will not really call it directly again except for demonstration purposes.

[5]:
def _draw_parametric_bs_reps_mle(
    mle_fun, gen_fun, data, args=(), size=1, progress_bar=False, rg=None,
):
    """Draw parametric bootstrap replicates of maximum likelihood estimator.

    Parameters
    ----------
    mle_fun : function
        Function with call signature mle_fun(data, *args) that computes
        a MLE for the parameters
    gen_fun : function
        Function to randomly draw a new data set out of the model
        distribution parametrized by the MLE. Must have call
        signature `gen_fun(*params, size, *args, rg)`.
    data : one-dimemsional Numpy array
        Array of measurements
    args : tuple, default ()
        Arguments to be passed to `mle_fun()`.
    size : int, default 1
        Number of bootstrap replicates to draw.
    progress_bar : bool, default False
        Whether or not to display progress bar.
    rg : numpy.random.Generator instance, default None
        RNG to be used in bootstrapping. If None, the default
        Numpy RNG is used with a fresh seed based on the clock.

    Returns
    -------
    output : numpy array
        Bootstrap replicates of MLEs.
    """
    if rg is None:
        rg = np.random.default_rng()

    params = mle_fun(data, *args)

    if progress_bar:
        iterator = tqdm.tqdm(range(size))
    else:
        iterator = range(size)

    return np.array(
        [mle_fun(gen_fun(*params, size=len(data), *args, rg=rg)) for _ in iterator]
    )

These functions still work as advertised. We can call _draw_parametric_bs_reps_mle() exactly as we have earlier in the lesson. To get 100 bootstrap replicates (only 100 so we don’t have to wait too long), we can do the following for Nanog.

[6]:
bs_reps = _draw_parametric_bs_reps_mle(
    mle_iid_nbinom, gen_nbinom, df["Nanog"].values, args=(), size=100, progress_bar=True
)
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 100/100 [00:03<00:00, 27.71it/s]

Enabling parallelizationΒΆ

To enable multiple cores to draw bootstrap replicates, we could use Python’s built-in multiprocessing module. This module has some problems running with Jupyter notebooks on some OSs, though. As a drop-in replacement, we could use Mike McKerns’s multiprocess package, which offers improvements on the built-in multiprocessing module. You can install multiprocess with

pip install --upgrade multiprocess

To enable seamless switching between these two multiprocessing packages, I included the following import statement.

try:
    import multiprocess
except:
    import multiprocessing as multiprocess

We can therefore use them interchangably, calling whichever package we are using multiprocess. The syntax is as follows.

with multiprocess.Pool(n_jobs) as pool:
    result = pool.starmap(fun, arg_iterable)

Here, fun is a function and arg_iterable is an iterable that produces arguments to the fun as tuples that will be splatted when passed to fun. This is best seen with a simple example before we get into the more complicated example of bootstrapping. We will choose a function that adds four arguments. Then, arg_iterable will be an iterable (in this case a list) with tuples, each of which has four elements.

[7]:
def fun(a, b, c, d):
    return a + b + c + d

arg_iterable = [(i, i+1, i+2, i+3) for i in range(8)]

# Display arg_iterable
arg_iterable
[7]:
[(0, 1, 2, 3),
 (1, 2, 3, 4),
 (2, 3, 4, 5),
 (3, 4, 5, 6),
 (4, 5, 6, 7),
 (5, 6, 7, 8),
 (6, 7, 8, 9),
 (7, 8, 9, 10)]

If we call fun with one of the arguments from the arg_iterable using a splat operator, we get the appropriate sum.

[8]:
fun(*arg_iterable[0])
[8]:
6

To iterate through the arguments and evaluate them in fun each time, we can do it in parallel, for example using two cores.

[9]:
with multiprocess.Pool(2) as pool:
    result = pool.starmap(fun, arg_iterable)

# Take a look
result
[9]:
[6, 10, 14, 18, 22, 26, 30, 34]

This gives the same result as if we did the following.

[10]:
[fun(*args) for args in arg_iterable]
[10]:
[6, 10, 14, 18, 22, 26, 30, 34]

We can see the equivalence then.

with multiprocess.Pool(2) as pool:
    result = pool.starmap(fun, arg_iterable)

is the same as

[fun(*args) for args in arg_iterable]

We can use this same structure for our bootstrap replicates. Here, fun is _draw_bs_reps_mle, and we need to make a list of arguments to pass to that function. This means splitting up size into chunks of size size // n_jobs,

[11]:
def draw_parametric_bs_reps_mle(
    mle_fun, gen_fun, data, args=(), size=1, n_jobs=1, progress_bar=False
):
    """Draw nonparametric bootstrap replicates of maximum likelihood estimator.

    Parameters
    ----------
    mle_fun : function
        Function with call signature mle_fun(data, *args) that computes
        a MLE for the parameters
    gen_fun : function
        Function to randomly draw a new data set out of the model
        distribution parametrized by the MLE. Must have call
        signature `gen_fun(*params, size, *args, rg)`.
    data : one-dimemsional Numpy array
        Array of measurements
    args : tuple, default ()
        Arguments to be passed to `mle_fun()`.
    size : int, default 1
        Number of bootstrap replicates to draw.
    n_jobs : int, default 1
        Number of cores to use in drawing bootstrap replicates.
    progress_bar : bool, default False
        Whether or not to display progress bar.

    Returns
    -------
    output : numpy array
        Bootstrap replicates of MLEs.
    """
    # Just call the original function if n_jobs is 1 (no parallelization)
    if n_jobs == 1:
        return _draw_parametric_bs_reps_mle(
            mle_fun, gen_fun, data, args=args, size=size, progress_bar=progress_bar
        )

    # Set up sizes of bootstrap replicates for each core, making sure we
    # get all of them, even if sizes % n_jobs != 0
    sizes = [size // n_jobs for _ in range(n_jobs)]
    sizes[-1] += size - sum(sizes)

    # Build arguments
    arg_iterable = [(mle_fun, gen_fun, data, args, s, progress_bar, None) for s in sizes]

    with multiprocess.Pool(n_jobs) as pool:
        result = pool.starmap(_draw_parametric_bs_reps_mle, arg_iterable)

    return np.concatenate(result)

Now, we can specify n_jobs to be greater than one to take advantage of multiple cores. It is important to know how many cores you have on your machine. I usually keep two cores open for other processes on my computer so it doesn’t die on me. You can check how many cores you have using the multiprocess.cpu_count() function.

[12]:
multiprocess.cpu_count()
[12]:
4

My computer has four cores, I will therefore use three cores. And now that I am unleashing three cores on the problem, I will take 10,000 replicates again, and this time it will only take about 1/3 as long as it did the first time I did the calculation.

[13]:
bs_reps = draw_parametric_bs_reps_mle(
    mle_iid_nbinom,
    gen_nbinom,
    df["Nanog"].values,
    args=(),
    size=10000,
    n_jobs=3,
    progress_bar=True,
)
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3333/3333 [02:08<00:00, 25.93it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3334/3334 [02:08<00:00, 25.90it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3333/3333 [02:08<00:00, 25.88it/s]

From these replicates, we can compute the confidence intervals for \(\alpha\) and \(b\) for Nanog.

[14]:
np.percentile(bs_reps, [2.5, 97.5], axis=0)
[14]:
array([[ 1.09329899, 56.91880427],
       [ 1.48626565, 82.95724807]])

Parameter estimation for all genesΒΆ

We are not empowered to compute the MLE and confidence intervals for all four genes. We will do so with parametric bootstrap. We can ignore the warnings, as the occur when the solver attempts to try illegal (negative) parameter values. Powell’s method is immune to this issue.

[15]:
genes = ["Nanog", "Prdm14", "Rest", "Rex1"]

# Data frame to store results
df_mle = pd.DataFrame(
    columns=["gene", "parameter", "mle", "conf_int_low", "conf_int_high"]
)

# Perform MLE for each gene
for gene in tqdm.tqdm(genes):
    mle = mle_iid_nbinom(df[gene].values)

    bs_reps = draw_parametric_bs_reps_mle(
        mle_iid_nbinom,
        gen_nbinom,
        df[gene].values,
        args=(),
        size=10000,
        n_jobs=3,
        progress_bar=False,
    )
    conf_int = np.percentile(bs_reps, [2.5, 97.5], axis=0)

    sub_df = pd.DataFrame(
        {
            "gene": [gene] * 2,
            "parameter": ["alpha", "b"],
            "mle": mle,
            "conf_int_low": conf_int[0, :],
            "conf_int_high": conf_int[1, :],
        }
    )
    df_mle = df_mle.append(sub_df)
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [09:00<00:00, 135.04s/it]

We now have a data frame with the results from our statistical inference. Let’s take a look.

[16]:
df_mle
[16]:
gene parameter mle conf_int_low conf_int_high
0 Nanog alpha 1.263097 1.091755 1.493337
1 Nanog b 69.347842 56.810812 82.658532
0 Prdm14 alpha 0.552886 0.452151 0.681750
1 Prdm14 b 8.200636 6.180404 10.579209
0 Rest alpha 4.530335 3.862485 5.447868
1 Rest b 16.543054 13.616810 19.451328
0 Rex1 alpha 1.634562 1.411392 1.928309
1 Rex1 b 84.680915 70.001857 100.053863

Finally, we can make a plot of the MLEs of the parameter values with confidence interbals.

[17]:
sub_df = df_mle.loc[df_mle["parameter"] == "alpha", :]
summaries = [
    {"estimate": est, "conf_int": conf, "label": label}
    for est, conf, label in zip(
        sub_df["mle"].values,
        sub_df[["conf_int_low", "conf_int_high"]].values,
        sub_df["gene"],
    )
]

p1 = bebi103.viz.confints(
        summaries,
        x_axis_label="Ξ±*",
        frame_height=75,
    )

sub_df = df_mle.loc[df_mle["parameter"] == "b", :]
summaries = [
    {"estimate": est, "conf_int": conf, "label": label}
    for est, conf, label in zip(
        sub_df["mle"].values,
        sub_df[["conf_int_low", "conf_int_high"]].values,
        sub_df["gene"],
    )
]
p2 = bebi103.viz.confints(
        summaries,
        x_axis_label="b*",
        frame_height=75,
    )

bokeh.io.show(bokeh.layouts.gridplot([p1, p2], ncols=1))

Implementation of MLE bootstrapping in the bebi103 packageΒΆ

The functionality we have developed in this lesson for bootstrapping MLE estimates is available in the function bebi103.bootstrap.draw_bs_reps_mle(). It has call signature

bebi103.draw_bs_reps_mle(
    mle_fun,
    gen_fun,
    data,
    mle_args=(),
    gen_args=(),
    size=1,
    n_jobs=1,
    progress_bar=False,
    rg=None,
):

Here, mle_fun is a function with call signature mle_fun(data, *mle_fun_args) that computes the maximum likelihood estimate from the data. gen_fun is a function with call signature gen_fun(params, *gen_args, size, rg) that generates bootstrap samples to be used to make bootstrap replicates of the maximum likelihood estimates. Here, params is a tuple of the parameters of the model. Note that not all of the inputs for gen_fun are necessarily used, but having them allows for flexibility. The bebi103.bootstrap.draw_bs_reps_mle() function allows for both parametric and nonparametric bootstrapping.

As an example, we will repeat the bootstrapped MLE confidence region calculation for Nanog, except we will do it nonparametrically this time.

[18]:
# This is gen_fun, for nonparametric, just a resampling
# params = (alpha, beta), but is ignored
# gen_params = (n,); this is used for nonparametric sample
def resample_fun(params, n, size, rg):
    return rg.choice(n, size=size)


bs_reps = bebi103.bootstrap.draw_bs_reps_mle(
    mle_iid_nbinom,
    resample_fun,
    df['Nanog'].values,
    gen_args=(df['Nanog'].values, ),
    size=10000,
    n_jobs=3,
    progress_bar=True,
)

# Show the confidence interval
np.percentile(bs_reps, [2.5, 97.5], axis=0)
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3333/3333 [02:16<00:00, 24.49it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3334/3334 [02:16<00:00, 24.46it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3333/3333 [02:16<00:00, 24.45it/s]
[18]:
array([[ 1.06094945, 57.17678826],
       [ 1.54650337, 82.31584567]])

The confidence intervals are slightly different than in the parametric case, but are quite similar.

Computing environmentΒΆ

[19]:
%load_ext watermark
%watermark -v -p multiprocess,numpy,scipy,pandas,tqdm,bokeh,bebi103,jupyterlab
CPython 3.8.5
IPython 7.19.0

multiprocess 0.70.11.1
numpy 1.19.2
scipy 1.5.2
pandas 1.1.3
tqdm 4.50.2
bokeh 2.2.3
bebi103 0.1.1
jupyterlab 2.2.6