Graphical model assessment: Predictive ECDFs

Dataset download


[1]:
import warnings

import pandas as pd
import numpy as np
import numba
import scipy.optimize
import scipy.stats as st

import bebi103

import bokeh_catplot

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

We continue with graphical model assessment of repeated measurements with the Rex1 mRNA counts from the Singer, et al. paper. Before we begin, we need to obtain the necessary MLEs for our two models, the single Negative Binomial model, and the mixture of two Negative Binomials model. You have seen this many times before, so you can skip this large code block (which gets us MLEs for the parameters) and move on to the next section. Just recall that the Rex1 mRNA counts are stored in a Numpy array n and the parameters for the single Negative Binomial model are in variables alpha and b. The parameters for the mixture model are stored as alpha1, b1, alpha2, b2, and w.

[2]:
def log_like_iid_nbinom(params, n):
    """Log likelihood for i.i.d. NBinom measurements, parametrized
    by alpha, b=1/beta."""
    alpha, b = params

    if alpha <= 0 or b <= 0:
        return -np.inf

    return np.sum(st.nbinom.logpmf(n, alpha, 1/(1+b)))


def mle_iid_nbinom(n):
    """Perform maximum likelihood estimates for parameters for i.i.d.
    NBinom measurements, parametrized by alpha, b=1/beta"""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, n: -log_like_iid_nbinom(params, n),
            x0=np.array([3, 3]),
            args=(n,),
            method='Powell'
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)


def initial_guess_mix(n, w_guess):
    """Generate initial guess for mixture model."""
    n_low = n[n < np.percentile(n, 100*w_guess)]
    n_high = n[n >= np.percentile(n, 100*w_guess)]

    alpha1, b1 = mle_iid_nbinom(n_low)
    alpha2, b2 = mle_iid_nbinom(n_high)

    return alpha1, b1, alpha2, b2


def log_like_mix(alpha1, b1, alpha2, b2, w, n):
    """Log-likeihood of binary Negative Binomial mixture model."""
    # Fix nonidentifieability be enforcing values of w
    if w < 0 or w > 1:
        return -np.inf

    # Physical bounds on parameters
    if alpha1 < 0 or alpha2 < 0 or b1 < 0 or b2 < 0:
        return -np.inf

    logx1 = st.nbinom.logpmf(n, alpha1, 1/(1+b1))
    logx2 = st.nbinom.logpmf(n, alpha2, 1/(1+b2))

    # Multipliers for log-sum-exp
    lse_coeffs = np.tile([w, 1-w], [len(n), 1]).transpose()

    # log-likelihood for each measurement
    log_likes = scipy.special.logsumexp(np.vstack([logx1, logx2]), axis=0, b=lse_coeffs)

    return np.sum(log_likes)


def mle_mix(n, w_guess):
    """Obtain MLE estimate for parameters for binary mixture
    of Negative Binomials."""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, n: -log_like_mix(*params, n),
            x0=[*initial_guess_mix(n, w_guess), w_guess],
            args=(n,),
            method='Powell',
            tol=1e-6,
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)


# Load in data
df = pd.read_csv("../data/singer_transcript_counts.csv", comment="#")
n = df['Rex1'].values

# Single Negative Binomial MLE
alpha, b = mle_iid_nbinom(n)

# Mixture model MLE
alpha1, b1, alpha2, b2, w = mle_mix(n, 0.2)

Predictive ECDFs

If we wish to accept a model with the MLE for the parameters as a good representation of the true generative model, we would like to see how data generated from that model would look. We can do a parametric bootstrap to generate data sets and then plot confidence intervals of the resulting ECDFs. We call ECDFs that come from the generative model predictive ECDFs. We can try that for the single Negative Binomial model. First, we’re write a function to draw samples out of a negative binomial.

[3]:
rg = np.random.default_rng()

def draw_neg_binom(alpha, b, size=1):
    return rg.negative_binomial(alpha, 1 / (1 + b), size=size)

Now, we can draw 5000 data sets from this distribution, each having 279 (the number of cells that were measured) data points.

[4]:
parametric_bs_samples = draw_neg_binom(alpha, b, size=(5000, len(n)))

We can then compute the predictive ECDFs for each of these samples. We want the values of the predictive ECDF at each possible value of \(n\), which requires us to compute the value of the ECDF at an arbitrary value. You wrote a function, ecdf(x, data), to do this in homework 7.1(e). Referring to the homework solutions, we can use that function here.

[5]:
@numba.njit
def ecdf(x, data):
    """Give the value of an ECDF at arbitrary points x."""
    y = np.arange(len(data) + 1) / len(data)
    return y[np.searchsorted(np.sort(data), x, side="right")]

Now we can compute the value of the ECDF for the respective values of \(n\).

[6]:
n_theor = np.arange(0, parametric_bs_samples.max() + 1)

ecdfs = np.array([ecdf(n_theor, sample) for sample in parametric_bs_samples])

Now that we have the ECDF values, we can compute a 95th-percentile of the value of the ECDF.

[7]:
ecdf_low, ecdf_high = np.percentile(ecdfs, [2.5, 97.5], axis=0)

And now, we can make a plot.

[8]:
p = bebi103.viz.fill_between(
    x1=n_theor,
    y1=ecdf_high,
    x2=n_theor,
    y2=ecdf_low,
    patch_kwargs={"fill_alpha": 0.5},
    x_axis_label="n",
    y_axis_label="ECDF",
)

bokeh.io.show(p)

This is a predictive ECDF. Is is the envelope in which we would expect 95% of the data points to lie if the generative model were true.

Comparing with data

We can overlay the ECDF of the data to see if it falls within this envelope.

[9]:
p = bokeh_catplot.ecdf(data=df, val='Rex1', palette=['orange'], p=p)

bokeh.io.show(p)

The Negative Binomial model generated some counts that were quite large, which is why the plot extends past counts of 1000. If you zoom in on the region where the data lie, the deviation of the measured ECDF from the ECDF generated by the model is clear.

We can also compare with the mixture model.

Using predictive ECDFs in the bebi103 package

The bebi103 package enables automated generation of predictive ECDFs. To demonstrate its usage, we can generate samples from the mixture model and plot a predictive ECDF.

[10]:
def draw_mix(alpha1, b1, alpha2, b2, w, size):
    """Draw samples from Negative Binomial binary mixture model."""
    n = np.empty(size)
    for i in range(size):
        low_cell_type = rg.uniform() < w

        if low_cell_type:
            n[i] = rg.negative_binomial(alpha1, 1 / (1 + b1))
        else:
            n[i] = rg.negative_binomial(alpha2, 1 / (1 + b2))

    return n

samples = np.array([draw_mix(alpha1, b1, alpha2, b2, w, size=len(n)) for _ in range(5000)])

Now, we can call bebi103.viz.predictive_ecdf() to make the plot.

[11]:
# Use discrete=True for discrete data
p = bebi103.viz.predictive_ecdf(
    samples=samples, data=n, discrete=True, x_axis_label="n"
)

bokeh.io.show(p)

Here we see that the predictive ECDF captures the measured data much better. If you zoom into the plot, you will notice that the bebi103.viz.predictive_ecdf() function plots quantiles in various shades of blue. By default, the outermost region (lightest blue) contains the middle 80% of the generated ECDFs, the next one in captures the middle 60%, the next the middle 40%, and the last the middle 20%. The median ECDF from the generative model is a dark blue line in the middle.

ECDF difference

While the predictive ECDF plot is useful, it is perhaps more illuminating to plot the difference between the predictive ECDF and the measured ECDF. We can do this with the diff=True kwarg.

[12]:
p = bebi103.viz.predictive_ecdf(
    samples=samples, data=n, diff=True, discrete=True, x_axis_label="n"
)

bokeh.io.show(p)

It is now clear that very few of the 279 data points lie outside the 80% envelope of the predictive ECDF. We might expect more to lie outside; it is conceivable that the model is too flexible.

Computing environment

[13]:
%load_ext watermark
%watermark -v -p numpy,pandas,scipy,bokeh,bokeh_catplot,bebi103,jupyterlab
CPython 3.7.5
IPython 7.9.0

numpy 1.17.4
pandas 0.24.2
scipy 1.3.1
bokeh 1.4.0
bokeh_catplot 0.1.6
bebi103 0.0.45
jupyterlab 1.2.3