Graphical model assessment for univariate modelsΒΆ
[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
cmd = "pip install --upgrade iqplot bebi103 watermark"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
data_path = "../data/"
# ------------------------------
import warnings
import pandas as pd
import numpy as np
import scipy.optimize
import scipy.stats as st
import numba
import bebi103
import iqplot
import bokeh.io
bokeh.io.output_notebook()
Graphical model assessmentΒΆ
We have already discussed model assessment. In this set of lesson, we will discuss implementation of graphical model assessment.
We will use the transcript counts from Singer, et al., this time focusing on the Rex1 gene, which shows clear biomodality. As a reminder, here is the ECDF for the transcript counts.
[2]:
df = pd.read_csv(os.path.join(data_path, "singer_transcript_counts.csv"), comment="#")
p = iqplot.ecdf(
data=df, q="Rex1", x_axis_label="Rex1 mRNA count", conf_int=True
)
bokeh.io.show(p)
Before we can proceed to the model assessments, we need to write functions to calculate the MLE.
Likelihood calculatorsΒΆ
In this and the following parts of this lesson, we will need the functions we have written to perform maximum likelihood estimation for a Negative Binomial distribution and for a mixture of two Negative Binomials. These are copied directly from the lesson on mixture models, so you do not need to focus on this rather long code cell.
[3]:
def log_like_iid_nbinom(params, n):
"""Log likelihood for i.i.d. NBinom measurements, parametrized
by alpha, b=1/beta."""
alpha, b = params
if alpha <= 0 or b <= 0:
return -np.inf
return np.sum(st.nbinom.logpmf(n, alpha, 1 / (1 + b)))
def mle_iid_nbinom(n):
"""Perform maximum likelihood estimates for parameters for i.i.d.
NBinom measurements, parametrized by alpha, b=1/beta"""
with warnings.catch_warnings():
warnings.simplefilter("ignore")
res = scipy.optimize.minimize(
fun=lambda params, n: -log_like_iid_nbinom(params, n),
x0=np.array([3, 3]),
args=(n,),
method="Powell",
)
if res.success:
return res.x
else:
raise RuntimeError("Convergence failed with message", res.message)
def initial_guess_mix(n, w_guess):
"""Generate initial guess for mixture model."""
n_low = n[n < np.percentile(n, 100 * w_guess)]
n_high = n[n >= np.percentile(n, 100 * w_guess)]
alpha1, b1 = mle_iid_nbinom(n_low)
alpha2, b2 = mle_iid_nbinom(n_high)
return alpha1, b1, alpha2, b2
def log_like_mix(alpha1, b1, alpha2, b2, w, n):
"""Log-likeihood of binary Negative Binomial mixture model."""
# Fix nonidentifieability be enforcing values of w
if w < 0 or w > 1:
return -np.inf
# Physical bounds on parameters
if alpha1 < 0 or alpha2 < 0 or b1 < 0 or b2 < 0:
return -np.inf
logx1 = st.nbinom.logpmf(n, alpha1, 1 / (1 + b1))
logx2 = st.nbinom.logpmf(n, alpha2, 1 / (1 + b2))
# Multipliers for log-sum-exp
lse_coeffs = np.tile([w, 1 - w], [len(n), 1]).transpose()
# log-likelihood for each measurement
log_likes = scipy.special.logsumexp(np.vstack([logx1, logx2]), axis=0, b=lse_coeffs)
return np.sum(log_likes)
def mle_mix(n, w_guess):
"""Obtain MLE estimate for parameters for binary mixture
of Negative Binomials."""
with warnings.catch_warnings():
warnings.simplefilter("ignore")
res = scipy.optimize.minimize(
fun=lambda params, n: -log_like_mix(*params, n),
x0=[*initial_guess_mix(n, w_guess), w_guess],
args=(n,),
method="Powell",
tol=1e-6,
)
if res.success:
return res.x
else:
raise RuntimeError("Convergence failed with message", res.message)
We can now carry out the MLE for both models.
[4]:
# Load in data
df = pd.read_csv("../data/singer_transcript_counts.csv", comment="#")
n = df['Rex1'].values
# Single Negative Binomial MLE
alpha, b = mle_iid_nbinom(n)
# Mixture model MLE
alpha1, b1, alpha2, b2, w = mle_mix(n, 0.2)
Constructing a Q-Q plotΒΆ
We will start by making a Q-Q plot for the single Negative-Binomial case. The function bebi103.viz.qqplot()
generates Q-Q plots. In order to provide the function with samples out of the generative distribution, letβs draw out of a single Negative Binomial with our \(\alpha\), \(b\) parametrization.
[5]:
rg = np.random.default_rng()
single_samples = np.array(
[rg.negative_binomial(alpha, 1 / (1 + b), size=len(n)) for _ in range(100000)]
)
With that in place, we can call bebi103.viz.qqplot()
to generate the plot.
[6]:
p = bebi103.viz.qqplot(
data=n,
samples=single_samples,
x_axis_label="transcript counts",
y_axis_label="transcript counts",
)
bokeh.io.show(p)
Clearly, the generative model cannot produce the observed transcript bounds, since the Q-Q plot shows strong separation of the generative quantiles from the observed quantiles.
Q-Q plot for the mixture modelΒΆ
We can also make a Q-Q plot for the mixture model. We need to write a function to generate data out of the mixture model. As we did in a previous lesson, we first randomly choose a cell type with weight \(w\), and then draw out of a Negative Binomial distribution with parameters corresponding to that cell type.
[7]:
@numba.njit
def draw_mix(alpha1, b1, alpha2, b2, w, size):
"""Draw samples from Negative Binomial binary mixture model."""
n = np.empty(size)
for i in range(size):
low_cell_type = np.random.rand() < w
if low_cell_type:
n[i] = np.random.negative_binomial(alpha1, 1 / (1 + b1))
else:
n[i] = np.random.negative_binomial(alpha2, 1 / (1 + b2))
return n
mixture_samples = np.array(
[draw_mix(alpha1, b1, alpha2, b2, w, size=len(n)) for _ in range(100000)]
)
Now, we can make the Q-Q plot!
[8]:
p = bebi103.viz.qqplot(
data=n,
samples=mixture_samples,
x_axis_label="transcript counts",
y_axis_label="transcript counts",
)
bokeh.io.show(p)
The Q-Q plot shows a much strong agreement between the theoretical distribution and the observed.
Predictive ECDFsΒΆ
If we wish to accept a model with the MLE for the parameters as a good representation of the true generative model, we would like to see how data generated from that model would look. We can do a parametric bootstrap to generate data sets and then plot confidence intervals of the resulting ECDFs. We call ECDFs that come from the generative model predictive ECDFs. We can try that for the single Negative Binomial model.
Given the samples (which we already computed and stored in single_samples
), we can then compute the predictive ECDFs for each of these samples. We want the values of the predictive ECDF at each possible value of \(n\), which requires us to compute the value of the ECDF at an arbitrary value. You wrote a function, ecdf(x, data)
, to do this in a homework problem. Referring to the homework solutions, we can use that function here.
[9]:
@numba.njit
def ecdf(x, data):
"""Give the value of an ECDF at arbitrary points x."""
y = np.arange(len(data) + 1) / len(data)
return y[np.searchsorted(np.sort(data), x, side="right")]
Now we can compute the value of the ECDF for the respective values of \(n\).
[10]:
n_theor = np.arange(0, single_samples.max() + 1)
ecdfs = np.array([ecdf(n_theor, sample) for sample in single_samples])
Now that we have the ECDF values, we can compute a 95th-percentile of the value of the ECDF.
[11]:
ecdf_low, ecdf_high = np.percentile(ecdfs, [2.5, 97.5], axis=0)
And now, we can make a plot.
[12]:
p = bebi103.viz.fill_between(
x1=n_theor,
y1=ecdf_high,
x2=n_theor,
y2=ecdf_low,
patch_kwargs={"fill_alpha": 0.5},
x_axis_label="n",
y_axis_label="ECDF",
)
bokeh.io.show(p)
This is a predictive ECDF. Is is the envelope in which we would expect 95% of the data points to lie if the generative model were true.
Comparing with dataΒΆ
We can overlay the ECDF of the data to see if it falls within this envelope.
[13]:
p = iqplot.ecdf(data=df, q='Rex1', palette=['orange'], p=p)
bokeh.io.show(p)
The Negative Binomial model generated some counts that were quite large, which is why the plot extends past counts of 1000. If you zoom in on the region where the data lie, the deviation of the measured ECDF from the ECDF generated by the model is clear.
We can also compare with the mixture model.
Using predictive ECDFs in the bebi103 packageΒΆ
The bebi103
package enables automated generation of predictive ECDFs. To demonstrate its usage, we can generate samples from the mixture model and plot a predictive ECDF.
[14]:
# Use discrete=True for discrete data
p = bebi103.viz.predictive_ecdf(
samples=mixture_samples, data=n, discrete=True, x_axis_label="n"
)
bokeh.io.show(p)
Here we see that the predictive ECDF captures the measured data much better. If you zoom into the plot, you will notice that the bebi103.viz.predictive_ecdf()
function plots quantiles in various shades of blue. By default, the outermost region (lightest blue) contains the middle 95% of the generated ECDFs, and the next one in captures the middle 68%. The median ECDF from the generative model is a dark blue line in the middle.
ECDF differenceΒΆ
While the predictive ECDF plot is useful, it is perhaps more illuminating to plot the difference between the predictive ECDF and the measured ECDF. We can do this with the diff='ecdf'
kwarg.
[15]:
p = bebi103.viz.predictive_ecdf(
samples=mixture_samples, data=n, diff='ecdf', discrete=True, x_axis_label="n"
)
bokeh.io.show(p)
It is now clear that very few of the 279 data points lie outside the 95% envelope of the predictive ECDF. We might expect more to lie outside; it is conceivable that the model is too flexible.
Computing environmentΒΆ
[16]:
%load_ext watermark
%watermark -v -p numpy,pandas,scipy,bokeh,iqplot,bebi103,jupyterlab
CPython 3.8.5
IPython 7.19.0
numpy 1.19.2
pandas 1.1.3
scipy 1.5.2
bokeh 2.2.3
iqplot 0.1.6
bebi103 0.1.1
jupyterlab 2.2.6