Graphical model assessment: Q-Q plots

Dataset download


[1]:
import warnings

import pandas as pd
import numpy as np
import scipy.optimize
import scipy.stats as st

import bebi103

import bokeh_catplot

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

Graphical model assessment

Thus far, we have not been terribly principled in how we have assessed models. By model assessment, I mean some metric, quantitative or qualitative, or how well matched the proposed generative model and the actual (unknown) generative model are. In my opinion, many of the most effective methods are graphical, meaning that we make plots of data generated from the theoretical generative model along with the measured data. We will cover a few effective ways of generating these graphics, starting with Q-Q plots.

Q-Q plots

Q-Q plots (the “Q” stands for “quantile) are convenient ways to graphically compare two probability distributions. In our case, we wish to compare the measured distribution to the theoretical one parametrized by the MLE. There are many ways to generate Q-Q plots, and many of the descriptions out there are kind of convoluted. Here is a procedure/description I like for \(N\) total empirical samples.

  1. Sort your samples from lowest to highest.

  2. Draw \(N\) samples from the theoretical distribution and sort them.

  3. Plot your samples against the samples from the theoretical distribution.

If the plotted points fall on a straight line of slope one and intercept zero (“the diagonal”), the distributions are the same. Deviations from the diagonal highlight differences in the distributions.

I actually like to generate many many samples from the theoretical distribution and then plot the 95% confidence region of the Q-Q plot.

This plot gives a feel of how plausible it is that the observed data were drawn out of the theoretical distribution. Let’s do this with a Negative Binomial distribution with our measured mRNA transcripts using the MLE to parametrize the theoretical distribution. We will do it for the Rex1 gene (which obviously shows bimodality).

As a reminder, here is the ECDF of that data set.

[2]:
df = pd.read_csv("../data/singer_transcript_counts.csv", comment="#")

p = bokeh_catplot.ecdf(
    data=df, val="Rex1", x_axis_label="Rex1 mRNA count", conf_int=True
)

bokeh.io.show(p)

Likelihood calculators

In this and the following parts of this lesson, we will need the functions we have written to perform maximum likelihood estimation for a Negative Binomial distribution and for a mixture of two Negative Binomials. These are copied directly from previous lessons and the previous section of this lesson, so you do not need to focus on this rather long code cell.

[3]:
def log_like_iid_nbinom(params, n):
    """Log likelihood for i.i.d. NBinom measurements, parametrized
    by alpha, b=1/beta."""
    alpha, b = params

    if alpha <= 0 or b <= 0:
        return -np.inf

    return np.sum(st.nbinom.logpmf(n, alpha, 1/(1+b)))


def mle_iid_nbinom(n):
    """Perform maximum likelihood estimates for parameters for i.i.d.
    NBinom measurements, parametrized by alpha, b=1/beta"""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, n: -log_like_iid_nbinom(params, n),
            x0=np.array([3, 3]),
            args=(n,),
            method='Powell'
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)


def initial_guess_mix(n, w_guess):
    """Generate initial guess for mixture model."""
    n_low = n[n < np.percentile(n, 100*w_guess)]
    n_high = n[n >= np.percentile(n, 100*w_guess)]

    alpha1, b1 = mle_iid_nbinom(n_low)
    alpha2, b2 = mle_iid_nbinom(n_high)

    return alpha1, b1, alpha2, b2


def log_like_mix(alpha1, b1, alpha2, b2, w, n):
    """Log-likeihood of binary Negative Binomial mixture model."""
    # Fix nonidentifieability be enforcing values of w
    if w < 0 or w > 1:
        return -np.inf

    # Physical bounds on parameters
    if alpha1 < 0 or alpha2 < 0 or b1 < 0 or b2 < 0:
        return -np.inf

    logx1 = st.nbinom.logpmf(n, alpha1, 1/(1+b1))
    logx2 = st.nbinom.logpmf(n, alpha2, 1/(1+b2))

    # Multipliers for log-sum-exp
    lse_coeffs = np.tile([w, 1-w], [len(n), 1]).transpose()

    # log-likelihood for each measurement
    log_likes = scipy.special.logsumexp(np.vstack([logx1, logx2]), axis=0, b=lse_coeffs)

    return np.sum(log_likes)


def mle_mix(n, w_guess):
    """Obtain MLE estimate for parameters for binary mixture
    of Negative Binomials."""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, n: -log_like_mix(*params, n),
            x0=[*initial_guess_mix(n, w_guess), w_guess],
            args=(n,),
            method='Powell',
            tol=1e-6,
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)

We can now carry out the MLE for both models.

[4]:
# Load in data
df = pd.read_csv("../data/singer_transcript_counts.csv", comment="#")
n = df['Rex1'].values

# Single Negative Binomial MLE
alpha, b = mle_iid_nbinom(n)

# Mixture model MLE
alpha1, b1, alpha2, b2, w = mle_mix(n, 0.2)

Constructing the Q-Q plot

We will first construct a Q-Q plot for the single Negative-Binomial case. The function bebi103.viz.qqplot() generates Q-Q plots. To construct the Q-Q plot, we make draws out of the theoretical distribution, so it needs a function to draw samples out of the generative distribution. The call signature must be gen_fun(*args, size). Let’s write a function to draw out of a single Negative Binomial with our \(\alpha\), \(b\) parametrization.

[5]:
rg = np.random.default_rng()

def draw_neg_binom(alpha, b, size=1):
    return rg.negative_binomial(alpha, 1 / (1 + b), size=size)

With that in place, we can call bebi103.viz.qqplot() to generate the plot.

[6]:
p = bebi103.viz.qqplot(
    data=n,
    gen_fun=draw_neg_binom,
    args=(alpha, b),
    x_axis_label="transcript counts",
    y_axis_label="transcript counts",
)

bokeh.io.show(p)

Clearly, the generative model cannot produce the observed transcript bounds, since the Q-Q plot shows strong separation of the generative quantiles from the observed quantiles.

Q-Q plot for the mixture model

We can also make a Q-Q plot for the mixture model. We need to write a function to generate data out of the mixture model. As we did in a previous lesson, we first randomly choose a cell type with weight \(w\), and then draw out of a Negative Binomial distribution with parameters corresponding to that cell type.

[7]:
def draw_mix(alpha1, b1, alpha2, b2, w, size):
    """Draw samples from Negative Binomial binary mixture model."""
    n = np.empty(size)
    for i in range(size):
        low_cell_type = rg.uniform() < w

        if low_cell_type:
            n[i] = rg.negative_binomial(alpha1, 1 / (1 + b1))
        else:
            n[i] = rg.negative_binomial(alpha2, 1 / (1 + b2))

    return n

Now, we can make the Q-Q plot!

[8]:
p = bebi103.viz.qqplot(
    data=n,
    gen_fun=draw_mix,
    args=(alpha1, b1, alpha2, b2, w),
    x_axis_label="transcript counts",
    y_axis_label="transcript counts",
)

bokeh.io.show(p)

The Q-Q plot shows a much strong agreement between the theoretical distribution and the observed.

Computing environment

[9]:
%load_ext watermark
%watermark -v -p numpy,pandas,scipy,bokeh,bokeh_catplot,bebi103,jupyterlab
CPython 3.7.5
IPython 7.9.0

numpy 1.17.4
pandas 0.24.2
scipy 1.3.1
bokeh 1.4.0
bokeh_catplot 0.1.6
bebi103 0.0.45
jupyterlab 1.2.3