Model comparison with the AIC

mRNA count dataset download

Spindle length dataset download


[1]:
import warnings

import pandas as pd
import numpy as np
import numba
import scipy.optimize
import scipy.stats as st

import bebi103

import bokeh_catplot

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

We have seen that we can assess models graphically, and even compare one model to another. In the example of the counts of Rex1 transcripts, we compared the model of a single Negative Binomial distribution to that of a mixture of two Negative Binomials. In the example of spindle length, we compared a model where the spindle length is independent of droplet diameter to one where it depends on droplet diameter according to conservation of tubulin.

There are many non-graphical ways to assess models, including likelihood-ratio tests and [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics). Both of these are involved topics (especially cross-validation; there is a lot to learn there), and we will not cover them in much depth here. We will go into much more depth on model selection next term.

The Akaike information criterion

For now, we will consider a metric for model selection called the Akaike information criterion, or AIC. We will not show it, but the AIC can be loosely interpreted as an estimate of a a quantity related to the distance between the true generative distribution and the model distribution. (I am speaking loosely here, there are important details missing, and in fact I have abused some language.) Given two models, the one with the lesser AIC is likely closer to the true generative distribution.

For a set of parameters \(\theta\) with MLE \(\theta^*\) and a model with log-likelihood \(\ell(\theta;\text{data})\), the AIC is given by

\begin{align} \text{AIC} = -2\ell(\theta^*;\text{data}) + 2p, \end{align}

where \(p\) is the number of free parameters in a model. Thus, more complicated models, those with more parameters, are penalized for having too many parameters. Furthermore, models where the likelihood function is broad, meaning that many parameter values are viable, have a lower log likelihood and are also penalized. So, models with few, meaningful parameters have lower AIC and are favored.

Akaike weights

If we compare two models, say model 1 and model 2, then the Akaike weight for model 1 is

\begin{align} w_1 = \frac{\mathrm{e}^{-\text{AIC}_1/2}}{\mathrm{e}^{-\text{AIC}_1/2} + \mathrm{e}^{-\text{AIC}_2/2}}. \end{align}

Because the values of the AIC can be large, it is more numerically stable to compute the Akaike weight as

\begin{align} w_1 = \frac{\mathrm{e}^{-(\text{AIC}_1 - \text{AIC}_\mathrm{max})/2}}{\mathrm{e}^{-(\text{AIC}_1 - \text{AIC}_\mathrm{max})/2}+\mathrm{e}^{-(\text{AIC}_2 - \text{AIC}_\mathrm{max})/2}}, \end{align}

where \(\text{AIC}_\mathrm{max}\) is the maximum of the two AICs. More generally, the Akaike weight of model \(i\) among many is

\begin{align} w_i = \frac{\mathrm{e}^{-(\text{AIC}_i - \text{AIC}_\mathrm{max})/2}}{\sum_j\mathrm{e}^{-(\text{AIC}_j - \text{AIC}_\mathrm{max})/2}}. \end{align}

Interpretation of the Akaike weights is lacking a consensus, but we can loosely interpret the Akaike weight as the probability that a model will give the best predictions of new data, assuming that we have considered all possible models.

AIC for mRNA counts

Let us now compare the single Negative Binomial to the mixture model for mRNA counts. We again need our functions for computing the MLE and computing the log-likelihood from previous lessons.

[2]:
def log_like_iid_nbinom(params, n):
    """Log likelihood for i.i.d. NBinom measurements, parametrized
    by alpha, b=1/beta."""
    alpha, b = params

    if alpha <= 0 or b <= 0:
        return -np.inf

    return np.sum(st.nbinom.logpmf(n, alpha, 1/(1+b)))


def mle_iid_nbinom(n):
    """Perform maximum likelihood estimates for parameters for i.i.d.
    NBinom measurements, parametrized by alpha, b=1/beta"""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, n: -log_like_iid_nbinom(params, n),
            x0=np.array([3, 3]),
            args=(n,),
            method='Powell'
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)


def initial_guess_mix(n, w_guess):
    """Generate initial guess for mixture model."""
    n_low = n[n < np.percentile(n, 100*w_guess)]
    n_high = n[n >= np.percentile(n, 100*w_guess)]

    alpha1, b1 = mle_iid_nbinom(n_low)
    alpha2, b2 = mle_iid_nbinom(n_high)

    return alpha1, b1, alpha2, b2


def log_like_mix(alpha1, b1, alpha2, b2, w, n):
    """Log-likeihood of binary Negative Binomial mixture model."""
    # Fix nonidentifieability be enforcing values of w
    if w < 0 or w > 1:
        return -np.inf

    # Physical bounds on parameters
    if alpha1 < 0 or alpha2 < 0 or b1 < 0 or b2 < 0:
        return -np.inf

    logx1 = st.nbinom.logpmf(n, alpha1, 1/(1+b1))
    logx2 = st.nbinom.logpmf(n, alpha2, 1/(1+b2))

    # Multipliers for log-sum-exp
    lse_coeffs = np.tile([w, 1-w], [len(n), 1]).transpose()

    # log-likelihood for each measurement
    log_likes = scipy.special.logsumexp(np.vstack([logx1, logx2]), axis=0, b=lse_coeffs)

    return np.sum(log_likes)


def mle_mix(n, w_guess):
    """Obtain MLE estimate for parameters for binary mixture
    of Negative Binomials."""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, n: -log_like_mix(*params, n),
            x0=[*initial_guess_mix(n, w_guess), w_guess],
            args=(n,),
            method='Powell',
            tol=1e-6,
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)

Now we can load in the data and compute the MLEs for each of the four genes.

[3]:
# Load in data
df = pd.read_csv("../data/singer_transcript_counts.csv", comment="#")

df_mle = pd.DataFrame(index=['alpha', 'b', 'alpha1', 'b1', 'alpha2', 'b2', 'w'])

for gene in df:
    n = df['Nanog'].values

    # Single Negative Binomial MLE
    alpha, b = mle_iid_nbinom(df[gene].values)

    # Mixture model MLE
    alpha1, b1, alpha2, b2, w = mle_mix(df[gene].values, 0.2)

    # Store results in data frame
    df_mle[gene] = [alpha, b, alpha1, b1, alpha2, b2, w]

# Take a look
df_mle
[3]:
Rex1 Rest Nanog Prdm14
alpha 1.634562 4.530335 1.263097 0.552886
b 84.680915 16.543054 69.347842 8.200636
alpha1 3.497013 2.786600 0.834843 2.274491
b1 4.104911 12.395707 66.535574 4.823948
alpha2 5.089623 6.683425 4.127565 0.560992
b2 31.810385 11.953264 28.132637 4.541668
w 0.160422 0.108772 0.466646 0.235637

For each of the two models, we can compute the log likelihood evaluated at the MLEs for the parameters.

[4]:
for gene in df:
    ell = log_like_iid_nbinom(
        (df_mle.loc["alpha", gene], df_mle.loc["b", gene]), df[gene].values
    )
    df_mle.loc["log_like_single", gene] = ell

    ell_mix = log_like_mix(
        *df_mle.loc[["alpha1", 'b1', 'alpha2', 'b2', 'w'], gene].values,
        df[gene].values,
    )
    df_mle.loc["log_like_mix", gene] = ell_mix

# Take a look
df_mle
[4]:
Rex1 Rest Nanog Prdm14
alpha 1.634562 4.530335 1.263097 0.552886
b 84.680915 16.543054 69.347842 8.200636
alpha1 3.497013 2.786600 0.834843 2.274491
b1 4.104911 12.395707 66.535574 4.823948
alpha2 5.089623 6.683425 4.127565 0.560992
b2 31.810385 11.953264 28.132637 4.541668
w 0.160422 0.108772 0.466646 0.235637
log_like_single -1638.678482 -1376.748398 -1524.928918 -713.091587
log_like_mix -1590.353743 -1372.108896 -1512.444552 -712.704687

We can already see a very large difference between the log likelihood evaluated at the MLE for Rex1, but not much difference for Prdm14. The mixture morel has \(p = 5\) parameters, while the single Negative Binomial model has \(p = 2\). With these numbers, we can compute the AIC and then also the Akaike weights.

[5]:
for gene in df:
    df_mle.loc['AIC_single', gene] = -2 * (df_mle.loc['log_like_single', gene] - 2)
    df_mle.loc['AIC_mix', gene] = -2 * (df_mle.loc['log_like_mix', gene] - 5)

# Take a look
df_mle
[5]:
Rex1 Rest Nanog Prdm14
alpha 1.634562 4.530335 1.263097 0.552886
b 84.680915 16.543054 69.347842 8.200636
alpha1 3.497013 2.786600 0.834843 2.274491
b1 4.104911 12.395707 66.535574 4.823948
alpha2 5.089623 6.683425 4.127565 0.560992
b2 31.810385 11.953264 28.132637 4.541668
w 0.160422 0.108772 0.466646 0.235637
log_like_single -1638.678482 -1376.748398 -1524.928918 -713.091587
log_like_mix -1590.353743 -1372.108896 -1512.444552 -712.704687
AIC_single 3281.356963 2757.496797 3053.857837 1430.183173
AIC_mix 3190.707487 2754.217792 3034.889103 1435.409375

Finally, we can compute the Akaike weight for the mixture model (the weight for the single Negative Binomial model is \(1-w_\mathrm{mix}\).

[6]:
for gene in df:
    AIC_max = max(df_mle.loc[['AIC_single', 'AIC_mix'], gene])
    numerator = np.exp(-(df_mle.loc['AIC_single', gene] - AIC_max)/2)
    denominator = numerator + np.exp(-(df_mle.loc['AIC_mix', gene] - AIC_max)/2)
    df_mle.loc['w_single', gene] = numerator / denominator

# Take a look
df_mle
[6]:
Rex1 Rest Nanog Prdm14
alpha 1.634562e+00 4.530335 1.263097 0.552886
b 8.468092e+01 16.543054 69.347842 8.200636
alpha1 3.497013e+00 2.786600 0.834843 2.274491
b1 4.104911e+00 12.395707 66.535574 4.823948
alpha2 5.089623e+00 6.683425 4.127565 0.560992
b2 3.181038e+01 11.953264 28.132637 4.541668
w 1.604216e-01 0.108772 0.466646 0.235637
log_like_single -1.638678e+03 -1376.748398 -1524.928918 -713.091587
log_like_mix -1.590354e+03 -1372.108896 -1512.444552 -712.704687
AIC_single 3.281357e+03 2757.496797 3053.857837 1430.183173
AIC_mix 3.190707e+03 2754.217792 3034.889103 1435.409375
w_single 2.068789e-20 0.162533 0.000076 0.931700

In looking at the Akaike weight for the mixture (w_mix), is it clear that the mixture model is strongly preferred for Rex1 and Nanog. There is not strong preference for Rest, and a preference for the single Negative Binomial model for Prdm14. Reminding ourselves of the ECDFs, this makes sense.

[7]:
genes = ["Nanog", "Prdm14", "Rest", "Rex1"]

plots = [
    bokeh_catplot.ecdf(
        data=df[gene],
        val=gene,
        x_axis_label="mRNA count",
        title=gene,
        frame_height=150,
        frame_width=200,
    )
    for gene in genes
]

bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

Rex1 clearly is bimodal, and Nanog appears to have a second where the ECDF reaches a value of about 0.4, which is what we see in the MLE estimates in the mixture model. Rest and Prdm14 both appear to be unimodal, agreeing with what we saw with the AIC analysis.

Note that this underscores something we’ve been stressing all term. You should do good exploratory data analysis first, and the EDA often tells much of the story!

Caveat

Remember, though, that we did not take into account that the measurements of the four genes were done in the same cells. We modeled that when we presented the mixture models at the beginning of this lesson. The analysis of a more complicated model with MLE proved to be out of reach due to computational difficulty. So, we should not make strong conclusions about what the relative quality of the mixture of single Negative Binomial models mean in this context. We will address these kinds of modeling issues next term.

AIC for the spindle model

We can do a similar analysis for the two competing models for mitotic spindle size. We need our functions from earlier.

[8]:
def theor_spindle_length(gamma, phi, d):
    """Compute spindle length using mathematical model"""
    return gamma * d / np.cbrt(1 + (gamma * d / phi)**3)


def log_likelihood(params, d, ell):
    """Log likelihood of spindle length model."""
    gamma, phi, sigma = params

    if gamma <= 0 or gamma > 1 or phi <= 0:
        return -np.inf

    mu = theor_spindle_length(gamma, phi, d)
    return np.sum(st.norm.logpdf(ell, mu, sigma))


def spindle_mle(d, ell):
    """Compute MLE for parameters in spindle length model."""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, d, ell: -log_likelihood(params, d, ell),
            x0=np.array([0.5, 35, 5]),
            args=(d, ell),
            method='Powell'
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)

We can now perform MLE to get the parameters for each model and store them in a Pandas Series for convenience.

[9]:
df = pd.read_csv("../data/good_invitro_droplet_data.csv", comment="#")

mle_1 = df["Spindle Length (um)"].apply([np.mean, np.std]).values
mle_2 = spindle_mle(
    df["Droplet Diameter (um)"].values, df["Spindle Length (um)"].values
)

s = pd.Series(
    index=["phi_1", "sigma_1", "gamma", "phi_2", "sigma_2"],
    data=np.concatenate((mle_1, mle_2)),
)

Next, we can compute the log likelihood evaluated at the MLE.

[10]:
s["log_like_1"] = st.norm.logpdf(df["Spindle Length (um)"], s["phi_1"], s["sigma_1"]).sum()
s["log_like_2"] = log_likelihood(
    s[["gamma", "phi_2", "sigma_2"]],
    df["Droplet Diameter (um)"],
    df["Spindle Length (um)"],
)

s
[10]:
phi_1           32.864030
sigma_1          4.788240
gamma            0.860475
phi_2           38.231250
sigma_2          3.753422
log_like_1   -1999.517925
log_like_2   -1837.158982
dtype: float64

The log likeihood for model 2, with spindle size depending on droplet diameter, is much greater than for model 1. And now we can compute the AIC,noting that there are two parameters for model 1 and three for model 2.

[11]:
s["AIC_1"] = -2 * (s['log_like_1'] - 2)
s["AIC_2"] = -2 * (s['log_like_2'] - 3)

s
[11]:
phi_1           32.864030
sigma_1          4.788240
gamma            0.860475
phi_2           38.231250
sigma_2          3.753422
log_like_1   -1999.517925
log_like_2   -1837.158982
AIC_1         4003.035850
AIC_2         3680.317964
dtype: float64

There is a massive disparity in the AICs, so we know that model 2 is strongly preferred. Nonetheless, we can compute the Akaike weight for model 1 to compare.

[12]:
AIC_max = max(s[['AIC_1', 'AIC_2']])
numerator = np.exp(-(s.loc['AIC_1'] - AIC_max)/2)
denominator = numerator + np.exp(-(s['AIC_2'] - AIC_max)/2)
s['w_single'] = numerator / denominator

s
[12]:
phi_1         3.286403e+01
sigma_1       4.788240e+00
gamma         8.604745e-01
phi_2         3.823125e+01
sigma_2       3.753422e+00
log_like_1   -1.999518e+03
log_like_2   -1.837159e+03
AIC_1         4.003036e+03
AIC_2         3.680318e+03
w_single      8.369539e-71
dtype: float64

Model 1 is completely out of the question, with a tiny Akaike weight!

Computing environment

[13]:
%load_ext watermark
%watermark -v -p numpy,pandas,scipy,bokeh,bokeh_catplot,bebi103,jupyterlab
CPython 3.7.5
IPython 7.9.0

numpy 1.17.4
pandas 0.24.2
scipy 1.3.1
bokeh 1.4.0
bokeh_catplot 0.1.6
bebi103 0.0.45
jupyterlab 1.2.3