Lecture 9: Dancing statistics

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This homework was generated from an Jupyter notebook. You can download the notebook here.

This tutorial was inspired by Geoff Cumming.

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as st
import numba

import tqdm

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
/Users/bois/Dropbox/git/bebi103/bebi103/viz.py:30: UserWarning: DataShader import failed with error "No module named 'datashader'".
Features requiring DataShader will not work and you will get exceptions.
  Features requiring DataShader will not work and you will get exceptions.""")
Loading BokehJS ...

In this fun exercise, we will investigate how replicable certain statistical conclusions, both the frequentist and Bayesian contexts, are. What I mean by replicability is best understood be working through the tutorial.

For this tutorial we will use the zebrafish embryo sleep data we have been using before. We will study the mean rest bout length on the sixth night, comparing wild type to the homozygous mutant. We are primarily interested in the difference in the mean bout lengths between the two genotypes. You can work up the data set to get the mean rest bouts for the sixth night, but here they are.

As a reminder, two of the mutant fish had no detectable rest bouts on the one-minute sampling interval. They obviously did have rest bouts, but they were smaller than one minute. We therefore have a decision: we can either choose the mean rest bout length to be zero for these fish, or we can jettison them from our samples because we know we could not properly sample them. For the purposes of this demonstration, we will neglect these points. (The purpose of this tutorial is to demonstrate reproducibility of statistics, not analyze this specific data set.)

Let's remind ourselves of what the data look like.

In [2]:
# Load data
df = pd.read_csv('../data/mean_rest_bouts.csv', comment='#')

# Pull out wild type and mutant and drop NAs
df = df[df['genotype'].isin(['wt', 'mut'])].dropna()

Let's look at these data with a jitter plot.

In [3]:
p = bebi103.viz.jitter(df,
                       cats=['genotype'],
                       val='mean_rest_bout_length',
                       order=['wt', 'mut'],
                       y_axis_label='mean rest bout length (s)',
                       width=0.2)
bokeh.io.show(p)

Cohen's d

Cohen's d is a commonly used measure of effect size in comparison of two data sets. It is the ratio of the difference of means compared to a pooled standard deviation.

\begin{align} d = \frac{|\bar{x} - \bar{y}|}{\sqrt{\left.(n_x r_x^2 + n_y r_y^2) \middle/ (n_x + n_y - 2)\right.}}. \end{align}

Here, $\bar{x}$ is the mean of the data from sample $x$, $r_x^2$ is the plug-in estimate for the variance from sample $x$, and $n_x$ is the number of measurements in sample $x$. The values for sample $y$ are similarly defined.

Roughly speaking, Cohen's d tells us how different the means of the data sets are compared to the variability in the data. A large Cohen's d means that the effect is large compared to the variability of the measurement.

Estimates of the difference of means and Cohen's d

First, we will compute nonparametric estimates from the data. We will estimate the difference in the mean bout length and Cohen's d. For speed, we save the two data sets as NumPy arrays.

In [4]:
wt = df.loc[df.genotype=='wt', 'mean_rest_bout_length'].values
mut = df.loc[df.genotype=='mut', 'mean_rest_bout_length'].values

Next, we'll write some functions to conveniently generate bootstrap replicates and do our hypothesis tests. These borrow heavily from the tutorial on hacker stats.

In [5]:
@numba.jit(nopython=True)
def cohen_d(x, y, return_abs=False):
    """Cohen's d for two data sets."""
    diff = x.mean() - y.mean()
    pooled_variance = (len(x) * np.var(x) + len(y) * np.var(y)) \
                            / (len(x) + len(y) - 2)

    if return_abs:
        return np.abs(diff) / np.sqrt(pooled_variance)
    return diff / np.sqrt(pooled_variance)


@numba.jit(nopython=True)
def t_stat(x, y):
    """Welch's t-statistic."""
    return (np.mean(x) - np.mean(y)) / \
        np.sqrt(np.var(x) / (len(x) - 1) + np.var(y) / (len(y) - 1))

    
@numba.jit(nopython=True)
def draw_perm_sample(x, y):
    """Generate a permutation sample."""
    concat_data = np.concatenate((x, y))
    np.random.shuffle(concat_data)
    return concat_data[:len(x)], concat_data[len(x):]
    
    
@numba.jit(nopython=True)
def draw_bs_sample(data):
    return np.random.choice(data, size=len(data))


@numba.jit(nopython=True)
def draw_perm_reps_diff_mean(x, y, size=10000):
    out = np.empty(size)
    for i in range(size):
        x_perm, y_perm = draw_perm_sample(x, y)
        out[i] = np.mean(x_perm) - np.mean(y_perm)
    return out


@numba.jit(nopython=True)
def draw_bs_reps_mean(data, size=10000):
    out = np.empty(size)
    for i in range(size):
        out[i] = np.mean(draw_bs_sample(data))
    return out


@numba.jit(nopython=True)
def draw_bs_reps_diff_mean(x, y, size=10000):
    out = np.empty(size)
    for i in range(size):
        out[i] = np.mean(draw_bs_sample(x)) - np.mean(draw_bs_sample(y))
    return out


@numba.jit(nopython=True)
def draw_bs_reps_cohen_d(x, y, size=10000, return_abs=False):
    out = np.empty(size)
    for i in range(size):
        out[i] = cohen_d(draw_bs_sample(x), draw_bs_sample(y), return_abs)
    return out

@numba.jit(nopython=True)
def draw_bs_reps_t(x, y, size=10000):
    """
    Bootstrap replicates using the Welch's t-statistic.
    """
    out = np.empty(size)
    for i in range(size):
        out[i] = t_stat(draw_bs_sample(x), draw_bs_sample(y))
    return out

Now we can compute the replicates. First, let's look at the means and their respective confidence intervals.

In [6]:
wt_reps = draw_bs_reps_mean(wt)
mut_reps = draw_bs_reps_mean(mut)

And from these, compute the 95% confidence interval.

In [7]:
wt_mean_conf_int = np.percentile(wt_reps, [2.5, 97.5])
mut_mean_conf_int = np.percentile(mut_reps, [2.5, 97.5])

print('WT:  [{1:.2f}, {0:.2f}, {2:.2f}] minutes'.format(
                np.mean(wt), *tuple(wt_mean_conf_int)))
print('MUT: [{1:.2f}, {0:.2f}, {2:.2f}] minutes'.format(
                np.mean(mut), *tuple(mut_mean_conf_int)))
WT:  [1.97, 2.21, 2.45] minutes
MUT: [1.65, 1.90, 2.18] minutes

There is some overlap in the confidence interval, though wild type tend to have longer rest bouts. Just looking at these numbers, we may not be all that certain that there is a discernible difference between wild type and mutant. (Remember, though, that we left out the two fish that did not have rest bouts at all. It could be that the key difference between the two genotypes is that some mutants do not sleep at all, while those that do have similar rest bout lengths as wild type.)

Now, let's look at the difference in the mean rest bout length.

In [8]:
reps = draw_bs_reps_diff_mean(wt, mut)
diff_mean_conf_int = np.percentile(reps, [2.5, 97.5])

print('WT - MUT:  [{1:.2f}, {0:.2f}, {2:.2f}] minutes'.format(
                np.mean(wt) - np.mean(mut), *tuple(diff_mean_conf_int)))
WT - MUT:  [-0.05, 0.31, 0.65] minutes

As we might expect, on the tail end of the confidence interval for the difference of means, we see that the mutant might actually have longer rest bouts that wild type.

Finally, lets compute the Cohen's d to check the effect size.

In [9]:
reps = draw_bs_reps_cohen_d(wt, mut)
cohen_d_conf_int = np.percentile(reps, [2.5, 97.5])

print('WT - MUT Cohen''s d:  [{1:.2f}, {0:.2f}, {2:.2f}] minutes'.format(
                cohen_d(wt, mut), *tuple(cohen_d_conf_int)))
WT - MUT Cohens d:  [-0.10, 0.54, 1.44] minutes

So, the effect size is 0.54, meaning that the mutant tends to have rest bouts 0.5 standard deviations as large as wild type fix. Jacob Cohen would call this a "medium" sized effect.

Null hypothesis significance testing

We will now perform an NHST on these two data sets. We formulate the hypothesis as follows.

  • $H_0$: Wild type and mutant fish have the same mean rest about length.
  • Test statistic: Cohen's d.
  • At least as extreme as: Cohen's d larger than what was observed.

We can then perform a bootstrap hypothesis test.

In [10]:
@numba.jit(nopython=True)
def cohen_nhst(wt, mut, size=100000):
    """
    Perform hypothesis test assuming equal means, using
    Cohen-d as test statistic.
    """
    # Shift data sets so that they have the same mean.
    wt_shifted = wt - np.mean(wt) + np.mean(np.concatenate((wt, mut)))
    mut_shifted = mut - np.mean(mut) + np.mean(np.concatenate((wt, mut)))

    # Draw replicates of Cohen's d
    reps = draw_bs_reps_cohen_d(wt_shifted, mut_shifted, size=size)

    # Compute p-value
    return np.sum(reps >= cohen_d(wt, mut)) / len(reps)

print("Cohen's d p-value:", cohen_nhst(wt, mut))
Cohen's d p-value: 0.0667

We get a p-value of about 0.07, which, if we use the typical bright line p-value for statistical significance, we would say that this difference is not statistically significant. We could also test what would happen if we used a different test statistic, like the difference of means.

In [11]:
# Shift data sets so that they have the same mean.
wt_shifted = wt - np.mean(wt) + np.mean(np.concatenate((wt, mut)))
mut_shifted = mut - np.mean(mut) + np.mean(np.concatenate((wt, mut)))

# Draw replicates of Cohen's d
reps = draw_bs_reps_diff_mean(wt_shifted, mut_shifted, size=100000)

# Compute p-value
p_val = np.sum(reps >= np.mean(wt) - np.mean(mut)) / len(reps)

print('Difference of means p-value:', p_val)
Difference of means p-value: 0.04027

Here, we get a p-value of about 0.04. We would say that the result is statistically significant if we used a bright line value of 0.05.

Finally, let's try the canonical test for this circumstance, the Welch's t-test. As a reminder, the test statistic for the Welch's t-test is

\begin{align} T = \frac{\bar{x}_w - \bar{x}_m}{\sqrt{r_w^2/n_w + r_m^2/n_m}}, \end{align}

where $r_w^2$ and $r_m^2$ are plug-in estimates for the variances. Importantly, when performing a Welch's t-test, Gaussianity of the two samples is assumed. So, the hypothesis test is defined as follows.

  • $H_0$: The two samples are both Gaussian distributed with equal means.
  • Test statistic: t-statistic.
  • At least as extreme as: t-statistic (wild type minus mutant) greater than or equal to what was observed.

This is implemented as st.ttest_ind() using the kwarg equal_var=False. We divide by two to get the one-tailed test.

In [12]:
print("Welch's p-value:", st.ttest_ind(wt, mut, equal_var=False)[1]/2)
Welch's p-value: 0.052542004908830583

Here, we are just above the bright line value of 0.05. We can perform a similar hypothesis test without the Gaussian assumption using the same test statistic as in the Welch's test.

In [13]:
# Draw replicates of t statistic
reps = draw_bs_reps_t(wt_shifted, mut_shifted, size=100000)

# Compute p-value
p_val = np.sum(reps >= t_stat(wt, mut)) / len(reps)

print("Welch's t-test without Gaussian assumption p-value:", p_val)
Welch's t-test without Gaussian assumption p-value: 0.06469

Finally, we will perform a permutation test. This test is specified as follows.

  • $H_0$: The sleep bout lengths of mutant and wild type fish are identically distributed.
  • Test statistic: difference of means, wild type minus mutant
  • At least as extreme as : difference of means is greater than what was observed.
In [14]:
# Draw permutation replicates
reps = draw_perm_reps_diff_mean(wt, mut, size=100000)

# Compute p-value
p_val = np.sum(reps >= np.mean(wt) - np.mean(mut)) / len(reps)

print("Permutation test p-value:", p_val)
Permutation test p-value: 0.05658

So, all of our tests give p-values that are close to each other, ranging from about 0.04 to 0.07. If we choose bright line p-values to deem something as significant or not, some similar hypothesis/test statistic pairs can give different results. So, do not use brightline p-values. They are a terrible idea. You went through the trouble of computing the p-value, just report it and leave it at that. Don't change a float to a bool.

A Bayesian analysis

For a Bayesian model comparison problem, we could do the same modeling, assume that the two distributions have the same mean, and not choose a model for how the data are distributed. This is a nonparametric Bayesian calculation, which is mathematically more challenging than the frequentist estimates and NHST we just did.

Instead, we will do a parametric Bayesian model selection, comparable to the Welch's t-test analysis. We will consider two models. In model 1, both wild type and mutant sleep bouts are drawn from the same Gaussian distribution. In model 2, they are drawn from different Gaussian distributions. Formally, model 1 is, with weakly informative priors,

\begin{align} &\mu \sim \text{HalfNorm}(0, 60) \\[1em] &\sigma \sim \text{HalfNorm}(0, 10)\\[1em] &t_i \sim \text{Norm}(\mu, \sigma)\;\forall i\text{, wild type or mutant}. \end{align}

Model 2 is given by

\begin{align} &\mu_\mathrm{wt} \sim \text{HalfNorm}(0, 60) \\[1em] &\mu_\mathrm{mut} \sim \text{HalfNorm}(0, 60) \\[1em] &\sigma_\mathrm{wt} \sim \text{HalfNorm}(0, 10)\\[1em] &\sigma_\mathrm{mut} \sim \text{HalfNorm}(0, 10)\\[1em] &t^\mathrm{wt}_i \sim \text{Norm}(\mu_\mathrm{wt}, \sigma_\mathrm{wt})\;\forall i\\[1em] &t^\mathrm{mut}_i \sim \text{Norm}(\mu_\mathrm{mut}, \sigma_\mathrm{mut})\;\forall i. \end{align}

We can code the two models up in Stan.

In [15]:
model_code_1 = """
data {
  int N1;
  int N2;
  real wt[N1];
  real mut[N2];
}


parameters {
  real<lower=0> mu;
  real<lower=0> sigma;
}


model {
  mu ~ inv_gamma(1.4, 2);
  sigma ~ normal(0, 2);
  
  wt ~ normal(mu, sigma);
  mut ~ normal(mu, sigma);
}


generated quantities {
  real wt_ppc[N1];
  real mut_ppc[N2];
  real log_lik[N1+N2];
  
  for (i in 1:N1) {
    wt_ppc[i] = normal_rng(mu, sigma);
    log_lik[i] = normal_lpdf(wt[i] | mu, sigma);
  }

  for (i in 1:N2) {
    mut_ppc[i] = normal_rng(mu, sigma);
    log_lik[i+N1] = normal_lpdf(mut[i] | mu, sigma);
  }
}
"""
sm_1 = bebi103.stan.StanModel(model_code = model_code_1)


model_code_2 = """
data {
  int N1;
  int N2;
  real wt[N1];
  real mut[N2];
}


parameters {
  real<lower=0> mu_wt;
  real<lower=0> mu_mut;
  real<lower=0> sigma_wt;
  real<lower=0> sigma_mut;
}


model {
  mu_wt ~ inv_gamma(1.4, 2);
  sigma_wt ~ normal(0, 2);
  mu_mut ~ inv_gamma(1.4, 2);
  sigma_mut ~ normal(0, 2);
  
  wt ~ normal(mu_wt, sigma_wt);
  mut ~ normal(mu_mut, sigma_mut);
}


generated quantities {
  real wt_ppc[N1];
  real mut_ppc[N2];
  real log_lik[N1+N2];
  
  for (i in 1:N1) {
    wt_ppc[i] = normal_rng(mu_wt, sigma_wt);
    log_lik[i] = normal_lpdf(wt[i] | mu_wt, sigma_wt);
  }

  for (i in 1:N2) {
    mut_ppc[i] = normal_rng(mu_mut, sigma_mut);
    log_lik[i+N1] = normal_lpdf(mut[i] | mu_mut, sigma_mut);
  }
}
"""
sm_2 = bebi103.stan.StanModel(model_code = model_code_2)
Using cached StanModel.
Using cached StanModel.

Let's now sample.

In [16]:
data = dict(N1=len(wt), N2=len(mut), wt=wt, mut=mut)

samples_1 = sm_1.sampling(data=data)
samples_2 = sm_2.sampling(data=data)

bebi103.stan.check_all_diagnostics(samples_1)
print('')
bebi103.stan.check_all_diagnostics(samples_2)
/Users/bois/anaconda3/lib/python3.7/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0.0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.

n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0.0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
Out[16]:
0

Now, let's plot some posterior predictive checks.

In [17]:
p_wt_1 = bebi103.viz.predictive_ecdf(samples_1, name='wt_ppc', data=wt, title='model 1')
p_mut_1 = bebi103.viz.predictive_ecdf(samples_1, name='mut_ppc', data=mut, title='model 1')

p_wt_2 = bebi103.viz.predictive_ecdf(samples_2, name='wt_ppc', data=wt, title='model 2')
p_mut_2 = bebi103.viz.predictive_ecdf(samples_2, name='mut_ppc', data=mut, title='model 2')

bokeh.io.show(bokeh.layouts.gridplot([[p_wt_1, p_mut_1],
                                      [p_wt_2, p_mut_2]]))

The models both seem to capture the data well, so they both pass posterior predictive checks. Let's compute 95% credible regions for the parameters $\mu_\mathrm{wt}$ and $\mu_\mathrm{mut}$ for model 2.

In [18]:
wt_mean_cred_reg = np.percentile(samples_2.extract('mu_wt')['mu_wt'], [2.5, 50, 97.5])
mut_mean_cred_reg = np.percentile(samples_2.extract('mu_mut')['mu_mut'], [2.5, 50, 97.5])

print('WT:  [{0:.2f}, {1:.2f}, {2:.2f}] minutes'.format(
                *wt_mean_cred_reg))
print('MUT: [{0:.2f}, {1:.2f}, {2:.2f}] minutes'.format(
                *mut_mean_cred_reg))
WT:  [1.92, 2.20, 2.47] minutes
MUT: [1.59, 1.89, 2.19] minutes

These are almost identical to the frequetist 95% confidence interval (see above when we calculated those).

Now, we'll perform a model comparison.

In [30]:
bebi103.stan.compare({'model_1': samples_1, 'model_2': samples_2},
                     ic='loo',
                     log_likelihood='log_lik')
Out[30]:
loo ploo dloo weight se dse warning
model_1 68.8858 2.25893 0 0.918991 10.246 0 0
model_2 71.6503 5.00257 2.76453 0.0810089 12.4828 3.90519 0

The model where the means are the same is slightly more predictive, so we might favor this model. So, we have:

  • A p-value that could be interpreted to mean that we should reject the hypothesis that the means are the same.
  • A LOO that slightly favors the hypothesis that the means are the same.

These seem contradictory. They are not. That is because they each answer different questions.

  • The p-value is the probability of of observing a value of the test statistic at least as extreme as what was observed experimentally, assuming the null hypothesis (in this case $H_0$ is true. There is no other hypothesis considered.
  • The LOO information criterion and associated weights is a measure of how predictive a model is for new data.

These are different things that can behave similarly, (i.e., a "good" model may not have a low p-value and tend to be predictive), but they are different assessments of a model. So, you need to think very carefully about what you present as a model assessment, and how you judge what others put forward as assessments of their models.

In this case, So, the (approximate) Bayesian credible region and the frequentist confidence interval almost exactly overlap, but the Bayesian model comparison and NHST give apparently different results (though, again, this is not a contradiction because they compute two different things).

Dancing

We will now do a fun, instructive experiment. We will "re-acquire" the data by drawing random samples out of Gaussian distributions that have the same mean and variance as is most likely from the actual data. We will then compute the confidence interval and credible region for $\delta$ and see how they vary from experiment to experiment. We will later repeat this with p-values and odds ratios.

The idea here is that if the data are indeed Gaussian distributed, we are looking at data that could plausibly be generated in an identical experiment.

First, we'll write a function to generate new data and use it to generate 500 new data sets.

In [31]:
def new_data(mu, sigma, n):
    """Generate new data"""
    return np.maximum(np.random.normal(mu, sigma, n), 0.01)

# Values from real data
mu_wt = np.mean(wt)
mu_mut = np.mean(mut)
sigma_wt = np.std(wt, ddof=1)
sigma_mut = np.std(mut, ddof=1)

# How many new data sets to generate
n_new_data = 500

# Generate new data
new_wt = [new_data(mu_wt, sigma_wt, len(wt)) for _ in range(n_new_data)]
new_mut = [new_data(mu_mut, sigma_mut, len(mut)) for _ in range(n_new_data)]

Now we can do the calculations. For the Bayesian calculations, we want to compute the 95% credible region for the difference between sleep bout lengths between wild type and mutants under model 2, the model where the wild type and mutant sleep bouts are drawn from different distributions. We call this difference in sleep bout lengths $\delta$. We will also compute weights for the single distribution model based on LOO. We'll repeat this 500 times, so we will write a function to do it. Because this is a high-throughput calculation, we will not perform diagnostic nor posterior predictive checks.

In [63]:
def analyze_bayes(wt, mut):
    """Perform a Bayesian analysis of the two data sets,
    including several forms of model comparison."""
    # Set up data for Stan
    data = dict(N1=len(wt), N2=len(mut), wt=wt, mut=mut)

    # Sample using MCMC
    samples_1 = sm_1.sampling(data=data)
    samples_2 = sm_2.sampling(data=data)

    # Credible region
    delta = (samples_2.extract('mu_wt')['mu_wt'] 
                 - samples_2.extract('mu_mut')['mu_mut'])
    cred_reg = np.percentile(delta, [2.5, 97.5])
    
    # Comparison
    df_compare_BMA = bebi103.stan.compare(
                    {'model_1': samples_1, 'model_2': samples_2},
                    ic='loo',
                    method='BB-pseudo-BMA',
                    log_likelihood='log_lik')
    df_compare_stacking = bebi103.stan.compare(
                    {'model_1': samples_1, 'model_2': samples_2},
                    ic='loo',
                    log_likelihood='log_lik')
    dloo = (df_compare_stacking.loc['model_2', 'loo'] 
            - df_compare_stacking.loc['model_1', 'loo'])
    weight_akaike = np.exp(-dloo/2) / (1 + np.exp(-dloo/2))
    
    return pd.DataFrame(data=dict(
        cred_low=cred_reg[0],
        cred_high=cred_reg[1],
        weight_akaike=weight_akaike,
        weight_BMA=df_compare_BMA.loc['model_2', 'weight'],
        weight_stacking=df_compare_stacking.loc['model_2', 'weight'],
        loo_1=df_compare_stacking.loc['model_1', 'loo'],
        loo_2=df_compare_stacking.loc['model_2', 'loo']),
                       index=[0])

Now let's use this function to do these calculations.

In [34]:
df_res = analyze_bayes(new_wt[0], new_mut[0])

for wt_data, mut_data in tqdm.tqdm_notebook(zip(new_wt[1:], new_mut[1:])):
    new_df = analyze_bayes(wt_data, mut_data)
    df_res = df_res.append(new_df, ignore_index=True)

We can do the same calculations with a frequentist approach. First, we'll compute the confidence intervals for $\delta$.

In [35]:
# Set up arrays for storing results
conf_int = np.empty((n_new_data, 2))
delta = np.empty(n_new_data)

# Do calcs!
for i, (wt_data, mut_data) in enumerate(zip(new_wt, new_mut)):
    # Compute confidence interval
    bs_reps = draw_bs_reps_diff_mean(wt_data, mut_data)
    conf_int[i,:] = np.percentile(bs_reps, (2.5, 97.5))

    # Sample difference of means
    delta[i] = wt_data.mean() - mut_data.mean()
    
# Store the results conveniently
df_conf = pd.DataFrame(columns=['conf_low', 'conf_high', 'delta'],
                       data=np.hstack((conf_int, delta.reshape(n_new_data, 1))))

# Add the results to our data frame.
df_res = pd.concat((df_res, df_conf), axis=1, sort=False)

Finally, we can do some null hypothesis significance testing. We will compute three p-values, our custom bootstraped p-value with Cohen's d, the p-value from a permutaiton test, and a p-value from Welch's t-test. Remember, with our custom bootstrapped p-value, the hypothesis is that the mutant and wild type sleep bout lengths were drawn out of distributions of the same mean (and no other assumptions). The test statistic is Cohen's d. The hypothesis in the permutation test is that the two data sets are identically distributed. The hypothesis in Welch's t-test is that the mutant and wild type were drawn from Gaussian distributions with the same mean, but with difference variances. The test statistic is a t-statistic, defined above. Finally, the odds ratio is the ratio of probability that the data come from Gaussian distributions with the same mean to the probability that the data come from Gaussian distributions with different means.

In [36]:
# Set up arrays for storing results
cohen_p = np.empty(n_new_data)
perm_test_p = np.empty(n_new_data)
welch_p = np.empty(n_new_data)

@numba.jit(nopython=True)
def perm_test_diff_mean(wt, mut, size=100000):
    reps = draw_perm_reps_diff_mean(wt, mut, size=size)
    return np.sum(reps >= np.mean(wt) - np.mean(mut)) / len(reps)

# Do calcs!
for i, (wt_data, mut_data) in tqdm.tqdm_notebook(enumerate(zip(new_wt, new_mut))):
    # Compute p-values
    cohen_p[i] = cohen_nhst(wt_data, mut_data)
    perm_test_p[i] = perm_test_diff_mean(wt_data, mut_data)
    welch_p[i] = st.ttest_ind(wt_data, mut_data, equal_var=False)[1] / 2

df_res['cohen_p'] = cohen_p
df_res['perm_p'] = perm_test_p
df_res['welch_p'] = welch_p

To visualize the results, we'll plot the frequentist and Bayesian confidence intervals and credible regions. We'll plot the confidence interval as a bar, and then the bounds of the credible region as dots. For ease of viewing, we will only plot 100 of these and will sort them by the plug-in estimate for δ.

In [69]:
# Sort by delta for easy display
df_res_sorted = df_res.sort_values(by='delta').iloc[::5]

# Set up figure
p = bokeh.plotting.figure(plot_height=250,
                          plot_width=700,
                          y_axis_label='δ (min)')

 #Populate glyphs
x_conf = [[i, i] for i in range(len(df_res_sorted))]
y_conf = [[r['conf_low'], r['conf_high']] for _, r in df_res_sorted.iterrows()]
p.multi_line(x_conf, y_conf, line_width=2, color='#1f77b4')
p.circle(np.arange(len(df_res_sorted)), df_res_sorted['cred_low'], color='#ff7f0e')
p.circle(np.arange(len(df_res_sorted)), df_res_sorted['cred_high'], color='#ff7f0e')

# Turn off axis ticks for x
p.xaxis.visible = False
p.xgrid.visible = False

bokeh.io.show(p)

The Bayesian credible region and frequentist confidence interval almost perfectly overlap, with the credible region always being a bit larger. This is not generally the case, but is the case for these data. More importantly, though, we see that the confidence interval can vary from experiment to experiment, but not that much. That is, the confidence intervals "dance" around, but all within about a factor of 3 of the original observed values.

Dancing: p-values and odds ratios

Now, let's look at the p-values. We will compute three p-values, our custom bootstraped p-value with Cohen's d, the p-value from a permutaiton test, and a p-value from Welch's t-test. Remember, with our custom bootstrapped p-value, the hypothesis is that the mutant and wild type sleep bout lengths were drawn out of distributions of the same mean (and no other assumptions). The test statistic is Cohen's d. The hypothesis in the permutation test is that the two data sets are identically distributed. The hypothesis in Welch's t-test is that the mutant and wild type were drawn from Gaussian distributions with the same mean, but with difference variances. The test statistic is a t-statistic, defined above. Finally, the odds ratio is the ratio of probability that the data come from Gaussian distributions with the same mean to the probability that the data come from Gaussian distributions with different means.

In [71]:
p = bokeh.plotting.figure(x_axis_type='log',
                          y_axis_type='log',
                          x_axis_label="Welch's p-value",
                          plot_width=600,
                          plot_height=400)
p.circle(df_res['welch_p'], df_res['cohen_p'],
         color='#1f77b4', alpha=0.5, legend="Cohen's d p-value")
p.circle(df_res['welch_p'], df_res['perm_p'],
         color='#ffbb78', alpha=0.5, legend='permuation test p-value')
p.circle(df_res['welch_p'], 1-df_res['weight_BMA'],
         color='#2ca02c', alpha=0.5, legend='BMA weight')
p.circle(df_res['welch_p'], 1-df_res['weight_stacking'],
         color='#7570b3', alpha=0.5, legend='stacking weight')

p.legend.location = 'bottom_right'
bokeh.io.show(p)

Most immediately striking is that the stacking weights sometimes go very close to zero and sometime very close to one. Remember, stacking is optimizing predictability of the model with an optimization step that could find a local minimum where one model is highly disfavored. This happens rarely, but on the plot it obscures the results. We'll make the plot again omitting the stacking weight result.

In [73]:
p = bokeh.plotting.figure(x_axis_type='log',
                          y_axis_type='log',
                          x_axis_label="Welch's p-value",
                          plot_width=600,
                          plot_height=400)
p.circle(df_res['welch_p'], df_res['cohen_p'],
         color='#1f77b4', alpha=0.5, legend="Cohen's d p-value")
p.circle(df_res['welch_p'], df_res['perm_p'],
         color='#ffbb78', alpha=0.5, legend='permuation test p-value')
p.circle(df_res['welch_p'], 1-df_res['weight_BMA'],
         color='#2ca02c', alpha=0.5, legend='BMA weight')

p.legend.location = 'bottom_right'
bokeh.io.show(p)

The custom p-value computed with a Cohen's d test statistic and the permutation test p-value are nearly equal to the Welch's p-value. The Bayesian model averaging weight (roughly the Akaike weight) varies with the Welch's p-value, but varies over a much smaller range, only going down to about 0.05 or so. Remember, they are computing different things, but are all in some way measures of difference between two data sets, and show similar trends.

But what is really striking here is the scale! Wow! In 500 repeats, we get p-values ranging over almost four orders of magnitude! That's three exclamations in a row! Four, now. Those exclamation points are there to highlight that the p-value is not a reproducible statistic at all.

The information criterion-based weight does not dance nearly as much. In all cases, the weight is between about 0.05 and 0.95, which means that we would not discount the simpler model of the data coming from the same distribution. Conversely, both confidence intervals and credible regions don't really dance much, and p-values and odds ratios dance like this.

The effect of sample size on dancing

The zebrafish sleep experiment had only about 20 samples, so maybe larger sample sizes will result in less extreme dancing of p-values. Let's do a numerical experiment to look at that. We will take 15, 20, 50, and 100 samples for our experimental "repeats" and investigate how the p-value and BMA weight varies. This is computationally intensive, taking about four hours on my machine.

In [54]:
n_new_data = 1000
n_samples = [15, 20, 50, 100]
p_vals = np.empty((n_new_data, len(n_samples)))
weight_BMA = np.empty((n_new_data, len(n_samples)))
weight_stacking = np.empty((n_new_data, len(n_samples)))
weight_akaike = np.empty((n_new_data, len(n_samples)))

# Do calcs!
for i in tqdm.tqdm_notebook(range(n_new_data)):
    for j, n in enumerate(n_samples):
        # Generate new data
        new_wt = new_data(mu_wt, sigma_wt, n)
        new_mut = new_data(mu_mut, sigma_mut, n)

        # Compute p-value
        p_vals[i,j] = st.ttest_ind(new_wt, new_mut, equal_var=False)[1]/2
        
        df_bayes = analyze_bayes(new_wt, new_mut)
        weight_BMA[i,j] = df_bayes.loc[0, 'weight_BMA']
        weight_stacking[i,j] = df_bayes.loc[0, 'weight_stacking']
        weight_akaike[i,j] = df_bayes.loc[0, 'weight_akaike']

Let's look at the ECDFs of p-values and BMA weights.

In [74]:
# Make tidy data frames for convenient plotting
df_p = pd.DataFrame(data=p_vals, columns=['n = ' + str(n) for n in n_samples])
df_p = df_p.melt(var_name='n', value_name='p')
df_bma = pd.DataFrame(data=weight_akaike, columns=['n = ' + str(n) for n in n_samples])
df_bma = df_bma.melt(var_name='n', value_name='weight_BMA')

# Make plots
p1 = bebi103.viz.ecdf_collection(df_p,
                                 cats=['n'],
                                 val='p',
                                 x_axis_label='p-value',
                                 val_axis_type='log',
                                 order=['n = 15', 'n = 20', 'n = 50', 'n = 100'],
                                 plot_width=700,
                                 plot_height=250,
                                 palette=bokeh.palettes.d3['Category20c'][4])
p2 = bebi103.viz.ecdf_collection(df_bma,
                                 cats=['n'],
                                 val='weight_BMA',
                                 order=['n = 15', 'n = 20', 'n = 50', 'n = 100'],
                                 x_axis_label='BMA weight',
                                 val_axis_type='linear',
                                 plot_width=700,
                                 plot_height=250,
                                 palette=bokeh.palettes.d3['Category20c'][8][4:])
p1.legend.location = 'top_left'
p2.legend.location = 'top_left'

bokeh.io.show(bokeh.layouts.gridplot([p1, p2], ncols=1))

Note the scale. The p-value ECDFs are on a log scale, and the BMA on a linear scale. We see that even though the p-value has large spreads as the number of samples increases, they also shift leftward. This is because small differences in samples can be discerned with large sample sizes. But notice that the p-value varies over orders of magnitude for similar data set. The BMA weight is more stable, not going strongly toward zero or one.

Conclusion

This little exercise in reproducibility tells use that because the p-values "dance", we had better be sure the dancefloor is far to the left. This suggests large $n$.

I would argue that if you are doing to use stacking or BMA weights in your analysis pipelines, you should do a "dance of the weights" analysis. If a "repeat" of your experiment might give you an inconclusive model comparison, you need to do more experiments.