(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
This homework was generated from an Jupyter notebook. You can download the notebook here.
This tutorial was inspired by Geoff Cumming.
import numpy as np
import pandas as pd
import scipy.stats as st
import numba
import tqdm
import bebi103
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
In this fun exercise, we will investigate how replicable certain statistical conclusions, both the frequentist and Bayesian contexts, are. What I mean by replicability is best understood be working through the tutorial.
For this tutorial we will use the zebrafish embryo sleep data we have been using before. We will study the mean rest bout length on the sixth night, comparing wild type to the homozygous mutant. We are primarily interested in the difference in the mean bout lengths between the two genotypes. You can work up the data set to get the mean rest bouts for the sixth night, but here they are.
As a reminder, two of the mutant fish had no detectable rest bouts on the one-minute sampling interval. They obviously did have rest bouts, but they were smaller than one minute. We therefore have a decision: we can either choose the mean rest bout length to be zero for these fish, or we can jettison them from our samples because we know we could not properly sample them. For the purposes of this demonstration, we will neglect these points. (The purpose of this tutorial is to demonstrate reproducibility of statistics, not analyze this specific data set.)
Let's remind ourselves of what the data look like.
# Load data
df = pd.read_csv('../data/mean_rest_bouts.csv', comment='#')
# Pull out wild type and mutant and drop NAs
df = df[df['genotype'].isin(['wt', 'mut'])].dropna()
Let's look at these data with a jitter plot.
p = bebi103.viz.jitter(df,
cats=['genotype'],
val='mean_rest_bout_length',
order=['wt', 'mut'],
y_axis_label='mean rest bout length (s)',
width=0.2)
bokeh.io.show(p)
Cohen's d is a commonly used measure of effect size in comparison of two data sets. It is the ratio of the difference of means compared to a pooled standard deviation.
\begin{align} d = \frac{|\bar{x} - \bar{y}|}{\sqrt{\left.(n_x r_x^2 + n_y r_y^2) \middle/ (n_x + n_y - 2)\right.}}. \end{align}
Here, $\bar{x}$ is the mean of the data from sample $x$, $r_x^2$ is the plug-in estimate for the variance from sample $x$, and $n_x$ is the number of measurements in sample $x$. The values for sample $y$ are similarly defined.
Roughly speaking, Cohen's d tells us how different the means of the data sets are compared to the variability in the data. A large Cohen's d means that the effect is large compared to the variability of the measurement.
First, we will compute nonparametric estimates from the data. We will estimate the difference in the mean bout length and Cohen's d. For speed, we save the two data sets as NumPy arrays.
wt = df.loc[df.genotype=='wt', 'mean_rest_bout_length'].values
mut = df.loc[df.genotype=='mut', 'mean_rest_bout_length'].values
Next, we'll write some functions to conveniently generate bootstrap replicates and do our hypothesis tests. These borrow heavily from the tutorial on hacker stats.
@numba.jit(nopython=True)
def cohen_d(x, y, return_abs=False):
"""Cohen's d for two data sets."""
diff = x.mean() - y.mean()
pooled_variance = (len(x) * np.var(x) + len(y) * np.var(y)) \
/ (len(x) + len(y) - 2)
if return_abs:
return np.abs(diff) / np.sqrt(pooled_variance)
return diff / np.sqrt(pooled_variance)
@numba.jit(nopython=True)
def t_stat(x, y):
"""Welch's t-statistic."""
return (np.mean(x) - np.mean(y)) / \
np.sqrt(np.var(x) / (len(x) - 1) + np.var(y) / (len(y) - 1))
@numba.jit(nopython=True)
def draw_perm_sample(x, y):
"""Generate a permutation sample."""
concat_data = np.concatenate((x, y))
np.random.shuffle(concat_data)
return concat_data[:len(x)], concat_data[len(x):]
@numba.jit(nopython=True)
def draw_bs_sample(data):
return np.random.choice(data, size=len(data))
@numba.jit(nopython=True)
def draw_perm_reps_diff_mean(x, y, size=10000):
out = np.empty(size)
for i in range(size):
x_perm, y_perm = draw_perm_sample(x, y)
out[i] = np.mean(x_perm) - np.mean(y_perm)
return out
@numba.jit(nopython=True)
def draw_bs_reps_mean(data, size=10000):
out = np.empty(size)
for i in range(size):
out[i] = np.mean(draw_bs_sample(data))
return out
@numba.jit(nopython=True)
def draw_bs_reps_diff_mean(x, y, size=10000):
out = np.empty(size)
for i in range(size):
out[i] = np.mean(draw_bs_sample(x)) - np.mean(draw_bs_sample(y))
return out
@numba.jit(nopython=True)
def draw_bs_reps_cohen_d(x, y, size=10000, return_abs=False):
out = np.empty(size)
for i in range(size):
out[i] = cohen_d(draw_bs_sample(x), draw_bs_sample(y), return_abs)
return out
@numba.jit(nopython=True)
def draw_bs_reps_t(x, y, size=10000):
"""
Bootstrap replicates using the Welch's t-statistic.
"""
out = np.empty(size)
for i in range(size):
out[i] = t_stat(draw_bs_sample(x), draw_bs_sample(y))
return out
Now we can compute the replicates. First, let's look at the means and their respective confidence intervals.
wt_reps = draw_bs_reps_mean(wt)
mut_reps = draw_bs_reps_mean(mut)
And from these, compute the 95% confidence interval.
wt_mean_conf_int = np.percentile(wt_reps, [2.5, 97.5])
mut_mean_conf_int = np.percentile(mut_reps, [2.5, 97.5])
print('WT: [{1:.2f}, {0:.2f}, {2:.2f}] minutes'.format(
np.mean(wt), *tuple(wt_mean_conf_int)))
print('MUT: [{1:.2f}, {0:.2f}, {2:.2f}] minutes'.format(
np.mean(mut), *tuple(mut_mean_conf_int)))
There is some overlap in the confidence interval, though wild type tend to have longer rest bouts. Just looking at these numbers, we may not be all that certain that there is a discernible difference between wild type and mutant. (Remember, though, that we left out the two fish that did not have rest bouts at all. It could be that the key difference between the two genotypes is that some mutants do not sleep at all, while those that do have similar rest bout lengths as wild type.)
Now, let's look at the difference in the mean rest bout length.
reps = draw_bs_reps_diff_mean(wt, mut)
diff_mean_conf_int = np.percentile(reps, [2.5, 97.5])
print('WT - MUT: [{1:.2f}, {0:.2f}, {2:.2f}] minutes'.format(
np.mean(wt) - np.mean(mut), *tuple(diff_mean_conf_int)))
As we might expect, on the tail end of the confidence interval for the difference of means, we see that the mutant might actually have longer rest bouts that wild type.
Finally, lets compute the Cohen's d to check the effect size.
reps = draw_bs_reps_cohen_d(wt, mut)
cohen_d_conf_int = np.percentile(reps, [2.5, 97.5])
print('WT - MUT Cohen''s d: [{1:.2f}, {0:.2f}, {2:.2f}] minutes'.format(
cohen_d(wt, mut), *tuple(cohen_d_conf_int)))
So, the effect size is 0.54, meaning that the mutant tends to have rest bouts 0.5 standard deviations as large as wild type fix. Jacob Cohen would call this a "medium" sized effect.
We will now perform an NHST on these two data sets. We formulate the hypothesis as follows.
We can then perform a bootstrap hypothesis test.
@numba.jit(nopython=True)
def cohen_nhst(wt, mut, size=100000):
"""
Perform hypothesis test assuming equal means, using
Cohen-d as test statistic.
"""
# Shift data sets so that they have the same mean.
wt_shifted = wt - np.mean(wt) + np.mean(np.concatenate((wt, mut)))
mut_shifted = mut - np.mean(mut) + np.mean(np.concatenate((wt, mut)))
# Draw replicates of Cohen's d
reps = draw_bs_reps_cohen_d(wt_shifted, mut_shifted, size=size)
# Compute p-value
return np.sum(reps >= cohen_d(wt, mut)) / len(reps)
print("Cohen's d p-value:", cohen_nhst(wt, mut))
We get a p-value of about 0.07, which, if we use the typical bright line p-value for statistical significance, we would say that this difference is not statistically significant. We could also test what would happen if we used a different test statistic, like the difference of means.
# Shift data sets so that they have the same mean.
wt_shifted = wt - np.mean(wt) + np.mean(np.concatenate((wt, mut)))
mut_shifted = mut - np.mean(mut) + np.mean(np.concatenate((wt, mut)))
# Draw replicates of Cohen's d
reps = draw_bs_reps_diff_mean(wt_shifted, mut_shifted, size=100000)
# Compute p-value
p_val = np.sum(reps >= np.mean(wt) - np.mean(mut)) / len(reps)
print('Difference of means p-value:', p_val)
Here, we get a p-value of about 0.04. We would say that the result is statistically significant if we used a bright line value of 0.05.
Finally, let's try the canonical test for this circumstance, the Welch's t-test. As a reminder, the test statistic for the Welch's t-test is
\begin{align} T = \frac{\bar{x}_w - \bar{x}_m}{\sqrt{r_w^2/n_w + r_m^2/n_m}}, \end{align}
where $r_w^2$ and $r_m^2$ are plug-in estimates for the variances. Importantly, when performing a Welch's t-test, Gaussianity of the two samples is assumed. So, the hypothesis test is defined as follows.
This is implemented as st.ttest_ind()
using the kwarg equal_var=False
. We divide by two to get the one-tailed test.
print("Welch's p-value:", st.ttest_ind(wt, mut, equal_var=False)[1]/2)
Here, we are just above the bright line value of 0.05. We can perform a similar hypothesis test without the Gaussian assumption using the same test statistic as in the Welch's test.
# Draw replicates of t statistic
reps = draw_bs_reps_t(wt_shifted, mut_shifted, size=100000)
# Compute p-value
p_val = np.sum(reps >= t_stat(wt, mut)) / len(reps)
print("Welch's t-test without Gaussian assumption p-value:", p_val)
Finally, we will perform a permutation test. This test is specified as follows.
# Draw permutation replicates
reps = draw_perm_reps_diff_mean(wt, mut, size=100000)
# Compute p-value
p_val = np.sum(reps >= np.mean(wt) - np.mean(mut)) / len(reps)
print("Permutation test p-value:", p_val)
So, all of our tests give p-values that are close to each other, ranging from about 0.04 to 0.07. If we choose bright line p-values to deem something as significant or not, some similar hypothesis/test statistic pairs can give different results. So, do not use brightline p-values. They are a terrible idea. You went through the trouble of computing the p-value, just report it and leave it at that. Don't change a float
to a bool
.
For a Bayesian model comparison problem, we could do the same modeling, assume that the two distributions have the same mean, and not choose a model for how the data are distributed. This is a nonparametric Bayesian calculation, which is mathematically more challenging than the frequentist estimates and NHST we just did.
Instead, we will do a parametric Bayesian model selection, comparable to the Welch's t-test analysis. We will consider two models. In model 1, both wild type and mutant sleep bouts are drawn from the same Gaussian distribution. In model 2, they are drawn from different Gaussian distributions. Formally, model 1 is, with weakly informative priors,
\begin{align} &\mu \sim \text{HalfNorm}(0, 60) \\[1em] &\sigma \sim \text{HalfNorm}(0, 10)\\[1em] &t_i \sim \text{Norm}(\mu, \sigma)\;\forall i\text{, wild type or mutant}. \end{align}
Model 2 is given by
\begin{align} &\mu_\mathrm{wt} \sim \text{HalfNorm}(0, 60) \\[1em] &\mu_\mathrm{mut} \sim \text{HalfNorm}(0, 60) \\[1em] &\sigma_\mathrm{wt} \sim \text{HalfNorm}(0, 10)\\[1em] &\sigma_\mathrm{mut} \sim \text{HalfNorm}(0, 10)\\[1em] &t^\mathrm{wt}_i \sim \text{Norm}(\mu_\mathrm{wt}, \sigma_\mathrm{wt})\;\forall i\\[1em] &t^\mathrm{mut}_i \sim \text{Norm}(\mu_\mathrm{mut}, \sigma_\mathrm{mut})\;\forall i. \end{align}
We can code the two models up in Stan.
model_code_1 = """
data {
int N1;
int N2;
real wt[N1];
real mut[N2];
}
parameters {
real<lower=0> mu;
real<lower=0> sigma;
}
model {
mu ~ inv_gamma(1.4, 2);
sigma ~ normal(0, 2);
wt ~ normal(mu, sigma);
mut ~ normal(mu, sigma);
}
generated quantities {
real wt_ppc[N1];
real mut_ppc[N2];
real log_lik[N1+N2];
for (i in 1:N1) {
wt_ppc[i] = normal_rng(mu, sigma);
log_lik[i] = normal_lpdf(wt[i] | mu, sigma);
}
for (i in 1:N2) {
mut_ppc[i] = normal_rng(mu, sigma);
log_lik[i+N1] = normal_lpdf(mut[i] | mu, sigma);
}
}
"""
sm_1 = bebi103.stan.StanModel(model_code = model_code_1)
model_code_2 = """
data {
int N1;
int N2;
real wt[N1];
real mut[N2];
}
parameters {
real<lower=0> mu_wt;
real<lower=0> mu_mut;
real<lower=0> sigma_wt;
real<lower=0> sigma_mut;
}
model {
mu_wt ~ inv_gamma(1.4, 2);
sigma_wt ~ normal(0, 2);
mu_mut ~ inv_gamma(1.4, 2);
sigma_mut ~ normal(0, 2);
wt ~ normal(mu_wt, sigma_wt);
mut ~ normal(mu_mut, sigma_mut);
}
generated quantities {
real wt_ppc[N1];
real mut_ppc[N2];
real log_lik[N1+N2];
for (i in 1:N1) {
wt_ppc[i] = normal_rng(mu_wt, sigma_wt);
log_lik[i] = normal_lpdf(wt[i] | mu_wt, sigma_wt);
}
for (i in 1:N2) {
mut_ppc[i] = normal_rng(mu_mut, sigma_mut);
log_lik[i+N1] = normal_lpdf(mut[i] | mu_mut, sigma_mut);
}
}
"""
sm_2 = bebi103.stan.StanModel(model_code = model_code_2)
Let's now sample.
data = dict(N1=len(wt), N2=len(mut), wt=wt, mut=mut)
samples_1 = sm_1.sampling(data=data)
samples_2 = sm_2.sampling(data=data)
bebi103.stan.check_all_diagnostics(samples_1)
print('')
bebi103.stan.check_all_diagnostics(samples_2)
Now, let's plot some posterior predictive checks.
p_wt_1 = bebi103.viz.predictive_ecdf(samples_1, name='wt_ppc', data=wt, title='model 1')
p_mut_1 = bebi103.viz.predictive_ecdf(samples_1, name='mut_ppc', data=mut, title='model 1')
p_wt_2 = bebi103.viz.predictive_ecdf(samples_2, name='wt_ppc', data=wt, title='model 2')
p_mut_2 = bebi103.viz.predictive_ecdf(samples_2, name='mut_ppc', data=mut, title='model 2')
bokeh.io.show(bokeh.layouts.gridplot([[p_wt_1, p_mut_1],
[p_wt_2, p_mut_2]]))
The models both seem to capture the data well, so they both pass posterior predictive checks. Let's compute 95% credible regions for the parameters $\mu_\mathrm{wt}$ and $\mu_\mathrm{mut}$ for model 2.
wt_mean_cred_reg = np.percentile(samples_2.extract('mu_wt')['mu_wt'], [2.5, 50, 97.5])
mut_mean_cred_reg = np.percentile(samples_2.extract('mu_mut')['mu_mut'], [2.5, 50, 97.5])
print('WT: [{0:.2f}, {1:.2f}, {2:.2f}] minutes'.format(
*wt_mean_cred_reg))
print('MUT: [{0:.2f}, {1:.2f}, {2:.2f}] minutes'.format(
*mut_mean_cred_reg))
These are almost identical to the frequetist 95% confidence interval (see above when we calculated those).
Now, we'll perform a model comparison.
bebi103.stan.compare({'model_1': samples_1, 'model_2': samples_2},
ic='loo',
log_likelihood='log_lik')
The model where the means are the same is slightly more predictive, so we might favor this model. So, we have:
These seem contradictory. They are not. That is because they each answer different questions.
These are different things that can behave similarly, (i.e., a "good" model may not have a low p-value and tend to be predictive), but they are different assessments of a model. So, you need to think very carefully about what you present as a model assessment, and how you judge what others put forward as assessments of their models.
In this case, So, the (approximate) Bayesian credible region and the frequentist confidence interval almost exactly overlap, but the Bayesian model comparison and NHST give apparently different results (though, again, this is not a contradiction because they compute two different things).
We will now do a fun, instructive experiment. We will "re-acquire" the data by drawing random samples out of Gaussian distributions that have the same mean and variance as is most likely from the actual data. We will then compute the confidence interval and credible region for $\delta$ and see how they vary from experiment to experiment. We will later repeat this with p-values and odds ratios.
The idea here is that if the data are indeed Gaussian distributed, we are looking at data that could plausibly be generated in an identical experiment.
First, we'll write a function to generate new data and use it to generate 500 new data sets.
def new_data(mu, sigma, n):
"""Generate new data"""
return np.maximum(np.random.normal(mu, sigma, n), 0.01)
# Values from real data
mu_wt = np.mean(wt)
mu_mut = np.mean(mut)
sigma_wt = np.std(wt, ddof=1)
sigma_mut = np.std(mut, ddof=1)
# How many new data sets to generate
n_new_data = 500
# Generate new data
new_wt = [new_data(mu_wt, sigma_wt, len(wt)) for _ in range(n_new_data)]
new_mut = [new_data(mu_mut, sigma_mut, len(mut)) for _ in range(n_new_data)]
Now we can do the calculations. For the Bayesian calculations, we want to compute the 95% credible region for the difference between sleep bout lengths between wild type and mutants under model 2, the model where the wild type and mutant sleep bouts are drawn from different distributions. We call this difference in sleep bout lengths $\delta$. We will also compute weights for the single distribution model based on LOO. We'll repeat this 500 times, so we will write a function to do it. Because this is a high-throughput calculation, we will not perform diagnostic nor posterior predictive checks.
def analyze_bayes(wt, mut):
"""Perform a Bayesian analysis of the two data sets,
including several forms of model comparison."""
# Set up data for Stan
data = dict(N1=len(wt), N2=len(mut), wt=wt, mut=mut)
# Sample using MCMC
samples_1 = sm_1.sampling(data=data)
samples_2 = sm_2.sampling(data=data)
# Credible region
delta = (samples_2.extract('mu_wt')['mu_wt']
- samples_2.extract('mu_mut')['mu_mut'])
cred_reg = np.percentile(delta, [2.5, 97.5])
# Comparison
df_compare_BMA = bebi103.stan.compare(
{'model_1': samples_1, 'model_2': samples_2},
ic='loo',
method='BB-pseudo-BMA',
log_likelihood='log_lik')
df_compare_stacking = bebi103.stan.compare(
{'model_1': samples_1, 'model_2': samples_2},
ic='loo',
log_likelihood='log_lik')
dloo = (df_compare_stacking.loc['model_2', 'loo']
- df_compare_stacking.loc['model_1', 'loo'])
weight_akaike = np.exp(-dloo/2) / (1 + np.exp(-dloo/2))
return pd.DataFrame(data=dict(
cred_low=cred_reg[0],
cred_high=cred_reg[1],
weight_akaike=weight_akaike,
weight_BMA=df_compare_BMA.loc['model_2', 'weight'],
weight_stacking=df_compare_stacking.loc['model_2', 'weight'],
loo_1=df_compare_stacking.loc['model_1', 'loo'],
loo_2=df_compare_stacking.loc['model_2', 'loo']),
index=[0])
Now let's use this function to do these calculations.
df_res = analyze_bayes(new_wt[0], new_mut[0])
for wt_data, mut_data in tqdm.tqdm_notebook(zip(new_wt[1:], new_mut[1:])):
new_df = analyze_bayes(wt_data, mut_data)
df_res = df_res.append(new_df, ignore_index=True)
We can do the same calculations with a frequentist approach. First, we'll compute the confidence intervals for $\delta$.
# Set up arrays for storing results
conf_int = np.empty((n_new_data, 2))
delta = np.empty(n_new_data)
# Do calcs!
for i, (wt_data, mut_data) in enumerate(zip(new_wt, new_mut)):
# Compute confidence interval
bs_reps = draw_bs_reps_diff_mean(wt_data, mut_data)
conf_int[i,:] = np.percentile(bs_reps, (2.5, 97.5))
# Sample difference of means
delta[i] = wt_data.mean() - mut_data.mean()
# Store the results conveniently
df_conf = pd.DataFrame(columns=['conf_low', 'conf_high', 'delta'],
data=np.hstack((conf_int, delta.reshape(n_new_data, 1))))
# Add the results to our data frame.
df_res = pd.concat((df_res, df_conf), axis=1, sort=False)
Finally, we can do some null hypothesis significance testing. We will compute three p-values, our custom bootstraped p-value with Cohen's d, the p-value from a permutaiton test, and a p-value from Welch's t-test. Remember, with our custom bootstrapped p-value, the hypothesis is that the mutant and wild type sleep bout lengths were drawn out of distributions of the same mean (and no other assumptions). The test statistic is Cohen's d. The hypothesis in the permutation test is that the two data sets are identically distributed. The hypothesis in Welch's t-test is that the mutant and wild type were drawn from Gaussian distributions with the same mean, but with difference variances. The test statistic is a t-statistic, defined above. Finally, the odds ratio is the ratio of probability that the data come from Gaussian distributions with the same mean to the probability that the data come from Gaussian distributions with different means.
# Set up arrays for storing results
cohen_p = np.empty(n_new_data)
perm_test_p = np.empty(n_new_data)
welch_p = np.empty(n_new_data)
@numba.jit(nopython=True)
def perm_test_diff_mean(wt, mut, size=100000):
reps = draw_perm_reps_diff_mean(wt, mut, size=size)
return np.sum(reps >= np.mean(wt) - np.mean(mut)) / len(reps)
# Do calcs!
for i, (wt_data, mut_data) in tqdm.tqdm_notebook(enumerate(zip(new_wt, new_mut))):
# Compute p-values
cohen_p[i] = cohen_nhst(wt_data, mut_data)
perm_test_p[i] = perm_test_diff_mean(wt_data, mut_data)
welch_p[i] = st.ttest_ind(wt_data, mut_data, equal_var=False)[1] / 2
df_res['cohen_p'] = cohen_p
df_res['perm_p'] = perm_test_p
df_res['welch_p'] = welch_p
To visualize the results, we'll plot the frequentist and Bayesian confidence intervals and credible regions. We'll plot the confidence interval as a bar, and then the bounds of the credible region as dots. For ease of viewing, we will only plot 100 of these and will sort them by the plug-in estimate for δ.
# Sort by delta for easy display
df_res_sorted = df_res.sort_values(by='delta').iloc[::5]
# Set up figure
p = bokeh.plotting.figure(plot_height=250,
plot_width=700,
y_axis_label='δ (min)')
#Populate glyphs
x_conf = [[i, i] for i in range(len(df_res_sorted))]
y_conf = [[r['conf_low'], r['conf_high']] for _, r in df_res_sorted.iterrows()]
p.multi_line(x_conf, y_conf, line_width=2, color='#1f77b4')
p.circle(np.arange(len(df_res_sorted)), df_res_sorted['cred_low'], color='#ff7f0e')
p.circle(np.arange(len(df_res_sorted)), df_res_sorted['cred_high'], color='#ff7f0e')
# Turn off axis ticks for x
p.xaxis.visible = False
p.xgrid.visible = False
bokeh.io.show(p)
The Bayesian credible region and frequentist confidence interval almost perfectly overlap, with the credible region always being a bit larger. This is not generally the case, but is the case for these data. More importantly, though, we see that the confidence interval can vary from experiment to experiment, but not that much. That is, the confidence intervals "dance" around, but all within about a factor of 3 of the original observed values.
Now, let's look at the p-values. We will compute three p-values, our custom bootstraped p-value with Cohen's d, the p-value from a permutaiton test, and a p-value from Welch's t-test. Remember, with our custom bootstrapped p-value, the hypothesis is that the mutant and wild type sleep bout lengths were drawn out of distributions of the same mean (and no other assumptions). The test statistic is Cohen's d. The hypothesis in the permutation test is that the two data sets are identically distributed. The hypothesis in Welch's t-test is that the mutant and wild type were drawn from Gaussian distributions with the same mean, but with difference variances. The test statistic is a t-statistic, defined above. Finally, the odds ratio is the ratio of probability that the data come from Gaussian distributions with the same mean to the probability that the data come from Gaussian distributions with different means.
p = bokeh.plotting.figure(x_axis_type='log',
y_axis_type='log',
x_axis_label="Welch's p-value",
plot_width=600,
plot_height=400)
p.circle(df_res['welch_p'], df_res['cohen_p'],
color='#1f77b4', alpha=0.5, legend="Cohen's d p-value")
p.circle(df_res['welch_p'], df_res['perm_p'],
color='#ffbb78', alpha=0.5, legend='permuation test p-value')
p.circle(df_res['welch_p'], 1-df_res['weight_BMA'],
color='#2ca02c', alpha=0.5, legend='BMA weight')
p.circle(df_res['welch_p'], 1-df_res['weight_stacking'],
color='#7570b3', alpha=0.5, legend='stacking weight')
p.legend.location = 'bottom_right'
bokeh.io.show(p)
Most immediately striking is that the stacking weights sometimes go very close to zero and sometime very close to one. Remember, stacking is optimizing predictability of the model with an optimization step that could find a local minimum where one model is highly disfavored. This happens rarely, but on the plot it obscures the results. We'll make the plot again omitting the stacking weight result.
p = bokeh.plotting.figure(x_axis_type='log',
y_axis_type='log',
x_axis_label="Welch's p-value",
plot_width=600,
plot_height=400)
p.circle(df_res['welch_p'], df_res['cohen_p'],
color='#1f77b4', alpha=0.5, legend="Cohen's d p-value")
p.circle(df_res['welch_p'], df_res['perm_p'],
color='#ffbb78', alpha=0.5, legend='permuation test p-value')
p.circle(df_res['welch_p'], 1-df_res['weight_BMA'],
color='#2ca02c', alpha=0.5, legend='BMA weight')
p.legend.location = 'bottom_right'
bokeh.io.show(p)
The custom p-value computed with a Cohen's d test statistic and the permutation test p-value are nearly equal to the Welch's p-value. The Bayesian model averaging weight (roughly the Akaike weight) varies with the Welch's p-value, but varies over a much smaller range, only going down to about 0.05 or so. Remember, they are computing different things, but are all in some way measures of difference between two data sets, and show similar trends.
But what is really striking here is the scale! Wow! In 500 repeats, we get p-values ranging over almost four orders of magnitude! That's three exclamations in a row! Four, now. Those exclamation points are there to highlight that the p-value is not a reproducible statistic at all.
The information criterion-based weight does not dance nearly as much. In all cases, the weight is between about 0.05 and 0.95, which means that we would not discount the simpler model of the data coming from the same distribution. Conversely, both confidence intervals and credible regions don't really dance much, and p-values and odds ratios dance like this.
The zebrafish sleep experiment had only about 20 samples, so maybe larger sample sizes will result in less extreme dancing of p-values. Let's do a numerical experiment to look at that. We will take 15, 20, 50, and 100 samples for our experimental "repeats" and investigate how the p-value and BMA weight varies. This is computationally intensive, taking about four hours on my machine.
n_new_data = 1000
n_samples = [15, 20, 50, 100]
p_vals = np.empty((n_new_data, len(n_samples)))
weight_BMA = np.empty((n_new_data, len(n_samples)))
weight_stacking = np.empty((n_new_data, len(n_samples)))
weight_akaike = np.empty((n_new_data, len(n_samples)))
# Do calcs!
for i in tqdm.tqdm_notebook(range(n_new_data)):
for j, n in enumerate(n_samples):
# Generate new data
new_wt = new_data(mu_wt, sigma_wt, n)
new_mut = new_data(mu_mut, sigma_mut, n)
# Compute p-value
p_vals[i,j] = st.ttest_ind(new_wt, new_mut, equal_var=False)[1]/2
df_bayes = analyze_bayes(new_wt, new_mut)
weight_BMA[i,j] = df_bayes.loc[0, 'weight_BMA']
weight_stacking[i,j] = df_bayes.loc[0, 'weight_stacking']
weight_akaike[i,j] = df_bayes.loc[0, 'weight_akaike']
Let's look at the ECDFs of p-values and BMA weights.
# Make tidy data frames for convenient plotting
df_p = pd.DataFrame(data=p_vals, columns=['n = ' + str(n) for n in n_samples])
df_p = df_p.melt(var_name='n', value_name='p')
df_bma = pd.DataFrame(data=weight_akaike, columns=['n = ' + str(n) for n in n_samples])
df_bma = df_bma.melt(var_name='n', value_name='weight_BMA')
# Make plots
p1 = bebi103.viz.ecdf_collection(df_p,
cats=['n'],
val='p',
x_axis_label='p-value',
val_axis_type='log',
order=['n = 15', 'n = 20', 'n = 50', 'n = 100'],
plot_width=700,
plot_height=250,
palette=bokeh.palettes.d3['Category20c'][4])
p2 = bebi103.viz.ecdf_collection(df_bma,
cats=['n'],
val='weight_BMA',
order=['n = 15', 'n = 20', 'n = 50', 'n = 100'],
x_axis_label='BMA weight',
val_axis_type='linear',
plot_width=700,
plot_height=250,
palette=bokeh.palettes.d3['Category20c'][8][4:])
p1.legend.location = 'top_left'
p2.legend.location = 'top_left'
bokeh.io.show(bokeh.layouts.gridplot([p1, p2], ncols=1))
Note the scale. The p-value ECDFs are on a log scale, and the BMA on a linear scale. We see that even though the p-value has large spreads as the number of samples increases, they also shift leftward. This is because small differences in samples can be discerned with large sample sizes. But notice that the p-value varies over orders of magnitude for similar data set. The BMA weight is more stable, not going strongly toward zero or one.
This little exercise in reproducibility tells use that because the p-values "dance", we had better be sure the dancefloor is far to the left. This suggests large $n$.
I would argue that if you are doing to use stacking or BMA weights in your analysis pipelines, you should do a "dance of the weights" analysis. If a "repeat" of your experiment might give you an inconclusive model comparison, you need to do more experiments.