(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0.
This tutorial was generated from an Jupyter notebook. You can download the notebook here.
This tutorial was inspired by Geoff Cumming.
import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st
import numba
import pymc3 as pm
import bebi103
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
In this exercise, we will investigate how replicable certain statistical conclusions, both the frequentist and Bayesian contexts, are. What I mean by replicability is best understood be working through the tutorial.
For this tutorial we will use the zebrafish embryo sleep data we have been using before. We will study the mean rest bout length on the sixth night, comparing wild type to the homozygous mutant. We are primarily interested in the difference in the mean bout lengths between the two genotypes. You can work up the data set to get the mean rest bouts for the sixth night, but here they are.
As a reminder, two of the mutant fish had no detectable rest bouts on the one-minute sampling interval. They obviously did have rest bouts, but they were smaller than one minute. We therefore have a decision: we can either choose the mean rest bout length to be zero for these fish, or we can jettison them from our samples because we know we could not properly sample them. For the purposes of this demonstration, we will neglect these points. (The purpose of this tutorial is to demonstrate reproducibility of statistics, not analyze this specific data set.)
Let's remind ourselves of what the data look like.
# Load data
df = pd.read_csv('../data/mean_rest_bouts.csv', comment='#')
# Pull out wild type and mutant and drop NAs
df = df[df['genotype'].isin(['wt', 'mut'])].dropna()
Let's look at these data with a jitter plot.
p = bebi103.viz.jitter(df,
cats=['genotype'],
val='mean_rest_bout_length',
order=['wt', 'mut'],
y_axis_label='mean rest bout length (s)',
jitter_width=0.2)
bokeh.io.show(p)
Cohen's d is a commonly used measure of effect size in comparison of two data sets. It is the ratio of the difference of means compared to a pooled standard deviation.
\begin{align} d = \frac{|\bar{x} - \bar{y}|}{\sqrt{\left.(n_x r_x^2 + n_y r_y^2) \middle/ (n_x + n_y - 2)\right.}}. \end{align}
Here, $\bar{x}$ is the mean of the data from sample $x$, $r_x^2$ is the plug-in estimate for the variance from sample $x$, and $n_x$ is the number of measurements in sample $x$. The values for sample $y$ are similarly defined.
Roughly speaking, Cohen's d tells us how different the means of the data sets are compared to the variability in the data. A large Cohen's d means that the effect is large compared to the variability of the measurement.
First, we will compute nonparametric estimates from the data. We will estimate the difference in the mean bout length and Cohen's d. For speed, we save the two data sets as NumPy arrays.
wt = df.loc[df.genotype=='wt', 'mean_rest_bout_length'].values
mut = df.loc[df.genotype=='mut', 'mean_rest_bout_length'].values
Next, we'll write some functions to conveniently generate bootstrap replicates and do our hypothesis tests. These borrow heavily from the previous tutorial.
@numba.jit(nopython=True)
def cohen_d(x, y, return_abs=False):
"""
Cohen's d
"""
diff = x.mean() - y.mean()
pooled_variance = (len(x) * np.var(x) + len(y) * np.var(y)) \
/ (len(x) + len(y) - 2)
if return_abs:
return np.abs(diff) / np.sqrt(pooled_variance)
return diff / np.sqrt(pooled_variance)
@numba.jit(nopython=True)
def t_stat(x, y):
"""Welch's t-statistic."""
return (np.mean(x) - np.mean(y)) / \
np.sqrt(np.var(x) / (len(x) - 1) + np.var(y) / (len(y) - 1))
@numba.jit(nopython=True)
def draw_perm_sample(x, y):
"""Generate a permutation sample."""
concat_data = np.concatenate((x, y))
np.random.shuffle(concat_data)
return concat_data[:len(x)], concat_data[len(x):]
@numba.jit(nopython=True)
def draw_bs_sample(data):
return np.random.choice(data, size=len(data))
@numba.jit(nopython=True)
def draw_perm_reps_diff_mean(x, y, size=10000):
out = np.empty(size)
for i in range(size):
x_perm, y_perm = draw_perm_sample(x, y)
out[i] = np.mean(x_perm) - np.mean(y_perm)
return out
@numba.jit(nopython=True)
def draw_bs_reps_mean(data, size=10000):
out = np.empty(size)
for i in range(size):
out[i] = np.mean(draw_bs_sample(data))
return out
@numba.jit(nopython=True)
def draw_bs_reps_diff_mean(x, y, size=10000):
out = np.empty(size)
for i in range(size):
out[i] = np.mean(draw_bs_sample(x)) - np.mean(draw_bs_sample(y))
return out
@numba.jit(nopython=True)
def draw_bs_reps_cohen_d(x, y, size=10000, return_abs=False):
out = np.empty(size)
for i in range(size):
out[i] = cohen_d(draw_bs_sample(x), draw_bs_sample(y), return_abs)
return out
@numba.jit(nopython=True)
def draw_bs_reps_t(x, y, size=10000):
"""
Bootstrap replicates using the Welch's t-statistic.
"""
out = np.empty(size)
for i in range(size):
out[i] = t_stat(draw_bs_sample(x), draw_bs_sample(y))
return out
Now we can compute the replicates. First, let's look at the means and their respective confidence intervals.
wt_reps = draw_bs_reps_mean(wt)
mut_reps = draw_bs_reps_mean(mut)
And from these, compute the 95% confidence interval.
wt_mean_conf_int = np.percentile(wt_reps, [2.5, 97.5])
mut_mean_conf_int = np.percentile(mut_reps, [2.5, 97.5])
print('WT: [{1:.2f}, {0:.2f}, {2:.2f}] minutes'.format(
np.mean(wt), *tuple(wt_mean_conf_int)))
print('MUT: [{1:.2f}, {0:.2f}, {2:.2f}] minutes'.format(
np.mean(mut), *tuple(mut_mean_conf_int)))
There is some overlap in the confidence interval, though wild type tend to have longer rest bouts. Just looking at these numbers, we may not be all that certain that there is a discernible difference between wild type and mutant. (Remember, though, that we left out the two fish that did not have rest bouts at all. It could be that the key difference between the two genotypes is that some mutants do not sleep at all, while those that do have similar rest bout lengths as wild type.)
Now, let's look at the difference in the mean rest bout length.
reps = draw_bs_reps_diff_mean(wt, mut)
diff_mean_conf_int = np.percentile(reps, [2.5, 97.5])
print('WT - MUT: [{1:.2f}, {0:.2f}, {2:.2f}] minutes'.format(
np.mean(wt) - np.mean(mut), *tuple(diff_mean_conf_int)))
As we might expect, on the tail end of the confidence interval for the difference of means, we see that the mutant might actually have longer rest bouts that wild type.
Finally, lets compute the Cohen's d to check the effect size.
reps = draw_bs_reps_cohen_d(wt, mut)
cohen_d_conf_int = np.percentile(reps, [2.5, 97.5])
print('WT - MUT Cohen''s d: [{1:.2f}, {0:.2f}, {2:.2f}] minutes'.format(
cohen_d(wt, mut), *tuple(cohen_d_conf_int)))
So, the effect size is 0.54, meaning that the mutant tends to have rest bouts 0.5 standard deviations as large as wild type fix. Jacob Cohen would call this a "medium" sized effect.
We will now perform an NHST on these two data sets. We formulate the hypothesis as follows.
We can then perform a bootstrap hypothesis test.
@numba.jit(nopython=True)
def cohen_nhst(wt, mut, size=100000):
"""
Perform hypothesis test assuming equal means, using
Cohen-d as test statistic.
"""
# Shift data sets so that they have the same mean.
wt_shifted = wt - np.mean(wt) + np.mean(np.concatenate((wt, mut)))
mut_shifted = mut - np.mean(mut) + np.mean(np.concatenate((wt, mut)))
# Draw replicates of Cohen's d
reps = draw_bs_reps_cohen_d(wt_shifted, mut_shifted, size=size)
# Compute p-value
return np.sum(reps >= cohen_d(wt, mut)) / len(reps)
print("Cohen's d p-value:", cohen_nhst(wt, mut))
We get a p-value of about 0.07, which, if we use the typical bright line p-value for statistical significance, we would say that this difference is not statistically significant. We could also test what would happen if we used a different test statistic, like the difference of means.
# Shift data sets so that they have the same mean.
wt_shifted = wt - np.mean(wt) + np.mean(np.concatenate((wt, mut)))
mut_shifted = mut - np.mean(mut) + np.mean(np.concatenate((wt, mut)))
# Draw replicates of Cohen's d
reps = draw_bs_reps_diff_mean(wt_shifted, mut_shifted, size=100000)
# Compute p-value
p_val = np.sum(reps >= np.mean(wt) - np.mean(mut)) / len(reps)
print('Difference of means p-value:', p_val)
Here, we get a p-value of about 0.04. We would say that the result is statistically significant if we used a bright line value of 0.05.
Finally, let's try the canonical test for this circumstance, the Welch's t-test. As a reminder, the test statistic for the Welch's t-test is
\begin{align} T = \frac{\bar{x}_w - \bar{x}_m}{\sqrt{r_w^2/n_w + r_m^2/n_m}}, \end{align}
where $r_w^2$ and $r_m^2$ are plug-in estimates for the variances. Importantly, when performing a Welch's t-test, Gaussianity of the two samples is assumed. So, the hypothesis test is defined as follows.
This is implemented as st.ttest_ind()
using the kwarg equal_var=False
. We divide by two to get the one-tailed test.
print("Welch's p-value:", st.ttest_ind(wt, mut, equal_var=False)[1]/2)
Here, we are just above the bright line value of 0.05. We can perform a similar hypothesis test without the Gaussian assumption using the same test statistic as in the Welch's test.
# Draw replicates of t statistic
reps = draw_bs_reps_t(wt_shifted, mut_shifted, size=100000)
# Compute p-value
p_val = np.sum(reps >= t_stat(wt, mut)) / len(reps)
print("Welch's t-test without Gaussian assumption p-value:", p_val)
Finally, we will perform a permutation test. This test is specified as follows.
# Draw permutation replicates
reps = draw_perm_reps_diff_mean(wt, mut, size=100000)
# Compute p-value
p_val = np.sum(reps >= np.mean(wt) - np.mean(mut)) / len(reps)
print("Permutation test p-value:", p_val)
So, all of our tests give p-values that are close to each other, ranging from about 0.04 to 0.07. If we choose bright line p-values to deem something as significant or not, some similar hypothesis/test statistic pairs can give different results. So, do not use brightline p-values. They are a terrible idea. You went through the trouble of computing the p-value, just report it and leave it at that. Don't change a float
to a bool
.
For a Bayesian model comparison problem, we could do the same modeling, assume that the two distributions have the same mean, and not choose a model for how the data are distributed. This is a nonparametric Bayesian calculation, which is mathematically more challenging than the frequentist estimates and NHST we just did.
Instead, we will do a parametric Bayesian model selection, comparable to the Welch's t-test analysis.
First, we do some analytical work. We have two data sets, $D_w$ and $D_m$. We define hypothesis $H_1$: these data sets were drawn from independent Gaussian distributions with differing means and differing variances. The competing hypothesis is $H_0$: the data sets were drawn from independent Gaussian distributions with the same mean, $\mu$, but different variances. Since we are computing confidence intervals of the difference or means, we will not consider $H_0$ in the Bayesian context, except for model comparison.
So, using hypothesis $H_1$, the likelihood is
\begin{align} &f(D_w,D_m \mid \mu_w,\sigma_w,\mu_m,\sigma_m, H_1) = \left(\prod_{i\in D_w} \frac{\mathrm{e}^{-(x_i-\mu_w)^2/2\sigma_w^2}}{\sqrt{2\pi\sigma_w^2}}\right) \left(\prod_{i\in D_m} \frac{\mathrm{e}^{-(x_i-\mu_m)^2/2\sigma_m^2}}{\sqrt{2\pi\sigma_m^2}}\right) \\[1em] &= \frac{1}{(2\pi)^{(n_w+n_m)/2}\sigma_w^{n_w}\sigma_m^{n_m}}\,\exp\left[ -\frac{1}{2\sigma_w^2}\sum_{i\in D_w}(x_i-\mu_w)^2 -\frac{1}{2\sigma_m^2}\sum_{i\in D_m}(x_i-\mu_m)^2 \right]. \end{align}
We saw in lecture 2 that
\begin{align} \sum_{i\in D}(x_i-\mu)^2 = n(\bar{x} - \mu)^2 + nr^2, \end{align}
where $\bar{x}$ is the sample mean and $r^2$ is the plug-in estimate for the variance. Thus, we can write our likelihood as
\begin{align} f(D_w,D_m \mid \mu_w,\sigma_w,\mu_m,\sigma_m, H_1) &= \frac{1}{(2\pi)^{(n_w+n_m)/2}\sigma_w^{n_w}\sigma_m^{n_m}}\\[1em] &\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\times \exp\left[ -\frac{n_w}{2\sigma_w^2}\left(r_w^2 + \left(\mu_w-\bar{x}_w\right)^2\right) -\frac{n_m}{2\sigma_m^2}\left(r_m^2 + \left(\mu_m-\bar{x}_m\right)^2\right) \right]. \end{align}
From this expression, we can immediately take $\mu = \mu_m = \mu_w$ to get the likelihood under hypothesis $H_0$.
\begin{align} f(D_w,D_m \mid \mu,\sigma_w, \sigma_m, H_0) &= \frac{1}{(2\pi)^{(n_w+n_m)/2}\sigma_w^{n_w}\sigma_m^{n_m}}\\[1em] &\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\times \exp\left[ -\frac{n_w}{2\sigma_w^2}\left(r_w^2 + \left(\mu-\bar{x}_w\right)^2\right) -\frac{n_m}{2\sigma_m^2}\left(r_m^2 + \left(\mu-\bar{x}_m\right)^2\right) \right]. \end{align}
For the prior of our hypotheses, we should probably use informative priors using what we know about fish sleep from prrevious experiments. But, being new to this field, we will use uninformative proiors. So, the prior for hypothesis $H_1$ is
\begin{align} g(\mu_w, \sigma_w, \mu_m, \sigma_m\mid H_1) = \frac{1}{(\mu_\mathrm{max}-\mu_\mathrm{min})^2}\, \frac{1}{\sigma_w \sigma_m \left(\ln (\sigma_\mathrm{max}/\sigma_\mathrm{min})\right)^2}, \end{align}
where we have assumed uniform priors on the $\mu$'s and Jeffreys priors on the $\sigma$'s. Similarly, the prior for $H_0$ is
\begin{align} g(\mu, \sigma_w, \sigma_m\mid H_0) = \frac{1}{\mu_\mathrm{max}-\mu_\mathrm{min}}\, \frac{1}{\sigma_w \sigma_m \left(\ln (\sigma_\mathrm{max}/\sigma_\mathrm{min})\right)^2}. \end{align}
To get the likelihood and prior for $H_1$ in a more convenient form, we now make a change of variables. Let $\delta = \mu_w-\mu_m$ and $\gamma = \mu_w + \mu_m$. Then, $\mu_w = (\gamma + \delta) / 2$ and $\mu_m = (\gamma - \delta) / 2$. We have to take care to apply the change of variables formula, which gives an extra factor of $1/2$ in the likelihood and prior. The likelihood is
\begin{align} f(D_w,D_m \mid \delta, \gamma,\sigma_w,\sigma_m, H_1) &= \frac{1}{2(2\pi)^{(n_w+n_m)/2}\sigma_w^{n_w}\sigma_m^{n_m}}\\[1em] &\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\times \exp\left[ -\frac{n_w}{2\sigma_w^2}\left(r_w^2 + \left(\frac{\gamma + \delta}{2}-\bar{x}_w\right)^2\right) -\frac{n_m}{2\sigma_m^2}\left(r_m^2 + \left(\frac{\gamma + \delta}{2}-\bar{x}_m\right)^2\right) \right], \end{align}
and the prior is
\begin{align} g(\delta, \gamma, \sigma_w, \sigma_m\mid H_1) = \frac{1}{2(\mu_\mathrm{max}-\mu_\mathrm{min})^2}\, \frac{1}{\sigma_w \sigma_m \left(\ln (\sigma_\mathrm{max}/\sigma_\mathrm{min})\right)^2}, \end{align}
bearing in mind the we must enforce that $\delta \le \gamma$, $2\mu_\mathrm{min} \le\gamma \le 2\mu_\mathrm{max}$, and $|\delta| \le \mu_\mathrm{max} - \mu_\mathrm{min}$.
We can proceed analytically all the way to the odds ratio, but to save the algebraic grunge, we will just solve for the MAP, covariance matrix, and ultimately the odds ratio using numerical optimization. So, let's code up the negative log posteriors of both.
@numba.jit(nopython=True)
def log_like_1(p, wt, mut):
"""
Log likelihood under model H1.
"""
nw = len(wt)
nm = len(mut)
rw2 = np.var(wt)
rm2 = np.var(mut)
delta, gamma, sigma_w, sigma_m = p
return ( -np.log(2) - (nw + nm) / 2 * np.log(2*np.pi)
- nw * np.log(sigma_w) - nm * np.log(sigma_m)
- nw / 2 / sigma_w**2 *(rw2 + ((gamma + delta) / 2 - np.mean(wt))**2)
- nm / 2 / sigma_m**2 *(rm2 + ((gamma - delta) / 2 - np.mean(mut))**2) )
@numba.jit(nopython=True)
def log_prior_1(p, mu_min, mu_max, sigma_min, sigma_max):
"""
Log prior under model H1.
"""
delta, gamma, sigma_w, sigma_m = p
if ( gamma < 2*mu_min or gamma > 2*mu_max
or delta > gamma or np.abs(delta) > mu_max - mu_min
or not (sigma_min <= sigma_m <= sigma_max)
or not (sigma_min <= sigma_w <= sigma_max) ):
return -np.inf
return ( -np.log(2) - np.log(sigma_m) - np.log(sigma_w)
- 2*np.log(np.log(sigma_max/sigma_min)) - 2*np.log(mu_max - mu_min) )
@numba.jit(nopython=True)
def log_like_0(p, wt, mut):
"""
Log likelihood under model H0.
"""
nw = len(wt)
nm = len(mut)
rw2 = np.var(wt)
rm2 = np.var(mut)
mu, sigma_w, sigma_m = p
return ( -(nw + nm) / 2 * np.log(2*np.pi)
- nw * np.log(sigma_w) - nm * np.log(sigma_m)
- nw / 2 / sigma_w**2 * (rw2 + (mu - np.mean(wt))**2)
- nm / 2 / sigma_m**2 * (rm2 + (mu - np.mean(mut))**2) )
@numba.jit(nopython=True)
def log_prior_0(p, mu_min, mu_max, sigma_min, sigma_max):
"""
Log prior under model H0.
"""
mu, sigma_w, sigma_m = p
if ( not (mu_min <= mu <= mu_max)
or not (sigma_min <= sigma_m <= sigma_max)
or not (sigma_min <= sigma_w <= sigma_max) ):
return -np.inf
return ( -np.log(sigma_m) - np.log(sigma_w)
- 2*np.log(np.log(sigma_max/sigma_min)) - np.log(mu_max - mu_min) )
def neg_log_post(p, wt, mut, mu_min, mu_max, sigma_min, sigma_max, model):
"""
Negative log posterior.
"""
if model == 1:
lp = log_prior_1(p, mu_min, mu_max, sigma_min, sigma_max)
if lp == -np.inf:
return np.inf
return -lp - log_like_1(p, wt, mut)
elif model == 0:
lp = log_prior_0(p, mu_min, mu_max, sigma_min, sigma_max)
if lp == -np.inf:
return np.inf
return -lp - log_like_0(p, wt, mut)
Let's specify the ranges on the parameter values for the $\mu$'s and $\sigma$'s. What might be reasonable in the present context? Well, obviously, $0 \le \mu \le 600$ minutes, since the length of a sleep bout must be nonnegative and there are 600 total minutes in the sixth night. These are hard physical limits, but we can constrain $\mu$ further. I think if I saw a fish sitting still for 10 hours, I would be pretty shocked if it weren't floating (i.e., dead). So, perhaps a more reasonable bound on $\mu$ is one hour. That still seems really long, but not outside the realm of reason. How about $\sigma$? Again, the standard deviation cannot be bigger than the actual night, so 600 is a hard upper bound. Again, 10 hours of no movement would imply death, so we will choose $\sigma_\mathrm{max} = 60$ min. For the lower bound, we might pick one second, since that is close to the sampling rate of the experiment. So, $\sigma_\mathrm{min} = 1/60$.
mu_min, mu_max = 0, 60
sigma_min, sigma_max = 1/60, 60
Now, we can find the MAP and error bars using optimization. I will go ahead and write a function to do the full analysis, including model selection. because we will be using it again and again later. The odds ratio is
\begin{align} O_{01} \approx &\left(\frac{g(H_0)}{g(H_1)}\right) \left(\frac{f(D\mid \mu^*, \sigma_w^*, \sigma_m^*, H_0)}{f(D_1, D_2\mid \delta^*, \gamma^*,\sigma_w^*, \sigma_m^*, H_1)}\right) \\[1em] &\;\;\;\;\times \left(\frac{g(\mu^*, \sigma_w^*, \sigma_m^* \mid H_0) \,(2\pi)^{3/2}\,\sqrt{\det\mathsf{\Sigma}_0}} {g(\delta^*, \gamma^*, \sigma_w^*, \sigma_m^*\mid H_1)\,(2\pi)^2 \sqrt{\det\mathsf{\Sigma}_1}}\right), \end{align}
where the asterisks represent the parameter values at the MAP. I will take $g(H_0)/g(H_1) \approx 1$, assuming that we do not favor either model a priori. To do the calculation, we need to compute the MAP and covariance for $H_0$, where the samples come from the same distribution. To compute the other portions, we first consider the likelihood, prior, and posterior for Gaussian distributed data.
def analyze_bayes(wt, mut):
"""
Compute MAP, covariance, and odds ratio for Bayesian model
"""
# Initial guesses
delta_0 = np.mean(wt) - np.mean(mut)
gamma_0 = np.mean(wt) + np.mean(mut)
mu_0 = np.mean(np.concatenate((wt, mut)))
sigma_w_0 = np.std(wt)
sigma_m_0 = np.std(mut)
args_0 = (wt, mut, mu_min, mu_max, sigma_min, sigma_max, 0)
args_1 = (wt, mut, mu_min, mu_max, sigma_min, sigma_max, 1)
# Find MAP
popt_0 = scipy.optimize.minimize(
neg_log_post, np.array([mu_0, sigma_w_0, sigma_m_0]), args=args_0).x
popt_1 = scipy.optimize.minimize(
neg_log_post, np.array([delta_0, gamma_0, sigma_w_0, sigma_m_0]),
args=args_1).x
# Compute covariances
cov_0 = np.linalg.inv(bebi103.tools.approx_hess(popt_0, neg_log_post, args=args_0))
cov_1 = np.linalg.inv(bebi103.tools.approx_hess(popt_1, neg_log_post, args=args_1))
# Compute odds ratio
log_goodfit = log_like_0(popt_0, wt, mut) - log_like_1(popt_1, wt, mut)
log_occam = log_prior_0(popt_0, mu_min, mu_max, sigma_min, sigma_max) \
- log_prior_1(popt_1, mu_min, mu_max, sigma_min, sigma_max) \
- np.log(2*np.pi)/2 \
+ np.log(np.linalg.det(cov_0)) / 2 \
- np.log(np.linalg.det(cov_1)) / 2
log_odds_ratio = log_goodfit + log_occam
# Return results
return {'popt_0': popt_0, 'popt_1': popt_1, 'cov_0': cov_0, 'cov_1': cov_1,
'log_goodfit': log_goodfit, 'log_occam': log_occam,
'log_odds_ratio': log_odds_ratio}
Let's do the analysis!
# Perform analysis
res = analyze_bayes(wt, mut)
That was easy (it's nice to have convenient functions). Let's see how we did on the parameter estimate for $\delta$, the difference in means. For fun, we'll print out all of the parameter estimates.
# Print results
print("""
Model 0:
µ = {0:.2f} ± {3:.2f} minutes
σw = {1:.2f} ± {4:.2f} minutes
σm = {2:.2f} ± {5:.2f} minutes
""".format(*res['popt_0'], *tuple(1.96*np.sqrt(np.diag(res['cov_0'])))))
print("""
Model 1:
δ = {0:.2f} ± {4:.2f} minutes
γ = {1:.2f} ± {5:.2f} minutes
σw = {2:.2f} ± {6:.2f} minutes
σm = {3:.2f} ± {7:.2f} minutes
""".format(*res['popt_1'], *tuple(1.96*np.sqrt(np.diag(res['cov_1'])))))
Note that we got $\delta =$ 0.31 ± 0.35 minutes, which is almost exactly the same as what we got for an estimate of the difference of means using nonparametric bootstrap.
Now, let's look at the odds ratio.
print("""
Goodness of fit: {0:.2f}
Occam factor: {1:.2f}
Odds ratio: {2:.2f}
""".format(*tuple(np.exp(
np.array([res['log_goodfit'], res['log_occam'], res['log_odds_ratio']])))))
As a check, let's perform the calculation again using PTMCMC.
with pm.Model() as same_mean_model:
# Priors
mu = pm.Uniform('mu', lower=1/60, upper=60)
sigma_m = bebi103.pm.Jeffreys('sigma_m', lower=1/60, upper=60)
sigma_w = bebi103.pm.Jeffreys('sigma_w', lower=1/60, upper=60)
# Likelihood
wt_obs = pm.Normal('wt_obs', mu=mu, sd=sigma_w, observed=wt)
mut_obs = pm.Normal('mut_obs', mu=mu, sd=sigma_m, observed=mut)
# Get the samples
trace_same_mean = pm.sample(draws=10000, tune=10000, njobs=4)
And we will do the calculation also for different means.
with pm.Model() as different_mean_model:
# Priors
mu_m = pm.Uniform('mu_m', lower=1/60, upper=60)
mu_w = pm.Uniform('mu_w', lower=1/60, upper=60)
sigma_m = bebi103.pm.Jeffreys('sigma_m', lower=1/60, upper=60)
sigma_w = bebi103.pm.Jeffreys('sigma_w', lower=1/60, upper=60)
# Likelihood
wt_obs = pm.Normal('wt_obs_m', mu=mu_w, sd=sigma_w, observed=wt)
mut_obs = pm.Normal('mut_obs_m', mu=mu_m, sd=sigma_m, observed=mut)
# Get the samples
trace_different_mean = pm.sample(draws=10000, tune=10000, njobs=4)
And now we can compute the WAIC and make comparisons.
pm.compare((trace_same_mean, trace_different_mean),
(same_mean_model, different_mean_model))
The model where the means are different is slightly more predictive, so we might favor this model. So, we have:
These seem contradictory. They are not. That is because they each answer different questions.
These are three different things that can behave similarly, (i.e., a "good" model may not have a low p-value, has a high Bayes factor, and tends to be predictive), but they are different assessments of a model. So, you need to think very carefully about what you present as a model assessment, and how you judge what others put forward as assessments of their models.
In this case, So, the (approximate) Bayesian credible region and the frequentist confidence interval almost exactly overlap, but the odds ratio and NHST give apparently different results (though, again, this is not a contradiction because they compute two different things).
We will now do a fun, instructive experiment. We will "re-acquire" the data by drawing random samples out of Gaussian distributions that have the same mean and variance as is most likely from the actual data. We will then compute the confidence interval and credible region for $\delta$ and see how they vary from experiment to experiment. We will later repeat this with p-values and odds ratios.
The idea here is that if the data are indeed Gaussian distributed, we are looking at data that could plausibly be generated in an identical experiment.
First, we'll write a function to generate new data.
def new_data(mu, sigma, n):
"""
Generate new data
"""
return np.maximum(np.random.normal(mu, sigma, n), 0)
Now we can do the calculations. We'll start by compute the confidence intervals for $\delta$.
# Number of new data sets to consider
n_new_data = 100
# Values from real data
mu_wt = np.mean(wt)
mu_mut = np.mean(mut)
sigma_wt = np.std(wt, ddof=1)
sigma_mut = np.std(mut, ddof=1)
# Number of trials to use in frequenstist calcs
n_trials = 50000
# Set up arrays for storing results
conf_int = np.empty((n_new_data, 2))
cred_reg = np.empty((n_new_data, 2))
sample_mean = np.empty(n_new_data)
# Do calcs!
for i in range(n_new_data):
# Generate new data
new_wt = new_data(mu_wt, sigma_wt, len(wt))
new_mut = new_data(mu_mut, sigma_mut, len(mut))
# Compute confidence interval
bs_reps = draw_bs_reps_diff_mean(new_wt, new_mut)
conf_int[i,:] = np.percentile(bs_reps, (2.5, 97.5))
# Compute Bayesian credible region
res = analyze_bayes(new_wt, new_mut)
err_bar = np.sqrt(res['cov_1'][0,0])
cred_reg[i,:] = res['popt_1'][0] + 1.96 * np.array([-err_bar, err_bar])
# Store the results conveniently
df_conf = pd.DataFrame(columns=['conf_int_low', 'conf_int_high',
'cred_reg_low', 'cred_reg_high'],
data=np.hstack((conf_int, cred_reg)))
# Sort by most probable delta
df_conf['delta'] = (df_conf['cred_reg_high'] + df_conf['cred_reg_low']) / 2
To visualize the results, we'll plot the frequentist and Bayesian confidence intervals and credible regions. We'll plot the confidence interval as a bar, and then the bounds of the credible region as dots.
# Set up figure
p = bokeh.plotting.figure(plot_height=250,
plot_width=700,
y_axis_label='δ (min)')
#Populate glyphs
x_conf = [[i, i] for i in range(len(df_conf))]
y_conf = [[r['conf_int_low'], r['conf_int_high']] for _, r in df_conf.iterrows()]
p.multi_line(x_conf, y_conf, line_width=3, color='#1f77b4')
p.circle(np.arange(len(df_conf)), df_conf['cred_reg_low'], color='#ff7f0e')
p.circle(np.arange(len(df_conf)), df_conf['cred_reg_high'], color='#ff7f0e')
# Turn off axis ticks for x
p.xaxis.visible = False
p.xgrid.visible = False
bokeh.io.show(p)
The Bayesian credible region and frequentist confidence interval almost perfectly overlap. This is not generally the case, but is the case for these data. More importantly, though, we see that the confidence interval can vary from experiment to experiment, but not that much. That is, the confidence intervals "dance" around, but all within about a factor of 3 of the original observed values.
It might be easier to visual the variability if we sort the values.
df_conf = df_conf.sort_values(by='delta')
# Set up figure
p = bokeh.plotting.figure(plot_height=250,
plot_width=700,
y_axis_label='δ (min)')
#Populate glyphs
x_conf = [[i, i] for i in range(len(df_conf))]
y_conf = [[r['conf_int_low'], r['conf_int_high']] for _, r in df_conf.iterrows()]
p.multi_line(x_conf, y_conf, line_width=3, color='#1f77b4')
p.circle(np.arange(len(df_conf)), df_conf['cred_reg_low'], color='#ff7f0e')
p.circle(np.arange(len(df_conf)), df_conf['cred_reg_high'], color='#ff7f0e')
# Turn off axis ticks for x
p.xaxis.visible = False
p.xgrid.visible = False
bokeh.io.show(p)
Now, let's look at the p-values. We will compute three p-values, our custom bootstraped p-value with Cohen's d, the p-value from a permutaiton test, and a p-value from Welch's t-test. Remember, with our custom bootstrapped p-value, the hypothesis is that the mutant and wild type sleep bout lengths were drawn out of distributions of the same mean (and no other assumptions). The test statistic is Cohen's d. The hypothesis in the permutation test is that the two data sets are identically distributed. The hypothesis in Welch's t-test is that the mutant and wild type were drawn from Gaussian distributions with the same mean, but with difference variances. The test statistic is a t-statistic, defined above. Finally, the odds ratio is the ratio of probability that the data come from Gaussian distributions with the same mean to the probability that the data come from Gaussian distributions with different means.
# Number of new data sets to consider
n_new_data = 500
# Number of trials to use in frequenstist calcs
n_trials = 50000
# Set up arrays for storing results
cohen_p = np.empty(n_new_data)
perm_test_p = np.empty(n_new_data)
welch_p = np.empty(n_new_data)
odds_ratio = np.empty(n_new_data)
goodfit = np.empty(n_new_data)
@numba.jit(nopython=True)
def perm_test_diff_mean(wt, mut, size=100000):
reps = draw_perm_reps_diff_mean(wt, mut, size=size)
return np.sum(reps >= np.mean(wt) - np.mean(mut)) / len(reps)
# Do calcs!
for i in range(n_new_data):
# Generate new data
new_wt = new_data(mu_wt, sigma_wt, len(wt))
new_mut = new_data(mu_mut, sigma_mut, len(mut))
# Compute p-values
cohen_p[i] = cohen_nhst(new_wt, new_mut, size=n_trials)
perm_test_p[i] = perm_test_diff_mean(new_wt, new_mut, size=n_trials)
welch_p[i] = st.ttest_ind(new_wt, new_mut, equal_var=False)[1] / 2
# Compute Bayesian odds ratio
res = analyze_bayes(new_wt, new_mut)
odds_ratio[i] = np.exp(res['log_odds_ratio'])
goodfit[i] = np.exp(res['log_goodfit'])
Now, let's plot these together. For the $x$-axis, we will plot the Welch's p-values to help visualize how the custom p-value and odds ratio vary with it. We will also plot the goodness of fit ratio to enable visualization of how the odds ratio is constructed.
p = bokeh.plotting.figure(x_axis_type='log',
y_axis_type='log',
x_axis_label="Welch's p-value",
plot_width=600,
plot_height=400)
p.circle(welch_p, cohen_p, color='#1f77b4', alpha=0.5, legend="Cohen's d p-value")
p.circle(welch_p, perm_test_p, color='#ffbb78', alpha=0.5, legend='permuation test p-value')
p.circle(welch_p, odds_ratio, color='#2ca02c', alpha=0.5, legend="odds ratio")
p.circle(welch_p, goodfit, color='#98df8a', alpha=0.5, legend="goodness of fit")
p.legend.location = 'top_left'
bokeh.io.show(p)
The goodness of fit very closely tracks the Welch's p-value, differing by a factor of 10 over most of the values. The custom p-value computed with a Cohen's d test statistic and the permtuation test p-value are nearly equal to the Welch's p-value. The Bayesian odds ratio is also tracks the p-values, but is almost two orders of magnitude larger. Remember, they are computing different things, but are all in some way measures of difference between two data sets, and show similar trends.
But what is really striking here is the scale! Wow! In 500 repeats, we get p-values ranging over almost four orders of magnitude! That's three exclamations in a row! Four, now. Those exclamation points are there to highlight that the p-value is not a reproducible statistic at all.
The Bayesian odds ratio also varies over four orders of magnitude. This is also not very reliable. The results of hypothesis tests/model selection tend to dance around a lot! Confidence intervals and credible regions don't really dance much, and p-values and odds ratios dance like this.
The zebrafish sleep experiment had only about 20 samples, so maybe larger sample sizes will result in less extreme dancing of odds ratios and p-values. Let's do a numerical experiment to look at that. We will take 15, 20, 50, and 100 samples for our experimental "repeats" and investigate how the p-value and odds ratio varies.
n_new_data = 10000
n_samples = [15, 20, 50, 100]
odds = np.empty((n_new_data, len(n_samples)))
p_vals = np.empty((n_new_data, len(n_samples)))
# Do calcs!
for i in range(n_new_data):
for j, n in enumerate(n_samples):
# Generate new data
new_wt = new_data(mu_wt, sigma_wt, n)
new_mut = new_data(mu_mut, sigma_mut, n)
# Compute odds ratio
res = analyze_bayes(new_wt, new_mut)
odds[i,j] = np.exp(res['log_odds_ratio'])
# Compute p-value
p_vals[i,j] = st.ttest_ind(new_wt, new_mut, equal_var=False)[1]/2
We'll make a plot of the probability distribution of odds ratios.
# Make tidy data frames for convenient plotting
df_p = pd.DataFrame(data=p_vals, columns=['n = ' + str(n) for n in n_samples])
df_p = df_p.melt(var_name='n', value_name='p')
df_odds = pd.DataFrame(data=odds, columns=['n = ' + str(n) for n in n_samples])
df_odds = df_odds.melt(var_name='n', value_name='odds ratio')
# Make plots
p1 = bebi103.viz.ecdf_collection(df_p,
cats=['n'],
val='p',
x_axis_label='p-value',
x_axis_type='log',
order=['n = 15', 'n = 20', 'n = 50', 'n = 100'],
plot_width=700,
plot_height=250,
palette=bokeh.palettes.d3['Category20c'][4])
p2 = bebi103.viz.ecdf_collection(df_odds,
cats=['n'],
val='odds ratio',
order=['n = 15', 'n = 20', 'n = 50', 'n = 100'],
x_axis_label='odds ratio',
x_axis_type='log',
plot_width=700,
plot_height=250,
palette=bokeh.palettes.d3['Category20c'][8][4:])
p1.legend.location = 'top_left'
p2.legend.location = 'top_left'
bokeh.io.show(bokeh.layouts.gridplot([p1, p2], ncols=1))
We see that even though both the odds ratio and the p-value have large spreads as the number of samples increases, they also shift leftward. This is because small differences in samples can be discerned with large sample sizes. But notice that the p-value and odds ratio vary over orders of magnitude for similar data set.
This little exercise in reproducibility tells use that because the p-values and odds ratios "dance", we had better be sure the bonfire around which they dance be far to the left. This suggests large $n$.
I would argue that doing a "dance of the odds ratio" analysis might not be a bad idea to do if you are trying to refute or support hypotheses. If a "repeat" of your experiment might give you an inconclusive odds ratio, you need to do more experiments.