Tutorial 7b: Dancing

(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

This tutorial was inspired by Geoff Cumming.

In [1]:
import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st
import numba

import pymc3 as pm

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

In this exercise, we will investigate how replicable certain statistical conclusions, both the frequentist and Bayesian contexts, are. What I mean by replicability is best understood be working through the tutorial.

For this tutorial we will use the zebrafish embryo sleep data we have been using before. We will study the mean rest bout length on the sixth night, comparing wild type to the homozygous mutant. We are primarily interested in the difference in the mean bout lengths between the two genotypes. You can work up the data set to get the mean rest bouts for the sixth night, but here they are.

As a reminder, two of the mutant fish had no detectable rest bouts on the one-minute sampling interval. They obviously did have rest bouts, but they were smaller than one minute. We therefore have a decision: we can either choose the mean rest bout length to be zero for these fish, or we can jettison them from our samples because we know we could not properly sample them. For the purposes of this demonstration, we will neglect these points. (The purpose of this tutorial is to demonstrate reproducibility of statistics, not analyze this specific data set.)

Let's remind ourselves of what the data look like.

In [2]:
# Load data
df = pd.read_csv('../data/mean_rest_bouts.csv', comment='#')

# Pull out wild type and mutant and drop NAs
df = df[df['genotype'].isin(['wt', 'mut'])].dropna()

Let's look at these data with a jitter plot.

In [3]:
p = bebi103.viz.jitter(df,
                       cats=['genotype'],
                       val='mean_rest_bout_length',
                       order=['wt', 'mut'],
                       y_axis_label='mean rest bout length (s)',
                       jitter_width=0.2)
bokeh.io.show(p)

Cohen's d

Cohen's d is a commonly used measure of effect size in comparison of two data sets. It is the ratio of the difference of means compared to a pooled standard deviation.

\begin{align} d = \frac{|\bar{x} - \bar{y}|}{\sqrt{\left.(n_x r_x^2 + n_y r_y^2) \middle/ (n_x + n_y - 2)\right.}}. \end{align}

Here, $\bar{x}$ is the mean of the data from sample $x$, $r_x^2$ is the plug-in estimate for the variance from sample $x$, and $n_x$ is the number of measurements in sample $x$. The values for sample $y$ are similarly defined.

Roughly speaking, Cohen's d tells us how different the means of the data sets are compared to the variability in the data. A large Cohen's d means that the effect is large compared to the variability of the measurement.

Estimates of the difference of means and Cohen's d

First, we will compute nonparametric estimates from the data. We will estimate the difference in the mean bout length and Cohen's d. For speed, we save the two data sets as NumPy arrays.

In [4]:
wt = df.loc[df.genotype=='wt', 'mean_rest_bout_length'].values
mut = df.loc[df.genotype=='mut', 'mean_rest_bout_length'].values

Next, we'll write some functions to conveniently generate bootstrap replicates and do our hypothesis tests. These borrow heavily from the previous tutorial.

In [5]:
@numba.jit(nopython=True)
def cohen_d(x, y, return_abs=False):
    """
    Cohen's d
    """
    diff = x.mean() - y.mean()
    pooled_variance = (len(x) * np.var(x) + len(y) * np.var(y)) \
                            / (len(x) + len(y) - 2)

    if return_abs:
        return np.abs(diff) / np.sqrt(pooled_variance)
    return diff / np.sqrt(pooled_variance)


@numba.jit(nopython=True)
def t_stat(x, y):
    """Welch's t-statistic."""
    return (np.mean(x) - np.mean(y)) / \
        np.sqrt(np.var(x) / (len(x) - 1) + np.var(y) / (len(y) - 1))

    
@numba.jit(nopython=True)
def draw_perm_sample(x, y):
    """Generate a permutation sample."""
    concat_data = np.concatenate((x, y))
    np.random.shuffle(concat_data)
    return concat_data[:len(x)], concat_data[len(x):]
    
    
@numba.jit(nopython=True)
def draw_bs_sample(data):
    return np.random.choice(data, size=len(data))


@numba.jit(nopython=True)
def draw_perm_reps_diff_mean(x, y, size=10000):
    out = np.empty(size)
    for i in range(size):
        x_perm, y_perm = draw_perm_sample(x, y)
        out[i] = np.mean(x_perm) - np.mean(y_perm)
    return out


@numba.jit(nopython=True)
def draw_bs_reps_mean(data, size=10000):
    out = np.empty(size)
    for i in range(size):
        out[i] = np.mean(draw_bs_sample(data))
    return out


@numba.jit(nopython=True)
def draw_bs_reps_diff_mean(x, y, size=10000):
    out = np.empty(size)
    for i in range(size):
        out[i] = np.mean(draw_bs_sample(x)) - np.mean(draw_bs_sample(y))
    return out


@numba.jit(nopython=True)
def draw_bs_reps_cohen_d(x, y, size=10000, return_abs=False):
    out = np.empty(size)
    for i in range(size):
        out[i] = cohen_d(draw_bs_sample(x), draw_bs_sample(y), return_abs)
    return out

@numba.jit(nopython=True)
def draw_bs_reps_t(x, y, size=10000):
    """
    Bootstrap replicates using the Welch's t-statistic.
    """
    out = np.empty(size)
    for i in range(size):
        out[i] = t_stat(draw_bs_sample(x), draw_bs_sample(y))
    return out

Now we can compute the replicates. First, let's look at the means and their respective confidence intervals.

In [6]:
wt_reps = draw_bs_reps_mean(wt)
mut_reps = draw_bs_reps_mean(mut)

And from these, compute the 95% confidence interval.

In [7]:
wt_mean_conf_int = np.percentile(wt_reps, [2.5, 97.5])
mut_mean_conf_int = np.percentile(mut_reps, [2.5, 97.5])

print('WT:  [{1:.2f}, {0:.2f}, {2:.2f}] minutes'.format(
                np.mean(wt), *tuple(wt_mean_conf_int)))
print('MUT: [{1:.2f}, {0:.2f}, {2:.2f}] minutes'.format(
                np.mean(mut), *tuple(mut_mean_conf_int)))
WT:  [1.97, 2.21, 2.45] minutes
MUT: [1.66, 1.90, 2.17] minutes

There is some overlap in the confidence interval, though wild type tend to have longer rest bouts. Just looking at these numbers, we may not be all that certain that there is a discernible difference between wild type and mutant. (Remember, though, that we left out the two fish that did not have rest bouts at all. It could be that the key difference between the two genotypes is that some mutants do not sleep at all, while those that do have similar rest bout lengths as wild type.)

Now, let's look at the difference in the mean rest bout length.

In [8]:
reps = draw_bs_reps_diff_mean(wt, mut)
diff_mean_conf_int = np.percentile(reps, [2.5, 97.5])

print('WT - MUT:  [{1:.2f}, {0:.2f}, {2:.2f}] minutes'.format(
                np.mean(wt) - np.mean(mut), *tuple(diff_mean_conf_int)))
WT - MUT:  [-0.06, 0.31, 0.66] minutes

As we might expect, on the tail end of the confidence interval for the difference of means, we see that the mutant might actually have longer rest bouts that wild type.

Finally, lets compute the Cohen's d to check the effect size.

In [9]:
reps = draw_bs_reps_cohen_d(wt, mut)
cohen_d_conf_int = np.percentile(reps, [2.5, 97.5])

print('WT - MUT Cohen''s d:  [{1:.2f}, {0:.2f}, {2:.2f}] minutes'.format(
                cohen_d(wt, mut), *tuple(cohen_d_conf_int)))
WT - MUT Cohens d:  [-0.09, 0.54, 1.41] minutes

So, the effect size is 0.54, meaning that the mutant tends to have rest bouts 0.5 standard deviations as large as wild type fix. Jacob Cohen would call this a "medium" sized effect.

Null hypothesis significance testing

We will now perform an NHST on these two data sets. We formulate the hypothesis as follows.

  • $H_0$: Wild type and mutant fish have the same mean rest about length.
  • Test statistic: Cohen's d.
  • At least as extreme as: Cohen's d larger than what was observed.

We can then perform a bootstrap hypothesis test.

In [10]:
@numba.jit(nopython=True)
def cohen_nhst(wt, mut, size=100000):
    """
    Perform hypothesis test assuming equal means, using
    Cohen-d as test statistic.
    """
    # Shift data sets so that they have the same mean.
    wt_shifted = wt - np.mean(wt) + np.mean(np.concatenate((wt, mut)))
    mut_shifted = mut - np.mean(mut) + np.mean(np.concatenate((wt, mut)))

    # Draw replicates of Cohen's d
    reps = draw_bs_reps_cohen_d(wt_shifted, mut_shifted, size=size)

    # Compute p-value
    return np.sum(reps >= cohen_d(wt, mut)) / len(reps)

print("Cohen's d p-value:", cohen_nhst(wt, mut))
Cohen's d p-value: 0.06546

We get a p-value of about 0.07, which, if we use the typical bright line p-value for statistical significance, we would say that this difference is not statistically significant. We could also test what would happen if we used a different test statistic, like the difference of means.

In [11]:
# Shift data sets so that they have the same mean.
wt_shifted = wt - np.mean(wt) + np.mean(np.concatenate((wt, mut)))
mut_shifted = mut - np.mean(mut) + np.mean(np.concatenate((wt, mut)))

# Draw replicates of Cohen's d
reps = draw_bs_reps_diff_mean(wt_shifted, mut_shifted, size=100000)

# Compute p-value
p_val = np.sum(reps >= np.mean(wt) - np.mean(mut)) / len(reps)

print('Difference of means p-value:', p_val)
Difference of means p-value: 0.04005

Here, we get a p-value of about 0.04. We would say that the result is statistically significant if we used a bright line value of 0.05.

Finally, let's try the canonical test for this circumstance, the Welch's t-test. As a reminder, the test statistic for the Welch's t-test is

\begin{align} T = \frac{\bar{x}_w - \bar{x}_m}{\sqrt{r_w^2/n_w + r_m^2/n_m}}, \end{align}

where $r_w^2$ and $r_m^2$ are plug-in estimates for the variances. Importantly, when performing a Welch's t-test, Gaussianity of the two samples is assumed. So, the hypothesis test is defined as follows.

  • $H_0$: The two samples are both Gaussian distributed with equal means.
  • Test statistic: t-statistic.
  • At least as extreme as: t-statistic (wild type minus mutant) greater than or equal to what was observed.

This is implemented as st.ttest_ind() using the kwarg equal_var=False. We divide by two to get the one-tailed test.

In [12]:
print("Welch's p-value:", st.ttest_ind(wt, mut, equal_var=False)[1]/2)
Welch's p-value: 0.0525420049088

Here, we are just above the bright line value of 0.05. We can perform a similar hypothesis test without the Gaussian assumption using the same test statistic as in the Welch's test.

In [13]:
# Draw replicates of t statistic
reps = draw_bs_reps_t(wt_shifted, mut_shifted, size=100000)

# Compute p-value
p_val = np.sum(reps >= t_stat(wt, mut)) / len(reps)

print("Welch's t-test without Gaussian assumption p-value:", p_val)
Welch's t-test without Gaussian assumption p-value: 0.06339

Finally, we will perform a permutation test. This test is specified as follows.

  • $H_0$: The sleep bout lengths of mutant and wild type fish are identically distributed.
  • Test statistic: difference of means, wild type minus mutant
  • At least as extreme as : difference of means is greater than what was observed.
In [14]:
# Draw permutation replicates
reps = draw_perm_reps_diff_mean(wt, mut, size=100000)

# Compute p-value
p_val = np.sum(reps >= np.mean(wt) - np.mean(mut)) / len(reps)

print("Permutation test p-value:", p_val)
Permutation test p-value: 0.05511

So, all of our tests give p-values that are close to each other, ranging from about 0.04 to 0.07. If we choose bright line p-values to deem something as significant or not, some similar hypothesis/test statistic pairs can give different results. So, do not use brightline p-values. They are a terrible idea. You went through the trouble of computing the p-value, just report it and leave it at that. Don't change a float to a bool.

A Bayesian analysis

For a Bayesian model comparison problem, we could do the same modeling, assume that the two distributions have the same mean, and not choose a model for how the data are distributed. This is a nonparametric Bayesian calculation, which is mathematically more challenging than the frequentist estimates and NHST we just did.

Instead, we will do a parametric Bayesian model selection, comparable to the Welch's t-test analysis.

First, we do some analytical work. We have two data sets, $D_w$ and $D_m$. We define hypothesis $H_1$: these data sets were drawn from independent Gaussian distributions with differing means and differing variances. The competing hypothesis is $H_0$: the data sets were drawn from independent Gaussian distributions with the same mean, $\mu$, but different variances. Since we are computing confidence intervals of the difference or means, we will not consider $H_0$ in the Bayesian context, except for model comparison.

So, using hypothesis $H_1$, the likelihood is

\begin{align} &f(D_w,D_m \mid \mu_w,\sigma_w,\mu_m,\sigma_m, H_1) = \left(\prod_{i\in D_w} \frac{\mathrm{e}^{-(x_i-\mu_w)^2/2\sigma_w^2}}{\sqrt{2\pi\sigma_w^2}}\right) \left(\prod_{i\in D_m} \frac{\mathrm{e}^{-(x_i-\mu_m)^2/2\sigma_m^2}}{\sqrt{2\pi\sigma_m^2}}\right) \\[1em] &= \frac{1}{(2\pi)^{(n_w+n_m)/2}\sigma_w^{n_w}\sigma_m^{n_m}}\,\exp\left[ -\frac{1}{2\sigma_w^2}\sum_{i\in D_w}(x_i-\mu_w)^2 -\frac{1}{2\sigma_m^2}\sum_{i\in D_m}(x_i-\mu_m)^2 \right]. \end{align}

We saw in lecture 2 that

\begin{align} \sum_{i\in D}(x_i-\mu)^2 = n(\bar{x} - \mu)^2 + nr^2, \end{align}

where $\bar{x}$ is the sample mean and $r^2$ is the plug-in estimate for the variance. Thus, we can write our likelihood as

\begin{align} f(D_w,D_m \mid \mu_w,\sigma_w,\mu_m,\sigma_m, H_1) &= \frac{1}{(2\pi)^{(n_w+n_m)/2}\sigma_w^{n_w}\sigma_m^{n_m}}\\[1em] &\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\times \exp\left[ -\frac{n_w}{2\sigma_w^2}\left(r_w^2 + \left(\mu_w-\bar{x}_w\right)^2\right) -\frac{n_m}{2\sigma_m^2}\left(r_m^2 + \left(\mu_m-\bar{x}_m\right)^2\right) \right]. \end{align}

From this expression, we can immediately take $\mu = \mu_m = \mu_w$ to get the likelihood under hypothesis $H_0$.

\begin{align} f(D_w,D_m \mid \mu,\sigma_w, \sigma_m, H_0) &= \frac{1}{(2\pi)^{(n_w+n_m)/2}\sigma_w^{n_w}\sigma_m^{n_m}}\\[1em] &\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\times \exp\left[ -\frac{n_w}{2\sigma_w^2}\left(r_w^2 + \left(\mu-\bar{x}_w\right)^2\right) -\frac{n_m}{2\sigma_m^2}\left(r_m^2 + \left(\mu-\bar{x}_m\right)^2\right) \right]. \end{align}

For the prior of our hypotheses, we should probably use informative priors using what we know about fish sleep from prrevious experiments. But, being new to this field, we will use uninformative proiors. So, the prior for hypothesis $H_1$ is

\begin{align} g(\mu_w, \sigma_w, \mu_m, \sigma_m\mid H_1) = \frac{1}{(\mu_\mathrm{max}-\mu_\mathrm{min})^2}\, \frac{1}{\sigma_w \sigma_m \left(\ln (\sigma_\mathrm{max}/\sigma_\mathrm{min})\right)^2}, \end{align}

where we have assumed uniform priors on the $\mu$'s and Jeffreys priors on the $\sigma$'s. Similarly, the prior for $H_0$ is

\begin{align} g(\mu, \sigma_w, \sigma_m\mid H_0) = \frac{1}{\mu_\mathrm{max}-\mu_\mathrm{min}}\, \frac{1}{\sigma_w \sigma_m \left(\ln (\sigma_\mathrm{max}/\sigma_\mathrm{min})\right)^2}. \end{align}

To get the likelihood and prior for $H_1$ in a more convenient form, we now make a change of variables. Let $\delta = \mu_w-\mu_m$ and $\gamma = \mu_w + \mu_m$. Then, $\mu_w = (\gamma + \delta) / 2$ and $\mu_m = (\gamma - \delta) / 2$. We have to take care to apply the change of variables formula, which gives an extra factor of $1/2$ in the likelihood and prior. The likelihood is

\begin{align} f(D_w,D_m \mid \delta, \gamma,\sigma_w,\sigma_m, H_1) &= \frac{1}{2(2\pi)^{(n_w+n_m)/2}\sigma_w^{n_w}\sigma_m^{n_m}}\\[1em] &\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\times \exp\left[ -\frac{n_w}{2\sigma_w^2}\left(r_w^2 + \left(\frac{\gamma + \delta}{2}-\bar{x}_w\right)^2\right) -\frac{n_m}{2\sigma_m^2}\left(r_m^2 + \left(\frac{\gamma + \delta}{2}-\bar{x}_m\right)^2\right) \right], \end{align}

and the prior is

\begin{align} g(\delta, \gamma, \sigma_w, \sigma_m\mid H_1) = \frac{1}{2(\mu_\mathrm{max}-\mu_\mathrm{min})^2}\, \frac{1}{\sigma_w \sigma_m \left(\ln (\sigma_\mathrm{max}/\sigma_\mathrm{min})\right)^2}, \end{align}

bearing in mind the we must enforce that $\delta \le \gamma$, $2\mu_\mathrm{min} \le\gamma \le 2\mu_\mathrm{max}$, and $|\delta| \le \mu_\mathrm{max} - \mu_\mathrm{min}$.

We can proceed analytically all the way to the odds ratio, but to save the algebraic grunge, we will just solve for the MAP, covariance matrix, and ultimately the odds ratio using numerical optimization. So, let's code up the negative log posteriors of both.

In [15]:
@numba.jit(nopython=True)
def log_like_1(p, wt, mut):
    """
    Log likelihood under model H1.
    """
    nw = len(wt)
    nm = len(mut)
    rw2 = np.var(wt)
    rm2 = np.var(mut)
    delta, gamma, sigma_w, sigma_m = p
    
    return ( -np.log(2) - (nw + nm) / 2 * np.log(2*np.pi)
             - nw * np.log(sigma_w) - nm * np.log(sigma_m)
             - nw / 2 / sigma_w**2 *(rw2 + ((gamma + delta) / 2 - np.mean(wt))**2)
             - nm / 2 / sigma_m**2 *(rm2 + ((gamma - delta) / 2 - np.mean(mut))**2) )
        

@numba.jit(nopython=True)
def log_prior_1(p, mu_min, mu_max, sigma_min, sigma_max):
    """
    Log prior under model H1.
    """
    delta, gamma, sigma_w, sigma_m = p
    if ( gamma < 2*mu_min or gamma > 2*mu_max
            or delta > gamma or np.abs(delta) > mu_max - mu_min
            or not (sigma_min <= sigma_m <= sigma_max)
            or not (sigma_min <= sigma_w <= sigma_max) ):
        return -np.inf
        
    return ( -np.log(2) - np.log(sigma_m) - np.log(sigma_w)
             - 2*np.log(np.log(sigma_max/sigma_min)) - 2*np.log(mu_max - mu_min) )


@numba.jit(nopython=True)
def log_like_0(p, wt, mut):
    """
    Log likelihood under model H0.
    """
    nw = len(wt)
    nm = len(mut)
    rw2 = np.var(wt)
    rm2 = np.var(mut)
    mu, sigma_w, sigma_m = p

    return ( -(nw + nm) / 2 * np.log(2*np.pi)
             - nw * np.log(sigma_w) - nm * np.log(sigma_m)
             - nw / 2 / sigma_w**2 * (rw2 + (mu - np.mean(wt))**2)
             - nm / 2 / sigma_m**2 * (rm2 + (mu - np.mean(mut))**2) )
        
        
@numba.jit(nopython=True)        
def log_prior_0(p, mu_min, mu_max, sigma_min, sigma_max):
    """
    Log prior under model H0.
    """
    mu, sigma_w, sigma_m = p

    if ( not (mu_min <= mu <= mu_max)
            or not (sigma_min <= sigma_m <= sigma_max)
            or not (sigma_min <= sigma_w <= sigma_max) ):
        return -np.inf
    
    return ( -np.log(sigma_m) - np.log(sigma_w)
            - 2*np.log(np.log(sigma_max/sigma_min)) - np.log(mu_max - mu_min) )


def neg_log_post(p, wt, mut, mu_min, mu_max, sigma_min, sigma_max, model):
    """
    Negative log posterior.
    """
    if model == 1:
        lp = log_prior_1(p, mu_min, mu_max, sigma_min, sigma_max)
        if lp == -np.inf:
            return np.inf
        return -lp - log_like_1(p, wt, mut)
    elif model == 0:
        lp = log_prior_0(p, mu_min, mu_max, sigma_min, sigma_max)
        if lp == -np.inf:
            return np.inf
        return -lp - log_like_0(p, wt, mut)

Let's specify the ranges on the parameter values for the $\mu$'s and $\sigma$'s. What might be reasonable in the present context? Well, obviously, $0 \le \mu \le 600$ minutes, since the length of a sleep bout must be nonnegative and there are 600 total minutes in the sixth night. These are hard physical limits, but we can constrain $\mu$ further. I think if I saw a fish sitting still for 10 hours, I would be pretty shocked if it weren't floating (i.e., dead). So, perhaps a more reasonable bound on $\mu$ is one hour. That still seems really long, but not outside the realm of reason. How about $\sigma$? Again, the standard deviation cannot be bigger than the actual night, so 600 is a hard upper bound. Again, 10 hours of no movement would imply death, so we will choose $\sigma_\mathrm{max} = 60$ min. For the lower bound, we might pick one second, since that is close to the sampling rate of the experiment. So, $\sigma_\mathrm{min} = 1/60$.

In [16]:
mu_min, mu_max = 0, 60
sigma_min, sigma_max = 1/60, 60

Now, we can find the MAP and error bars using optimization. I will go ahead and write a function to do the full analysis, including model selection. because we will be using it again and again later. The odds ratio is

\begin{align} O_{01} \approx &\left(\frac{g(H_0)}{g(H_1)}\right) \left(\frac{f(D\mid \mu^*, \sigma_w^*, \sigma_m^*, H_0)}{f(D_1, D_2\mid \delta^*, \gamma^*,\sigma_w^*, \sigma_m^*, H_1)}\right) \\[1em] &\;\;\;\;\times \left(\frac{g(\mu^*, \sigma_w^*, \sigma_m^* \mid H_0) \,(2\pi)^{3/2}\,\sqrt{\det\mathsf{\Sigma}_0}} {g(\delta^*, \gamma^*, \sigma_w^*, \sigma_m^*\mid H_1)\,(2\pi)^2 \sqrt{\det\mathsf{\Sigma}_1}}\right), \end{align}

where the asterisks represent the parameter values at the MAP. I will take $g(H_0)/g(H_1) \approx 1$, assuming that we do not favor either model a priori. To do the calculation, we need to compute the MAP and covariance for $H_0$, where the samples come from the same distribution. To compute the other portions, we first consider the likelihood, prior, and posterior for Gaussian distributed data.

In [17]:
def analyze_bayes(wt, mut):
    """
    Compute MAP, covariance, and odds ratio for Bayesian model
    """
    # Initial guesses
    delta_0 = np.mean(wt) - np.mean(mut)
    gamma_0 = np.mean(wt) + np.mean(mut)
    mu_0 = np.mean(np.concatenate((wt, mut)))
    sigma_w_0 = np.std(wt)
    sigma_m_0 = np.std(mut)
    args_0 = (wt, mut, mu_min, mu_max, sigma_min, sigma_max, 0)
    args_1 = (wt, mut, mu_min, mu_max, sigma_min, sigma_max, 1)

    # Find MAP
    popt_0 = scipy.optimize.minimize(
        neg_log_post, np.array([mu_0, sigma_w_0, sigma_m_0]), args=args_0).x

    popt_1 = scipy.optimize.minimize(
        neg_log_post, np.array([delta_0, gamma_0, sigma_w_0, sigma_m_0]), 
        args=args_1).x

    # Compute covariances
    cov_0 = np.linalg.inv(bebi103.tools.approx_hess(popt_0, neg_log_post, args=args_0))
    cov_1 = np.linalg.inv(bebi103.tools.approx_hess(popt_1, neg_log_post, args=args_1))
    
    # Compute odds ratio
    log_goodfit = log_like_0(popt_0, wt, mut) - log_like_1(popt_1, wt, mut)
    log_occam = log_prior_0(popt_0, mu_min, mu_max, sigma_min, sigma_max) \
                - log_prior_1(popt_1, mu_min, mu_max, sigma_min, sigma_max) \
                - np.log(2*np.pi)/2  \
                + np.log(np.linalg.det(cov_0)) / 2 \
                - np.log(np.linalg.det(cov_1)) / 2
    log_odds_ratio = log_goodfit + log_occam
    
    # Return results
    return {'popt_0': popt_0, 'popt_1': popt_1, 'cov_0': cov_0, 'cov_1': cov_1,
            'log_goodfit': log_goodfit, 'log_occam': log_occam,
            'log_odds_ratio': log_odds_ratio}

Let's do the analysis!

In [18]:
# Perform analysis
res = analyze_bayes(wt, mut)

That was easy (it's nice to have convenient functions). Let's see how we did on the parameter estimate for $\delta$, the difference in means. For fun, we'll print out all of the parameter estimates.

In [19]:
# Print results
print("""
Model 0:
µ  = {0:.2f} ± {3:.2f} minutes
σw = {1:.2f} ± {4:.2f} minutes
σm = {2:.2f} ± {5:.2f} minutes
""".format(*res['popt_0'], *tuple(1.96*np.sqrt(np.diag(res['cov_0'])))))

print("""
Model 1:
δ  = {0:.2f} ± {4:.2f} minutes
γ  = {1:.2f} ± {5:.2f} minutes
σw = {2:.2f} ± {6:.2f} minutes
σm = {3:.2f} ± {7:.2f} minutes
""".format(*res['popt_1'], *tuple(1.96*np.sqrt(np.diag(res['cov_1'])))))
Model 0:
µ  = 2.07 ± 0.19 minutes
σw = 0.51 ± 0.17 minutes
σm = 0.60 ± 0.19 minutes


Model 1:
δ  = 0.31 ± 0.35 minutes
γ  = 4.11 ± 0.35 minutes
σw = 0.49 ± 0.16 minutes
σm = 0.58 ± 0.18 minutes

Note that we got $\delta =$ 0.31 ± 0.35 minutes, which is almost exactly the same as what we got for an estimate of the difference of means using nonparametric bootstrap.

Now, let's look at the odds ratio.

In [20]:
print("""
Goodness of fit:  {0:.2f}
Occam factor:    {1:.2f}
Odds ratio:      {2:.2f}
""".format(*tuple(np.exp(
    np.array([res['log_goodfit'], res['log_occam'], res['log_odds_ratio']])))))
Goodness of fit:  0.49
Occam factor:    152.67
Odds ratio:      74.67

As a check, let's perform the calculation again using PTMCMC.

In [21]:
with pm.Model() as same_mean_model:
    # Priors
    mu = pm.Uniform('mu', lower=1/60, upper=60)
    sigma_m = bebi103.pm.Jeffreys('sigma_m', lower=1/60, upper=60)
    sigma_w = bebi103.pm.Jeffreys('sigma_w', lower=1/60, upper=60)
                
    # Likelihood
    wt_obs = pm.Normal('wt_obs', mu=mu, sd=sigma_w, observed=wt)
    mut_obs = pm.Normal('mut_obs', mu=mu, sd=sigma_m, observed=mut)
    
    # Get the samples
    trace_same_mean = pm.sample(draws=10000, tune=10000, njobs=4)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%|██████████| 20000/20000 [00:18<00:00, 1061.70it/s]

And we will do the calculation also for different means.

In [22]:
with pm.Model() as different_mean_model:
    # Priors
    mu_m = pm.Uniform('mu_m', lower=1/60, upper=60)
    mu_w = pm.Uniform('mu_w', lower=1/60, upper=60)
    sigma_m = bebi103.pm.Jeffreys('sigma_m', lower=1/60, upper=60)
    sigma_w = bebi103.pm.Jeffreys('sigma_w', lower=1/60, upper=60)
                
    # Likelihood
    wt_obs = pm.Normal('wt_obs_m', mu=mu_w, sd=sigma_w, observed=wt)
    mut_obs = pm.Normal('mut_obs_m', mu=mu_m, sd=sigma_m, observed=mut)
    
    # Get the samples
    trace_different_mean = pm.sample(draws=10000, tune=10000, njobs=4)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%|██████████| 20000/20000 [00:26<00:00, 754.25it/s]

And now we can compute the WAIC and make comparisons.

In [23]:
pm.compare((trace_same_mean, trace_different_mean),
           (same_mean_model, different_mean_model))
Out[23]:
WAIC pWAIC dWAIC weight SE dSE warning
1 70.97 4.89 0 0.6 12.66 0 1
0 71.31 3.73 0.34 0.4 11.04 3.67 1

The model where the means are different is slightly more predictive, so we might favor this model. So, we have:

  • A p-value that could be interpreted to mean that we should reject the hypothesis that the means are the same.
  • A Bayes factor that slightly favors the hypothesis that the means are the same
  • A WAIC that very slightly favors the hypothesis that the means are different

These seem contradictory. They are not. That is because they each answer different questions.

  • The p-value is the probability of of observing a value of the test statistic at least as extreme as what was observed experimentally, assuming the null hypothesis (in this case $H_0$ is true. There is no other hypothesis considered.
  • The Bayes factor, assuming equal a priori probability for two specific models, is the odds ratio. That is, it is the ratio of the probabilities of the respective models being true, given the data.
  • The WAIC is a measure of how predictive a model is for new data.

These are three different things that can behave similarly, (i.e., a "good" model may not have a low p-value, has a high Bayes factor, and tends to be predictive), but they are different assessments of a model. So, you need to think very carefully about what you present as a model assessment, and how you judge what others put forward as assessments of their models.

In this case, So, the (approximate) Bayesian credible region and the frequentist confidence interval almost exactly overlap, but the odds ratio and NHST give apparently different results (though, again, this is not a contradiction because they compute two different things).

Dancing: confidence intervals and credible regions

We will now do a fun, instructive experiment. We will "re-acquire" the data by drawing random samples out of Gaussian distributions that have the same mean and variance as is most likely from the actual data. We will then compute the confidence interval and credible region for $\delta$ and see how they vary from experiment to experiment. We will later repeat this with p-values and odds ratios.

The idea here is that if the data are indeed Gaussian distributed, we are looking at data that could plausibly be generated in an identical experiment.

First, we'll write a function to generate new data.

In [24]:
def new_data(mu, sigma, n):
    """
    Generate new data
    """
    return np.maximum(np.random.normal(mu, sigma, n), 0)

Now we can do the calculations. We'll start by compute the confidence intervals for $\delta$.

In [25]:
# Number of new data sets to consider
n_new_data = 100

# Values from real data
mu_wt = np.mean(wt)
mu_mut = np.mean(mut)
sigma_wt = np.std(wt, ddof=1)
sigma_mut = np.std(mut, ddof=1)

# Number of trials to use in frequenstist calcs
n_trials = 50000

# Set up arrays for storing results
conf_int = np.empty((n_new_data, 2))
cred_reg = np.empty((n_new_data, 2))
sample_mean = np.empty(n_new_data)

# Do calcs!
for i in range(n_new_data):
    # Generate new data
    new_wt = new_data(mu_wt, sigma_wt, len(wt))
    new_mut = new_data(mu_mut, sigma_mut, len(mut))
    
    # Compute confidence interval
    bs_reps = draw_bs_reps_diff_mean(new_wt, new_mut)
    conf_int[i,:] = np.percentile(bs_reps, (2.5, 97.5))
    
    # Compute Bayesian credible region
    res = analyze_bayes(new_wt, new_mut)
    err_bar = np.sqrt(res['cov_1'][0,0])
    cred_reg[i,:] = res['popt_1'][0] + 1.96 * np.array([-err_bar, err_bar])
    
# Store the results conveniently
df_conf = pd.DataFrame(columns=['conf_int_low', 'conf_int_high', 
                                'cred_reg_low', 'cred_reg_high'],
                      data=np.hstack((conf_int, cred_reg)))

# Sort by most probable delta
df_conf['delta'] = (df_conf['cred_reg_high'] + df_conf['cred_reg_low']) / 2

To visualize the results, we'll plot the frequentist and Bayesian confidence intervals and credible regions. We'll plot the confidence interval as a bar, and then the bounds of the credible region as dots.

In [26]:
# Set up figure
p = bokeh.plotting.figure(plot_height=250,
                          plot_width=700,
                          y_axis_label='δ (min)')

 #Populate glyphs
x_conf = [[i, i] for i in range(len(df_conf))]
y_conf = [[r['conf_int_low'], r['conf_int_high']] for _, r in df_conf.iterrows()]
p.multi_line(x_conf, y_conf, line_width=3, color='#1f77b4')
p.circle(np.arange(len(df_conf)), df_conf['cred_reg_low'], color='#ff7f0e')
p.circle(np.arange(len(df_conf)), df_conf['cred_reg_high'], color='#ff7f0e')

# Turn off axis ticks for x
p.xaxis.visible = False
p.xgrid.visible = False

bokeh.io.show(p)

The Bayesian credible region and frequentist confidence interval almost perfectly overlap. This is not generally the case, but is the case for these data. More importantly, though, we see that the confidence interval can vary from experiment to experiment, but not that much. That is, the confidence intervals "dance" around, but all within about a factor of 3 of the original observed values.

It might be easier to visual the variability if we sort the values.

In [27]:
df_conf = df_conf.sort_values(by='delta')

# Set up figure
p = bokeh.plotting.figure(plot_height=250,
                          plot_width=700,
                          y_axis_label='δ (min)')

 #Populate glyphs
x_conf = [[i, i] for i in range(len(df_conf))]
y_conf = [[r['conf_int_low'], r['conf_int_high']] for _, r in df_conf.iterrows()]
p.multi_line(x_conf, y_conf, line_width=3, color='#1f77b4')
p.circle(np.arange(len(df_conf)), df_conf['cred_reg_low'], color='#ff7f0e')
p.circle(np.arange(len(df_conf)), df_conf['cred_reg_high'], color='#ff7f0e')

# Turn off axis ticks for x
p.xaxis.visible = False
p.xgrid.visible = False

bokeh.io.show(p)

Dancing: p-values and odds ratios

Now, let's look at the p-values. We will compute three p-values, our custom bootstraped p-value with Cohen's d, the p-value from a permutaiton test, and a p-value from Welch's t-test. Remember, with our custom bootstrapped p-value, the hypothesis is that the mutant and wild type sleep bout lengths were drawn out of distributions of the same mean (and no other assumptions). The test statistic is Cohen's d. The hypothesis in the permutation test is that the two data sets are identically distributed. The hypothesis in Welch's t-test is that the mutant and wild type were drawn from Gaussian distributions with the same mean, but with difference variances. The test statistic is a t-statistic, defined above. Finally, the odds ratio is the ratio of probability that the data come from Gaussian distributions with the same mean to the probability that the data come from Gaussian distributions with different means.

In [28]:
# Number of new data sets to consider
n_new_data = 500

# Number of trials to use in frequenstist calcs
n_trials = 50000

# Set up arrays for storing results
cohen_p = np.empty(n_new_data)
perm_test_p = np.empty(n_new_data)
welch_p = np.empty(n_new_data)
odds_ratio = np.empty(n_new_data)
goodfit = np.empty(n_new_data)

@numba.jit(nopython=True)
def perm_test_diff_mean(wt, mut, size=100000):
    reps = draw_perm_reps_diff_mean(wt, mut, size=size)
    return np.sum(reps >= np.mean(wt) - np.mean(mut)) / len(reps)

# Do calcs!
for i in range(n_new_data):
    # Generate new data
    new_wt = new_data(mu_wt, sigma_wt, len(wt))
    new_mut = new_data(mu_mut, sigma_mut, len(mut))
    
    # Compute p-values
    cohen_p[i] = cohen_nhst(new_wt, new_mut, size=n_trials)
    perm_test_p[i] = perm_test_diff_mean(new_wt, new_mut, size=n_trials)
    welch_p[i] = st.ttest_ind(new_wt, new_mut, equal_var=False)[1] / 2

    # Compute Bayesian odds ratio
    res = analyze_bayes(new_wt, new_mut)
    odds_ratio[i] = np.exp(res['log_odds_ratio'])
    goodfit[i] = np.exp(res['log_goodfit'])

Now, let's plot these together. For the $x$-axis, we will plot the Welch's p-values to help visualize how the custom p-value and odds ratio vary with it. We will also plot the goodness of fit ratio to enable visualization of how the odds ratio is constructed.

In [29]:
p = bokeh.plotting.figure(x_axis_type='log',
                          y_axis_type='log',
                          x_axis_label="Welch's p-value",
                          plot_width=600,
                          plot_height=400)
p.circle(welch_p, cohen_p, color='#1f77b4', alpha=0.5, legend="Cohen's d p-value")
p.circle(welch_p, perm_test_p, color='#ffbb78', alpha=0.5, legend='permuation test p-value')
p.circle(welch_p, odds_ratio, color='#2ca02c', alpha=0.5, legend="odds ratio")
p.circle(welch_p, goodfit, color='#98df8a', alpha=0.5, legend="goodness of fit")


p.legend.location = 'top_left'
bokeh.io.show(p)

The goodness of fit very closely tracks the Welch's p-value, differing by a factor of 10 over most of the values. The custom p-value computed with a Cohen's d test statistic and the permtuation test p-value are nearly equal to the Welch's p-value. The Bayesian odds ratio is also tracks the p-values, but is almost two orders of magnitude larger. Remember, they are computing different things, but are all in some way measures of difference between two data sets, and show similar trends.

But what is really striking here is the scale! Wow! In 500 repeats, we get p-values ranging over almost four orders of magnitude! That's three exclamations in a row! Four, now. Those exclamation points are there to highlight that the p-value is not a reproducible statistic at all.

The Bayesian odds ratio also varies over four orders of magnitude. This is also not very reliable. The results of hypothesis tests/model selection tend to dance around a lot! Confidence intervals and credible regions don't really dance much, and p-values and odds ratios dance like this.

The effect of sample size on dancing

The zebrafish sleep experiment had only about 20 samples, so maybe larger sample sizes will result in less extreme dancing of odds ratios and p-values. Let's do a numerical experiment to look at that. We will take 15, 20, 50, and 100 samples for our experimental "repeats" and investigate how the p-value and odds ratio varies.

In [30]:
n_new_data = 10000
n_samples = [15, 20, 50, 100]
odds = np.empty((n_new_data, len(n_samples)))
p_vals = np.empty((n_new_data, len(n_samples)))

# Do calcs!
for i in range(n_new_data):
    for j, n in enumerate(n_samples):
        # Generate new data
        new_wt = new_data(mu_wt, sigma_wt, n)
        new_mut = new_data(mu_mut, sigma_mut, n)

        # Compute odds ratio
        res = analyze_bayes(new_wt, new_mut)
        odds[i,j] = np.exp(res['log_odds_ratio'])

        # Compute p-value
        p_vals[i,j] = st.ttest_ind(new_wt, new_mut, equal_var=False)[1]/2

We'll make a plot of the probability distribution of odds ratios.

In [34]:
# Make tidy data frames for convenient plotting
df_p = pd.DataFrame(data=p_vals, columns=['n = ' + str(n) for n in n_samples])
df_p = df_p.melt(var_name='n', value_name='p')
df_odds = pd.DataFrame(data=odds, columns=['n = ' + str(n) for n in n_samples])
df_odds = df_odds.melt(var_name='n', value_name='odds ratio')

# Make plots
p1 = bebi103.viz.ecdf_collection(df_p,
                                 cats=['n'],
                                 val='p',
                                 x_axis_label='p-value',
                                 x_axis_type='log',
                                 order=['n = 15', 'n = 20', 'n = 50', 'n = 100'],
                                 plot_width=700,
                                 plot_height=250,
                                 palette=bokeh.palettes.d3['Category20c'][4])
p2 = bebi103.viz.ecdf_collection(df_odds,
                                 cats=['n'],
                                 val='odds ratio',
                                 order=['n = 15', 'n = 20', 'n = 50', 'n = 100'],
                                 x_axis_label='odds ratio',
                                 x_axis_type='log',
                                 plot_width=700,
                                 plot_height=250,
                                 palette=bokeh.palettes.d3['Category20c'][8][4:])
p1.legend.location = 'top_left'
p2.legend.location = 'top_left'

bokeh.io.show(bokeh.layouts.gridplot([p1, p2], ncols=1))

We see that even though both the odds ratio and the p-value have large spreads as the number of samples increases, they also shift leftward. This is because small differences in samples can be discerned with large sample sizes. But notice that the p-value and odds ratio vary over orders of magnitude for similar data set.

Conclusion

This little exercise in reproducibility tells use that because the p-values and odds ratios "dance", we had better be sure the bonfire around which they dance be far to the left. This suggests large $n$.

I would argue that doing a "dance of the odds ratio" analysis might not be a bad idea to do if you are trying to refute or support hypotheses. If a "repeat" of your experiment might give you an inconclusive odds ratio, you need to do more experiments.