Dancing statistics

Dataset download

This lesson was inspired by Geoff Cumming.


[1]:
import numpy as np
import pandas as pd
import scipy.stats as st
import numba

import tqdm

import bokeh_catplot

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

In this fun exercise, we will investigate how replicable certain statistical conclusions are. What I mean by replicability is best understood be working through this notebook.

For this lesson we will use the zebrafish embryo sleep data from the Prober lab. A description of their work on the genetic regulation of sleep can be found on the research page of the lab website. In particular, the movie below comes from their experiments watching moving/sleeping larvae over time.

The data we will use are processed from raw data published in Gandhi et al., 2015. In their experiment they were studying the effect of a deletion in the gene coding for arylalkylamine N-acetyltransferase (aanat), which is a key enzyme in the rhythmic production of melatonin. Melatonin is a hormone responsible for regulation of circadian rhythms. It is often taken as a drug to treat sleep disorders. The goal of this study is to investigate the effects of aanat deletion on sleep pattern in 5+ day old zebrafish larvae.

Among other sleep properties, they measured the mean rest bout length on the sixth night, comparing wild type larvae to the homozygous mutant. A rest bout is defined as a period of time in which the fish does not move. The length of a rest bout is just the amount of time the fish is still. We are primarily interested in the difference in the mean bout lengths between the two genotypes. The processed data are found here.

Let’s load the data.

[2]:
# Load data
df = pd.read_csv('../data/mean_rest_bouts.csv', comment='#')

# Pull out wild type and mutant and drop NAs
df = df[df['genotype'].isin(['wt', 'mut'])].dropna()

Let’s look at these data with an ECDF.

[34]:
p = bokeh_catplot.ecdf(
    df,
    cats="genotype",
    val="mean_rest_bout_length",
    order=["wt", "mut"],
    x_axis_label="mean rest bout length (min)",
    style='staircase',
)

bokeh.io.show(p)

The distribution for the mutant appears to be shifted to the left of the wild type, meaning that the mutant has shorted rest bouts. Let’s add 95% confidence intervals to the plot to see how the ECDFs might change if we did the experiment again.

[36]:
p = bokeh_catplot.ecdf(
    df,
    cats="genotype",
    val="mean_rest_bout_length",
    order=["wt", "mut"],
    x_axis_label="mean rest bout length (min)",
    style='staircase',
    conf_int=True,
)

bokeh.io.show(p)

These is strong overlap of the confidence intervals, so it may just be that the differences are due to the finite sample size. We will investigate further with some modeling.

Cohen’s d

Cohen’s d is a commonly used measure of effect size in comparison of two data sets. It is the ratio of the difference of means compared to a pooled standard deviation.

\begin{align} d = \frac{|\bar{x} - \bar{y}|}{\sqrt{\left.(n_x \hat{\sigma}_x^2 + n_y \hat{\sigma}_y^2) \middle/ (n_x + n_y - 2)\right.}}. \end{align}

Here, \(\bar{x}\) is the plug-in estimate for the mean of the data from sample \(x\), \(\hat{\sigma}_x^2\) is the plug-in estimate for the variance from sample \(x\), and \(n_x\) is the number of measurements in sample \(x\). The values for sample \(y\) are similarly defined.

Roughly speaking, Cohen’s d tells us how different the means of the data sets are compared to the variability in the data. A large Cohen’s d means that the effect is large compared to the variability of the measurement.

Estimates of the difference of means and Cohen’s d

First, we will compute nonparametric estimates from the data. We will estimate the difference in the mean bout length and Cohen’s d. For speed, we save the two data sets as NumPy arrays.

[4]:
wt = df.loc[df.genotype=='wt', 'mean_rest_bout_length'].values
mut = df.loc[df.genotype=='mut', 'mean_rest_bout_length'].values

Next, we’ll write some functions to conveniently generate bootstrap replicates and do our hypothesis tests. These borrow heavily from the lessons on hacker stats.

[5]:
@numba.jit(nopython=True)
def cohen_d(x, y, return_abs=False):
    """Cohen's d for two data sets."""
    diff = x.mean() - y.mean()
    pooled_variance = (len(x) * np.var(x) + len(y) * np.var(y)) / (len(x) + len(y) - 2)

    if return_abs:
        return np.abs(diff) / np.sqrt(pooled_variance)
    return diff / np.sqrt(pooled_variance)


@numba.jit(nopython=True)
def t_stat(x, y):
    """Welch's t-statistic."""
    return (np.mean(x) - np.mean(y)) / np.sqrt(
        np.var(x) / (len(x) - 1) + np.var(y) / (len(y) - 1)
    )


@numba.jit(nopython=True)
def draw_perm_sample(x, y):
    """Generate a permutation sample."""
    concat_data = np.concatenate((x, y))
    np.random.shuffle(concat_data)
    return concat_data[: len(x)], concat_data[len(x) :]


@numba.jit(nopython=True)
def draw_bs_sample(data):
    """Draw a single bootstrap sample."""
    return np.random.choice(data, size=len(data))


@numba.jit(nopython=True)
def draw_perm_reps_t(x, y, size=10000):
    out = np.empty(size)
    for i in range(size):
        x_perm, y_perm = draw_perm_sample(x, y)
        out[i] = t_stat(x_perm, y_perm)
    return out


@numba.jit(nopython=True)
def draw_bs_reps_mean(data, size=10000):
    out = np.empty(size)
    for i in range(size):
        out[i] = np.mean(draw_bs_sample(data))
    return out


@numba.jit(nopython=True)
def draw_bs_reps_diff_mean(x, y, size=10000):
    out = np.empty(size)
    for i in range(size):
        out[i] = np.mean(draw_bs_sample(x)) - np.mean(draw_bs_sample(y))
    return out


@numba.jit(nopython=True)
def draw_bs_reps_cohen_d(x, y, size=10000, return_abs=False):
    out = np.empty(size)
    for i in range(size):
        out[i] = cohen_d(draw_bs_sample(x), draw_bs_sample(y), return_abs)
    return out


@numba.jit(nopython=True)
def draw_bs_reps_t(x, y, size=10000):
    """
    Bootstrap replicates using the Welch's t-statistic.
    """
    out = np.empty(size)
    for i in range(size):
        out[i] = t_stat(draw_bs_sample(x), draw_bs_sample(y))
    return out

Now we can compute the replicates. First, let’s look at the means and their respective confidence intervals.

[6]:
wt_reps = draw_bs_reps_mean(wt)
mut_reps = draw_bs_reps_mean(mut)

And from these, compute the 95% confidence interval.

[7]:
wt_mean_conf_int = np.percentile(wt_reps, [2.5, 97.5])
mut_mean_conf_int = np.percentile(mut_reps, [2.5, 97.5])

p = bebi103.viz.plot_with_error_bars(
    centers=[wt.mean(), mut.mean()],
    confs=np.vstack([wt_mean_conf_int, mut_mean_conf_int]),
    names=["WT", "mutant"],
)

bokeh.io.show(p)

There is some overlap in the confidence interval, though wild type tend to have longer rest bouts. Just looking at these numbers, we may not be all that certain that there is a discernible difference between wild type and mutant.

Now, let’s look at the difference in the mean rest bout lengths, which we define as \(\delta = \bar{x}_\mathrm{wt} - \bar{x}_\mathrm{mut}\).

[8]:
reps = draw_bs_reps_diff_mean(wt, mut)
diff_mean_conf_int = np.percentile(reps, [2.5, 97.5])

print(
    "δ = WT - MUT:  [{1:.2f}, {0:.2f}, {2:.2f}] minutes".format(
        np.mean(wt) - np.mean(mut), *tuple(diff_mean_conf_int)
    )
)
δ = WT - MUT:  [-0.06, 0.31, 0.65] minutes

As we might expect, on the tail end of the confidence interval for the difference of means, we see that the mutant might actually have longer rest bouts that wild type.

Finally, lets compute the Cohen’s d to check the effect size.

[9]:
reps = draw_bs_reps_cohen_d(wt, mut)
cohen_d_conf_int = np.percentile(reps, [2.5, 97.5])

print(
    "WT - MUT Cohen's d:  [{1:.2f}, {0:.2f}, {2:.2f}] minutes".format(
        cohen_d(wt, mut), *tuple(cohen_d_conf_int)
    )
)
WT - MUT Cohen's d:  [-0.08, 0.54, 1.42] minutes

So, the effect size is 0.54, meaning that the mutant tends to have rest bouts 0.5 standard deviations as large as wild type fix. Jacob Cohen would call this a “medium” sized effect.

Null hypothesis significance testing

We will now perform an NHST on these two data sets. We formulate the hypothesis as follows.

  • \(H_0\): Wild type and mutant fish have the same mean rest about length.

  • Test statistic: Cohen’s d.

  • At least as extreme as: Cohen’s d larger than what was observed.

We can then perform a bootstrap hypothesis test.

[10]:
@numba.jit(nopython=True)
def cohen_nhst(wt, mut, size=100000):
    """
    Perform hypothesis test assuming equal means, using
    Cohen-d as test statistic.
    """
    # Shift data sets so that they have the same mean.
    wt_shifted = wt - np.mean(wt) + np.mean(np.concatenate((wt, mut)))
    mut_shifted = mut - np.mean(mut) + np.mean(np.concatenate((wt, mut)))

    # Draw replicates of Cohen's d
    reps = draw_bs_reps_cohen_d(wt_shifted, mut_shifted, size=size)

    # Compute p-value
    return np.sum(reps >= cohen_d(wt, mut)) / len(reps)

print("Cohen's d p-value:", cohen_nhst(wt, mut))
Cohen's d p-value: 0.06738

We get a p-value of about 0.07, which, if we use the typical bright line p-value for statistical significance, we would say that this difference is not statistically significant. We could also test what would happen if we used a different test statistic, like the difference of means.

[11]:
# Shift data sets so that they have the same mean.
wt_shifted = wt - np.mean(wt) + np.mean(np.concatenate((wt, mut)))
mut_shifted = mut - np.mean(mut) + np.mean(np.concatenate((wt, mut)))

# Draw replicates of difference of means
reps = draw_bs_reps_diff_mean(wt_shifted, mut_shifted, size=100000)

# Compute p-value
p_val = np.sum(reps >= np.mean(wt) - np.mean(mut)) / len(reps)

print('Difference of means p-value:', p_val)
Difference of means p-value: 0.03939

Here, we get a p-value of about 0.04. We would say that the result is statistically significant if we used a bright line value of 0.05.

Finally, let’s try a canonical test for this circumstance, the Welch’s t-test. As a reminder, the test statistic for the Welch’s t-test is

\begin{align} T = \frac{\bar{x}_w - \bar{x}_m}{\sqrt{\hat{\sigma}_w^2/n_w + \hat{\sigma}_m^2/n_m}}, \end{align}

where \(\hat{\sigma}_w^2\) and \(\hat{\sigma}_m^2\) are plug-in estimates for the variances. Importantly, when performing a Welch’s t-test, Normality of the two samples is assumed. So, the hypothesis test is defined as follows.

  • \(H_0\): The two samples are both Normally distributed with equal means.

  • Test statistic: t-statistic.

  • At least as extreme as: t-statistic (wild type minus mutant) greater than or equal to what was observed.

This is implemented as scipy.stats.ttest_ind() using the kwarg equal_var=False. We divide by two to get the one-tailed test. Note that Welch’s t-test is not exact, but is asymptotically exact for large sample sizes.

[12]:
print("Welch's p-value:", st.ttest_ind(wt, mut, equal_var=False)[1]/2)
Welch's p-value: 0.052542004908830583

Here, we are just above the bright line value of 0.05. We can perform a similar hypothesis test without the Normal assumption using the same test statistic as in the Welch’s test.

[13]:
# Draw replicates of t statistic
reps = draw_bs_reps_t(wt_shifted, mut_shifted, size=100000)

# Compute p-value
p_val = np.sum(reps >= t_stat(wt, mut)) / len(reps)

print("Welch's t-test without Normal assumption p-value:", p_val)
Welch's t-test without Normal assumption p-value: 0.06448

Finally, we will perform a permutation test. This test is specified as follows.

  • \(H_0\): The sleep bout lengths of mutant and wild type fish are identically distributed.

  • Test statistic: Welch’s t-statistic

  • At least as extreme as : difference of means is greater than what was observed.

[14]:
# Draw permutation replicates
reps = draw_perm_reps_t(wt, mut, size=100000)

# Compute p-value
p_val = np.sum(reps >= t_stat(wt, mut)) / len(reps)

print("Permutation test p-value:", p_val)
Permutation test p-value: 0.05032

So, all of our tests give p-values that are close to each other, ranging from about 0.04 to 0.07. If we choose bright line p-values to deem something as significant or not, some similar hypothesis/test statistic pairs can give different results. So, my advice is do not use brightline p-values. You went through the trouble of computing the p-value, just report it and leave it at that. Don’t change a float to a bool.

Model comparison

As an alternative to NHST, we can ask a similar (but different) question. We can compare two generative models. In one model, wild type and mutant sleep mean bout lengths come from the same Normal distribution. In the other, they come from different Normal distributions. We can compute an AIC for each model and then the Akaike weight for the first model (that they come from the same Normal distribution). Again, we can use the convenient feature that for Normal distributions, the MLE is given by the plug-in estimates for the parameters.

[15]:
def akaike_weight(wt, mut):
    """Compute the Akaike weight for model 1"""
    x_concat = np.concatenate((wt, mut))
    mu_1 = np.mean(x_concat)
    sigma_1 = np.std(x_concat)
    aic_1 = -2 * (st.norm.logpdf(x_concat, mu_1, sigma_1).sum() - 2)

    mu_wt = np.mean(wt)
    sigma_wt = np.std(wt)
    mu_mut = np.mean(mut)
    sigma_mut = np.std(mut)
    aic_2 = -2 * (
        st.norm.logpdf(wt, mu_wt, sigma_wt).sum()
        + st.norm.logpdf(mut, mu_mut, sigma_mut).sum()
        - 4
    )

    aic_max = max(aic_1, aic_2)

    return np.exp(-(aic_1 - aic_max) / 2) / (
        np.exp(-(aic_1 - aic_max) / 2) + np.exp(-(aic_2 - aic_max) / 2)
    )

Now that we have this function, we can compute the Akaike weight for this data set.

[16]:
akaike_weight(wt, mut)
[16]:
0.598909687738875

The Akaike weight says that we should slightly favor the model where the data points come from the same Normal distribution. This runs contrary to our hypothesis tests. The AIC is penalizing the model with more parameters.

Dancing

We will now do a fun, instructive experiment. We will “re-acquire” the data by drawing random samples out of Normal distributions parametrized by the maximum likelihood estimates we obtain from the data. (Recall that the MLE for the mean and variance of a Normally distributed random variable is given by the plug-in estimates.) We will then compute the confidence interval and credible region for \(\delta\) and see how they vary from experiment to experiment. We will later repeat this with p-values and odds ratios.

The idea here is that if the data are indeed Gaussian distributed, we are looking at data that could plausibly be generated in an identical experiment.

First, we’ll write a function to generate new data and use it to generate 500 new data sets.

[17]:
def new_data(mu, sigma, n):
    """Generate new data"""
    return np.maximum(np.random.normal(mu, sigma, n), 0.01)

# Values from real data
mu_wt = np.mean(wt)
mu_mut = np.mean(mut)
sigma_wt = np.std(wt, ddof=0)
sigma_mut = np.std(mut, ddof=0)

# How many new data sets to generate
n_new_data = 500

# Generate new data
new_wt = [new_data(mu_wt, sigma_wt, len(wt)) for _ in range(n_new_data)]
new_mut = [new_data(mu_mut, sigma_mut, len(mut)) for _ in range(n_new_data)]

Now we can do the calculations. First, we’ll compute the confidence intervals for \(\delta\).

[18]:
# Set up arrays for storing results
conf_int = np.empty((n_new_data, 2))
delta = np.empty(n_new_data)

# Do calcs!
for i, (wt_data, mut_data) in enumerate(tqdm.tqdm(zip(new_wt, new_mut))):
    # Compute confidence interval
    bs_reps = draw_bs_reps_diff_mean(wt_data, mut_data)
    conf_int[i,:] = np.percentile(bs_reps, (2.5, 97.5))

    # Sample difference of means
    delta[i] = wt_data.mean() - mut_data.mean()

# Store the results conveniently
df_res = pd.DataFrame(columns=['conf_low', 'conf_high', 'delta'],
                       data=np.hstack((conf_int, delta.reshape(n_new_data, 1))))
500it [00:05, 92.80it/s]

Next, we can do some null hypothesis significance testing. We will compute three p-values, our custom bootstraped p-value with Cohen’s d, the p-value from a permutaiton test, and a p-value from Welch’s t-test. Remember, with our custom bootstrapped p-value, the hypothesis is that the mutant and wild type sleep bout lengths were drawn out of distributions of the same mean (and no other assumptions). The test statistic is Cohen’s d. The hypothesis in the permutation test is that the two data sets are identically distributed. The hypothesis in Welch’s t-test is that the mutant and wild type were drawn from Normal distributions with the same mean, but with difference variances. The test statistic is a t-statistic, defined above.

[19]:
# Set up arrays for storing results
cohen_p = np.empty(n_new_data)
perm_test_p = np.empty(n_new_data)
welch_p = np.empty(n_new_data)

@numba.jit(nopython=True)
def perm_test_t(wt, mut, size=100000):
    reps = draw_perm_reps_t(wt, mut, size=size)
    return np.sum(reps >= t_stat(wt, mut)) / len(reps)

# Do calcs!
for i, (wt_data, mut_data) in enumerate(tqdm.tqdm(zip(new_wt, new_mut))):
    # Compute p-values
    cohen_p[i] = cohen_nhst(wt_data, mut_data)
    perm_test_p[i] = perm_test_t(wt_data, mut_data)
    welch_p[i] = st.ttest_ind(wt_data, mut_data, equal_var=False)[1] / 2

df_res['cohen_p'] = cohen_p
df_res['perm_p'] = perm_test_p
df_res['welch_p'] = welch_p
500it [01:41,  4.91it/s]

Finally, we can compute the Akaike weights for all of our generated data sets.

[20]:
df_res["akaike_weight"] = [
    akaike_weight(wt_data, mut_data) for wt_data, mut_data in zip(new_wt, new_mut)
]

Dancing confidence intervals(?)

To visualize the results, we’ll plot the confidence intervals. We’ll plot the confidence interval as a bar, and then the bounds of the credible region as dots. For ease of viewing, we will only plot 100 of these and will sort them by the plug-in estimate for δ.

[21]:
# Sort by delta for easy display
df_res_sorted = df_res.sort_values(by="delta").iloc[::5]

# Set up figure
p = bokeh.plotting.figure(plot_height=250, plot_width=700, y_axis_label="δ (min)")

# Populate glyphs
p.circle(np.arange(len(df_res_sorted)), df_res_sorted['delta'])
x_conf = [[i, i] for i in range(len(df_res_sorted))]
y_conf = [[r["conf_low"], r["conf_high"]] for _, r in df_res_sorted.iterrows()]
p.multi_line(x_conf, y_conf, line_width=2, color="#1f77b4")

# Turn off axis ticks for x
p.xaxis.visible = False
p.xgrid.visible = False

bokeh.io.show(p)

The confidence interval can vary from experiment to experiment, but not that much. That is, the confidence intervals “dance” around, but all within about a factor of 3 of the original observed values.

Dancing: p-values

Now, let’s look at the p-values and the Akaike weights.

[22]:
p = bokeh.plotting.figure(
    x_axis_type="log",
    y_axis_type="log",
    x_axis_label="Welch's p-value",
    plot_width=600,
    plot_height=400,
)
p.circle(
    df_res["welch_p"],
    df_res["cohen_p"],
    color="#1f77b4",
    alpha=0.5,
    legend_label="Cohen's d p-value",
)
p.circle(
    df_res["welch_p"],
    df_res["perm_p"],
    color="#ffbb78",
    alpha=0.5,
    legend_label="permuation test p-value",
)
p.circle(
    df_res["welch_p"],
    df_res["akaike_weight"],
    color="#2ca02c",
    alpha=0.5,
    legend_label="Akaike weight",
)

p.legend.location = "bottom_right"
bokeh.io.show(p)

The custom p-value computed with a Cohen’s d test statistic and the permutation test p-value are nearly equal to the Welch’s p-value.

But what is really striking here is the scale! Wow! In 500 repeats, we get p-values ranging over four orders of magnitude! That’s three exclamations in a row! Four, now. Those exclamation points are there to highlight that the p-value is not a reproducible statistic at all.

The Akaike weights are less variable, and more conservative. Not many of them dip below 0.1, and we would be unlikely to select one model against another for most of the values of the Akaike weights we calculated. However, they are still rather variable, and in 500 repeats the can var over many orders of magnitude as well. Though better than the p-values, they are still not terribly reproducible.

Conversely, both confidence intervals don’t really dance much, and p-values and odds ratios dance like this.

The effect of sample size on dancing

The zebrafish sleep experiment had only about 20 samples, so maybe larger sample sizes will result in less extreme dancing of p-values. Let’s do a numerical experiment to look at that. We will take 15, 20, 50, and 100 samples for our experimental “repeats” and investigate how the p-value varies. For speed, we will only compute the p-value from Welch’s t-test, which we showed previously to track closely with our custom p-values.

[23]:
n_new_data = 1000
n_samples = [15, 20, 50, 100]
p_vals = np.empty((n_new_data, len(n_samples)))
akaike_weights = np.empty((n_new_data, len(n_samples)))

# Do calcs!
for i in tqdm.tqdm(range(n_new_data)):
    for j, n in enumerate(n_samples):
        # Generate new data
        new_wt = new_data(mu_wt, sigma_wt, n)
        new_mut = new_data(mu_mut, sigma_mut, n)

        # Compute p-values and Akaikie weights
        p_vals[i,j] = st.ttest_ind(new_wt, new_mut, equal_var=False)[1]/2
        akaike_weights[i, j] = akaike_weight(new_wt, new_mut)
100%|██████████| 1000/1000 [00:03<00:00, 278.29it/s]

Let’s look at the ECDFs of p-values and Akaike weights.

[24]:
# Make tidy data frames for convenient plotting
df_p = pd.DataFrame(data=p_vals, columns=["n = " + str(n) for n in n_samples])
df_p = df_p.melt(var_name="n", value_name="p")
df_akaike = pd.DataFrame(
    data=akaike_weights, columns=["n = " + str(n) for n in n_samples]
)
df_akaike = df_akaike.melt(var_name="n", value_name="akaike_weight")

# Make plots
p1 = bokeh_catplot.ecdf(
    df_p,
    cats=["n"],
    val="p",
    x_axis_label="p-value",
    x_axis_type="log",
    order=["n = 15", "n = 20", "n = 50", "n = 100"],
    frame_width=650,
    frame_height=225,
    palette=bokeh.palettes.d3["Category20c"][4],
)
p2 = bokeh_catplot.ecdf(
    df_akaike,
    cats=["n"],
    val="akaike_weight",
    order=["n = 15", "n = 20", "n = 50", "n = 100"],
    x_axis_label="Akaike weight",
    x_axis_type="log",
    frame_width=650,
    frame_height=225,
    palette=bokeh.palettes.d3["Category20c"][8][4:],
)
p1.legend.location = "top_left"
p2.legend.location = "top_left"
p1.x_range = p2.x_range

bokeh.io.show(bokeh.layouts.gridplot([p1, p2], ncols=1))

We see that even though the p-value and Akaike weight have large spreads as the number of samples increases, they also shift leftward. This is because small differences in samples can be discerned with large sample sizes. But notice that the p-value and the Akaike weight varies over orders of magnitude for similar data set.

Conclusion

This little exercise in reproducibility tells use that because the p-values “dance”, and to a lesser extent so do the Akaike weights, we had better be sure the dancefloor is far to the left. This suggests large \(n\) is needed.

I would argue that you should do a similar “dancing” analysis of your data sets when you have a reasonable generative model in mind so that you can decide what constitutes a small p-value of Akaike weight.

Computing environment

[25]:
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,numba,tqdm,bokeh,bokeh_catplot,bebi103,jupyterlab
CPython 3.7.5
IPython 7.1.1

numpy 1.17.4
scipy 1.3.1
pandas 0.24.2
numba 0.46.0
tqdm 4.40.0
bokeh 1.4.0
bokeh_catplot 0.1.7
bebi103 0.0.45
jupyterlab 1.2.3