import warnings
# Our numerical workhorses
import numpy as np
import pandas as pd
# Import plotting tools
import matplotlib.pyplot as plt
import seaborn as sns
# Magic function to make matplotlib inline; other style specs must come AFTER
%matplotlib inline
# This enables high res graphics inline (only use with static plots (non-Bokeh))
# SVG is preferred, but there is a bug in Jupyter with vertical lines
%config InlineBackend.figure_formats = {'png', 'retina'}
# JB's favorite Seaborn settings for notebooks
rc = {'lines.linewidth': 2,
'axes.labelsize': 18,
'axes.titlesize': 18,
'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)
sns.set_style('darkgrid', rc=rc)
# Suppress future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
In the previous tutorial, we studied parameter estimation from a frequentist perspective. We saw that if we are trying to estimate a parameter from a probability distribution, we should try to define an unbiased estimator for that parameter based on the expectation values of the distribution. Beyond that, we saw that we can do computational "experiments" to get samples from the repeated hypothetical experiments that define frequentist probability.
In this tutorial, we will apply this same computational experimental approach to do hypothesis testing. It is important to note that frequentist hypothesis testing is different from Bayesian model selection, in that in the frequentist hypothesis tests, we will only consider how probable it is to get the observed data under a specific hypothesis, often called the null hypothesis. We will not assess other hypotheses or compare them. Remember that the probability of a hypothesis being true is not something that makes any sense to a frequentist.
We will again use the data from Reeves, et al. for this tutorial. Let's load them in again.
# Load data into DataFrame
file_name = '../data/reeves_et_al/reeves_gradient_width_various_methods.csv'
df_fly = pd.read_csv(file_name, comment='#', header=[0, 1])
# Get columns we want
fly_wt = df_fly['wt']['cross-sections'].dropna().values
fly_venus = df_fly['dl1/+; dl-venus/+']['anti-Dorsal'].dropna().values
Though there are many kind of frequentist hypothesis tests, we will focus on the most often used type of test. The idea here is to quantify the size of the effect of an experiment with some test statistic (such as a mean). Then we propose a null hypothesis which says that the effect is not real. We then compute the probability of observing a value of the test statistic that is more extreme that the one we measured if the null hypothesis is true. This is called a p-value. If the p-value is small, the effect is said to be statistically significant.
We will go through this process first to test if the dorsal gradient width in wild type is different from that in the Venus fusion. The strategy is to computationally do the experiment many times over with according to the null hypothesis and count the results that show the effect size we observed experimentally.
For our test statistic, we will use the mean of the gradient width. We estimated the means in the last tutorial.
# Mean and variance of Dorsal gradient widths
mu_wt = fly_wt.mean()
var_wt = fly_wt.var(ddof=1)
mu_venus = fly_venus.mean()
var_venus = fly_venus.var(ddof=1)
# Print results
print("""
µ σ n
---------------------------
WT {0:.4f} {2:.4f} {4:d}
Venus {1:.4f} {3:.4f} {5:d}
""".format(mu_wt, mu_venus, np.sqrt(var_wt), np.sqrt(var_venus),
len(fly_wt), len(fly_venus)))
The question now is whether we could get a mean that is so different if the WT and Venus were exactly the same. So, our null hypothesis is that there is no difference between WT and Venus fusion flies. We then repeat the experiment under this hypothesis. To do this, we put all them measurements together, scramble them, and then choose 152 to be in one set and 29 in the other. We compute the absolute value of the difference of the means. We do this over and over again.
def permute_sample(x, y, stat_fun, args=(), n_trials=1000):
"""
Considered two arrays `x` and `y` as one and then selects
random samples out of the combined array the same size as
the original `x` and `y`. The computes test statistic for each
sample.
"""
# Array to store samples
samples = np.empty(n_trials)
# Put the two data sets together
vals = np.concatenate((x, y))
for i in range(n_trials):
# Permute and compute difference in means
vals = np.random.permutation(vals)
samples[i] = stat_fun(vals[:len(x)], vals[len(x):], *args)
return samples
def p_value_from_samples(samples, actual, greater_or_less='greater'):
"""
Compute probability that actual values if either greater than
or less than samples value.
"""
# Compute p-value
if greater_or_less == 'greater':
return np.sum(samples >= actual) / len(samples)
else:
return np.sum(samples <= actual) / len(samples)
Let's do the test to see what the probability of getting the observed absolute difference in means is.
# Function for absolute value of difference in means
def mean_diff(x, y, return_abs=False):
diff = y.mean() - x.mean()
if return_abs:
return np.abs(diff)
return diff
# Get samples
samples = permute_sample(fly_wt, fly_venus, mean_diff, args=(True,),
n_trials=100000)
# Compute p-value
p = p_value_from_samples(samples, mean_diff(fly_wt, fly_venus, True))
print('p-value =', p)
So, we get a p-value of about 0.0002, which suggests that there is a low chance of observing the same mean under the null hypothesis, which again is that the same process is generating the gradients. This p-value is low, so we would say there is a statistically significant difference between the two means.
We can see this more clearly by plotting the CDF of the samples.
# Make the CDF
x = np.sort(samples)
y = np.arange(len(samples)) / len(samples)
plt.plot(x, y, '.')
plt.margins(0.02)
plt.xlabel(r'$|\mu_\mathrm{wt} - \mu_{\mathrm{Venus}}|$')
plt.ylabel('cdf')
# Plot the true difference
plt.plot(mean_diff(fly_wt, fly_venus, True) * np.array([1, 1]),
plt.gca().axes.get_ylim(), '-', color=sns.color_palette()[2])
We see that very few samples occur with a mean as big as we are seeing, suggesting that this is very unlikely under the null hypothesis.
We can perform this type of permutation test with any statistic we are interested in. Let's try medians.
def median_diff(x, y, return_abs=False):
diff = np.median(y) - np.median(x)
if return_abs:
return np.abs(diff)
return diff
# Get samples
samples_median = permute_sample(fly_wt, fly_venus, median_diff, args=(True,),
n_trials=100000)
# Compute p-value
p = p_value_from_samples(samples_median, median_diff(fly_wt, fly_venus, True))
print('median p-value =', p)
Or with the variance.
def variance_diff(x, y, return_abs=False):
diff = y.var() - x.var()
if return_abs:
return np.abs(diff)
return diff
# Get samples
samples_var = permute_sample(fly_wt, fly_venus, variance_diff, args=(True,),\
n_trials=100000)
# Compute p-value
p = p_value_from_samples(samples_var, variance_diff(fly_wt, fly_venus, True))
print('variance p-value =', p)
We see that while it is unlikely to get the same difference in median under the null hypothesis, there is some chance we could get difference in variance.
We can also check Cohen's d, a standard deviation weighted effect size we discussed in class.
def cohen_d(x, y, return_abs=False):
"""
Cohen's d
"""
diff = y.mean() - x.mean()
pooled_variance = (len(x) * x.var() + len(y) * y.var()) / (len(x) + len(y))
if return_abs:
return np.abs(diff) / np.sqrt(pooled_variance)
return diff / np.sqrt(pooled_variance)
# Get samples
samples_d = permute_sample(fly_wt, fly_venus, cohen_d, args=(True,),
n_trials=100000)
# Compute p-value
p = p_value_from_samples(samples_d, cohen_d(fly_wt, fly_venus))
print('variance p-value =', p)
Again, we would report this results as "statistically significant."
So far, we have looked at the absolute value of the difference between parameters. We could just look at the difference of the respective statistics. We might get different results.
# Compute p-values
samples = permute_sample(fly_wt, fly_venus, mean_diff, n_trials=100000)
p = p_value_from_samples(samples, mean_diff(fly_wt, fly_venus))
print('Mean p-value =', p)
samples = permute_sample(fly_wt, fly_wt, median_diff, n_trials=100000)
p = p_value_from_samples(samples, median_diff(fly_wt, fly_venus))
print('Median p-value =', p)
samples = permute_sample(fly_venus, fly_wt, variance_diff, n_trials=100000)
p = p_value_from_samples(samples, variance_diff(fly_venus, fly_wt))
print('Variance p-value =', p)
samples = permute_sample(fly_wt, fly_venus, cohen_d, n_trials=100000)
p = p_value_from_samples(samples, cohen_d(fly_wt, fly_venus))
print('Cohen-d p-value =', p)
The p-values have done down when we are not looking at absolute values. I.e., we are not looking just to see if the effect magnitude is expected from the null hypothesis, but also the sign of the effect.
You have no doubt of many named frequentist hypothesis tests, like the Student-t test, Welch's t-test, the Mann-Whitney U-test, and countless others. What is with all of those names? It helps to think more generally about how frequentist hypothesis testing is usually done.
To do a frequentist hypothesis test, people typically follow the following prescription (borrowing heavily from the treatment in Gregory's excellent book).
1) Choose a null hypothesis. This is the hypothesis you want to test the truth of. In our case, it's the hypothesis that the means are the same, $|\mu_1 - \mu_2| = 0$.
2) Choose a suitable statistic that can be computed from measurements and has a predictable distribution. For this example, we choose as our statistic
\begin{align} T &= \frac{\bar{x}_1 - \bar{x}_2 - (\mu_1 - \mu_2)}{S_D\sqrt{n_1^{-1}+n_2^{-1}}},\\[1em] \text{where }S_D^2 &= \frac{(n_1 - 1)S_1^2 + (n_2-1)S_2^2}{n_1+n_2-2}, \\[1em] \text{with } S_1^2 &= \frac{1}{n_1-1}\sum_{i\in D_1}(x_i - \bar{x}_1)^2, \\[1em] \end{align}and $S_2^2$ similarly defined. The T statistic is the difference of the difference of the observed means and the difference of the true means, weighted by the spread in the data. This is a reasonable statistic for determining something about means from data. This is the appropriate statistic when $\sigma_1$ and $\sigma_2$ are both unknown but assumed to be equal. (When they are assumed to be unequal, you need to adjust the statistic you use. This test is called Welch's t-test.) It can be derived that this statistic has the Student-t distribution,
\begin{align} P(t) &= \frac{1}{\sqrt{\pi \nu}} \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\Gamma\left(\frac{\nu}{2}\right)} \left(1 + \left(\frac{t^2}{\nu}\right)\right)^{-\frac{\nu + 1}{2}},\\[1em] \text{where } \nu &= n_1+n_2-2.\\[1em] \end{align}3) Evaluate the statistic from measured data. In the case of the Student-t test, we compute $T$.
4) Plot $P(t)$. The area under the curve where $|t| > |T|$ is the p-value, the probability that we would observe our data under the null hypothesis that $|\mu_1 - \mu_2| = 0$. Reject the null hypothesis if this is small.
As you can see from the above prescription, item 2 can be tricky. Coming up with test statistics that also have a distribution that we can write down is difficult. When such a test statistic is found, the test usually gets a name.
In the approach we have taken, we can invent any test statistic we want, and we can test is by numerically "repeating" the experiment, in accordance with the frequentist defintion of probability.
So, I would encourage you not to get caught up in names. If someone reports a p-value with a name, simply look up what test statistic they are using, and that will give you an understanding of what is going on with the test.
There are many.
I should also mention that there are names for many of the tests we did in this tutorial, and there are countless other tests. Many of these tests have analytical forms and can be rapidly computed. Most are included in scipy.stats
. I have chosen to present a method of hypothesis testing that is intuitive with the frequentist definition of probability front and center. It also allows you to design your own tests that fit a null hypothesis that you are interested in that might not be "off-the-shelf."
That said, given all these problems with p-values, I generally advocate for their abandonment. They seldom answer the question scientists are asking and lead to great confusion.