Null hypothesis significance testing

Imagine we do an experiment where we make repeated measurements under control conditions and another set of repeated measurements under test conditions. We may wish to evaluate a hypothesis that there is some similarity in the data generation process between the control and test. If this similar data generating processes cannot produce the observed results, we might like to reject the hypothesis that they are similar. The procedure of null hypothesis significance testing (NHST) formalizes the procedure of evaluating hypotheses in this way.

Steps of a NHST

A typical hypothesis test consists of these steps.

Clearly state the null hypothesis.
Define a test statistic, a scalar value that you can compute from data, almost always a statistical functional of the empirical distribution. Compute it directly from your measured data.
Simulate data acquisition for the scenario where the null hypothesis is true. Do this many times, computing and storing the value of the test statistic each time.
The fraction of simulations for which the test statistic is at least as extreme as the test statistic computed from the measured data is called the p-value, which is what you report.

We need to be clear on our definition here. The p-value is the probability of observing a test statistic being at least as extreme as what was measured if the null hypothesis is true. It is exactly that, and nothing else. It is not the probability that the null hypothesis is true. In the frequentist interpretation of probability, we cannot assign a probability to the truth of a hypothesis because a hypothesis is not a random variable.

Rejecting the null hypothesis

The above prescription of a NHST ends with a p-value. Many researchers do not end there, but end with a decision, either a rejection or acceptance of a hypothesis. To take this approach, we adjust the procedure above, the first two steps being the same.

Clearly state the null hypothesis.
Define a test statistic, a scalar value that you can compute from data, almost always a statistical functional of the empirical distribution. Compute it directly from your measured data.
Decide on how you will determine the rejection region from the null distribution, which is defined in the next step.
Simulate data acquisition for the scenario where the null hypothesis is true. Do this many times, computing and storing the value of the test statistic each time. This is sampling out of the null distribution. That is the distribution of the values of the test statistic under the assumption that the null hypothesis is true.
If the observed test statistic lies within the rejection region, reject the null hypothesis. Otherwise, do not reject it.

We are free to choose the rejection region however we like; we just have to be explicit in how we do it. A common method is to choose a significance level \(\alpha\), such that values of the test statistic within the rejection region impart a total probability mass of \(\alpha\), usually with \(\alpha\) being small. This approach is equivalent to computing the p-value, and if it is less than \(\alpha\), the researcher decides to reject the null hypothesis. (I generally do not advocate such black and white decisions, described in the next section of this lesson.)

Importantly, note that the significance level (or more generally the means of computing the rejection region) is decided before simulating the null hypothesis.

Defining a test

A hypothesis test is defined by the null hypothesis, the test statistic, and what it means to be at least as extreme. That uniquely defines the hypothesis test you are doing. All of the named hypothesis tests, like the Student-t test, the Mann-Whitney U-test, Welch’s t-test, etc., describe a specific hypothesis with a specific test statistic with a specific definition of what it means to be at least as extreme (e.g., one-tailed or two-tailed). I can never remember what these are, nor do I encourage you to; you can always look them up. Rather, you should just clearly write out what your test is in terms of the hypothesis, test statistic, and definition of extremeness.

Performing tests using hacker stats

Now, the real trick to doing a hypothesis test is simulating data acquisition assuming the null hypothesis is true. For some tests, that is some trios of null hypothesis, test statistic, and meaning of “at least as extreme as,” simulation is unnecessary because analytical results exist. This is the case for many of the named tests. However, in many cases, these are not the hypotheses you wish you test. Because you have random number generation at your finger tips, you can test the hypotheses you want to test, not just the ones that have names.

I will demonstrate two hypothesis tests and how we can simulate them. For both examples, we will consider the commonly encountered problem of performing the same measurements under two different conditions, control and test. You might have in mind the example of C. elegans egg lengths for well-fed mothers versus starved mothers.

Test and control come from the same distribution.

Here, the null hypothesis is that the distribution \(F\) of the control (wild type) measurements is the same as that \(G\) of the test (mutant), or \(F = G\). To simulate this, we can do a permutation test. Say we have \(n\) control measurements and \(m\) test measurements. We concatenate arrays of the control and test measurements to get a single array with \(n + m\) entries. We then randomly scramble the order of the entries (this is implemented in np.random.permuation()). We take the first \(n\) to be labeled “control” and the last \(m\) to be labeled “test.” We then compute the test statistic according to these labels. In this way, we are simulating the null hypothesis: whether or not a sample is test or control makes no difference.

For this case, we might define our test statistic to be the difference of means, the difference of medians, or anything else that can be computed from the two data sets and has a scalar value.

Test and control distributions have the same mean.

The null hypothesis here is exactly as I have stated, the test and control distributions have the same mean, and nothing more. To simulate this, we shift the data sets so that they have the same mean. In other words, if the control data are \(\mathbf{x}\) and the test data are \(\mathbf{y}\), then we define the mean of all measurements to be

\[\begin{aligned} \bar{z} = \frac{n\bar{x} + m\bar{y}}{n+m}. \end{aligned}\]

Then, we define

\[\begin{split}\begin{aligned} \mathbf{x}_{\mathrm{shift}} &= \mathbf{x} - \bar{x} + \bar{z}, \\ \mathbf{y}_{\mathrm{shift}} &= \mathbf{y} - \bar{y} + \bar{z}. \end{aligned}\end{split}\]

Now, the data sets \(\mathbf{x}_\mathrm{shift}\) and \(\mathbf{y}_\mathrm{shift}\) have the same mean, but everything else about them is the same as \(\mathbf{x}\) and \(\mathbf{y}\), respectively. They have exactly the same variance, for example.

To simulate the null hypothesis, then, we draw bootstrap samples from \(\mathbf{x}_\mathrm{shift}\) and \(\mathbf{y}_\mathrm{shift}\) and compute the test statistic from the bootstrap samples, over and over again.

In both of these cases, no assumptions were made about the underlying distributions. Only the empirical distributions were used; these are nonparametric hypothesis tests.

Interpretation of the p-value

The p-value is best interpreted by its definition. It is the probability of observing a test statistic under the null hypothesis that is at least as extreme as what was observed experimentally. That’s it. It is important to interpret a p-value as what it is, and not anything else. Most importantly, it is not the probability that a null hypothesis is true. Even if it small, it is not a confirmation that another hypothesis you have in mind is true.