Comments and opinions on NHST

What does it mean to be significant?

If the p-value is small, the effect is said to be statistically significant (hence the term “significance level”). But what is small; what should we choose for the significance level? By choosing a significance level, we draw a line in the sand. On one side of the line is a rejected hypothesis, and on the other an accepted one. I generally discourage a bright line p-value used to deem a result statistically significant or not. You computed the p-value, it has a specific meaning; you should report it. I do not see a need to convert a computed value, the p-value, into a Boolean, True/False on whether or not we attach the word “significant” to the result. That is, I do not see a need to make a decision.

Do we even want to calculate a p-value?

The question the NHST and its result, the p-value, address is rarely the question we want to ask. For example, say we are doing a test of the null hypothesis that two sets of measurements have the same mean. In most cases, what we really want is knowledge about the underlying generative distributions of the control and test data. Which of the following two questions better serve that goal?

  1. How different are the means of the two samples?

  2. Would we say there is a statistically significant difference of the means of the two samples? Or, more precisely, what is the probability of observing a difference in means of the two samples at least as large as the observed difference in means, if the two samples in fact have the same mean?

The second question is convoluted and often of little scientific interest. I would say that the first question is much more relevant. To put it in perspective, say we made trillions of measurements of two different samples and their mean differs by one part per million. This difference, though tiny, would still give a low p-value, and therefore often be deemed “statistically significant.” But, ultimately, it is the size of the difference, or the effect size, we care about.

What is with all those names?

You have no doubt heard of many named hypothesis tests, like the Student-t test, Welch’s t-test, the Mann-Whitney U-test, and countless others. What is with all of those names? It helps to think more generally about how frequentist hypothesis testing is usually done.

To do a hypothesis test, people unfortunately do not do what I have laid out here with a resampling-based approach, but typically use a named test. HEre is an example of that for a Student-t test (borrowing heavily from the treatment in Gregory’s book).

  1. Choose a null hypothesis. This is the hypothesis you want to test the truth of.

  2. Choose a suitable test statistic that can be computed from measurements and has a predictable distribution. For the example of two sets of repeated measurements where the null hypothesis is that they come from identical Normal distributions, we can choose as our statistic

    \[\begin{split}\begin{aligned} T &= \frac{\bar{x}_1 - \bar{x}_2 - (\mu_1 - \mu_2)}{S_D\sqrt{n_1^{-1}+n_2^{-1}}},\nonumber \\[1em] \text{where }S_D^2 &= \frac{(n_1 - 1)S_1^2 + (n_2-1)S_2^2}{n_1+n_2-2}, \nonumber \\[1em] \text{with } S_1^2 &= \frac{1}{n_1-1}\sum_{i\in D_1}(x_i - \bar{x}_1)^2, \end{aligned}\end{split}\]

    and \(S_2^2\) similarly defined. The T statistic is the difference of the difference of the observed means and the difference of the true means, weighted by the spread in the data. This is a reasonable statistic for determining something about means from data. This is the appropriate statistic when \(\sigma_1\) and \(\sigma_2\) are both unknown but assumed to be equal. (When they are assumed to be unequal, you need to adjust the statistic you use. This test is called Welch’s t-test.) It can be derived that this statistic has the Student-t distribution,

    \[\begin{split}\begin{aligned} P(t) &= \frac{1}{\sqrt{\pi \nu}} \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\Gamma\left(\frac{\nu}{2}\right)} \left(1 + \left(\frac{t^2}{\nu}\right)\right)^{-\frac{\nu + 1}{2}},\\[1em] \text{where } \nu &= n_1+n_2-2. \end{aligned}\end{split}\]
  3. Evaluate the test statistic from measured data. In the case of the Student-t test, we compute \(T\).

  4. Plot \(P(t)\). The area under the curve where \(t > T\) is the p-value, the probability that we would observe our data under the null hypothesis. Reject the null hypothesis if this is small.

As you can see from the above prescription, item 2 can be tricky. Coming up with test statistics that also have a distribution that we can write down is difficult 1. When such a test statistic is found, the test usually gets a name. The main reason for doing things this way is that most hypothesis tests were developed before computers, so we couldn’t just bootstrap our way through hypothesis tests. (The bootstrap was invented by Brad Efron in 1979.) Conversely, in the approach we have taken, sometimes referred to as “hacker stats,” we can invent any test statistic we want, and we can test is by numerically “repeating” the experiment, in accordance with the frequentist interpretation of probability.

So, I would encourage you not to get caught up in names. If someone reports a p-value with a name, simply look up the things you need to define the p-values (the hypothesis being tested, the test statistic, and what it means to be as extreme), and that will give you an understanding of what is going on with the test.

That said, many of the tests with names have analytical forms and can be rapidly computed. Most are included in the scipy.stats module. I have chosen to present a method of hypothesis testing that is intuitive with the frequentist interpretation of probability front and center. It also allows you to design your own tests that fit a null hypothesis that you are interested in that might not be “off-the-shelf.”

Warnings about hypothesis tests

There are many. I will discuss them more when I talk about “statistical pitfalls” later in the term.

  1. An effect being statistically significant does not mean the effect is significant in practice or even important. It only means exactly what it is defined to mean: an effect is unlikely to have happened by chance under the null hypothesis. Far more important is the effect size.

  2. The p-value is not the probability that the null hypothesis is true. It is the probability of observing the test statistic being at least as extreme as what was measured if the null hypothesis is true. I.e., if \(H_0\) is the null hypothesis,

    \[\begin{aligned} \text{p-value} = P(\text{test stat at least as extreme as observed}\mid H_0). \end{aligned}\]

    It is not the probability that the null hypothesis is true given that the test statistic was at least as extreme as the data.

    \[\begin{aligned} \text{p-value} \ne P(H_0\mid\text{test stat at least as extreme as observed}). \end{aligned}\]

    We often actually want the probability that the null hypothesis is true, and the p-value is often erroneously interpreted as this (even though it does not even make sense in under the frequentist interpretation of probability) to great peril.

  3. Null hypothesis significance testing does not say anything about alternative hypotheses. Rejection of the null hypothesis does not mean acceptance of any other hypotheses.

  4. P-values are not very reproducible, as we will see when we do “dance of the p-values.”

  5. Rejecting a null hypothesis is also kind of odd, considering you computed

    \[\begin{aligned} P(\text{test stat at least as extreme as observed}\mid H_0). \end{aligned}\]

    This does not really describe the probability that the hypothesis is true. This, along with point 4, means that the p-value better be really low for you to reject the null hypothesis.

  6. Throughout the literature, you will see null hypothesis testing when the null hypothesis is not relevant at all. People compute p-values because that’s what they are supposed to do. Again, it gets to the point that effect size is waaaaay more important than a null hypothesis significance test.

Given all these problems with p-values, based on my experience, I would advocate for their abandonment. I am not the only one. They seldom answer the question scientists are asking and lead to great confusion.

That said, I know they are widely and effectively used in contexts I am less familiar with. It’s foolish to make strong, blanket statements (like I just did), so consider this a tempering of my advocacy for abandonment.


1

I am not going to discuss other important considerations that go into choice of test statistics, especially when you want to make an accept/reject decision, such as power and false positive rates.