16 The permutation test

Consider a linear regression model with intercept: \[ y_i = \beta_0 + \boldsymbol{x}_{i,\text{-}0}^T \boldsymbol{\beta}_{\text{-}0} + \epsilon_i, \quad \epsilon_i \overset{\text{i.i.d.}} \sim F, \quad i = 1, \ldots, n, \tag{16.1}\] where \(F\) is an unknown distribution. Suppose we wish to test the null hypothesis that none of the variables in \(\boldsymbol{x}_{\text{-}0}\) are associated with \(y\): \[ H_0: \boldsymbol{\beta}_{\text{-}0} = \boldsymbol{0}. \tag{16.2}\] Furthermore, suppose the sample size \(n\) is small enough that neither asymptotic inference nor the bootstrap are reliable. In this case, we can use the permutation test, which controls Type-I error for any sample size \(n\).

16.1 General formulation of the permutation test

Note that the null hypothesis can be formulated as follows: \[ H_0: y_i \overset{\text{i.i.d.}}\sim G \quad \text{for some distribution } G. \] If we view the model matrix \(\boldsymbol X\) as random, then we can also formulate \(H_0\) as an independence null hypothesis: \[ H_0: \boldsymbol{x}_{\text{-}0} \perp\!\!\!\perp y. \] Both of these reformulations suggest that, under the null hypothesis, the null distribution of the data \((\boldsymbol X, \boldsymbol y)\) is invariant to permutations of \(\boldsymbol y\), while keeping \(\boldsymbol X\) fixed. In other words, \[ (\boldsymbol X, \pi \boldsymbol y) \overset{d}{=} (\boldsymbol X, \boldsymbol y), \quad \text{for all } \pi \in \mathcal{S}_n, \tag{16.3}\] where \(\mathcal{S}_n\) is the group of all permutations of \(\{1, \ldots, n\}\) and \(\pi \boldsymbol y\) is the permuted response vector. Therefore, we can use permuted instances of the data to approximate the null distribution of any test statistic under \(H_0\). There are two instances of the permutation test: one based on the entire group \(\mathcal S_n\) and the other based on a random sample of \(\mathcal S_n\).

16.1.1 Permutation test based on the entire permutation group

Consider any test statistic \(T: (\boldsymbol X, \boldsymbol y) \mapsto \mathbb R\). For example, this may be the usual \(F\)-statistic for testing the hypothesis (16.2) in the model (16.1). Then, the permutation test based on the entire permutation group is as follows:

Permutation test based on the entire permutation group

Compute the observed value of the test statistic \(T(\boldsymbol X, \boldsymbol y)\).
For each \(\pi \in \mathcal{S}_n\), compute the test statistic on the permuted data, \(T(\boldsymbol X, \pi \boldsymbol y)\).
Compute the quantile \(\mathbb Q_{1-\alpha}[\{T(\boldsymbol X, \pi \boldsymbol y): \pi \in \mathcal S_n\}]\).
Reject if \(T(\boldsymbol X, \boldsymbol y) > \mathbb Q_{1-\alpha}[\{T(\boldsymbol X, \pi \boldsymbol y): \pi \in \mathcal S_n\}]\).

As claimed at the outset of this chapter, this test has non-asymptotic Type-I error control.

Theorem 16.1 For any \(n\), the permutation test based on the entire permutation group has Type-I error at most \(\alpha\) for testing the null hypothesis \(H_0: \boldsymbol{\beta}_{\text{-}0} = \boldsymbol{0}\).

Proof. Suppose \(H_0\) holds. Let \(\tau \in \mathcal S_n\). Then, by the permutation invariance property (16.3), we have \[ \begin{split} &\mathbb P[T(\boldsymbol X, \boldsymbol y) > \mathbb Q_{1-\alpha}[\{T(\boldsymbol X, \pi \boldsymbol y): \pi \in \mathcal S_n\}]] \\ &\quad = \mathbb P[T(\boldsymbol X, \tau \boldsymbol y) > \mathbb Q_{1-\alpha}[\{T(\boldsymbol X, \pi \tau \boldsymbol y): \pi \in \mathcal S_n\}]] \\ &\quad = \mathbb P[T(\boldsymbol X, \tau \boldsymbol y) > \mathbb Q_{1-\alpha}[\{T(\boldsymbol X, \pi \boldsymbol y): \pi \in \mathcal S_n\}]]. \end{split} \] Therefore, the probability that \(T(\boldsymbol X, \boldsymbol y)\) exceeds the \((1-\alpha)\)-quantile of the permutation distribution is the same as the probability that any other permuted test statistic exceeds the \((1-\alpha)\)-quantile. Therefore, we have \[ \begin{split} &\mathbb P[T(\boldsymbol X, \boldsymbol y) > \mathbb Q_{1-\alpha}[\{T(\boldsymbol X, \pi \boldsymbol y): \pi \in \mathcal S_n\}]] \\ &\quad= \frac{1}{|\mathcal S_n|} \sum_{\tau \in \mathcal S_n} \mathbb P[T(\boldsymbol X, \tau \boldsymbol y) > \mathbb Q_{1-\alpha}[\{T(\boldsymbol X, \pi \boldsymbol y): \pi \in \mathcal S_n\}]] \\ &\quad = \mathbb E\left[\frac{|\{\tau \in \mathcal S_n: T(\boldsymbol X, \tau \boldsymbol y) > \mathbb Q_{1-\alpha}[\{T(\boldsymbol X, \pi \boldsymbol y): \pi \in \mathcal S_n\}\}|}{|\mathcal S_n|}\right] \\ &\quad \leq \alpha. \end{split} \tag{16.4}\]

16.1.2 Permutation test based on a sample of the permutation group

The permutation test based on the entire permutation group is computationally infeasible for large \(n\). Instead, we can use a random sample of the permutation group to approximate the null distribution of the test statistic.

Permutation test based on a sample of the permutation group

Compute the observed value of the test statistic \(T(\boldsymbol X, \boldsymbol y)\).
Draw a random sample \((\pi_1, \dots, \pi_B)\) from \(\mathcal S_n\).
For each \(b = 1, \dots, B\), compute the test statistic on the permuted data, \(T(\boldsymbol X, \pi_b \boldsymbol y)\).
Compute the quantile \(\mathbb Q_{1-\alpha}[\{T(\boldsymbol X, \boldsymbol y)\} \cup \{T(\boldsymbol X, \pi_b \boldsymbol y): b = 1, \dots, B\}]\).
Reject if \(T(\boldsymbol X, \boldsymbol y) > \mathbb Q_{1-\alpha}[\{T(\boldsymbol X, \boldsymbol y)\} \cup \{T(\boldsymbol X, \pi_b \boldsymbol y): b = 1, \dots, B\}]\).

This gives us not just an approximation to the permutation test based on the entire permutation group, but a finite-sample valid test in its own right. The inclusion of the original test statistic \(T(\boldsymbol X, \boldsymbol y)\) in the quantile computation ensures that the test has finite-sample Type-I error control; see the exchangeability-based argument in the proof.

Theorem 16.2 For any \(n\), the permutation test based on a sample of the permutation group has Type-I error at most \(\alpha\) for testing the null hypothesis \(H_0: \boldsymbol{\beta}_{\text{-}0} = \boldsymbol{0}\).

Proof. Suppose \(H_0\) holds. We claim that the \(B+1\) test statistics \[ \{T(\boldsymbol X, \boldsymbol y)\} \cup \{T(\boldsymbol X, \pi_b \boldsymbol y): b = 1, \dots, B\} \] are exchangeable, i.e., their joint distribution is independent of their ordering. To see that, let \(\tau\) be a randomly sampled permutation from \(\mathcal S_n\). Then, by the permutation invariance property (16.3), we have \[ \begin{split} &\{T(\boldsymbol X, \boldsymbol y)\} \cup \{T(\boldsymbol X, \pi_b \boldsymbol y): b = 1, \dots, B\} \\ &\quad\stackrel{d}{=} \{T(\boldsymbol X, \tau \boldsymbol y)\} \cup \{T(\boldsymbol X, \pi_b \tau \boldsymbol y): b = 1, \dots, B\}. \end{split} \] It is not hard to see that \(\{\tau, \pi_1 \tau, \dots, \pi_B\tau\}\) is an i.i.d. sample from \(\mathcal S_n\), from which the claimed exchangeability follows. From this exchangeability, we get that \[ \begin{split} &\mathbb P[T(\boldsymbol X, \boldsymbol y) > \mathbb Q_{1-\alpha}[\{T(\boldsymbol X, \boldsymbol y)\} \cup \{T(\boldsymbol X, \pi_b \boldsymbol y): b = 1, \dots, B\}]] \\ &\quad= \mathbb P[T(\boldsymbol X, \pi_b \boldsymbol y) > \mathbb Q_{1-\alpha}[\{T(\boldsymbol X, \boldsymbol y)\} \cup \{T(\boldsymbol X, \pi_b \boldsymbol y): b = 1, \dots, B\}]] \end{split} \] for each \(b = 1, \dots, B\), from which Type-I error control follows by the same argument as in the derivation (16.4).

16.1.3 Obtaining \(p\)-values from permutation tests

In some cases, it is desirable to extract \(p\)-values from permutation tests, rather than just the decision to accept or reject at a fixed level \(\alpha\). The \(p\)-value is the smallest level at which the null hypothesis can be rejected, i.e., the probability under the permutation distribution of observing a test statistic at least as extreme as the observed test statistic. For permutation tests based on the full permutation group, the \(p\)-value can be computed as follows: \[ p = \frac{1}{|\mathcal S_n|}\sum_{\tau \in \mathcal S_n} \mathbb I\{T(\boldsymbol X, \tau \boldsymbol y) \geq T(\boldsymbol X, \boldsymbol y)\}. \] For permutation tests based on a sample of the permutation group, the \(p\)-value can be computed as \[ p = \frac{1}{B+1}\left(1 + \sum_{b=1}^B \mathbb I\{T(\boldsymbol X, \pi_b \boldsymbol y) \geq T(\boldsymbol X, \boldsymbol y)\}\right). \tag{16.5}\]

Warning

A common mistake is to omit the “1+” term from the numerator and denominator of equation (16.5). These terms are essential for constructing a valid \(p\)-value. In particular, these terms prevent the \(p\)-value from being exactly zero.

16.2 Special case: Two-groups model

The most common application of the permutation test is to the two-groups model: \[ y_i = \beta_0 + \beta_1 x_{i,1} + \epsilon_i, \quad \text{where } x_{i,1} \in \{0, 1\}. \] The goal here is to test whether the binary “treatment” variable has any effect on the response variable: \[ H_0: \beta_1 = 0. \] To make the two groups more explicit, we can write \[ \{y_i: x_{i,1} = 0\} \overset{\text{i.i.d.}}\sim G_0, \quad \{y_i: x_{i,1} = 1\} \overset{\text{i.i.d.}}\sim G_1, \] and the null hypothesis can be reformulated as \[ H_0: G_0 = G_1. \] In this case, the permutation mechanism randomly reassigns observations to the two groups. A commonly used test statistic \(T\) used in conjunction with this test is the difference in means between the two groups. While the permutation test controls Type-I error exactly under the hypothesis that the two groups come from exactly the same distribution, we might want to test a weaker hypothesis that the two groups have the same mean. It turns out that, at least asymptotically, the permutation test controls Type-I error under this weaker null hypothesis if it is based on a studentized statistic, such as \[ T(\boldsymbol X, \boldsymbol y) = \frac{\bar y_1 - \bar y_0}{\sqrt{\frac{\hat \sigma_0^2}{n_0} + \frac{\hat \sigma_1^2}{n_1}}}, \] where \(\bar y_0\) and \(\bar y_1\) are the sample means of the two groups, \(\hat \sigma_0^2\) and \(\hat \sigma_1^2\) are the sample variances of the two groups, and \(n_0\) and \(n_1\) are the sample sizes of the two groups.

16.3 Permutation test versus bootstrap

The bootstrap and permutation are both resampling-based tests that use computation as a substitute for mathematical derivations of sampling distributions. Both methods have better finite-sample performance than their asymptotic counterparts. The bootstrap and the permutation test are typically considered primarily in the context of confidence interval construction and hypothesis testing, respectively, although the bootstrap can also be used for hypothesis testing in certain cases. The key difference is that the permutation test has valid Type-I error control in finite samples, while the bootstrap requires an asymptotic justification (even if the asymptotic convergence is faster than typical CLT-based asymptotics). Furthermore, the bootstrap is somewhat more versatile than the permutation test, as the latter is restricted to testing null hypotheses about all non-intercept coefficients.