15 The bootstrap
15.1 Introduction to the bootstrap
The bootstrap can be used for either confidence interval construction or hypothesis testing, with confidence interval construction being a much more common application. For this reason, we will focus on confidence interval construction in remainder of this section as well as in Section 15.2 and Section 15.3. We will discuss bootstrap hypothesis testing in Section 15.4.
15.1.1 Usual inference paradigm
We typically carry out linear model inference for \(\beta_j\) by approximating the sampling distribution of \(\widehat\beta_j\), or a derivative thereof, such as the \(t\)-statistic. Under the standard linear model assumptions, we have \[ g(\boldsymbol{\widehat \beta}, \boldsymbol \beta) \equiv \widehat{\beta}_j - \beta_j \sim N(0, \sigma^2 [(\boldsymbol{X}^T \boldsymbol{X})^{-1}]_{jj}) \tag{15.1}\] and \[ g(\boldsymbol{\widehat \beta}, \boldsymbol \beta) \equiv \frac{\widehat{\beta}_j - \beta_j}{\text{s.e.}(\widehat \beta_j)} \sim t_{n-p}. \tag{15.2}\] Here, \(g(\boldsymbol{\widehat \beta}, \boldsymbol \beta)\) denotes a derivative quantity, whose distribution is the basis for inference. If the lower and upper quantiles \(\mathbb Q_{\alpha/2}[g(\boldsymbol{\widehat \beta}, \boldsymbol \beta)]\) and \(\mathbb Q_{1-\alpha/2}[g(\boldsymbol{\widehat \beta}, \boldsymbol \beta)]\) are known, then we can construct a \((1-\alpha)\) confidence interval for \(\beta_j\) as \[ \text{CI}(\beta_j) \equiv \{\beta_j: g(\boldsymbol{\widehat \beta}, \boldsymbol \beta) \in [\mathbb Q_{\alpha/2}[g(\boldsymbol{\widehat \beta}, \boldsymbol \beta)], \mathbb Q_{1-\alpha/2}[g(\boldsymbol{\widehat \beta}, \boldsymbol \beta)]]\}. \tag{15.3}\]
Under model misspecification, the distributions on the right-hand sides of equations (15.1) and (15.2) may no longer be valid.
15.1.2 Bootstrap inference paradigm
The bootstrap is an approach to more robust inference that obtains such sampling distributions by a technique known as resampling. The core idea of the bootstrap is to use the data to construct an approximation to the data-generating distribution and then to approximate the sampling distribution of any statistic by simulating from this approximate data-generating distribution. In more detail, the bootstrap paradigm is as follows:
This approach, pioneered by Brad Efron in 1979, obviates the need for stringent assumptions and mathematical derivations to obtain limiting distributions, replacing these with added computation. The bootstrap is extremely flexible and can be adapted to apply in a variety of settings. Furthermore, bootstrap methods are typically more accurate than their asymptotic counterparts in finite samples. While the justification of the bootstrap is still asymptotic (requiring \(\widehat F\) to approach \(F\)), the rate of convergence is often “second-order” \(O(1/n)\) rather than the usual “first-order” \(O(1/\sqrt{n})\) of standard asymptotic inference. This faster second-order convergence gives the bootstrap an advantage in finite samples.
15.1.3 Overview of bootstrap flavors
The bootstrap comes in a variety of flavors, dictated by the mechanism by which the data distribution \(F\) is learned (e.g. the parametric, residual, or pairs bootstraps), and the derivative quantity \(g(\cdot, \cdot)\) on which inference is based (e.g. the empirical bootstrap and the bootstrap-\(t\)). These two sets of flavors can be mixed and matched.
15.2 Derivative quantities on which to base inference
15.2.1 The empirical bootstrap
The empirical bootstrap, the most common choice, is based on the quantity \[ g(\boldsymbol{\widehat \beta}, \boldsymbol{\beta}) = \widehat \beta_j- \beta_j, \] We can derive that if \[ \mathbb P\left[\widehat \beta_j - \beta_j \in \left[\mathbb Q_{\alpha/2}[\widehat \beta^{(b)}_j - \widehat \beta_j], \mathbb Q_{1-\alpha/2}[\widehat \beta^{(b)}_j - \widehat \beta_j]\right]\right] \overset \cdot \geq 1-\alpha, \] then \[ \mathbb P\left[\beta_j \in \left[\widehat \beta_j - \mathbb Q_{1-\alpha/2}[\widehat \beta^{(b)}_j - \widehat \beta_j], \widehat \beta_j - \mathbb Q_{\alpha/2}[\widehat \beta^{(b)}_j - \widehat \beta_j]\right]\right] \overset \cdot \geq 1-\alpha. \] For this reason, we define \[ \text{CI}^{\text{boot}}(\beta_j) \equiv \left[\widehat \beta_j - \mathbb Q_{1-\alpha/2}[\widehat \beta^{(b)}_j - \widehat \beta_j], \widehat \beta_j - \mathbb Q_{\alpha/2}[\widehat \beta^{(b)}_j - \widehat \beta_j]\right]. \]
A commonly used alternative to the empirical bootstrap is the percentile bootstrap, defined by \[ \text{CI}^{\text{boot}}(\beta_j) \equiv \left[\mathbb Q_{\alpha/2}[\widehat \beta^{(b)}_j], \mathbb Q_{1-\alpha/2}[\widehat \beta^{(b)}_j]\right]. \tag{15.5}\] Here, the resampling distribution of \(\widehat \beta_j\) is used directly to construct the confidence interval. However, this approach does not fall within the bootstrap paradigm described above, and in particular, the formula (15.5) is not a special case of (15.4). The formula (15.5) can be viewed as seeking an interval within which \(\widehat \beta_j\) (rather than \(\beta_j\) itself) falls with 95% probability. The percentile bootstrap is only justified when the distribution of \(\widehat \beta^{(b)}_j\) is symmetric about \(\widehat \beta_j\), in which case it coincides with the empirical bootstrap.
15.2.2 The bootstrap-\(t\) method
A weakness of the empirical bootstrap is that the quantity \(g(\boldsymbol{\widehat \beta}, \boldsymbol{\beta}) = \widehat \beta_j - \beta_j\) has distribution \(N(0, \sigma^2 [(X^T X)^{-1}]_{jj})\) (recalling equation (15.1)), which depends on the nuisance parameter \(\sigma^2\). When we approximate this distribution by bootstrapping, we implicitly are substituting in an estimate of \(\sigma^2\), which is itself subject to sampling variability. The empirical bootstrap does not account for this variability, because the distribution \(\widehat F\) on which the estimate of \(\sigma^2\) is based is held fixed throughout. To see this more clearly, consider the following example.
Example 15.1 (Non-pivotality in the normal mean problem) Suppose that \(\boldsymbol y \sim N(\boldsymbol 1 \beta_0, \sigma^2 \boldsymbol I)\), and the goal is to construct a confidence interval for \(\beta_0\). Defining \(\widehat \beta_0 \equiv \bar y\) and \(\widehat \sigma^2 \equiv \tfrac 1 {n-1} \sum_{i=1}^n (y_i - \bar y)^2\), consider the empirical bootstrap based on resampling \(\boldsymbol y^{(b)} \sim N(\boldsymbol 1 \widehat \beta_0, \widehat \sigma^2 \boldsymbol I)\). In this case, we will find that \[ \widehat \beta^{(b)}_0 - \widehat \beta_0 = \bar y^{(b)} - \bar y \sim N(0, \widehat \sigma^2/n), \] which will give rise to the bootstrap confidence interval \[ \text{CI}^{\text{boot}}(\beta_0) = \widehat \beta_0 \pm z_{1-\alpha/2} \tfrac{1}{\sqrt{n}} \widehat \sigma. \] The uncertainty in the estimate \(\widehat \sigma^2\) is not accounted for. We know from Unit 2 that, if the usual linear model assumptions are satisfied, we could account for this uncertainty by using a \(t\)-distribution instead of a normal distribution.
This issue can be addressed by bootstrapping a pivotal quantity \(g(\boldsymbol{\widehat \beta}, \boldsymbol{\beta})\), i.e., a quantity whose distribution does not depend on unknown parameters (at least under standard assumptions). In the context of the linear model, the \(t\)-statistic (15.2) is pivotal. Bootstrapping the \(t\)-statistic, called the bootstrap-\(t\) method, is therefore a way to account for the uncertainty in the estimate of \(\sigma^2\). To derive the bootstrap-\(t\) method, we approximate \[ \mathbb{P} \left[ \frac{\widehat{\beta}_j - \beta_j}{\text{s.e.}(\widehat{\beta}_j)} \in \left[ \mathbb{Q}_{\alpha/2} \left( \frac{\widehat{\beta}_j^{(b)} - \widehat{\beta}_j}{\text{s.e.}(\widehat{\beta}_j^{(b)})} \right), \mathbb{Q}_{1-\alpha/2} \left( \frac{\widehat{\beta}_j^{(b)} - \widehat{\beta}_j}{\text{s.e.}(\widehat{\beta}_j^{(b)})} \right) \right] \right] \overset{\cdot}{\geq} 1-\alpha, \] which justifies the bootstrap-\(t\) confidence interval \[ \begin{split} &\text{CI}^{\text{boot}}(\beta_j) \\ &\quad \equiv \left[\widehat \beta_j - \text{s.e.}(\widehat \beta_j) \mathbb Q_{1-\alpha/2} \left( \frac{\widehat \beta_j^{(b)} - \widehat \beta_j}{\text{s.e.}(\widehat \beta_j^{(b)})} \right), \widehat \beta_j - \text{s.e.}(\widehat \beta_j) \mathbb Q_{\alpha/2} \left( \frac{\widehat \beta_j^{(b)} - \widehat \beta_j}{\text{s.e.}(\widehat \beta_j^{(b)})} \right)\right]. \end{split} \]
15.3 Techniques for learning the data distribution
The methods for learning the data distribution lie on a spectrum based on how flexibly they model this distribution. Methods with less flexibility are more stable (i.e. less variable) but less robust. On the other hand, methods with more flexibility are more robust but less stable. The following table shows the methods in increasing order of flexibility, including which types of model misspecification they are robust to.
Method | Non-normality | Hetero-skedasticity | Group correlation | Temporal correlation |
---|---|---|---|---|
Parametric | No | No | No | No |
Residual | Yes | No | No | No |
Pairs | Yes | Yes | No | No |
Clustered | Yes | Yes | Yes | No |
Moving Blocks | Yes | Yes | No | Yes |
We will present these methods in the same order, starting from the parametric bootstrap.
15.3.1 The parametric bootstrap
The parametric bootstrap proceeds by specifying a parametric model for the data \((\boldsymbol X, \boldsymbol y)\), such as the one from Unit 2: \[ \boldsymbol y = \boldsymbol X \boldsymbol \beta + \boldsymbol \epsilon, \quad \boldsymbol \epsilon \sim N(0, \sigma^2 \boldsymbol I). \tag{15.6}\] The model matrix \(\boldsymbol X\) is kept fixed and only the distribution of \(\boldsymbol y\) (conditionally on \(\boldsymbol X\)) is modeled. These parameters can be fit by maximum likelihood, as usual. Then, the bootstrapped datasets can be generated as follows: \[ \boldsymbol X^{(b)} = \boldsymbol X; \quad \boldsymbol y^{(b)} \sim N(\boldsymbol X \boldsymbol{\widehat \beta}, \widehat \sigma^2 \boldsymbol I). \] In the context of the linear model, the parametric bootstrap is rarely used, particularly in the context of the standard parametric form (15.6) and the least squares estimator \(\boldsymbol{\widehat \beta}\). Indeed, it offers no robustness to model misspecification and the distribution of the least squares estimator is known exactly, so there is no need to approximate it using resampling. In contexts beyond linear models or when estimators beyond the least squares estimator are of interest, the parametric bootstrap is useful because it can be used the approximate the distributions of analytically intractable estimators, substituting computation for math.
15.3.2 The residual bootstrap
The residual bootstrap is based on a parametric model for \(\mathbb E[\boldsymbol y \mid \boldsymbol X]\) and a nonparametric model for the noise terms. In particular, suppose that \[ y_i = \boldsymbol x_{i*}^T \boldsymbol \beta + \epsilon_i, \quad \epsilon_i \overset{\text{i.i.d.}}{\sim} G \quad \text{for } i = 1, \dots, n, \tag{15.7}\] where \(G\) is an unknown distribution without a parametric form. Then, the data-generating distribution \(F\) is specified by the pair \((\boldsymbol \beta, G)\). As with the parametric bootstrap, the model matrix \(\boldsymbol X\) is kept fixed and only the distribution of \(\boldsymbol y\) (conditionally on \(\boldsymbol X\)) is modeled. The parameter vector \(\boldsymbol \beta\) can be fit by least squares, as usual. Then, the distribution of the noise terms \(\epsilon_i\) can be estimated by the empirical distribution of the residuals \(\widehat \epsilon_i\): \[ \widehat G = \frac{1}{n} \sum_{i=1}^n \delta_{\widehat \epsilon_i}, \] where \(\delta_{\widehat \epsilon_i}\) is the Dirac delta function at \(\widehat \epsilon_i\). The bootstrapped datasets can then be generated as follows: \[ x_{i*}^{(b)} = x_{i*}; \quad y_i^{(b)} = \boldsymbol x_{i*}^T \boldsymbol{\widehat \beta} + \epsilon_i^{(b)}, \quad \epsilon_i^{(b)} \overset{\text{i.i.d.}}{\sim} \widehat G \quad \text{for } i = 1, \dots, n. \] Note that the sampling of \(\epsilon_i^{(b)} \overset{\text{i.i.d.}}{\sim} \widehat G\) is equivalent to sampling with replacement from the residuals \(\widehat \epsilon_i\). By avoiding modeling \(\epsilon_i\) as normal, the residual bootstrap is robust to non-normality. However, it is not robust to heteroskedasticity or correlated errors, because it models the \(\epsilon_i\) as i.i.d. from some distribution.
15.3.3 Pairs bootstrap
Weakening the assumptions further, let us assume simply that \[ (\boldsymbol{x}_{i*}, y_i) \overset{\text{i.i.d.}}{\sim} F \] for some joint distribution \(F\). Unlike the parametric and residual bootstraps, the pairs bootstrap treats the predictors \(\boldsymbol X\) as random rather than fixed. We can then fit \(\widehat F\) as the empirical distribution of the data: \[ \widehat F = \frac{1}{n} \sum_{i=1}^n \delta_{(\boldsymbol{x}_{i*}, y_i)}. \] The bootstrapped datasets can then be generated as follows: \[ (\boldsymbol{x}_{i*}^{(b)}, y_i^{(b)}) \overset{\text{i.i.d.}}{\sim} \widehat F \quad \text{for } i = 1, \dots, n. \] Note that sampling the pairs \((\boldsymbol{x}_{i*}^{(b)}, y_i^{(b)})\) from \(\widehat F\) is equivalent to sampling with replacement from the rows of the original data. The benefit of the pairs bootstrap is that it does not assume homoskedasticity since the error variance is allowed to depend on \(\boldsymbol{x}_{i*}\). Therefore, the pairs bootstrap addresses both non-normality and heteroskedasticity, though it does not address correlated errors (though variants of the pairs bootstrap do; see below). Note that the pairs bootstrap does not even assume that \(\mathbb{E}[y_i] = \boldsymbol{x}_{i*}^T \boldsymbol{\beta}\) for some \(\boldsymbol{\beta}\). However, in the presence of model bias, it is unclear for what parameters we are even doing inference. While the pairs bootstrap assumes less than the residual bootstrap, it may be somewhat less efficient in the case when the assumptions of the latter are met. However, the pairs bootstrap is the most commonly applied flavor of the bootstrap.
15.3.4 Clustered bootstrap
In the presence of clustered errors, the pairs bootstrap can be modified to the clustered bootstrap. The distributional assumption underlying the clustered bootstrap is the following: \[ \{(\boldsymbol{x}_{i*}, y_i): c(i) = c\} \overset{\text{i.i.d.}}{\sim} F \quad \text{for } c = 1, \dots, C, \tag{15.8}\] where \(c(i)\) is the cluster to which observation \(i\) belongs. Therefore, entire clusters are modeled as coming i.i.d. from some distribution across clusters. As with the pairs bootstrap, this distribution is estimated by the empirical distribution of the data, and resampling from this distribution amounts to sampling entire clusters (rather than individual observations) from the original data, with replacement. This kind of resampling preserves the joint correlation structure within clusters. Note that the clustered bootstrap is a special case of the pairs bootstrap, where each pair forms its own cluster.
15.3.5 Moving blocks bootstrap
In the case of temporally (or spatially) correlated errors, the pairs bootstrap can be modified to the moving blocks bootstrap. The distributional assumption underlying the moving blocks bootstrap is the same as that of the clustered bootstrap (15.8), except the clusters are defined as contiguous blocks of observations. The distribution across blocks is fit as the empirical distribution of all blocks of a given size, and resampling from this distribution amounts to sampling entire blocks (rather than individual observations) from the original data, with replacement. This kind of resampling preserves the joint correlation structure within temporal or spatial blocks, though it ignores correlations across boundaries of these blocks. Like the clustered bootstrap, the moving blocks bootstrap is a special case of the pairs bootstrap.
15.4 Bootstrap hypothesis testing
The bootstrap inference paradigm described in Section 15.1.2 is primarily for constructing confidence intervals. For one-dimensional quantities like \(\beta_j\), confidence intervals can be used to perform hypothesis tests via duality. However, it is more challenging to use the bootstrap to create confidence regions for multi-dimensional quantities like \(\boldsymbol{\beta}_S\). Nevertheless, in some cases the bootstrap paradigm can be adapted to perform hypothesis tests directly.
15.4.1 Bootstrap testing paradigm
Here, the key new step is the third, in which a null data distribution \(\widehat F_0\) is derived from the approximate data distribution \(\widehat F\). The challenge is that, depending on the form of \(\widehat F\), this may or may not be possible. In fact, obtaining \(\widehat F_0\) from \(\widehat F\) is easily done whenever the model for the data involves a parameter vector \(\boldsymbol \beta\), as is the case for the parametric and residual bootstraps (see the next section for more detail on the latter). In this case, \(\widehat F_0\) can be obtained from \(\widehat F\) by setting the coefficients \(\boldsymbol \beta_S\) to zero. On the other hand, for the pairs bootstrap and its variants, it is not clear how to obtain \(\widehat F_0\) from \(\widehat F\). Finally, note that the bootstrap testing paradigm is in principle compatible with any test statistic \(T\). A popular choice for \(T\) is the \(F\)-statistic for \(H_0: \boldsymbol \beta_S = 0\) from Unit 2.
15.4.2 Testing with the residual bootstrap
A commonly used bootstrap flavor for hypothesis testing is the residual bootstrap. Recalling the data-generating model (15.7), suppose \(\widehat F = (\boldsymbol{\widehat \beta}, \widehat G)\) is the fitted model. Then, we can define \(\widehat F_0 = ((\boldsymbol 0, \boldsymbol{\widehat \beta}_{\text{-}S}), \widehat G)\). Therefore, the bootstrapped null data are drawn from the following distribution: \[ x_{i*}^{(b)} = x_{i*}; \quad y_i^{(b)} = \boldsymbol x_{i, \text{-}S}^T \boldsymbol{\widehat \beta}_{\text{-}S} + \epsilon_i^{(b)}, \quad \epsilon_i^{(b)} \overset{\text{i.i.d.}}\sim \widehat G. \] As before, the bootstrapped residuals \(\epsilon_i^{(b)}\) are sampled with replacement from the set of original residuals.