14 Asymptotic methods

In this section, we present a set of asymptotic methods for fixing heteroskedastic or correlated errors, in the setting that \[ \boldsymbol{y} \sim N(\boldsymbol{X} \boldsymbol{\beta}, \boldsymbol{\Sigma}). \tag{14.1}\] These methods are based on estimating \(\boldsymbol{\Sigma}\); they use this estimate to either (i) build a better estimate \(\boldsymbol{\widehat{\beta}}\) (Section 14.1) or (ii) build better standard errors for the least squares estimate (Section 14.2). We discuss these two approaches in turn, followed by how to carry out inference based on the resulting estimates (Section 14.3).

14.1 Methods that build a better estimate of \(\boldsymbol{\beta}\)

14.1.1 Generalized least squares

Let us premultiply \(\boldsymbol y\) by \(\boldsymbol \Sigma^{-1/2}\) to obtain \[ \boldsymbol \Sigma^{-1/2} \boldsymbol y \sim N(\boldsymbol \Sigma^{-1/2} \boldsymbol X \boldsymbol \beta, \boldsymbol I). \] Viewing \(\boldsymbol \Sigma^{-1/2} \boldsymbol y\) as the new response and \(\boldsymbol \Sigma^{-1/2} \boldsymbol X\) as the new model matrix, we can apply the usual least squares estimator to obtain \[ \boldsymbol{\tilde \beta}^{\text{GLS}} = (\boldsymbol X^T \boldsymbol \Sigma^{-1} \boldsymbol X)^{-1} \boldsymbol X^T \boldsymbol \Sigma^{-1} \boldsymbol y. \tag{14.2}\] This is the generalized least squares estimate for the model (14.1), and has the following distribution: \[ \boldsymbol{\tilde{\beta}}^{\text{GLS}} \sim N(\boldsymbol{\beta}, (\boldsymbol{X}^T \boldsymbol{\Sigma}^{-1}\boldsymbol{X})^{-1}). \] By the Gauss-Markov theorem, this is the best linear unbiased estimate of \(\boldsymbol{\beta}\), recovering efficiency. We would like to carry out inference based on the latter distributional result analogously to how we did so in Chapter 2, as long as we can estimate \(\boldsymbol \Sigma\) accurately enough.

14.1.2 Models for \(\boldsymbol{\Sigma}\)

This class of methods typically postulates a parametric form for \(\boldsymbol{\Sigma}\), denoted by \(\boldsymbol{\Sigma}(\boldsymbol{\nu})\), where \(\boldsymbol{\nu}\) is a vector of parameters, and then proceed by estimating \(\boldsymbol \nu\). Below are a few examples of such parametric models:

Heteroskedastic errors. In this case, we can assume that \(\boldsymbol{\Sigma} = \text{diag}(\sigma_1^2, \dots, \sigma_n^2)\), where \[ \log \sigma_i^2 = \boldsymbol{x}_i^T \boldsymbol{\nu}. \]
Clustered errors. Suppose that each observation \(i\) falls into a cluster \(c(i)\). Then, we can postulate a random effects model \[ y_i = \boldsymbol{x}_i^T \boldsymbol{\beta} + \delta_{c(i)} + \tau_i, \quad \text{where} \quad \delta_c \overset{\text{i.i.d.}}\sim N(0, \sigma^2_{\delta}), \quad \tau_i \overset{\text{i.i.d.}}\sim N(0, \sigma^2_{\tau}). \] This imposes a block-diagonal structure on \(\boldsymbol{\Sigma}\), where each block corresponds to a cluster.
Temporal errors. If the observations have a temporal structure, we might impose an AR(1) model on the residuals: \[ \epsilon_1 = \tau_1; \quad \epsilon_i = \rho \epsilon_{i-1} + \tau_i \quad \text{for } i > 1, \quad \text{where} \quad \tau_i \overset{\text{i.i.d.}}\sim N(0, \sigma^2). \] This imposes an approximately banded structure on \(\boldsymbol{\Sigma}\), with \(\Sigma_{i_1i_2} = \sigma^2\rho^{|i_1-i_2|}\).

Random versus fixed effects models

Random effects models deal address correlated errors but not with model bias. The difference is that, in the case of correlated errors, the errors may be correlated with themselves but not with the regressors. In the case of model bias, the errors may be correlated with the regressors. To address model bias in the presence of clustering structure, fixed effects are necessary. Fixed effects models decrease model bias at the cost of increased variance, because more parameters must be estimated. Random effects models are more susceptible to model bias but have lower variance.

14.1.3 Estimating \(\boldsymbol{\Sigma}\)

Given a parametric model for \(\boldsymbol{\Sigma}\), we can estimate \(\boldsymbol{\nu}\) by one of two approaches. The first approach, typical in statistics, is to maximize the likelihood as a function of \((\boldsymbol \beta, \boldsymbol \nu)\). The second approach, typical in econometrics, is to estimate \(\boldsymbol{\beta}\) using OLS, and then to fit \(\boldsymbol \nu\) based on the residuals. This gives us the estimate \(\boldsymbol{\widehat \Sigma} = \boldsymbol{\widehat \Sigma}(\boldsymbol{\widehat \nu})\).

14.1.4 Inferring about \(\boldsymbol \beta\) based on the estimate \(\boldsymbol{\widehat \Sigma}\)

With an estimate \(\boldsymbol{\widehat \Sigma}\) in hand, we can use it to build a (hopefully) better estimate of \(\boldsymbol{\beta}\), using the following plug-in version of the GLS estimate (14.2): \[ \boldsymbol{\widehat{\beta}}^{\text{FGLS}} \equiv (\boldsymbol{X}^T \boldsymbol{\widehat{\Sigma}}^{-1}\boldsymbol{X})^{-1}\boldsymbol{X}^T \boldsymbol{\widehat{\Sigma}}^{-1}\boldsymbol{y}. \tag{14.3}\]

This is called the feasible generalized least squares estimate (FGLS) in econometrics, to contrast it with the infeasible estimate that assumes \(\boldsymbol{\Sigma}\) is known exactly. Then, we can carry out inference based on the approximation distribution \[ \boldsymbol{\widehat{\beta}}^{\text{FGLS}} \overset \cdot \sim N(\boldsymbol{\beta}, (\boldsymbol{X}^T \boldsymbol{\widehat{\Sigma}}^{-1}\boldsymbol{X})^{-1}). \tag{14.4}\]

14.2 Methods that build better standard errors for OLS estimate

Sometimes we don’t feel comfortable enough with our estimate of \(\boldsymbol{\Sigma}\) to actually modify the least squares estimator. So we want to keep using our least squares estimator, but still get standard errors robust to heteroskedastic or correlated errors. There are several strategies to computing valid standard errors in such situations.

14.2.1 Sandwich standard errors

Let’s say that \(\boldsymbol{y} = \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), where \(\boldsymbol{\epsilon} \sim N(\boldsymbol{0}, \boldsymbol{\Sigma})\). Then, we can compute that the covariance matrix of the least squares estimate \(\boldsymbol{\widehat{\beta}}\) is

\[ \text{Var}[\boldsymbol{\widehat{\beta}}] = (\boldsymbol{X}^T \boldsymbol{X})^{-1}(\boldsymbol{X}^T \boldsymbol{\Sigma} \boldsymbol{X})(\boldsymbol{X}^T \boldsymbol{X})^{-1}. \tag{14.5}\]

Note that this expression reduces to the usual \(\sigma^2(\boldsymbol{X}^T \boldsymbol{X})^{-1}\) when \(\boldsymbol{\Sigma} = \sigma^2 \boldsymbol{I}\). It is called the sandwich variance because we have the \((\boldsymbol{X}^T \boldsymbol{\Sigma} \boldsymbol{X})\) term sandwiched between two \((\boldsymbol{X}^T \boldsymbol{X})^{-1}\) terms. If we have some estimate \(\boldsymbol{\widehat{\Sigma}}\) of the covariance matrix, we can construct

\[ \widehat{\text{Var}}[\boldsymbol{\widehat{\beta}}] \equiv (\boldsymbol{X}^T \boldsymbol{X})^{-1}(\boldsymbol{X}^T \boldsymbol{\widehat{\Sigma}} \boldsymbol{X})(\boldsymbol{X}^T \boldsymbol{X})^{-1}. \tag{14.6}\]

Different estimates \(\boldsymbol{\widehat{\Sigma}}\) are appropriate in different situations. Below we consider three of the most common choices: one for heteroskedasticity (due to Huber-White), one for group-correlated errors (due to Liang-Zeger), and one for temporally-correlated errors (due to Newey-West).

14.2.2 Specific instances of sandwich standard errors

Huber-White standard errors

Suppose \(\boldsymbol{\Sigma} = \text{diag}(\sigma_1^2, \dots, \sigma_n^2)\) for some variances \(\sigma_1^2, \dots, \sigma_n^2 > 0\). The Huber-White sandwich estimator is defined by (14.5), with

\[ \boldsymbol{\widehat{\Sigma}} \equiv \text{diag}(\widehat{\sigma}_1^2, \dots, \widehat{\sigma}_n^2), \quad \text{where} \quad \widehat{\sigma}_i^2 = (y_i - \boldsymbol{x}_{i*}^T \boldsymbol{\widehat{\beta}})^2. \]

While each estimator \(\widehat{\sigma}_i^2\) is very poor, Huber and White’s insight was that the resulting estimate of the (averaged) quantity \(\boldsymbol{X}^T \boldsymbol{\widehat{\Sigma}}\boldsymbol{X}\) is not bad. To see why, assume that \((\boldsymbol{x}_{i*}, y_i) \overset{\text{i.i.d.}}{\sim} F\) for some joint distribution \(F\). Then, we have that

\[ \begin{split} \frac{1}{n}(\boldsymbol{X}^T \widehat{\boldsymbol{\Sigma}} \boldsymbol{X} - \boldsymbol{X}^T \boldsymbol{\Sigma} \boldsymbol{X}) &= \frac{1}{n} \sum_{i=1}^n (\widehat{\sigma}_i^2 - \sigma_i^2) \boldsymbol{x}_{i*} \boldsymbol{x}_{i*}^T \\ &= \frac{1}{n} \sum_{i=1}^n ((\epsilon_i + \boldsymbol{x}_{i*}^T(\boldsymbol{\beta} - \boldsymbol{\widehat{\beta}}))^2 - \sigma_i^2) \boldsymbol{x}_{i*} \boldsymbol{x}_{i*}^T \\ &= \frac{1}{n} \sum_{i=1}^n \epsilon_i^2 \boldsymbol{x}_{i*} \boldsymbol{x}_{i*}^T + o_p(1) \\ &\overset p \to 0. \end{split} \]

The last step holds by the law of large numbers, since \(\mathbb{E}[\epsilon_i^2 \boldsymbol{x}_{i*} \boldsymbol{x}_{i*}^T] = 0\) for each \(i\).

Liang-Zeger standard errors

Next, let’s consider the case of group-correlated errors. Suppose that the observations are clustered, with correlated errors among clusters but not between clusters. Suppose there are \(C\) clusters of observations, with the \(i\)th observation belonging to cluster \(c(i) \in \{1, \dots, C\}\). Suppose for the sake of simplicity that the observations are ordered so that clusters are contiguous. Let \(\boldsymbol{\widehat{\epsilon}}_c\) be the vector of residuals in cluster \(c\), so that \(\boldsymbol{\widehat{\epsilon}} = (\boldsymbol{\widehat{\epsilon}}_1, \dots, \boldsymbol{\widehat{\epsilon}}_C)\). Then, the true covariance matrix is \(\boldsymbol{\Sigma} = \text{block-diag}(\boldsymbol{\Sigma}_1, \dots, \boldsymbol{\Sigma}_C)\) for some positive definite \(\boldsymbol{\Sigma}_1, \dots, \boldsymbol{\Sigma}_C\). The Liang-Zeger estimator is then defined by (14.5), with

\[ \boldsymbol{\widehat{\Sigma}} \equiv \text{block-diag}(\boldsymbol{\widehat{\Sigma}_1}, \dots, \boldsymbol{\widehat{\Sigma}_C}), \quad \text{where} \quad \boldsymbol{\widehat{\Sigma}_c} \equiv \boldsymbol{\widehat{\epsilon}}_c \boldsymbol{\widehat{\epsilon}}_c^T. \]

Note that the Liang-Zeger estimator is a generalization of the Huber-White estimator. Its justification is similar as well: while each \(\boldsymbol{\widehat{\Sigma}_c}\) is a poor estimator, the resulting estimate of the (averaged) quantity \(\boldsymbol{X}^T \boldsymbol{\widehat{\Sigma}}\boldsymbol{X}\) is not bad as long as the number of clusters is large. Liang-Zeger standard errors are referred to as “clustered standard errors” in the econometrics community. It is recommended to employ clustered standard errors even when using cluster-level fixed effects, in order to capture remaining within-cluster correlations.

Newey-West standard errors

Finally, consider the case when our observations \(i\) have a temporal structure, and we believe there to be nontrivial correlations between \(\epsilon_{i1}\) and \(\epsilon_{i2}\) for \(|i1 - i2| \leq L\). Then, a natural extension of the Huber-White estimate of \(\boldsymbol{\Sigma}\) is \(\widehat{\Sigma}_{i1,i2} = \widehat{\epsilon}_{i1}\widehat{\epsilon}_{i2}\) for each pair \((i1, i2)\) such that \(|i1 - i2| \leq L\). Unfortunately, this is not guaranteed to give a positive semidefinite matrix \(\boldsymbol{\widehat{\Sigma}}\). Therefore, Newey and West proposed a slightly modified estimator:

\[ \boldsymbol{\widehat{\Sigma}}_{i1,i2} = \max\left(0, 1-\frac{|i1-i2|}{L+1}\right)\widehat{\epsilon}_{i1}\widehat{\epsilon}_{i2}. \]

This estimator shrinks the off-diagonal estimates \(\widehat{\epsilon}_{i1}\widehat{\epsilon}_{i2}\) based on their distance to the diagonal. It can be shown that this modification restores positive semidefiniteness of \(\boldsymbol{\widehat{\Sigma}}\).

14.3 Inference based on an approximate covariance matrix

Whether based on the relations (14.4) or (14.6), we end up with an estimator \(\boldsymbol{\widehat \beta}\) and an approximate covariance matrix \(\widehat{\boldsymbol{\Omega}}\), so that \[ \boldsymbol{\widehat{\beta}} \overset{\cdot}{\sim} N(\boldsymbol{\beta}, \widehat{\boldsymbol{\Omega}}). \]

This allows us to construct confidence intervals and hypothesis tests for each \(\beta_j\), by simply replacing \(\text{SE}(\beta_j)\) with \(\sqrt{\widehat{\Omega}_{jj}}\). For contrasts and prediction intervals, we can use the fact that \(\boldsymbol{c}^T \boldsymbol{\beta} \overset{\cdot}{\sim} N(\boldsymbol{c}^T \boldsymbol{\beta}, \boldsymbol{c}^T \widehat{\boldsymbol{\Omega}} \boldsymbol{c})\), so that \(\text{CE}(\boldsymbol{c}^T \boldsymbol{\beta}) = \sqrt{\boldsymbol{c}^T \widehat{\boldsymbol{\Omega}} \boldsymbol{c}}\). It is less obvious how to use the matrix \(\widehat{\boldsymbol{\Omega}}\) to test the hypothesis \(H_0: \boldsymbol{\beta}_S = \boldsymbol{0}\). To this end, we can use a Wald test (we will discuss Wald tests in more detail in Unit 4). The Wald test statistic is

\[ W = \boldsymbol{\widehat{\beta}}_S^T (\widehat{\boldsymbol{\Omega}}_{S, S})^{-1} \boldsymbol{\widehat{\beta}}_S, \]

which is asymptotically distributed as \(\chi^2_{|S|}\) under the null hypothesis. This is based on the following result.

Lemma 14.1 Let \(\boldsymbol Z \sim N(0, \boldsymbol \Sigma)\) be a \(d\)-dimensional random vector, with \(\boldsymbol \Sigma\) invertible. Then, \[ \boldsymbol Z^T \boldsymbol \Sigma^{-1} \boldsymbol Z \sim \chi^2_d. \]

This gives us the test \[ \phi_{\text{Wald}}(\boldsymbol X, \boldsymbol y) = \mathbb{I}\left\{ \boldsymbol{\widehat{\beta}}_S^T (\widehat{\boldsymbol{\Omega}}_{S, S})^{-1} \boldsymbol{\widehat{\beta}}_S > \chi^2_{|S|} (1-\alpha) \right\}. \tag{14.7}\] As it turns out, under the usual linear model assumptions, this test is asymptotically equivalent to the usual \(F\)-test for the hypothesis \(H_0: \boldsymbol{\beta}_S = \boldsymbol{0}\).

Proposition 14.1 The homoskedasticity-based \(F\)-statistic for the null hypothesis \(H_0: \boldsymbol{\beta}_S = \boldsymbol{0}\) can be expressed as \[ F = \boldsymbol{\widehat{\beta}}_S^T (\widehat{\boldsymbol{\Omega}}_{S, S})^{-1} \boldsymbol{\widehat{\beta}}_S / |S|, \] allowing us to rewrite the Wald test as \[ \phi_{\text{Wald}}(\boldsymbol X, \boldsymbol y) = \mathbb{I}\left\{ F > \tfrac{1}{|S|}\chi^2_{|S|} (1-\alpha) \right\}. \] Since \(F_{|S|, n-p} \overset d \to \tfrac{1}{|S|}\chi^2_{|S|}\) as \(n \rightarrow \infty\), it follows that the \(F\)-test and the Wald test are asymptotically equivalent.

Proof. Recall from Chapter 8 that the \(F\)-test statistic can be expressed as \[ F = \frac{\|(\boldsymbol H - \boldsymbol H_{\text{-}S})\boldsymbol y \|^2 / |S|}{\|(\boldsymbol I - \boldsymbol H) \boldsymbol y \|^2 / (n-p)} = \frac{\|(\boldsymbol H - \boldsymbol H_{\text{-}S})\boldsymbol y \|^2 / |S|}{\widehat \sigma^2}, \] where \(\boldsymbol H - \boldsymbol H_{\text{-}S}\) is the projection matrix onto \(C(\boldsymbol X_{*, S}^\perp)\). Now, let \(\boldsymbol{\widehat \beta}\) be the least squares estimate in the regression of \(\boldsymbol y\) on \(\boldsymbol X\). Then, we have \[ \begin{split} (\boldsymbol H - \boldsymbol H_{\text{-}S})\boldsymbol y &= (\boldsymbol H - \boldsymbol H_{\text{-}S})(\boldsymbol X_{*,S} \boldsymbol{\widehat \beta}_S + \boldsymbol X_{*,\text{-}S} \boldsymbol{\widehat \beta}_{\text{-}S} + \boldsymbol{\widehat \epsilon}) \\ &= (\boldsymbol H - \boldsymbol H_{\text{-}S})\boldsymbol X_{*,S} \boldsymbol{\widehat \beta}_S \\ &= \boldsymbol X_{*,S}^\perp \boldsymbol{\widehat \beta}_S. \end{split} \] Therefore, we have \[ \|(\boldsymbol H - \boldsymbol H_{\text{-}S})\boldsymbol y\|^2 = \boldsymbol{\widehat \beta}_S^T (\boldsymbol X_{*,S}^\perp)^T \boldsymbol X_{*,S}^\perp \boldsymbol{\widehat \beta}_S. \] Next, we claim that \[ (\boldsymbol X_{*,S}^\perp)^T \boldsymbol X_{*,S}^\perp = \{[(\boldsymbol X^T \boldsymbol X)^{-1}]_{S,S}]\}^{-1}. \tag{14.8}\] To see this, note that \[ \boldsymbol{\widehat \beta}_S \sim N(\boldsymbol \beta_S, \sigma^2 [(\boldsymbol X^T \boldsymbol X)^{-1}]_{S,S}). \tag{14.9}\] On the other hand, since \(\boldsymbol{\widehat \beta}_S\) can be obtained by regressing \(\boldsymbol y\) onto \(\boldsymbol X_{*,S}^{\perp}\), we also have that \[ \boldsymbol{\widehat \beta}_S \sim N(\boldsymbol \beta_S, \sigma^2 [(\boldsymbol X_{*,S}^{\perp})^T \boldsymbol X_{*,S}^{\perp}]^{-1}). \tag{14.10}\] Combining statements (14.9) and (14.10), we verify the claimed relationship (14.8). Therefore, we have \[ F = \frac{\boldsymbol{\widehat \beta}_S^T \{[(\boldsymbol X^T \boldsymbol X)^{-1}]_{S,S}\}^{-1} \boldsymbol{\widehat \beta}_S / |S|}{\widehat \sigma^2}. \] Recalling that \(\boldsymbol{\widehat \Omega} = \widehat \sigma^2 (\boldsymbol X^T \boldsymbol X)^{-1}\), we find that \[ F = \boldsymbol{\widehat \beta}_S^T \left(\boldsymbol{\widehat \Omega}_{S,S}\right)^{-1} \boldsymbol{\widehat \beta}_S / |S|, \] as desired.