5 Collinearity and adjustment

See also Agresti 2.2.4, 2.5.6, 2.5.7, 4.6.5

An important part of linear regression analysis is the dependence of the least squares coefficient for a predictor (\(x_j\)) on what other predictors are in the model \((\boldsymbol x_{\text{-}j})\). This relationship is dictated by the extent to which \(\boldsymbol x_{*j}\) is correlated with \(\boldsymbol X_{*,\text{-}j}\). To explore this phenomenon, it will be useful to compare two different regressions:

Regress \(\boldsymbol y\) on just \(\boldsymbol{x}_{*j}\). Let the resulting coefficient for \(x_j\) be \(\widehat{\beta}_j\).
Regress \(\boldsymbol y\) on all of \(\boldsymbol X\) (i.e., on both \(\boldsymbol{x}_{*j}\) and \(\boldsymbol X_{*, \text{-}j}\)). Let the resulting coefficients for \(x_j\) and \(\boldsymbol x_{\text{-}j}\) be \(\widehat{\beta}_{j|\text{-}j}\) and \(\widehat{\beta}_{\text{-}j|j}\), respectively.

5.1 Least squares estimates in the orthogonal case

The simplest case to analyze is when \(\boldsymbol x_{*j}\) is orthogonal to \(\boldsymbol X_{*,\text{-}j}\) in the sense that

\[ \boldsymbol x_{*j}^T \boldsymbol X_{*,\text{-}j} = \boldsymbol{0}. \tag{5.1}\]

In this case, we can derive the least squares coefficient vector \(\boldsymbol{\widehat{\beta}} = (\widehat{\beta}_{j|\text{-}j}, \boldsymbol{\widehat{\beta}}_{\text{-}j|j})\) in the regression of \(\boldsymbol y\) on \(\boldsymbol X\):

\[ \begin{split} \begin{pmatrix} \widehat{\beta}_{j|\text{-}j} \\ \boldsymbol{\widehat{\beta}}_{\text{-}j|j} \end{pmatrix} &= (\boldsymbol{X}^T \boldsymbol{X})^{-1} \boldsymbol{X}^T \boldsymbol{y} \\ &= \begin{pmatrix} \boldsymbol x_{*j}^T \boldsymbol x_{*j} & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{X}_{*,\text{-}j}^T \boldsymbol{X}_{*,\text{-}j} \end{pmatrix}^{-1} \begin{pmatrix} \boldsymbol x_{*j}^T \\ \boldsymbol{X}_{*,\text{-}j}^T \end{pmatrix} \boldsymbol{y} \\ &= \begin{pmatrix} (\boldsymbol x_{*j}^T \boldsymbol x_{*j})^{-1} \boldsymbol x_{*j}^T \boldsymbol{y} \\ (\boldsymbol{X}_{*,\text{-}j}^T \boldsymbol{X}_{*,\text{-}j})^{-1} \boldsymbol{X}_{*,\text{-}j}^T \boldsymbol{y} \end{pmatrix} \\ &= \begin{pmatrix} \widehat{\beta}_{j} \\ \boldsymbol{\widehat{\beta}}_{\text{-}j} \end{pmatrix}. \end{split} \tag{5.2}\]

Therefore, the least squares coefficient of \(x_j\) is the same regardless of whether the other predictors are included in the regression, i.e.

\[ \widehat{\beta}_{j|\text{-}j} = \widehat{\beta}_{j}. \tag{5.3}\]

5.2 Least squares estimates via orthogonalization

The orthogonality assumption (5.1) is almost never satisfied in practice. Usually, \(\boldsymbol{x}_{*j}\) has a nonzero projection \(\boldsymbol{X}_{*,\text{-}j}\boldsymbol{\widehat{\gamma}}\) onto \(C(\boldsymbol{X}_{*,\text{-}j})\):

\[ \boldsymbol{x}_{*j} = \boldsymbol{X}_{*,\text{-}j}\boldsymbol{\widehat{\gamma}} + \boldsymbol{x}^{\perp}_{*j}, \]

where \(\boldsymbol{x}^{\perp}_{*j}\) is the residual from regressing \(\boldsymbol{x}_{*j}\) onto \(\boldsymbol{X}_{*,\text{-}j}\) and is therefore orthogonal to \(C(\boldsymbol{X}_{*,\text{-}j})\). In other words, \(\boldsymbol{x}^{\perp}_{*j}\) is the projection of \(\boldsymbol{x}_{*j}\) onto the orthogonal complement of \(C(\boldsymbol{X}_{*,\text{-}j})\). Another way of framing this relationship is that \(\boldsymbol{x}^{\perp}_{*j}\) is the result of adjusting \(\boldsymbol{x}_{*j}\) for \(\boldsymbol{X}_{*,\text{-}j}\).

With this decomposition, let us change basis from \((\boldsymbol{x}_{*j}, \boldsymbol{X}_{*,\text{-}j})\) to \((\boldsymbol{x}^{\perp}_{*j}, \boldsymbol{X}_{*,\text{-}j})\) by the process explored in Homework 1 Question 1. Let us write:

\[ \begin{aligned} \boldsymbol{y} &= \boldsymbol{x}_{*j} \beta_{j|\text{-}j} + \boldsymbol{X}_{*,\text{-}j}\boldsymbol{\beta}_{\text{-}j|j} + \boldsymbol{\epsilon} \\ &\Longleftrightarrow \ \boldsymbol{y} = (\boldsymbol{X}_{*,\text{-}j}\boldsymbol{\widehat{\gamma}} + \boldsymbol{x}^{\perp}_{*j})\beta_{j|\text{-}j} + \boldsymbol{X}_{*,\text{-}j}\boldsymbol{\beta}_{\text{-}j|j} + \boldsymbol{\epsilon} \\ &\Longleftrightarrow \ \boldsymbol{y} = \boldsymbol{x}^{\perp}_{*j}\beta_{j|\text{-}j} + \boldsymbol{X}_{*,\text{-}j}\boldsymbol{\beta}'_{\text{-}j|j} + \boldsymbol{\epsilon}. \end{aligned} \]

What this means is that \(\widehat{\beta}_{j|\text{-}j}\), the least squares coefficient of \(\boldsymbol{x}_{*j}\) in the regression of \(\boldsymbol{y}\) on \((\boldsymbol{x}_{*j}, \boldsymbol{X}_{*,\text{-}j})\), is also the least squares coefficient of \(\boldsymbol{x}^{\perp}_{*j}\) in the regression of \(\boldsymbol{y}\) on \((\boldsymbol{x}^{\perp}_{*j}, \boldsymbol{X}_{*,\text{-}j})\). However, since \(\boldsymbol{x}^{\perp}_{*j}\) is orthogonal to \(\boldsymbol{X}_{*,\text{-}j}\) by construction, we can use the result (5.2) to conclude that:

\[ \widehat{\beta}_{j|\text{-}j} \text{ is the least squares coefficient of } \boldsymbol{x}^{\perp}_{*j} \text{ in the } \textit{univariate} \text{ regression of } \boldsymbol{y} \text{ on } \boldsymbol{x}^{\perp}_{*j}. \]

We can solve this univariate regression explicitly to obtain:

\[ \widehat{\beta}_{j|\text{-}j} = \frac{(\boldsymbol{x}^{\perp}_{*j})^T \boldsymbol{y}}{\|\boldsymbol{x}^{\perp}_{*j}\|^2}. \tag{5.4}\]

5.3 Adjustment and partial correlation

Equivalently, letting \(\boldsymbol{\widehat{\beta}}_{\text{-}j}\) be the least squares estimate in the regression of \(\boldsymbol{y}\) on \(\boldsymbol{X}_{*,\text{-}j}\) (note that this is not the same as \(\boldsymbol{\widehat{\beta}}_{\text{-}j|j}\)), we can write:

\[ \widehat{\beta}_{j|\text{-}j} = \frac{(\boldsymbol{x}^{\perp}_{*j})^T(\boldsymbol{y} - \boldsymbol{X}_{*,\text{-}j}\boldsymbol{\widehat{\beta}}_{\text{-}j})}{\|\boldsymbol{x}^{\perp}_{*j}\|^2} \equiv \frac{(\boldsymbol{x}^{\perp}_{*j})^T \boldsymbol y^\perp}{\|\boldsymbol{x}^{\perp}_{*j}\|^2}. \]

We can interpret this result as follows:

Theorem 5.1 The linear regression coefficient \(\widehat{\beta}_{j|\text{-}j}\) results from first adjusting \(\boldsymbol{y}\) and \(\boldsymbol{x}_{*j}\) for the effects of all other variables, and then regressing the residuals from \(\boldsymbol{y}\) onto the residuals from \(\boldsymbol{x}_{*j}\).

In this sense, the least squares coefficient for a predictor in a multiple linear regression reflects the effect of the predictor on the response after controlling for the effects of all other predictors.

Econometricians calls this the Frisch-Waugh-Lovell (FWL) theorem, to acknowledge economists Ragnar Frisch and Frederick V. Waugh, who first derived the result in 1933, and Michael C. Lovell, who later rediscovered and extended it in 1963. In the statistical literature, this fact was known at least as early as 1907, when Yule documented it in his paper “On the Theory of Correlation for any Number of Variables, treated by a New System of Notation.”

A related quantity is the partial correlation between \(\boldsymbol{x}_{*j}\) and \(\boldsymbol{y}\) after controlling for \(\boldsymbol{X}_{*,\text{-}j}\), defined as the empirical correlation between \(\boldsymbol{x}_{*j}^\perp\) and \(\boldsymbol{y}^\perp\): \[ \rho(\boldsymbol{x}_{*j}, \boldsymbol{y}|\boldsymbol{X}_{*,\text{-}j}) \equiv \frac{(\boldsymbol{x}_{*j}^\perp)^T(\boldsymbol{y}^\perp)}{\|\boldsymbol{x}_{*j}^\perp\| \|\boldsymbol{y}^\perp\|}. \] We can then connect the least squares coefficient \(\widehat{\beta}_{j|\text{-}j}\) to this partial correlation: \[ \widehat{\beta}_{j|\text{-}j} = \frac{(\boldsymbol{x}^{\perp}_{*j})^T \boldsymbol y^\perp}{\|\boldsymbol{x}^{\perp}_{*j}\|^2} = \frac{\|\boldsymbol{y}^\perp\|}{\|\boldsymbol{x}_{*j}^\perp\|}\rho(\boldsymbol{x}_{*j}, \boldsymbol{y}|\boldsymbol{X}_{*,\text{-}j}), \]

in a similar spirit to equation (4.5).

5.4 Effects of collinearity

Collinearity between a predictor \(x_j\) and the other predictors tends to make the estimate \(\widehat{\beta}_{j|\text{-}j}\) unstable. Intuitively, this makes sense because it becomes harder to distinguish between the effects of predictor \(x_j\) and those of the other predictors on the response. To find the variance of \(\widehat{\beta}_{j|\text{-}j}\) for a model matrix \(\boldsymbol{X}\), we could in principle use the formula (3.3). However, this formula involves the inverse of the matrix \(\boldsymbol{X}^T \boldsymbol{X}\), which is hard to reason about. Instead, we can employ the formula (5.4) to calculate directly that:

\[ \text{Var}[\widehat{\beta}_{j|\text{-}j}] = \frac{\sigma^2}{\|\boldsymbol{x}_{*j}^\perp\|^2}. \tag{5.5}\]

We see that the variance of \(\widehat{\beta}_{j|\text{-}j}\) is inversely proportional to \(\|\boldsymbol{x}_{*j}^\perp\|^2\). This means that the greater the collinearity, the less of \(\boldsymbol{x}_{*j}\) is left over after adjusting for \(\boldsymbol{X}_{*,\text{-}j}\), and the greater the variance of \(\widehat{\beta}_{j|\text{-}j}\). To quantify the effect of this adjustment, suppose there were no other predictors other than the intercept term. Then, we would have:

\[ \text{Var}[\widehat{\beta}_{j|1}] = \frac{\sigma^2}{\|\boldsymbol{x}_{*j}-\bar{x}_j \boldsymbol{1}_n\|^2}. \]

Therefore, we can rewrite the variance (5.5) as:

\[ \begin{split} \text{Var}[\widehat{\beta}_{j|\text{-}j}] &= \frac{\|\boldsymbol{x}_{*j}-\bar{x}_j \boldsymbol{1}_n\|^2}{\|\boldsymbol{x}_{*j}-\boldsymbol{X}_{*,\text{-}j}\boldsymbol{\widehat{\gamma}}\|^2} \cdot \text{Var}[\widehat{\beta}_{j|1}] \\ &= \frac{1}{1-R_j^2} \cdot \text{Var}[\widehat{\beta}_{j|1}] \equiv \text{VIF}_j \cdot \text{Var}[\widehat{\beta}_{j|1}], \end{split} \tag{5.6}\]

where \(R_j^2\) is the \(R^2\) value when regressing \(\boldsymbol{x}_{*j}\) on \(\boldsymbol{X}_{*,\text{-}j}\) and VIF stands for variance inflation factor. The higher \(R_j^2\), the more of the variance in \(\boldsymbol{x}_{*j}\) is explained by other predictors, the higher the variance in \(\widehat{\beta}_{j|\text{-}j}\).

5.5 Application: Average treatment effect estimation in causal inference

Suppose we’d like to study the effect of an exposure or treatment (e.g. taking a blood pressure medication) on a response \(y\) (e.g. blood pressure). In the Neyman-Rubin causal model, for a given individual \(i\) we denote by \(y_i(1)\) and \(y_i(0)\) the outcomes that would have occurred had the individual received the treatment and the control, respectively. These are called potential outcomes. Let \(t_i \in \{0,1\}\) indicate whether the \(i\)th individual actually received treatment or control. Therefore, the observed outcome is¹ \[ y_i^{\text{obs}} = y_i(t_i). \tag{5.7}\] Based on the data \(\{(t_i, y_i^{\text{obs}})\}_{i = 1, \dots, n}\), the most basic goal is to estimate the

\[ \textit{average treatment effect} \ \tau \equiv \mathbb{E}[y(1) - y(0)], \]

where averaging is done over the population of individuals (often called units in causal inference). Of course, we do not observe both \(y_i(1)\) and \(y_i(0)\) for any unit \(i\). Additionally, usually in observational studies we have confounding variables \(w_2, \dots, w_{p-1}\): variables that influence both the treatment assignment and the response (e.g. degree of health-seeking activity). It is important to control for these confounders in order to get an unbiased estimate of the treatment effect. Suppose the following linear model holds:

\[ y(t) = \beta_0 + \beta_1 t + \beta_2 w_2 + \cdots + \beta_{p-1} w_{p-1} + \epsilon \quad \text{for } t \in \{0, 1\}, \quad \text{where} \ \epsilon \perp\!\!\!\perp t. \]

This assumption can be broken down into the following statements:

the treatment effect is constant across units;
the response is a linear function of the treatment and observed confounders;
there is no unmeasured confounding.

Under these assumptions, we find that

\[ \tau \equiv \mathbb{E}[y(1) - y(0)] = \beta_1. \]

Using the relationship (5.7), we find that

\[ y^{\text{obs}} = \beta_0 + \beta_1 t + \beta_2 w_2 + \cdots + \beta_{p-1} w_{p-1} + \epsilon \quad \text{for } t \in \{0, 1\}. \]

In this case, the average treatment effect \(\tau\) is identified as the coefficient \(\beta_1\) in the above regression, i.e. \(\tau = \beta_1\). In other words, the causal parameter coincides with a parameter of the statistical model for the observed data. Therefore, the least squares estimate \(\widehat{\beta}_1\) is an unbiased estimate of the average treatment effect.

In this context, we can interpret Theorem 5.1 as follows: To get an estimate of the causal effect of an exposure on an outcome in the presence of confounders, first adjust both exposure and outcome for the confounders, and then estimate the effect of the adjusted exposure on the adjusted outcome via a univariate linear model. This is the essence of covariate adjustment in causal inference.

Note

Causal inference is a vast field, which lies mostly beyond the scope of STAT 9610; see STAT 9210 instead.

This requires the stable unit treatment value assumption (SUTVA), which prevents issues like interference, where the treatment of individual \(i\) can be affected by the assigned treatments of other individuals.↩︎