22  Inference in GLMs

22.1 Preliminaries

22.1.1 Inferential goals

There are two types of inferential goals: hypothesis testing and confidence interval/region construction.

22.1.1.1 Hypothesis testing

  1. Single coefficient: \(H_0: \beta_j = \beta_j^0\) versus \(H_1: \beta_j \neq \beta_j^0\) for some \(\beta_j^0 \in \mathbb{R}\).
  2. Group of coefficients: \(H_0: \boldsymbol{\beta}_S = \boldsymbol{\beta}_S^0\) versus \(H_1: \boldsymbol{\beta}_S \neq \boldsymbol{\beta}_S^0\) for some \(S \subset \{0,\dots,p-1\}\) and some \(\boldsymbol{\beta}_S^0 \in \mathbb{R}^{|S|}\).
  3. Goodness of fit: The goodness of fit null hypothesis is that the GLM (20.1) is correctly specified. Consider the saturated model: \[ y_i \overset{\text{ind}} \sim \text{EDM}(\mu_i, \phi_0/w_i) \quad \text{for} \quad i = 1,\dots,n. \tag{22.1}\] Let \[ \mathcal{M}^{\text{GLM}} \equiv \{\boldsymbol{\mu}: g(\mu_i) = \boldsymbol{x}_{i*}^T \boldsymbol{\beta} + o_i \text{ for some } \boldsymbol{\beta} \in \mathbb{R}^p\} \] be the set of mean vectors consistent with the GLM. Then, the goodness of fit testing problem is \(H_0: \boldsymbol{\mu} \in \mathcal{M}^{\text{GLM}}\) versus \(H_1: \boldsymbol{\mu} \notin \mathcal{M}^{\text{GLM}}\).

22.1.1.2 Confidence interval/region construction

  1. Confidence interval for a single coefficient: Here, the goal is to produce a confidence interval \(\text{CI}(\beta_j)\) for a coefficient \(\beta_j\).
  2. Confidence region for a group of coefficients: Here, the goal is to produce a confidence region \(\text{CR}(\boldsymbol{\beta}_S)\) for a group of coefficients \(\boldsymbol{\beta}_S\).
  3. Confidence interval for a fitted value: In GLMs, fitted values can either be considered for parameters on the linear scale (\(\eta_i = \boldsymbol{x}_{i*}^T \boldsymbol{\beta} + o_i\)) or the mean scale (\(\mu_i = g^{-1}(\boldsymbol{x}_{i*}^T \boldsymbol{\beta} + o_i)\)). The goal, then, is to produce confidence intervals \(\text{CI}(\eta_i)\) or \(\text{CI}(\mu_i)\) for \(\eta_i\) or \(\mu_i\), respectively.

22.1.2 Inferential tools

Inference in GLMs is based on asymptotic likelihood theory. These asymptotics can be based on large-sample asymptotics or small-dispersion asymptotics. Large-sample asymptotics are applicable for testing hypotheses and estimating parameters within models where the number of parameters is fixed while the sample size grows. Small-dispersion asymptotics are applicable for testing hypotheses and estimating parameters within models where the dispersion is small, regardless of the sample size. Large-sample asymptotics apply to testing and estimating coefficients in GLMs (20.1) with a fixed number of parameters as the sample size grows, but not to testing goodness of fit. Indeed, goodness-of-fit tests refer to the saturated model (22.1), whose number of parameters grows with \(n\). Small-dispersion asymptotics, on the other hand, apply to goodness-of-fit testing.

Hypothesis tests (and, by inversion, confidence intervals) can be constructed in three asymptotically equivalent ways: Wald tests, likelihood ratio tests (LRT), and score tests. These tests can be justified using either large-sample or small-dispersion asymptotics, depending on the context. Despite their asymptotic equivalence, in finite samples, some tests may be preferable to others (though for normal linear models, these tests are equivalent in finite samples as well). See Figure 22.1.

Figure 22.1: A comparison of the three asymptotic methods for GLM inference.

22.2 Wald inference

Wald inference is based on the following asymptotic normality statement:

\[ \boldsymbol{\widehat \beta} \overset{\cdot}{\sim} N(\boldsymbol{\beta}, \boldsymbol{I}^{-1}(\boldsymbol{\beta})) = N(\boldsymbol{\beta}, \phi_0(\boldsymbol{X}^T \boldsymbol{W}(\boldsymbol{\beta}) \boldsymbol{X})^{-1}), \tag{22.2}\]

recalling our derivation of the Fisher information from equation (21.4). This normal approximation can be justified via the central limit theorem in the context of large-sample asymptotics or small-dispersion asymptotics. Wald inference is easy to carry out, and for this reason, it is considered the default type of inference. However, as we will see in Unit 5, it also tends to be the least accurate in small samples. Furthermore, Wald tests are usually not applied for testing goodness of fit.

22.2.1 Wald test for \(\beta_j = \beta_j^0\) (known \(\phi_0\))

Based on the Wald approximation (22.2), under the null hypothesis, we have:

\[ \begin{split} \widehat \beta_j &\overset{\cdot}{\sim} N(\beta_j^0, \phi_0[(\boldsymbol{X}^T \boldsymbol{W}(\boldsymbol{\beta}) \boldsymbol{X})^{-1}]_{jj}) \\ &\approx N(\beta_j^0, \phi_0[(\boldsymbol{X}^T \boldsymbol{W}(\boldsymbol{\widehat \beta}) \boldsymbol{X})^{-1}]_{jj}) \\ &\equiv N(\beta_j^0, \text{SE}(\widehat \beta_j)^2), \end{split} \]

where we have used a plug-in estimator of the variance. This leads us to the Wald \(z\)-test:

\[ \phi(\boldsymbol{X}, \boldsymbol{y}) \equiv 1\left(\left|\frac{\widehat \beta_j - \beta_j^0}{\text{SE}(\widehat \beta_j)}\right| > z_{1-\alpha/2}\right). \]

Since a one-dimensional parameter is being tested, we can make the test one-sided if desired.

22.2.2 Wald test for \(\boldsymbol{\beta}_S = \boldsymbol{\beta}_S^0\) (known \(\phi_0\))

Extending the reasoning above, we have under the null hypothesis that:

\[ \boldsymbol{\widehat \beta}_S \overset{\cdot}{\sim} N(\boldsymbol{\beta}_S^0, \phi_0[(\boldsymbol{X}^T \boldsymbol{W}(\boldsymbol{\beta}) \boldsymbol{X})^{-1}]_{S,S}) \approx N(\boldsymbol{\beta}_S^0, \phi_0[(\boldsymbol{X}^T \boldsymbol{W}(\boldsymbol{\widehat \beta}) \boldsymbol{X})^{-1}]_{S,S}), \]

and therefore:

\[ \frac{1}{\phi_0} (\boldsymbol{\widehat \beta}_S - \boldsymbol{\beta}_S^0)^T \left([(\boldsymbol{X}^T \boldsymbol{W}(\boldsymbol{\widehat \beta}) \boldsymbol{X})^{-1}]_{S,S}\right)^{-1}(\boldsymbol{\widehat \beta}_S - \boldsymbol{\beta}_S^0) \overset{\cdot}{\sim} \chi^2_{|S|}. \]

Hence, we have the Wald \(\chi^2\) test:

\[ \begin{split} &\phi(\boldsymbol{X}, \boldsymbol{y}) \\ &\equiv 1\left(\frac{1}{\phi_0} (\boldsymbol{\widehat \beta}_S - \boldsymbol{\beta}_S^0)^T \left([(\boldsymbol{X}^T \boldsymbol{W}(\boldsymbol{\widehat \beta}) \boldsymbol{X})^{-1}]_{S,S}\right)^{-1}(\boldsymbol{\widehat \beta}_S - \boldsymbol{\beta}_S^0) > \chi^2_{|S|}(1-\alpha)\right). \end{split} \]

22.2.3 Wald confidence interval for \(\beta_j\) (known \(\phi_0\))

Inverting the Wald test for \(\beta_j\), we get a Wald confidence interval:

\[ \text{CI}(\beta_j) \equiv \widehat \beta_j \pm z_{1-\alpha/2} \cdot \text{SE}(\widehat \beta_j), \tag{22.3}\] where \[ \text{SE}(\widehat \beta_j) \equiv \sqrt{\phi_0[(\boldsymbol{X}^T \boldsymbol{W}(\boldsymbol{\widehat \beta}) \boldsymbol{X})^{-1}]_{jj}}. \]

22.2.4 Wald confidence region for \(\boldsymbol{\beta}_S\) (known \(\phi_0\))

By inverting the test of \(H_0: \boldsymbol{\beta}_S = \boldsymbol{\beta}_S^0\), we get the Wald confidence region:

\[ \small \begin{split} \text{CR}(\boldsymbol{\beta}_S) \equiv \left\{\boldsymbol{\beta}_S: \frac{1}{\phi_0} (\boldsymbol{\widehat \beta}_S - \boldsymbol{\beta}_S)^T \left([(\boldsymbol{X}^T \boldsymbol{W}(\boldsymbol{\widehat \beta}) \boldsymbol{X})^{-1}]_{S,S}\right)^{-1}(\boldsymbol{\widehat \beta}_S - \boldsymbol{\beta}_S) \leq \chi^2_{|S|}(1-\alpha)\right\}. \end{split} \]

If \(S = \{0, 1, \dots, p-1\}\), we are left with:

\[ \text{CR}(\boldsymbol{\beta}_S) \equiv \left\{\boldsymbol{\beta}: \frac{1}{\phi_0} (\boldsymbol{\widehat \beta} - \boldsymbol{\beta})^T \boldsymbol{X}^T \boldsymbol{W}(\boldsymbol{\widehat \beta}) \boldsymbol{X} (\boldsymbol{\widehat \beta} - \boldsymbol{\beta}) \leq \chi^2_{p}(1-\alpha)\right\}. \]

22.2.5 Wald confidence intervals for \(\eta_i\) and \(\mu_i\) (known \(\phi_0\))

Given the Wald approximation (22.2), we have:

\[ \widehat \eta_i \equiv o_i + \boldsymbol{x}_{i*}^T \boldsymbol{\widehat \beta} \overset{\cdot}{\sim} N(\eta_i, \phi_0 \cdot \boldsymbol{x}_{i*}^T (\boldsymbol{X}^T \boldsymbol{W}(\boldsymbol{\widehat \beta}) \boldsymbol{X})^{-1} \boldsymbol{x}_{i*}) \equiv N(\eta_i, \text{SE}(\hat \eta_i)^2). \]

Hence, the Wald interval for \(\eta_i\) is:

\[ \text{CI}(\eta_i) \equiv o_i + \boldsymbol{x}_{i*}^T \boldsymbol{\widehat\beta} \pm z_{1-\alpha/2} \cdot \text{SE}(\hat \eta_i), \] where \[ \text{SE}(\hat \eta_i) \equiv \sqrt{\phi_0 \boldsymbol{x}_{i*}^T (\boldsymbol{X}^T \boldsymbol{W}(\boldsymbol{\widehat \beta}) \boldsymbol{X})^{-1} \boldsymbol{x}_{i*}}. \]

A confidence interval for \(\mu_i \equiv \mathbb{E}_{\boldsymbol{\beta}}[y_i] = g^{-1}(\eta_i)\) can be obtained by applying the monotonic function \(g^{-1}\) to the endpoints of the confidence interval for \(\eta_i\). Note that the resulting confidence interval may be asymmetric. We can get a symmetric interval by applying the delta method, but this interval would be less accurate because it involves the delta method approximation in addition to the Wald approximation.

22.2.6 Wald inference when \(\phi_0\) is unknown

When \(\phi_0\) is unknown, we need to plug in an estimate \(\widetilde \phi_0\) (e.g. the deviance-based or Pearson-based estimate). Now our standard errors are \[ \text{SE}(\widehat \beta_j) \equiv \sqrt{\widetilde \phi_0 \cdot [(\boldsymbol{X}^T \boldsymbol{W}(\boldsymbol{\widehat \beta}) \boldsymbol{X})^{-1}]_{jj}}, \] and our test statistic for \(H_0: \beta_j = \beta_j^0\) is \[ \frac{\widehat \beta_j - \beta_j^0}{\sqrt{\widetilde \phi_0}\sqrt{[(\boldsymbol{X}^T \boldsymbol{W}(\boldsymbol{\widehat \beta}) \boldsymbol{X})^{-1}]_{jj}}}. \]

Unlike linear regression, it is not the case in general that \(\boldsymbol{\widehat \beta}\) and \(\widetilde \phi_0\) are independent. Nevertheless, they are asymptotically independent. Therefore, the above statistic is approximately distributed as \(t_{n-p}\). Hence, the test for \(H_0: \beta_j = \beta_j^0\) is:

\[ \phi(\boldsymbol{X}, \boldsymbol{y}) \equiv 1\left(\left|\frac{\widehat \beta_j - \beta_j^0}{\text{SE}(\widehat \beta_j)}\right| > t_{n-p}(1-\alpha/2)\right). \]

Likewise, we would replace \(z_{1-\alpha}\) by \(t_{n-p}(1-\alpha/2)\) for all tests and confidence intervals concerning univariate quantities. For multivariate quantities, we will get approximate \(F\) distributions instead of approximate \(\chi^2\) distributions. For example:

\[ \frac{\frac{1}{|S|}(\boldsymbol{\widehat \beta}_S - \boldsymbol{\beta}_S^0)^T \left([(\boldsymbol{X}^T \boldsymbol{W}(\boldsymbol{\widehat \beta}) \boldsymbol{X})^{-1}]_{S,S}\right)^{-1}(\boldsymbol{\widehat \beta}_S - \boldsymbol{\beta}_S^0)}{\widetilde \phi_0} \overset{\cdot}{\sim} F_{|S|, n-p}. \]

22.3 Likelihood ratio inference

22.3.1 Testing one or more coefficients (\(\phi_0\) known)

Let \(\ell(\boldsymbol{y}, \boldsymbol{\mu}) = -\frac{D(\boldsymbol{y}, \boldsymbol{\mu})}{2\phi_0} + C\) be the GLM log-likelihood (recall equation (21.6)). Let \(H_0: \boldsymbol{\beta}_S = \boldsymbol{\beta}_S^0\) be a null hypothesis about some subset of variables \(S \subset \{0, 1, \dots, p-1\}\), and let \(\boldsymbol{\widehat{\mu}}_{\text{-}S}\) be the maximum likelihood estimate under the null hypothesis. Likelihood ratio inference is based on the following asymptotic chi-square distribution:

\[ 2(\ell(\boldsymbol{y}, \boldsymbol{\widehat{\mu}}) - \ell(\boldsymbol{y}, \boldsymbol{\widehat{\mu}}_{\text{-}S})) = \frac{D(\boldsymbol{y}, \boldsymbol{\widehat{\mu}}_{\text{-}S}) - D(\boldsymbol{y}, \boldsymbol{\widehat{\mu}})}{\phi_0} \overset{\cdot}{\sim} \chi^2_{|S|}. \tag{22.4}\]

This approximation holds either in large samples (large-sample asymptotics) or in small samples but with small dispersion (small-dispersion asymptotics). The latter has to do with the fact that under small-dispersion asymptotics,

\[ \frac{d(y_i, \mu_i)}{\phi_0/w_i} \overset{\cdot}{\sim} \chi^2_1, \]

so

\[ \frac{D(\boldsymbol{y}, \boldsymbol{\mu})}{\phi_0} = \sum_{i = 1}^n \frac{d(y_i, \mu_i)}{\phi_0/w_i} \overset{\cdot}{\sim} \chi^2_n. \]

Suppose we wish to test the null hypothesis \(H_0: \boldsymbol{\beta}_S = \boldsymbol{\beta}_S^0\). Then, based on the approximation (22.4), we can define the likelihood ratio test:

\[ \phi(\boldsymbol{X}, \boldsymbol{y}) \equiv 1\left(\frac{D(\boldsymbol{y}, \boldsymbol{\widehat{\mu}}_{\text{-}S}) - D(\boldsymbol{y}, \boldsymbol{\widehat{\mu}})}{\phi_0} > \chi^2_{|S|}(1-\alpha)\right). \]

22.3.2 Confidence interval for a single coefficient

We can obtain a confidence interval for \(\beta_j\) by inverting the likelihood ratio test. Let \(\boldsymbol{\widehat{\mu}}_{\text{-}j}(\beta_j^0)\) be the fitted mean vector under the constraint \(\beta_j = \beta_j^0\). Then, inverting the likelihood ratio test gives us the confidence interval:

\[ \text{CI}(\beta_j) \equiv \left\{\beta_j: \frac{D(\boldsymbol{y}, \boldsymbol{\widehat{\mu}}_{\text{-}j}(\beta_j)) - D(\boldsymbol{y}, \boldsymbol{\widehat{\mu}})}{\phi_0} \leq \chi^2_{|S|}(1-\alpha)\right\}. \]

Likelihood ratio-based confidence intervals tend to be more accurate than Wald intervals, especially when the parameter is near the edge of the parameter space, but they require more computation because \(\boldsymbol{\widehat{\mu}}_{\text{-}j}(\beta_j)\) must be computed on a large grid of \(\beta_j\) values. If we wanted to create confidence regions for groups of parameters, this would become computationally intensive due to the curse of dimensionality.

22.3.3 Goodness of fit testing (\(\phi_0\) known)

For \(\phi_0\) known, we can also construct a goodness of fit test. To this end, we compare the deviances of the GLM and saturated model:

\[ \frac{D(\boldsymbol{y}, \boldsymbol{\widehat{\mu}}) - D(\boldsymbol{y}, \boldsymbol{y})}{\phi_0} = \frac{D(\boldsymbol{y}; \boldsymbol{\widehat{\mu}})}{\phi_0} \overset{\cdot}{\sim} \chi^2_{n-p}. \]

Note that the goodness of fit test is a significance test with respect to the saturated model (22.1), which has \(n\) free parameters. Therefore, the number of free parameters increases with the sample size, so large-sample asymptotics cannot justify this test. Instead, we must rely on small-dispersion asymptotics. In particular, the likelihood ratio test relies on the saddlepoint approximation for small dispersions. To verify whether the saddlepoint approximation is accurate, we can apply the rules of thumb from Section 19.6.2.2 for each observation \(y_i\), when it is drawn from the distribution fitted under the GLM (rather than the saturated model). For instance, we can check that \(m \hat \mu_i \geq 3\) and \(m(1-\hat \mu_i) \geq 3\) in the case of grouped logistic regression of \(\hat \mu_i \geq 3\) for Poisson regression. Here, \(\hat \mu_i\) are the fitted means under the GLM.

22.3.4 Likelihood ratio inference for \(\phi_0\) unknown

If \(\phi_0\) is unknown, we can estimate it as discussed above and construct an \(F\)-statistic as follows:

\[ F \equiv \frac{(D(\boldsymbol{y}; \boldsymbol{\widehat{\mu}}_{\text{-}S}) - D(\boldsymbol{y}; \boldsymbol{\widehat{\mu}}))/|S|}{\widetilde{\phi}_0}. \]

In normal linear model theory, the null distribution of \(F\) is exactly \(F_{|S|, n-p}\). For GLMs, the null distribution of \(F\) is approximately \(F_{|S|, n-p}\). We can use this \(F\) distribution to construct hypothesis tests for groups of coefficients, or invert it to get a confidence interval for a single coefficient. We cannot construct a goodness of fit test in the case that \(\phi_0\) is unknown because the residual degrees of freedom would be used up to estimate \(\phi_0\) rather than to carry out inference.

22.4 Score-based inference

Score-based inference can be used for the same set of inferential tasks as likelihood ratio inference.

22.4.1 Testing multiple coefficients (\(\phi_0\) known)

Let \(H_0: \boldsymbol{\beta}_S = \boldsymbol{\beta}_S^0\) be a null hypothesis about a subset of variables \(S\), and let \(\boldsymbol{\widehat{\beta}}^0\) be the maximum likelihood estimate under this null hypothesis. In particular, let \(\boldsymbol{\widehat{\beta}}^0_S \equiv \boldsymbol{\beta}_S^0\) and \[ \boldsymbol{\widehat{\beta}}^0_{\text{-}S} = \underset{\boldsymbol{\beta}_{\text{-}S}}{\arg \max}\ \ell(\boldsymbol{\beta}^0_{S}, \boldsymbol{\beta}_{\text{-}S}). \tag{22.5}\] Let us partition the score vector \(\boldsymbol U(\boldsymbol{\beta}) = \frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \equiv (\boldsymbol U_S(\boldsymbol \beta), \boldsymbol U_{\text{-}S}(\boldsymbol \beta))\). We have \[ \begin{split} \boldsymbol U(\boldsymbol \beta) &= { \boldsymbol U_S(\boldsymbol \beta) \choose \boldsymbol U_{\text{-}S}(\boldsymbol \beta) } \\ &\sim N\left(0, \boldsymbol I(\boldsymbol \beta)\right) = N\left(0, \begin{bmatrix} \boldsymbol I_{S,S}(\boldsymbol \beta) & \boldsymbol I_{S,\text{-}S}(\boldsymbol \beta) \\ \boldsymbol I_{\text{-}S,S}(\boldsymbol \beta) & \boldsymbol I_{\text{-}S,\text{-}S}(\boldsymbol \beta) \end{bmatrix}\right). \end{split} \tag{22.6}\] This approximation can be justified either by small-dispersion asymptotics or large-sample asymptotics (both based on the central limit theorem). The score test statistic is based on plugging in the null estimate \(\boldsymbol{\widehat{\beta}}^0\) into the score vector and extracting the components corresponding to \(S\): \[ \boldsymbol U_S(\boldsymbol{\widehat{\beta}}^0). \] This vector does not have the distribution obtained from the coordinates \(S\) of equation (22.6) because an estimate is plugged in. Instead, we can derive the distribution of this vector by conditioning on \(\boldsymbol U_{\text{-}S}(\boldsymbol{\widehat{\beta}}^0) = 0\): \[ \begin{split} \boldsymbol U_S(\boldsymbol{\widehat{\beta}}^0) &\overset d \approx \left.\mathcal L(\boldsymbol U_S(\boldsymbol \beta) \mid \boldsymbol U_{\text{-}S}(\boldsymbol \beta) = 0)\right|_{\boldsymbol \beta = \boldsymbol{\widehat{\beta}}^0} \\ & \overset \cdot \sim N\left(0, \boldsymbol I_{S,S}(\boldsymbol{\widehat{\beta}}^0) - \boldsymbol I_{S, \text{-}S}(\boldsymbol{\widehat{\beta}}^0)\boldsymbol I_{\text{-}S, \text{-}S}(\boldsymbol{\widehat{\beta}}^0)^{-1}\boldsymbol I_{\text{-}S, S}(\boldsymbol{\widehat{\beta}}^0)\right). \end{split} \tag{22.7}\] The second line is obtained from (22.6) using the formula for a conditional distribution of a multivariate normal distribution. Therefore, the score test is based on the following chi-square approximation: \[ \boldsymbol U_S(\boldsymbol{\widehat{\beta}}^0)^T \left[\boldsymbol I_{S,S}(\boldsymbol{\widehat{\beta}}^0) - \boldsymbol I_{S, \text{-}S}(\boldsymbol{\widehat{\beta}}^0)\boldsymbol I_{\text{-}S, \text{-}S}(\boldsymbol{\widehat{\beta}}^0)^{-1}\boldsymbol I_{\text{-}S, S}(\boldsymbol{\widehat{\beta}}^0)\right]^{-1} \boldsymbol U_S(\boldsymbol{\widehat{\beta}}^0) \overset \cdot \sim \chi^2_{|S|}. \] Recalling the expressions for the score (21.2) and Fisher information matrix (21.4) in GLMs, we derive the score statistic \[ \begin{split} &T^2_{\text{score}}(\boldsymbol X, \boldsymbol y) \equiv \\ &\frac{1}{\phi_0}(\boldsymbol{y} - \boldsymbol{\widehat{\mu}}^0)^T \boldsymbol{\widehat{W}}^0 \boldsymbol{\widehat{M}}^0 \boldsymbol{X}_{*,S} \times \\ & \quad \left[\boldsymbol X_{*,S}^T \boldsymbol {\widehat W}^{0}\boldsymbol X_{*,S} - \boldsymbol X_{*,S}^T \boldsymbol {\widehat W}^{0}\boldsymbol X_{*,\text{-}S}(\boldsymbol X_{*,\text{-}S}^T \boldsymbol {\widehat W}^{0}\boldsymbol X_{*,\text{-}S})^{-1}\boldsymbol X_{*,\text{-}S}^T \boldsymbol {\widehat W}^{0}\boldsymbol X_{*,S} \right]^{-1} \times \\ & \quad \boldsymbol{X}_{*,S}^T \boldsymbol{\widehat{M}}^0 \boldsymbol{\widehat{W}}^0 (\boldsymbol{y} - \boldsymbol{\widehat{\mu}}^0). \end{split} \] The score test is therefore \[ \phi(\boldsymbol X, \boldsymbol y) \equiv 1(T^2_{\text{score}}(\boldsymbol X, \boldsymbol y) > \chi^2_{|S|}(1-\alpha)). \] An equivalent formulation of the score test can be derived by noting that \[ \left[I_{S,S}(\boldsymbol{\widehat{\beta}}^0) - \boldsymbol I_{S, \text{-}S}(\boldsymbol{\widehat{\beta}}^0)\boldsymbol I_{\text{-}S, \text{-}S}(\boldsymbol{\widehat{\beta}}^0)^{-1}\boldsymbol I_{\text{-}S, S}(\boldsymbol{\widehat{\beta}}^0)\right]^{-1} = [\boldsymbol I(\boldsymbol{\widehat{\beta}}^0)^{-1}]_{S,S}. \] Hence, we have \[ \begin{split} &\boldsymbol U_S(\boldsymbol{\widehat{\beta}}^0)^T \left[\boldsymbol I_{S,S}(\boldsymbol{\widehat{\beta}}^0) - \boldsymbol I_{S, \text{-}S}(\boldsymbol{\widehat{\beta}}^0)\boldsymbol I_{\text{-}S, \text{-}S}(\boldsymbol{\widehat{\beta}}^0)^{-1}\boldsymbol I_{\text{-}S, S}(\boldsymbol{\widehat{\beta}}^0)\right]^{-1} \boldsymbol U_S(\boldsymbol{\widehat{\beta}}^0) \\ &\quad= \boldsymbol U_S(\boldsymbol{\widehat{\beta}}^0)^T [\boldsymbol I(\boldsymbol{\widehat{\beta}}^0)^{-1}]_{S,S} \boldsymbol U_S(\boldsymbol{\widehat{\beta}}^0) \\ &\quad= \boldsymbol U(\boldsymbol{\widehat{\beta}}^0)^T \boldsymbol I(\boldsymbol{\widehat{\beta}}^0)^{-1} \boldsymbol U(\boldsymbol{\widehat{\beta}}^0), \end{split} \] where the last step used the fact that \(\boldsymbol U_{\text{-}S}(\boldsymbol{\widehat \beta}^{0}) = 0\). Specializing to GLMs, we find that the score test statistic can be written as \[ \begin{split} &T^2_{\text{score}}(\boldsymbol X, \boldsymbol y) = \\ &\frac{1}{\phi_0} (\boldsymbol y - \boldsymbol{\widehat \mu}^0)^T \boldsymbol{\widehat W}^0 \boldsymbol{\widehat M}^0 \boldsymbol X (\boldsymbol X^T \boldsymbol{\widehat W}^0 \boldsymbol X)^{-1} \boldsymbol X^T \boldsymbol{\widehat M}^0 \boldsymbol{\widehat W}^0 (\boldsymbol y - \boldsymbol{\widehat \mu}^0). \end{split} \tag{22.8}\]

22.4.2 Testing a single coefficient (\(\phi_0\) known)

If \(S = \{j\}\), the normal approximation (22.7) specializes to \[ T_{\text{score}} \overset{\cdot}{\sim} N(0, 1), \] where \[ \begin{split} &T_{\text{score}} = \\ &\frac{\boldsymbol{x}_{*,j}^T \boldsymbol{\widehat{M}}^0 \boldsymbol{\widehat{W}}^0 (\boldsymbol{y} - \boldsymbol{\widehat{\mu}}^0)}{\sqrt{\phi_0 (\boldsymbol x_{*,j}^T \boldsymbol {\widehat W}^{0}\boldsymbol x_{*,j} - \boldsymbol x_{*,j}^T \boldsymbol {\widehat W}^{0}\boldsymbol X_{*,\text{-}j}(\boldsymbol X_{*,\text{-}j}^T \boldsymbol {\widehat W}^{0}\boldsymbol X_{*,\text{-}j})^{-1}\boldsymbol X_{*,\text{-}j}^T \boldsymbol {\widehat W}^{0}\boldsymbol x_{*,j})}}. \end{split} \]

Unlike its multivariate counterparts, we can construct not just a two-sided test but also one-sided tests based on this normal approximation. For example, below is a right-sided score test for \(H_0: \beta_j = \beta_j^0\):

\[ \phi(\boldsymbol{X}, \boldsymbol{y}) = 1\left(T_{\text{score}} > z_{1-\alpha}\right). \]

The nice thing about the score test is that the model need only be fit under the null hypothesis. Therefore, computation can be recycled if \(\boldsymbol X_{*, \text{-}j}\) is a standard set of control variables and there are several options for \(\boldsymbol x_{*, j}\) to test.

22.4.3 Confidence interval for a single coefficient (\(\phi_0\) known)

Just as with the likelihood ratio test, it is possible to invert a score test for a single coefficient to obtain a confidence interval. It is uncommon to invert a multivariate test to obtain a confidence region for multiple coordinates of \(\boldsymbol{\beta}\), given the computationally expensive search across a grid of possible \(\boldsymbol{\beta}\) values.

22.4.4 Goodness of fit testing (\(\phi_0\) known)

We can view goodness of fit testing as testing a hypothesis about the coefficients in the following augmented GLM: \[ y_i \sim \text{EDM}(\mu_i, \phi_i); \quad g(\mu_i) = \boldsymbol x_{i*}^T \boldsymbol \beta + \boldsymbol z_{i*}^T \boldsymbol \gamma, \] where \(\boldsymbol Z \in \mathbb R^{n \times (n-p)}\) is a matrix of extra variables so that the columns of the augmented model matrix \(\widetilde{\boldsymbol X} \equiv [\boldsymbol X, \boldsymbol Z]\) form a basis for \(\mathbb R^n\). Then, the goodness of fit null hypothesis is \(H_0: \boldsymbol \gamma = \boldsymbol 0\), and the alternative hypothesis is the saturated model. To test this hypothesis, we use the score test statistic in the form of equation (22.8) with the augmented model matrix \(\widetilde{\boldsymbol X}\) in place of \(\boldsymbol X\) and the GLM fits \(\boldsymbol {\widehat \mu}, \boldsymbol{\widehat W}, \boldsymbol{\widehat M}\) in place of \(\boldsymbol{\widehat \mu}^0, \boldsymbol{\widehat W}^0, \boldsymbol{\widehat M}^0\) to reflect that the full original GLM is being fit under the null hypothesis. Therefore, we get the score statistic \[ T^2_{\text{score}} = \frac{1}{\phi_0} (\boldsymbol y - \boldsymbol{\widehat \mu})^T \boldsymbol{\widehat W} \boldsymbol{\widehat M} \widetilde{\boldsymbol X} (\widetilde{\boldsymbol X}^T \boldsymbol{\widehat W} \widetilde{\boldsymbol X})^{-1} \widetilde{\boldsymbol X}^T \boldsymbol{\widehat M} \boldsymbol{\widehat W} (\boldsymbol y - \boldsymbol{\widehat \mu}). \] Now, note that the matrix \(\widetilde{\boldsymbol X}\) is a full-rank square matrix, and therefore it is invertible. Hence, we can simplify the statistic as follows: \[ \begin{split} T^2_{\text{score}} &= \frac{1}{\phi_0} (\boldsymbol y - \boldsymbol{\widehat \mu}) \boldsymbol{\widehat W} \boldsymbol{\widehat M} \widetilde{\boldsymbol X} \widetilde{\boldsymbol X}^{-1} \boldsymbol{\widehat W}^{-1}((\widetilde{\boldsymbol X})^{T})^{-1} \widetilde{\boldsymbol X}^T \boldsymbol{\widehat M} \boldsymbol{\widehat W} (\boldsymbol y - \boldsymbol{\widehat \mu}) \\ &= \frac{1}{\phi_0} (\boldsymbol y - \boldsymbol{\widehat \mu}) \boldsymbol{\widehat W} \boldsymbol{\widehat M} \boldsymbol{\widehat W}^{-1} \boldsymbol{\widehat M} \boldsymbol{\widehat W} (\boldsymbol y - \boldsymbol{\widehat \mu}) \\ &= \frac{1}{\phi_0}(\boldsymbol{y} - \boldsymbol{\widehat{\mu}})^T \boldsymbol{\widehat{W}} \boldsymbol{\widehat{M}}^2 (\boldsymbol{y} - \boldsymbol{\widehat{\mu}}) \\ &= \frac{1}{\phi_0}(\boldsymbol{y} - \boldsymbol{\widehat{\mu}})^T \text{diag}\left( \frac{w_i}{V(\widehat{\mu}_i)} \right) (\boldsymbol{y} - \boldsymbol{\widehat{\mu}}) \\ &= \frac{1}{\phi_0}\sum_{i=1}^n \frac{w_i (y_i - \widehat{\mu}_i)^2}{V(\widehat{\mu}_i)} \\ &\equiv \frac{1}{\phi_0} X^2, \end{split} \] where \(X^2\) is the Pearson chi-square statistic. Therefore, the score test for goodness of fit is:

\[ \phi(\boldsymbol{X}, \boldsymbol{y}) \equiv 1\left(X^2 > \phi_0 \chi^2_{n-p}(1-\alpha)\right). \]

In the context of contingency table analysis (see the next chapter), this test reduces to the Pearson chi-square test of independence between two categorical variables. This test was proposed in 1900; it was only pointed out about a century later that this is a score test (Smyth 2003).

As with the likelihood ratio test for goodness of fit, the score test for goodness of fit must be justified by small-dispersion asymptotics. In particular, the score test for goodness of fit relies on the central limit theorem for small dispersions. To verify whether this approximation is accurate, we can apply the rules of thumb from Section 19.6.1.2 for each observation \(y_i\), when it is drawn from the distribution fitted under the GLM (rather than the saturated model). For instance, we can check that \(m \hat \mu_i \geq 5\) and \(m(1-\hat \mu_i) \geq 5\) in the case of grouped logistic regression of \(\hat \mu_i \geq 5\) for Poisson regression. Here, \(\hat \mu_i\) are the fitted means under the GLM.

22.4.5 Score test inference for \(\phi_0\) unknown

Score test inference for one or more coefficients \(\boldsymbol{\beta}_S\) can be achieved by replacing \(\phi_0\) with one of its estimators and replacing the normal and chi-square distributions with \(t\) and \(F\) distributions, respectively. For example, the score test for a single coefficient \(\beta_j\) is:

\[ \phi(\boldsymbol{X}, \boldsymbol{y}) = 1\left(T_{\text{score}} > t_{n-p}(1-\alpha)\right), \] where \[ \begin{split} &T_{\text{score}} = \\ &\frac{\boldsymbol{x}_{*,j}^T \boldsymbol{\widehat{M}}^0 \boldsymbol{\widehat{W}}^0 (\boldsymbol{y} - \boldsymbol{\widehat{\mu}}^0)}{\sqrt{\tilde \phi_0 (\boldsymbol x_{*,j}^T \boldsymbol {\widehat W}^{0}\boldsymbol x_{*,j} - \boldsymbol x_{*,j}^T \boldsymbol {\widehat W}^{0}\boldsymbol X_{*,\text{-}j}(\boldsymbol X_{*,\text{-}j}^T \boldsymbol {\widehat W}^{0}\boldsymbol X_{*,\text{-}j})^{-1}\boldsymbol X_{*,\text{-}j}^T \boldsymbol {\widehat W}^{0}\boldsymbol x_{*,j})}}. \end{split} \]

The \(t\) and \(F\) distributions are not exact in finite samples, but are better approximations than the normal and chi-square distributions. The score test for goodness of fit is not applicable in the case when \(\phi_0\) is unknown, similarly to the likelihood ratio test. Indeed, note the relationship between the Pearson goodness of fit test, which rejects when \(\frac{1}{\phi_0}X^2 > \chi^2_{n-p}(1-\alpha)\), and the Pearson estimator of the dispersion parameter: \(\widetilde{\phi}_0 \equiv \frac{X^2}{n-p}\). If we try to plug in the Pearson estimator for the dispersion into the Pearson goodness of fit test, we end up with a test statistic deterministically equal to \(n-p\). This reflects the fact that the residual degrees of freedom can either be used to estimate the dispersion or to test goodness of fit; they cannot be used for both.