25 Poisson regression
25.1 Model definition and interpretation
The Poisson regression model (with offsets) is:
\[ y_i \overset{\text{ind}}\sim \text{Poi}(\mu_i); \quad \log \mu_i = o_i + \boldsymbol{x}_{i*}^T \boldsymbol{\beta}. \tag{25.1}\]
Because the log of the mean is linear in the predictors, Poisson regression models are often called loglinear models. To interpret the coefficients, note that a unit increase in \(x_j\) (while keeping the other variables fixed) is associated with an increase in the predicted mean by a factor of \(e^{\beta_j}\).
25.2 Example: Modeling rates
One cool feature of the Poisson model is that rates can be easily modeled with the help of offsets. Let’s say that the count \(y_i\) is collected over the course of a time interval of length \(t_i\), or a spatial region with area \(t_i\), or a population of size \(t_i\), etc. Then, it is meaningful to model:
\[ y_i \overset{\text{ind}} \sim \text{Poi}(\pi_i t_i), \quad \log \pi_i = \boldsymbol{x}^T_{i*}\boldsymbol{\beta}, \]
where \(\pi_i\) represents the rate of events per day / per square mile / per capita, etc. In other words:
\[ y_i \overset{\text{ind}} \sim \text{Poi}(\mu_i), \quad \log \mu_i = \log t_i + \boldsymbol{x}^T_{i*}\boldsymbol{\beta}, \]
which is exactly equation (25.1) with offsets \(o_i = \log t_i\). For example, in single-cell RNA-sequencing, \(y_i\) is the number of RNA molecules aligning to a gene in cell \(i\) and \(t_i\) is the total number of RNA molecules measured in the cell, a quantity called the library size. We might use a Poisson regression model to carry out a differential expression analysis between two cell types.
25.3 Estimation and inference
25.3.1 Score, Fisher information, and Wald inference
We found in Chapter 4 that the score and Fisher information for Poisson regression are:
\[ \boldsymbol{U}(\boldsymbol{ \beta}) = \boldsymbol{X}^T(\boldsymbol{y} - \boldsymbol{\mu}), \]
and:
\[ \boldsymbol{I}(\boldsymbol{\beta}) = \boldsymbol{X}^T \boldsymbol{W} \boldsymbol{X} = \boldsymbol{X}^T \text{diag}(V(\mu_i))\boldsymbol{X} = \boldsymbol{X}^T \text{diag}(\mu_i)\boldsymbol{X}, \]
respectively. Hence, the normal equations for the MLE are:
\[ \boldsymbol{X}^T(\boldsymbol{y} - \boldsymbol{\hat \mu}). \]
Wald inference is based on the approximation:
\[ \boldsymbol{\hat \beta} \overset \cdot \sim N(\boldsymbol{\beta}, (\boldsymbol{X}^T \text{diag}(\hat \mu_i)\boldsymbol{X})^{-1}). \]
25.3.2 Likelihood ratio inference
For likelihood ratio inference, we first derive the total deviance. The unit deviance of a Poisson distribution is:
\[ d(y, \mu) = 2\left(y \log \frac{y}{\mu} - (y - \mu)\right). \]
Hence, the total deviance is:
\[ D(\boldsymbol{y}, \boldsymbol{\mu}) = \sum_{i = 1}^n d(y_i, \mu_i) = 2\sum_{i = 1}^n \left(y_i \log \frac{y_i}{\mu_i} - (y_i - \mu_i)\right). \]
The residual deviance is then:
\[ D(\boldsymbol{y}, \boldsymbol{\hat\mu}) = 2\sum_{i = 1}^n \left(y_i \log \frac{y_i}{\hat \mu_i} - (y_i - \hat \mu_i)\right) = 2\sum_{i = 1}^n y_i \log \frac{y_i}{\hat \mu_i}. \]
The last equality holds for any model containing the intercept, since by the normal equations we have \(\sum_{i = 1}^n (y_i - \hat \mu_i) = \boldsymbol{1}^T (\boldsymbol{y} - \boldsymbol{\hat \mu}) = 0\). We can carry out a likelihood ratio test for \(H_0: \boldsymbol \beta_S = \boldsymbol{0}\) via:
\[ D(\boldsymbol{y}, \boldsymbol{\hat \mu}^0) - D(\boldsymbol{y}, \boldsymbol{\hat \mu}) = 2\sum_{i = 1}^n y_i \log \frac{\hat \mu_i}{\hat \mu^0_{i}} \overset{\cdot}\sim \chi^2_{|S|}. \]
We can carry out a goodness-of-fit test via:
\[ D(\boldsymbol{y}, \boldsymbol{\hat\mu}) = 2\sum_{i = 1}^n y_i \log \frac{y_i}{\hat \mu_i} \overset{\cdot}\sim \chi^2_{n - p}. \] This approximation is justified by the saddlepoint approximation in small-dispersion asymptotics, which is reliable if \(\hat \mu_i \geq 3\) for all \(i\).
25.3.3 Score-based inference
Recalling Section 22.4.2, the score test for \(H_0: \beta_j = 0\) is based on the approximation \[ \frac{\boldsymbol{x}_{*,j}^T (\boldsymbol{y} - \boldsymbol{\widehat{\mu}}^0)}{\sqrt{\boldsymbol x_{*,j}^T \boldsymbol {\widehat W}^{0}\boldsymbol x_{*,j} - \boldsymbol x_{*,j}^T \boldsymbol {\widehat W}^{0}\boldsymbol X_{*,\text{-}j}(\boldsymbol X_{*,\text{-}j}^T \boldsymbol {\widehat W}^{0}\boldsymbol X_{*,\text{-}j})^{-1}\boldsymbol X_{*,\text{-}j}^T \boldsymbol {\widehat W}^{0}\boldsymbol x_{*,j}}} \overset{\cdot}\sim N(0,1), \] where \[ \boldsymbol {\widehat W}^{0} = \text{diag}(\boldsymbol{\widehat{\mu}}^0). \]
On the other hand, the score test for goodness-of-fit is based on the Pearson \(X^2\) statistic:
\[ X^2 \equiv \sum_{i = 1}^n \frac{(y_i - \hat \mu_i)^2}{\hat \mu_i} \overset{\cdot}\sim \chi^2_{n - p}. \] This approximation is justified by the central limit theorem in small-dispersion asymptotics, which is reliable if \(\hat \mu_i \geq 5\) for all \(i\).
25.4 Relationship between Poisson and multinomial distributions
Suppose that \(y_i \overset{\text{ind}}\sim \text{Poi}(\mu_i)\) for \(i = 1, \dots, n\). Then,
\[ \begin{split} \mathbb{P}\left[y_1 = m_1, \dots, y_n = m_n \left| \sum_{i}y_i = m\right.\right] &= \frac{\mathbb{P}[y_1 = m_1, \dots, y_n = m_n]}{\mathbb{P}[\sum_{i}y_i = m]} \\ &= \frac{\prod_{i = 1}^n e^{-\mu_i}\frac{\mu_i^{m_i}}{m_i!}}{e^{-\sum_i \mu_i}\frac{(\sum_i \mu_i)^m}{m!}} \\ &= {m \choose m_1, \dots, m_n} \prod_{i = 1}^n \pi_i^{m_i}, \end{split} \] where \[ \pi_i \equiv \frac{\mu_i}{\sum_{i' = 1}^n \mu_{i'}}. \] We recognize the last expression in the previous display as the probability mass function of the multinomial distribution with parameters \((\pi_1, \dots, \pi_n)\) summing to one. In words, the joint distribution of a set of independent Poisson distributions conditional on their sum is a multinomial distribution.
25.5 Example: \(2 \times 2\) contingency tables
25.5.1 Poisson model for \(2 \times 2\) contingency tables
Let’s say that we have two binary random variables \(x_1, x_2 \in \{0,1\}\) with joint distribution \(\mathbb{P}(x_1 = j, x_2 = k) = \pi_{jk}\) for \(j,k \in \{0,1\}\). We collect a total of \(m\) samples from this joint distribution and summarize the counts in a \(2 \times 2\) table, where \(y_{jk}\) is the number of times we observed \((x_1, x_2) = (j,k)\), so that:
\[ (y_{00}, y_{01}, y_{10}, y_{11}) \sim \text{Mult}(m, (\pi_{00}, \pi_{01}, \pi_{10}, \pi_{11})). \]
Our primary question is whether these two random variables are independent, i.e.
\[ \pi_{jk} = \pi_{j+}\pi_{+k}, \tag{25.2}\] where \[ \pi_{j+} \equiv \mathbb{P}[x_1 = j] = \pi_{j1} + \pi_{j2}; \quad \pi_{+k} \equiv \mathbb{P}[x_2 = k] = \pi_{1k} + \pi_{2k}. \]
We can express this equivalently as:
\[ \pi_{00}\pi_{11} = \pi_{01}\pi_{10}. \]
In other words, we can express the independence hypothesis concisely as:
\[ H_0: \frac{\pi_{11}\pi_{00}}{\pi_{10}\pi_{01}} = 1. \tag{25.3}\]
Let’s arbitrarily assume that, additionally, \(m \sim \text{Poi}(\mu_{++})\). Then, by the relationship between Poisson and multinomial distributions, we have:
\[ y_{jk} \overset{\text{ind}} \sim \text{Poi}(\mu_{++}\pi_{jk}). \]
Let \(i \in \{1,2,3,4\}\) index the four pairs \[ (x_1, x_2) \in \{(0,0), (0,1), (1,0), (1,1)\}, \] so that for \(i = 1, \dots, 4\), we have \[ y_i \overset{\text{ind}}\sim \text{Poi}(\mu_i); \quad \log \mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_{12}x_{i1} x_{i2}, \tag{25.4}\]
where:
\[ \begin{aligned} \beta_0 = \log \mu_{++} + \log \pi_{00}; \quad \beta_1 = \log \frac{\pi_{10}}{\pi_{00}}; \\ \beta_2 = \log \frac{\pi_{01}}{\pi_{00}}; \quad \beta_{12} = \log\frac{\pi_{11}\pi_{00}}{\pi_{10}\pi_{01}}. \end{aligned} \tag{25.5}\]
Note that the independence hypothesis (25.3) reduces to the hypothesis \(H_0: \beta_{12} = 0\):
\[ H_0: \frac{\pi_{11}\pi_{00}}{\pi_{10}\pi_{01}} = 1 \quad \Longleftrightarrow \quad H_0: \beta_{12} = 0. \]
So the presence of an interaction in the Poisson regression is equivalent to a lack of independence between \(x_1\) and \(x_2\). We can test the latter hypothesis using our standard tools for Poisson regression.
For example, we can use the Pearson \(X^2\) goodness-of-fit test. To apply this test, we must find the fitted means under the null hypothesis model:
\[ y_i \overset{\text{ind}}\sim \text{Poi}(\mu_i); \quad \log \mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}, \quad i = 1, \dots, 4. \tag{25.6}\]
The normal equations give us the following:
\[ \begin{aligned} y_{++} &\equiv \sum_{j, k = 0}^1 y_{jk} = \sum_{j, k = 0}^1 \hat \mu_{jk} \equiv \hat \mu_{++}; \\ y_{+1} &\equiv \sum_{j = 0}^1 y_{j1} = \sum_{j = 0}^1 \hat \mu_{j1} \equiv \hat \mu_{+1}; \\ y_{1+} &\equiv \sum_{k = 0}^1 y_{1k} = \sum_{k = 0}^1 \hat \mu_{1k} \equiv \hat \mu_{1+}. \end{aligned} \]
By combining these equations, we arrive at:
\[ \hat \mu_{++} = y_{++}; \quad \hat \mu_{j+} = y_{j+} \text{ for all } j \in \{0, 1\}; \quad \hat \mu_{+k} = y_{+k} \text{ for all } k \in \{0, 1\}. \]
Therefore, the fitted means under the null hypothesis model (25.6) are:
\[ \hat \mu_{jk} = \hat \mu_{++}\hat \pi_{jk} = \hat \mu_{++}\hat \pi_{j+}\hat \pi_{+k} = y_{++}\frac{y_{j+}}{y_{++}}\frac{y_{+k}}{y_{++}} = \frac{y_{j+}y_{+k}}{y_{++}}. \]
Hence, we have:
\[ X^2 = \sum_{j,k = 0}^1 \frac{(y_{jk} - \widehat \mu_{jk})^2}{\widehat \mu_{jk}} = \sum_{j, k = 0}^1 \frac{(y_{jk} - y_{j+}y_{+k}/y_{++})^2}{y_{j+}y_{+k}/y_{++}}. \]
Alternatively, we can use the likelihood ratio test, which gives:
\[ G^2 = 2\sum_{j,k = 0}^1 y_{jk}\log\frac{y_{jk}}{\widehat \mu_{jk}} = 2\sum_{j, k = 0}^1 y_{jk}\log\frac{y_{jk}}{y_{j+}y_{+k}/y_{++}}. \]
We would compare both \(X^2\) and \(G^2\) to a \(\chi^2_1\) distribution.
25.5.2 Inference is the same regardless of conditioning on margins
Now, our data might actually have been collected such that \(m = y_{++} \sim \text{Poi}(\mu_{++})\); for example, maybe \(y_{++} = m\) was fixed in advance. Is the Poisson inference proposed above actually valid in the latter case? In fact, it is! To see this, let us consider the log likelihoods of the two models:
\[ p_{\boldsymbol{\mu}}(\boldsymbol{y}) = p_{\mu_{++}}(y_{++})p_{\boldsymbol{\pi}}(\boldsymbol{y} | y_{++}), \]
so:
\[ \begin{split} \log p_{\boldsymbol{\mu}}(\boldsymbol{y}) &= \log p_{\mu_{++}}(y_{++}) + \log p_{\boldsymbol{\pi}}(\boldsymbol{y} | y_{++}) = C + \log p_{\boldsymbol{\pi}}(\boldsymbol{y} | y_{++}). \end{split} \]
In other words, the log-likelihoods of the Poisson and multinomial models, as a function of \(\boldsymbol{\pi}\), differ from each other by a constant. Therefore, any likelihood-based inference in these models is equivalent. The same argument shows that conditioning on the row or column totals (as opposed to the overall total) also yields the same exact inference. Therefore, regardless of the sampling mechanism, we can always conduct an independence test in a \(2 \times 2\) table via a Poisson regression.
25.5.3 Equivalence among Poisson and logistic regressions
We’ve talked about two ways to view a \(2 \times 2\) contingency table. In the logistic regression view, we thought about one variable as a predictor and the other as a response, seeking to test whether the predictor has an impact on the response. In the Poisson regression view, we thought about the two variables symmetrically, seeking to test independence. It turns out that these two perspectives are equivalent. Recall that we have derived in equations (25.4) and (25.5) that \(x_1 \perp \!\!\! \perp x_2\) if and only if \(\beta_{12} = 0\) in the Poisson regression:
\[ \log y_i \overset{\text{ind}}{\sim} \text{Poi}(\mu_i), \quad \log \mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_{12} x_{i1}x_{i2}, \quad i = 1, \dots, 4. \]
However, we have:
\[ \begin{split} \beta_{12} &= \log \frac{\pi_{11}\pi_{00}}{\pi_{01}\pi_{10}} \\ &= \log \frac{\pi_{11}/\pi_{01}}{\pi_{01}/\pi_{00}} = \log \frac{\mathbb{P}[x_2 = 1 \mid x_1 = 1] / \mathbb{P}[x_2 = 0 \mid x_1 = 1]}{\mathbb{P}[x_2 = 1 \mid x_1 = 0] / \mathbb{P}[x_2 = 0 \mid x_1 = 0]}. \end{split} \]
Recalling the logistic regression of \(x_2\) on \(x_1\):
\[ \text{logit } \mathbb{P}[x_2 = 1 \mid x_1] = \tilde \beta_0 + \tilde \beta_1 x_1, \tag{25.7}\]
and that \(\tilde \beta_1\) is the log odds ratio, we conclude that:
\[ \beta_{12} = \tilde \beta_1, \]
so \(x_1 \perp \!\!\! \perp x_2\) if and only if \(\tilde \beta_1 = 0\). Due to the equivalence between Poisson and multinomial distributions, the hypothesis tests and confidence intervals for the log odds ratio \(\beta_{12}\) (or \(\tilde \beta_1\)) obtained from Poisson and logistic regressions will be the same.
25.6 Example: Poisson models for \(J \times K\) contingency tables
Suppose now that \(x_1 \in \{0, 1, \dots, J-1\}\) and \(x_2 \in \{0, 1, \dots, K-1\}\). Then, we denote \(\mathbb{P}[x_1 = j, x_2 = k] = \pi_{jk}\). We still are interested in testing for independence between \(x_1\) and \(x_2\), which amounts to a goodness-of-fit test for the Poisson model:
\[ y_{jk} \overset{\text{ind}}\sim \text{Poi}(\mu_{jk}); \quad \log \mu_{jk} = \beta_0 + \beta^1_j + \beta^2_k. \]
The score (Pearson) and deviance-based goodness-of-fit statistics for this test are:
\[ X^2 = \sum_{j = 0}^{J-1} \sum_{k = 0}^{K-1} \frac{(y_{jk} - \widehat \mu_{jk})^2}{\widehat \mu_{jk}} \quad \text{and} \quad G^2 = 2\sum_{j = 0}^{J-1} \sum_{k = 0}^{K-1} y_{jk}\log \frac{y_{jk}}{\widehat \mu_{jk}}, \]
where \(\widehat \mu_{jk} = \widehat y_{++}\frac{y_{j+}}{y_{++}}\frac{y_{+k}}{y_{++}}\). Like with the \(2 \times 2\) case, the test is the same regardless of whether we condition on the row sums, column sums, total count, or if we do not condition at all. The degrees of freedom in the full model is \(JK\), while the degrees of freedom in the partial model is \(J + K - 1\), so the degrees of freedom for the goodness-of-fit test is \(JK - J - K + 1 = (J - 1)(K - 1)\). Pearson erroneously concluded that the test had \(JK - 1\) degrees of freedom, which, when Fisher corrected it, created a lot of animosity between these two statisticians.
25.7 Example: Poisson models for \(J \times K \times L\) contingency tables
These ideas can be extended to multi-way tables, for example, three-way tables. If we have \(x_1 \in \{0, 1, \dots, J-1\}\), \(x_2 \in \{0, 1, \dots, K-1\}\), \(x_3 \in \{0, 1, \dots, L-1\}\), then we might be interested in testing several kinds of null hypotheses:
- Mutual independence: \(H_0: x_1 \perp \!\!\! \perp x_2 \perp \!\!\! \perp x_3\).
- Joint independence: \(H_0: x_1 \perp \!\!\! \perp (x_2, x_3)\).
- Conditional independence: \(H_0: x_1 \perp \!\!\! \perp x_2 \mid x_3\).
These three null hypotheses can be shown to be equivalent to the Poisson regression model:
\[ y_{jkl} \overset{\text{ind}}\sim \text{Poi}(\mu_{jkl}), \]
where:
\[ \log \mu_{jkl} = \beta^0 + \beta^1_j + \beta^2_k + \beta^3_l \quad \text{(mutual independence)}; \]
\[ \log \mu_{jkl} = \beta^0 + \beta^1_j + \beta^2_k + \beta^3_l + \beta^{2,3}_{kl} \quad \text{(joint independence)}; \]
\[ \log \mu_{jkl} = \beta^0 + \beta^1_j + \beta^2_k + \beta^3_l + \beta^{1,3}_{jl} + \beta^{2,3}_{kl} \quad \text{(conditional independence)}. \] Here, \(j = 1, \dots, J-1\), \(k = 1, \dots, K-1\), and \(l = 1, \dots, L-1\).
For example, consider conditional independence. In this case, we have \[ \begin{split} \pi_{jkl} &= \mathbb P[x_1 = j, x_2 = k \mid x_3 = l] \\ &= \mathbb P[x_1 = j \mid x_3 = l] \mathbb P[x_2 = k \mid x_3 = l] \mathbb P[x_3 = l] \\ &= \frac{\pi_{j+l}}{\pi_{++l}} \frac{\pi_{k+l}}{\pi_{++l}} \pi_{++l} \\ &= \frac{\pi_{j+l} \pi_{+kl}}{\pi_{++l}}. \end{split} \] Therefore, \[ \begin{split} \log \mu_{jkl} &= \log \mu_{+++} + \log \pi_{jkl} \\ &= \log \mu_{+++} + \log \pi_{j+l} + \log \pi_{+kl} - \log \pi_{++l}. \end{split} \] Therefore, \[ \begin{split} \log \mu_i = \log \mu_{+++} &+ \sum_{j = 0}^{J-1}\sum_{l = 0}^{L-1}\log \pi_{j+l}1(x_{i1} = j)1(x_{i3} = l) \\ &+ \sum_{k = 0}^{K-1}\sum_{l = 0}^{L-1}\log \pi_{k+l}1(x_{i2} = k)1(x_{i3} = l) \\ &- \sum_{l = 0}^{L-1} \log \pi_{++l}1(x_{i3} = l). \end{split} \] In this representation, there is some multicollinearity among the predictors. For example, \(\{1(x_{i3 = l})\}_{l = 0}^{L-1}\) are multicollinear with the intercept, so we remove the \(1(x_{i3} = 0)\). Next, \(\{1(x_{i1} = j)1(x_{i3} = l)\}_{j = 0}^{J-1}\) are multicollinear with \(1(x_{i3} = l)\) for each \(l\), so we remove \(1(x_{i1} = 0)1(x_{i3} = l)\) for each \(l\). Similarly, \(\{1(x_{i2} = k)1(x_{i3} = l)\}_{k = 0}^{K-1}\) are multicollinear with \(1(x_{i3} = l)\) for each \(l\), so we remove \(1(x_{i2} = 0)1(x_{i3} = l)\) for each \(l\). This leaves us with the model \[ \begin{split} \log \mu_i = \beta^0 &+ \sum_{j = 1}^{J-1}\sum_{l = 0}^{L-1} \beta^{1,3}_{j,l}(x_{i1} = j)1(x_{i3} = l) \\ &+ \sum_{k = 1}^{K-1}\sum_{l = 0}^{L-1} \beta^{2,3}_{k,l}1(x_{i2} = k)1(x_{i3} = l) \\ &+ \sum_{l = 1}^{L-1} \beta^3_{l} 1(x_{i3} = l). \end{split} \] Alternatively, we can reparameterize to the following form: \[ \begin{split} \log \mu_i &= \beta^0 + \\ &\quad \sum_{j = 1}^{J-1} \beta^1_{j} 1(x_{i1} = j) + \sum_{k = 1}^{K-1} \beta^2_{k} 1(x_{i2} = k) + \sum_{l = 1}^{L-1} \beta^3_{l} 1(x_{i3} = l) \\ & \quad + \sum_{j = 1}^{J-1}\sum_{l = 1}^{L-1} \beta^{1,3}_{j,l}(x_{i1} = j)1(x_{i3} = l) + \sum_{k = 1}^{K-1}\sum_{l = 1}^{L-1} \beta^{2,3}_{k,l}1(x_{i2} = k)1(x_{i3} = l). \end{split} \] Note that the coefficients here are not necessarily the same as the coefficients in the display above. This formulation recovers the conditional independence model referenced above.