27 R demo

27.1 Contingency table analysis

Let’s take a look at the UC Berkeley admissions data:

library(readr)
library(dplyr)
library(ggplot2)
library(tibble)
library(tidyr)

ucb_data <- UCBAdmissions |> as_tibble()
ucb_data

# A tibble: 24 × 4
   Admit    Gender Dept      n
   <chr>    <chr>  <chr> <dbl>
 1 Admitted Male   A       512
 2 Rejected Male   A       313
 3 Admitted Female A        89
 4 Rejected Female A        19
 5 Admitted Male   B       353
 6 Rejected Male   B       207
 7 Admitted Female B        17
 8 Rejected Female B         8
 9 Admitted Male   C       120
10 Rejected Male   C       205
# ℹ 14 more rows

It contains data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex. Let’s see whether there is an association between Gender and Admit. Let’s first aggregate over department:

ucb_data_agg <- ucb_data |>
  group_by(Admit, Gender) |>
  summarise(n = sum(n), .groups = "drop")
ucb_data_agg

# A tibble: 4 × 3
  Admit    Gender     n
  <chr>    <chr>  <dbl>
1 Admitted Female   557
2 Admitted Male    1198
3 Rejected Female  1278
4 Rejected Male    1493

Let’s see what the admissions rates are by gender:

ucb_data_agg |>
  group_by(Gender) |>
  summarise(`Admission rate` = sum(n*(Admit == "Admitted"))/sum(n))

# A tibble: 2 × 2
  Gender `Admission rate`
  <chr>             <dbl>
1 Female            0.304
2 Male              0.445

This suggests that men have substantially higher admission rates than women. Let’s see if we can confirm this using either a Fisher’s exact test or a Pearson chi-square test.

# first convert to 2x2 table format
admit_vs_gender <- ucb_data_agg |>
  pivot_wider(names_from = Gender, values_from = n) |>
  column_to_rownames(var = "Admit")
admit_vs_gender

         Female Male
Admitted    557 1198
Rejected   1278 1493

# Fisher exact test (note that the direction of the effect can be deduced)
fisher.test(admit_vs_gender)


    Fisher's Exact Test for Count Data

data:  admit_vs_gender
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.4781839 0.6167675
sample estimates:
odds ratio 
 0.5432254

# Chi-square test
chisq.test(admit_vs_gender)


    Pearson's Chi-squared test with Yates' continuity correction

data:  admit_vs_gender
X-squared = 91.61, df = 1, p-value < 2.2e-16

As a sanity check, let’s run the Poisson regression underlying the chi-square test above.

pois_fit <- glm(n ~ Admit + Gender + Admit*Gender,
                family = "poisson",
                data = ucb_data_agg)
summary(pois_fit)


Call:
glm(formula = n ~ Admit + Gender + Admit * Gender, family = "poisson", 
    data = ucb_data_agg)

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)               6.32257    0.04237 149.218   <2e-16 ***
AdmitRejected             0.83049    0.05077  16.357   <2e-16 ***
GenderMale                0.76584    0.05128  14.933   <2e-16 ***
AdmitRejected:GenderMale -0.61035    0.06389  -9.553   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance:  4.8635e+02  on 3  degrees of freedom
Residual deviance: -3.4062e-13  on 0  degrees of freedom
AIC: 43.225

Number of Fisher Scoring iterations: 2

Based on all of these tests, there seems to be a very substantial difference in admissions rates based on gender. That is not good.

But perhaps, women tend to apply to more selective departments? Let’s look into this:

ucb_data |>
  group_by(Dept) |>
  summarise(admissions_rate = sum(n*(Admit == "Admitted"))/sum(n),
            prop_female_applicants = sum(n*(Gender == "Female"))/sum(n)) |>
  ggplot(aes(x = admissions_rate, y = prop_female_applicants)) +
  geom_point() +
  scale_x_continuous(limits = c(0, 1)) +
  scale_y_continuous(limits = c(0, 1)) +
  labs(x = "Admissions rate",
       y = "Proportion female applicants")

Indeed, it does seem that female applicants typically applied to more selective departments! This suggests that it is very important to control for department when evaluating the association between admissions and gender. To do this, we can run a test for conditional independence in the \(J \times K \times L\) table:

pois_fit <- glm(n ~ Admit + Dept + Gender + Admit:Dept + Gender:Dept,
                family = "poisson",
                data = ucb_data)
pchisq(sum(resid(pois_fit, "pearson")^2),
  df = pois_fit$df.residual,
  lower.tail = FALSE
)

[1] 0.002840164

Still we find a significant effect! But what is the direction of the effect? The chi-square test does not tell us. We can simply compute the admissions rates by department and plot them:

ucb_data |>
  group_by(Dept, Gender) |>
  summarise(`Admission rate` = sum(n*(Admit == "Admitted"))/sum(n),
            .groups = "drop") |>
  pivot_wider(names_from = Gender, values_from = `Admission rate`) |>
  ggplot(aes(x = Female, y = Male, label = Dept)) +
  geom_point() +
  ggrepel::geom_text_repel() +
  geom_abline(color = "red", linetype = "dashed") +
  scale_x_continuous(limits = c(0, 1)) +
  scale_y_continuous(limits = c(0, 1)) +
  labs(x = "Female admission rate",
       y = "Male admission rate")

Now the difference doesn’t seem so huge, with most departments close to even and with department A heavily skewed towards admitting women!

27.2 Revisiting the crime data, again

library(tidyverse)

Here we are again, face to face with the crime data, with one last chance to get the analysis right. Let’s load and preprocess it, as before.

# read crime data
crime_data <- read_tsv("data/Statewide_crime.dat")

# read and transform population data
population_data <- read_csv("data/state-populations.csv")
population_data <- population_data |>
  filter(State != "Puerto Rico") |>
  select(State, Pop) |>
  rename(state_name = State, state_pop = Pop)

# collate state abbreviations
state_abbreviations <- tibble(
  state_name = state.name,
  state_abbrev = state.abb
) |>
  add_row(state_name = "District of Columbia", state_abbrev = "DC")

# add CrimeRate to crime_data
crime_data <- crime_data |>
  mutate(STATE = ifelse(STATE == "IO", "IA", STATE)) |>
  rename(state_abbrev = STATE) |>
  filter(state_abbrev != "DC") |> # remove outlier
  left_join(state_abbreviations, by = "state_abbrev") |>
  left_join(population_data, by = "state_name") |>
  select(state_abbrev, Violent, Metro, HighSchool, Poverty, state_pop)

crime_data

# A tibble: 50 × 6
   state_abbrev Violent Metro HighSchool Poverty state_pop
   <chr>          <dbl> <dbl>      <dbl>   <dbl>     <dbl>
 1 AK               593  65.6       90.2     8      724357
 2 AL               430  55.4       82.4    13.7   4934193
 3 AR               456  52.5       79.2    12.1   3033946
 4 AZ               513  88.2       84.4    11.9   7520103
 5 CA               579  94.4       81.3    10.5  39613493
 6 CO               345  84.5       88.3     7.3   5893634
 7 CT               308  87.7       88.8     6.4   3552821
 8 DE               658  80.1       86.5     5.8    990334
 9 FL               730  89.3       85.9     9.7  21944577
10 GA               454  71.6       85.2    10.8  10830007
# ℹ 40 more rows

Let’s recall the logistic regression we ran on these data in Chapter 4:

bin_fit <- glm(Violent / state_pop ~ Metro + HighSchool + Poverty,
  weights = state_pop,
  family = "binomial",
  data = crime_data
)

We had found very poor results from the goodness of fit test for this model. We have therefore omitted some important variables and/or we have serious overdispersion on our hands.

We haven’t discussed in any detail how to deal with overdispersion in logistic regression models, so let’s try a Poisson model instead. The natural way to model rates using Poisson distributions is via offsets:

pois_fit <- glm(Violent ~ Metro + HighSchool + Poverty + offset(log(state_pop)),
  family = "poisson",
  data = crime_data
)
summary(pois_fit)


Call:
glm(formula = Violent ~ Metro + HighSchool + Poverty + offset(log(state_pop)), 
    family = "poisson", data = crime_data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.609e+01  3.520e-01  -45.72   <2e-16 ***
Metro       -2.585e-02  5.727e-04  -45.15   <2e-16 ***
HighSchool   9.106e-02  3.450e-03   26.39   <2e-16 ***
Poverty      6.077e-02  4.852e-03   12.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 15589  on 49  degrees of freedom
Residual deviance: 11741  on 46  degrees of freedom
AIC: 12135

Number of Fisher Scoring iterations: 5

Again, everything is significant, and again, the regression summary shows that we have a huge residual deviance. This was to be expected, given that \(\text{Bin}(m, \pi) \approx \text{Poi}(m\pi)\) for large \(m\) and small \(\pi\). So, the natural thing to try is a negative binomial regression! Negative binomial regression is not implemented in the regular glm package, but glm.nb() from the MASS package is a dedicated function for this task. Let’s see what we get:

nb_fit <- MASS::glm.nb(Violent ~ Metro + HighSchool + Poverty + offset(log(state_pop)),
  data = crime_data
)
summary(nb_fit)


Call:
MASS::glm.nb(formula = Violent ~ Metro + HighSchool + Poverty + 
    offset(log(state_pop)), data = crime_data, init.theta = 1.467747388, 
    link = log)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)  
(Intercept) -10.254088   5.273418  -1.944   0.0518 .
Metro        -0.012188   0.008518  -1.431   0.1525  
HighSchool    0.028052   0.052482   0.535   0.5930  
Poverty      -0.026852   0.068449  -0.392   0.6948  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.4677) family taken to be 1)

    Null deviance: 59.516  on 49  degrees of freedom
Residual deviance: 55.487  on 46  degrees of freedom
AIC: 732.58

Number of Fisher Scoring iterations: 1

              Theta:  1.468 
          Std. Err.:  0.268 

 2 x log-likelihood:  -722.575

Aha! Things are not looking so significant anymore! And the residual deviance is not as huge! Although, we must be careful! The residual deviance no longer has the usual \(\chi^2\) distribution because of the estimated dispersion parameter. So we don’t really have an easy goodness of fit test. The estimated value of \(\gamma\) (confusingly called \(\theta\) in the summary) is significantly different from zero, indicating overdispersion. Let’s formally test for overdispersion using the nonstandard likelihood ratio test discussed above:

T_LRT <- 2 * (as.numeric(logLik(nb_fit)) - as.numeric(logLik(pois_fit)))
p_LRT <- pchisq(T_LRT, df = 1, lower.tail = FALSE)/2
p_LRT

[1] 0

So at the very least the NB model fits much better than the Poisson model. Let’s do some inference based on this model. For example, we can get Wald confidence intervals:

confint.default(nb_fit)

                   2.5 %      97.5 %
(Intercept) -20.58979658 0.081620714
Metro        -0.02888413 0.004507747
HighSchool   -0.07481066 0.130915138
Poverty      -0.16100973 0.107305015

Or we can get LRT-based (i.e. profile) confidence intervals:

confint(nb_fit)

Waiting for profiling to be done...

                   2.5 %       97.5 %
(Intercept) -19.20209590 -0.860399348
Metro        -0.03153902  0.006365841
HighSchool   -0.06265118  0.115318303
Poverty      -0.13930110  0.085200541

Or we can get confidence intervals for the predicted means:

predict(nb_fit,
  newdata = crime_data |> column_to_rownames(var = "state_abbrev"),
  type = "response",
  se.fit = TRUE
)

$fit
       AK        AL        AR        AZ        CA        CO        CT        DE 
 116.1520  617.7064  375.4895  700.6931 3257.5300  725.1538  436.7863  127.2572 
       FL        GA        HI        ID        IL        IN        IA        KS 
2232.2308 1301.2937  157.1416  263.8572 1379.1847  954.3366  546.5503  439.0649 
       KY        LA        MA        MD        ME        MI        MN        MO 
 541.5706  391.6745  747.7454  737.0032  274.2879 1322.9956  970.4078  871.2829 
       MS        MT        NC        ND        NE        NH        NJ        NM 
 380.6756  199.4947 1313.0904  134.8128  305.0634  261.1975  966.9940  204.3311 
       NV        NY        OH        OK        OR        PA        RI        SC 
 327.7316 1926.3861 1477.1713  495.9711  517.8397 1600.0813   96.3565  684.9102 
       SD        TN        TX        UT        VA        VT        WA        WI 
 160.9225  867.0224 2423.0647  416.6648 1244.5168  148.1635 1012.1932  892.0644 
       WV        WY 
 226.4515  100.1906 

$se.fit
       AK        AL        AR        AZ        CA        CO        CT        DE 
 21.00552 143.65071 130.44272 165.08459 910.57769 121.34777  85.53768  32.15169 
       FL        GA        HI        ID        IL        IN        IA        KS 
427.89514 173.04544  31.73873  40.28262 239.43324 147.21049 104.05752  68.82044 
       KY        LA        MA        MD        ME        MI        MN        MO 
133.28938 129.40665 150.23524 158.93816  92.04222 171.28409 216.32477 110.88843 
       MS        MT        NC        ND        NE        NH        NJ        NM 
138.28105  65.60335 379.90855  26.74061  69.62560  66.73731 220.88371  59.26953 
       NV        NY        OH        OK        OR        PA        RI        SC 
 64.30971 387.25204 241.24541  95.44911  81.97419 220.42078  33.97964 119.45174 
       SD        TN        TX        UT        VA        VT        WA        WI 
 41.50215 169.68896 738.95321 107.62725 209.14651  51.32810 191.75629 137.35158 
       WV        WY 
 71.55328  22.79279 

$residual.scale
[1] 1

We can carry out some hypothesis tests as well, e.g. to test \(H_0: \beta_{\text{Metro}} = 0\):

nb_fit_partial <- MASS::glm.nb(Violent ~ HighSchool + Poverty + offset(log(state_pop)),
  data = crime_data
)
anova_fit <- anova(nb_fit_partial, nb_fit)
anova_fit

Likelihood ratio tests of Negative Binomial Models

Response: Violent
                                                  Model    theta Resid. df
1         HighSchool + Poverty + offset(log(state_pop)) 1.428675        47
2 Metro + HighSchool + Poverty + offset(log(state_pop)) 1.467747        46
     2 x log-lik.   Test    df LR stat.   Pr(Chi)
1       -724.1882                                
2       -722.5753 1 vs 2     1 1.612878 0.2040877