23  R demo

23.1 Crime data

Let’s revisit the crime data from Homework 2, this time fitting a logistic regression to it.

library(readr)
library(dplyr)
library(ggplot2)

# read crime data
crime_data <- read_tsv("data/Statewide_crime.dat")

# read and transform population data
population_data <- read_csv("data/state-populations.csv")
population_data <- population_data |>
  filter(State != "Puerto Rico") |>
  select(State, Pop) |>
  rename(state_name = State, state_pop = Pop)

# collate state abbreviations
state_abbreviations <- tibble(
  state_name = state.name,
  state_abbrev = state.abb
) |>
  add_row(state_name = "District of Columbia", state_abbrev = "DC")

# add CrimeRate to crime_data
crime_data <- crime_data |>
  mutate(STATE = ifelse(STATE == "IO", "IA", STATE)) |>
  rename(state_abbrev = STATE) |>
  filter(state_abbrev != "DC") |> # remove outlier
  left_join(state_abbreviations, by = "state_abbrev") |>
  left_join(population_data, by = "state_name") |>
  mutate(CrimeRate = Violent / state_pop) |>
  select(state_abbrev, CrimeRate, Metro, HighSchool, Poverty, state_pop)

crime_data
# A tibble: 50 × 6
   state_abbrev CrimeRate Metro HighSchool Poverty state_pop
   <chr>            <dbl> <dbl>      <dbl>   <dbl>     <dbl>
 1 AK           0.000819   65.6       90.2     8      724357
 2 AL           0.0000871  55.4       82.4    13.7   4934193
 3 AR           0.000150   52.5       79.2    12.1   3033946
 4 AZ           0.0000682  88.2       84.4    11.9   7520103
 5 CA           0.0000146  94.4       81.3    10.5  39613493
 6 CO           0.0000585  84.5       88.3     7.3   5893634
 7 CT           0.0000867  87.7       88.8     6.4   3552821
 8 DE           0.000664   80.1       86.5     5.8    990334
 9 FL           0.0000333  89.3       85.9     9.7  21944577
10 GA           0.0000419  71.6       85.2    10.8  10830007
# ℹ 40 more rows

We can fit a GLM using the glm command, specifying as additional arguments the observation weights as well as the exponential dispersion model. In this case, the weights are the state populations and the family is binomial:

glm_fit <- glm(CrimeRate ~ Metro + HighSchool + Poverty,
  weights = state_pop,
  family = "binomial",
  data = crime_data
)

We can print the summary table as usual:

summary(glm_fit)

Call:
glm(formula = CrimeRate ~ Metro + HighSchool + Poverty, family = "binomial", 
    data = crime_data, weights = state_pop)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.609e+01  3.520e-01  -45.72   <2e-16 ***
Metro       -2.586e-02  5.727e-04  -45.15   <2e-16 ***
HighSchool   9.106e-02  3.450e-03   26.39   <2e-16 ***
Poverty      6.077e-02  4.852e-03   12.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 15590  on 49  degrees of freedom
Residual deviance: 11742  on 46  degrees of freedom
AIC: 12136

Number of Fisher Scoring iterations: 5

Amazingly, everything is very significant! This is because the weights for each observation (the state populations) are very high, effectively making the sample size very high. But frankly, this is a bit suspicious. Glancing at the bottom of the regression summary, we see a residual deviance of 11742 on 46 degrees of freedom. This part of the summary refers to the deviance-based goodness of fit test. Under the null hypothesis that the model fits well, we expect that the residual deviance has a distribution of \(\chi^2_{46}\), which has a mean of 46.

Let’s formally check the goodness of fit. We can extract the residual deviance and residual degrees of freedom from the GLM fit:

glm_fit$deviance
[1] 11742.28
glm_fit$df.residual
[1] 46

We can then compute the chi-square \(p\)-value:

# compute based on residual deviance from fit object
pchisq(glm_fit$deviance,
  df = glm_fit$df.residual,
  lower.tail = FALSE
)
[1] 0
# compute residual deviance as sum of squares of residuals
pchisq(sum(resid(glm_fit, "deviance")^2),
  df = glm_fit$df.residual,
  lower.tail = FALSE
)
[1] 0

Wow, we get a \(p\)-value of zero! Let’s try doing a score-based (i.e., Pearson) goodness of fit test:

pchisq(sum(resid(glm_fit, "pearson")^2),
  df = glm_fit$df.residual,
  lower.tail = FALSE
)
[1] 0

Also zero. So we need to immediately stop using this model for inference about these data, since it fits the data very poorly. We will discuss how to build a better model for the crime data in the next unit. For now, we turn to analyzing a different dataset.

23.2 Noisy miner data

Credit: Generalized Linear Models With Examples in R textbook.

Let’s consider the noisy miner dataset. Noisy miners are a small but aggressive native Australian bird. We want to know how the number of these birds observed in a patch of land depends on various factors of that patch of land.

library(GLMsData)
data("nminer")
nminer |> as_tibble()
# A tibble: 31 × 8
   Miners  Eucs  Area Grazed Shrubs Bulokes Timber Minerab
    <int> <int> <int>  <int>  <int>   <int>  <int>   <int>
 1      0     2    22      0      1     120     16       0
 2      0    10    11      0      1      67     25       0
 3      1    16    51      0      1      85     13       3
 4      1    20    22      0      1      45     12       2
 5      1    19     4      0      1     160     14       8
 6      1    18    61      0      1      75      6       1
 7      1    12    16      0      1     100     12       8
 8      1    16    14      0      1     321     15       5
 9      0     3     5      0      1     275      8       0
10      1    12     6      1      0     227     10       4
# ℹ 21 more rows

Since the response is a count, we can model it as a Poisson random variable. Let’s fit that GLM:

glm_fit <- glm(Minerab ~ . - Miners, family = "poisson", data = nminer)
summary(glm_fit)

Call:
glm(formula = Minerab ~ . - Miners, family = "poisson", data = nminer)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.886345   0.875737  -1.012    0.311    
Eucs         0.129309   0.021757   5.943 2.79e-09 ***
Area        -0.028736   0.013241  -2.170    0.030 *  
Grazed       0.140831   0.364622   0.386    0.699    
Shrubs       0.335828   0.375059   0.895    0.371    
Bulokes      0.001469   0.001773   0.828    0.408    
Timber      -0.006781   0.009074  -0.747    0.455    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 150.545  on 30  degrees of freedom
Residual deviance:  54.254  on 24  degrees of freedom
AIC: 122.41

Number of Fisher Scoring iterations: 6

We exclude Miners because this is just a binarized version of the response variable. Things look a bit better on the GOF front:

pchisq(sum(resid(glm_fit, "deviance")^2),
  df = glm_fit$df.residual,
  lower.tail = FALSE
)
[1] 0.000394186
pchisq(sum(resid(glm_fit, "pearson")^2),
  df = glm_fit$df.residual,
  lower.tail = FALSE
)
[1] 0.0001185197

Still, there is some model misspecification, but for now, we still proceed with the rest of the analysis.

The standard errors shown in the summary are based on the Wald test. We can get Wald confidence intervals based on these standard errors by using the formula:

glm_fit |>
  summary() |>
  coef() |>
  as.data.frame() |>
  transmute(`2.5 %` = Estimate + qnorm(0.025)*`Std. Error`,
            `97.5 %` = Estimate + qnorm(0.025)*`Std. Error`)
                   2.5 %       97.5 %
(Intercept) -2.602757559 -2.602757559
Eucs         0.086666177  0.086666177
Area        -0.054686818 -0.054686818
Grazed      -0.573814583 -0.573814583
Shrubs      -0.399274191 -0.399274191
Bulokes     -0.002007061 -0.002007061
Timber      -0.024565751 -0.024565751

Or, we can simply use confint.default():

confint.default(glm_fit)
                   2.5 %       97.5 %
(Intercept) -2.602757559  0.830066560
Eucs         0.086666177  0.171951888
Area        -0.054686818 -0.002784651
Grazed      -0.573814583  0.855476296
Shrubs      -0.399274191  1.070929206
Bulokes     -0.002007061  0.004944760
Timber      -0.024565751  0.011002885

Or, we might want LRT-based confidence intervals, which are given by confint():

confint(glm_fit)
Waiting for profiling to be done...
                  2.5 %       97.5 %
(Intercept) -2.63176754  0.812111327
Eucs         0.08782624  0.173336323
Area        -0.05658079 -0.004456166
Grazed      -0.57858596  0.855903871
Shrubs      -0.38600748  1.090319407
Bulokes     -0.00214123  0.004838901
Timber      -0.02483241  0.010820749

In this case, the two sets of confidence intervals seem fairly similar.

Now, we can get prediction intervals, either on the linear predictor scale or on the mean scale:

pred_linear <- predict(glm_fit, newdata = nminer[31,], se.fit = TRUE)
pred_mean <- predict(glm_fit, newdata = nminer[31,], type = "response", se.fit = TRUE)

pred_linear
$fit
       31 
0.6556799 

$se.fit
[1] 0.2635664

$residual.scale
[1] 1
pred_mean
$fit
      31 
1.926452 

$se.fit
       31 
0.5077481 

$residual.scale
[1] 1
log(pred_mean$fit)
       31 
0.6556799 

We see that the prediction on the linear predictor scale is exactly the logarithm of the prediction on the mean scale. However, the standard error given on the mean scale uses the delta method. We prefer to directly transform the confidence interval from the linear scale using the inverse link function (in this case, the exponential):

# using delta method
c(pred_mean$fit + qnorm(0.025)*pred_mean$se.fit,
  pred_mean$fit + qnorm(0.975)*pred_mean$se.fit)
       31        31 
0.9312839 2.9216197 
# using transformation
exp(c(pred_linear$fit + qnorm(0.025)*pred_linear$se.fit,
      pred_linear$fit + qnorm(0.975)*pred_linear$se.fit))
      31       31 
1.149238 3.229285 

In this case, the intervals obtained are somewhat different. We can plot confidence intervals for the fit in a univariate case (e.g., regressing Minerab on Eucs) using geom_smooth():

nminer |>
  ggplot(aes(x = Eucs, y = Minerab)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "glm",
              method.args = list(family = "poisson"))
`geom_smooth()` using formula = 'y ~ x'

We can also test the coefficients in the model. The Wald tests for individual coefficients were already given by the regression summary above. We might want to carry out likelihood ratio tests for individual coefficients instead. For example, let’s do this for Eucs:

glm_fit_partial <- glm(Minerab ~ . - Miners - Eucs, family = "poisson", data = nminer)
anova(glm_fit_partial, glm_fit, test = "LRT")
Analysis of Deviance Table

Model 1: Minerab ~ (Miners + Eucs + Area + Grazed + Shrubs + Bulokes + 
    Timber) - Miners - Eucs
Model 2: Minerab ~ (Miners + Eucs + Area + Grazed + Shrubs + Bulokes + 
    Timber) - Miners
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1        25     95.513                          
2        24     54.254  1   41.259 1.333e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Eucs variable is quite significant! We can manually carry out the LRT as a sanity check:

deviance_partial <- glm_fit_partial$deviance
deviance_full <- glm_fit$deviance
lrt_stat <- deviance_partial - deviance_full
p_value <- pchisq(lrt_stat, df = 1, lower.tail = FALSE)
tibble(lrt_stat, p_value)
# A tibble: 1 × 2
  lrt_stat  p_value
     <dbl>    <dbl>
1     41.3 1.33e-10

We can test groups of variables using the likelihood ratio test as well.