Generalised linear models

Estimating binary responses, counts, rates, overdispersion, and zero-inflation

Joseph Bulbulia https://josephbulbulia.netlify.app (Victoria University of Wellington)https://www.wgtn.ac.nz
2021-APRIL-27

Objectives

  1. To understand the concept of a “generalised linear model,” and why the standard (gaussian) linear model is a special case.
  2. To undertand how to write four classes of generalised linear models:
  1. To understand how to interpret and report the results of a generalised linear model.

Introduction

Recall that a regression model combines data from a sample, on the one hand, with the tools of probability theory, on the other hand, to infer features of an unobserved population. We call these unobserved features “parameters.” The task in regression is to estimate uncertainty in these parameters, and where possible, to narrow the bandwidth of such uncertainty. In a nutshell, regression is educated guesswork.

Our focus has been on teaching workflows for estimation that are grounding in three imperatives:

  1. to clarify our question: what do we want to infer?
  2. to clarify our assumptions: what do we believe about the world and our data before estimation gets going
  3. to clarify our decisions: inevitably applied regression modelers must make choices: what have we decided and why?

So far, our models have assumed that response data in our models are sampling from a “normal” or “gaussian” distribution. This assumption is useful because we live i an ordered universe from which many of our observations are sampling from a normal distribution. However, although the gaussian distribution is sensible choice for estimating many parameters it is not a sensible choice for estimating all parameters. The linear regression model that we have been using is really a special case of the generalised linear regression model. Today we introduce the generalised linear model. ## A regression model implies a ‘link’ function

So far, we have written the gaussian linear model as:

\[ y_i = \alpha + \beta x_i + e\]

Where y is the response for the i\({^th}\) data point, \(\alpha\) is the intercept, or the expected response for the population when all other regression coefficients are set to zero, \(\beta\) is the adjustment in this expectation for the i\(^{th}\) predictor \(x\) associated with \(y_i\), where both y and x are random variables.

Often applied regression models were use compact matrix notation, where a bolded symbol is short hand for a matrix:

\[\boldsymbol{\eta} = \boldsymbol{\alpha} + \boldsymbol{X}\boldsymbol{\beta}\\\]

For example here, \[\boldsymbol{X}\boldsymbol{\beta}\\\] is shorthand for \[\beta_1X_i + \beta_2X_i + \beta_2X \dots\]

We can write the gaussian linear model in way that makes our assumptions and decisions explicit about what we think/hope it is doing when combining data with probability theory to estimate unknown (and inherently uncertain) parameters.1

Recall that there are two parameters that we need to estimate for gaussian linear model: the mean, and the variance.

\[ y_i \sim Normal ( \mu_i, \sigma)\\ \mu_i = \alpha + \beta x_i \\ \sigma \sim Exponentional (1) \\ \]

We assume that our response variable is a random variable that is sampling from a normal distribution with that has a mean \(\mu_i\) and a variance \(\sigma^2\). For reasons that will become clear in the final lecture, we will write the variance parameter as the square root of the variance parameter, \(\sigma\) which is a standard deviation (see Appendix 3 .appendix3)

Every linear model implies a ‘link’ function:

\(g(\cdot)\) = link function as well as an an inverse link function:

\(h(\cdot) = g^-1(\cdot)\) = inverse link function

Where \((\cdot)\) is the linear predictor: \(\boldsymbol{\eta}\)

In the gaussian linear model we assume

\[g(\cdot) = h(\cdot)\] That is, the link function is an identity function. This is a special case. In the remainder of the lecture we will explore different mapping from the expected outcome of a model and its linear predictors.

Logistic regression

We begin with outcomes that are distributed as binary data. Consider home ownership in the nz jitter dataset:

hist(nz$HomeOwner)

From the graph, we can see response distribution is fully bimodal (excluding the NAs).

The binomial link function does two things:

The logistic function satisfies the first task:

\[logit(\cdot) = log_e\frac{p}{(1-p)}\]

The inverse logit functionsatisfies the second task:

\[logit^-1(\cdot) = log\frac{e^\cdot}{(1-e^\cdot)}\]

We can write this:

\[ \Pr(y_i = 1) = p_i \\ logit(p_i) = \alpha + \beta x_i \\ \Pr(y_i = 1) = logit^{-1}(\alpha + \beta x_i) \] In R these functions are written:

logit <- qlogis
invlogit <- plogis

Logistic regression

We begin with outcomes that are distributed as binary data. Consider home ownership in the nz jitter dataset:

hist(nz$HomeOwner)

From the graph, we can see response distribution is fully bimodal (excluding the NAs).

The binomial link function does two things:

The logistic function satisfies the first task:

\[logit(\cdot) = log_e\frac{p}{(1-p)}\]

The inverse logit functionsatisfies the second task:

\[logit^-1(\cdot) = log\frac{e^\cdot}{(1-e^\cdot)}\]

We can write this:

\[ \Pr(y_i = 1) = p_i \\ logit(p_i) = \alpha + \beta x_i \\ \Pr(y_i = 1) = logit^{-1}(\alpha + \beta x_i) \] In R these functions are written:

logit <- qlogis 
invlogit <- plogis

For logistic regression, the functions are employed as follow:

\[ \Pr(y_i = 1) = p_i \\ logit(p_i) = \alpha + \beta x_i \\ \Pr(y_i = 1) = logit^{-1}(\alpha + \beta x_i) \]

Aside: later on we will express paramters for the linear predictors, so something like:

\[Ownership_i \sim binomial (n, Pr_i)\\ Ownersip_i = \alpha + \beta X_i \\ \alpha \sim N(0,\sigma_{alpha})\\ \sigma_{alpha} \sim(0,10) \\ \beta \sim lognormal(0,1) \\ \]

Where the values we assign correspond to a range of expectations bout how the world might be. Let’s not get distracted with these details for now, however, I include them to give you a sense of how notation can help us to clarify our assumptions and decisions – which recall are essential navigations points for applied scientists.

Extimating the probability of home ownership in the population

We can write a generalised linear model that predict home ownership as follows:

home <- glm(HomeOwner ~ 1, data = nz, 
            family = "binomial")

Which gives us the following result

parameters::model_parameters(home)
Parameter   | Log-Odds |   SE |       95% CI |     z |      p
-------------------------------------------------------------
(Intercept) |     1.46 | 0.05 | [1.36, 1.55] | 30.38 | < .001

Or plugging this into the equation:

# this is how to quickly generate the equation
equatiomatic::extract_eq(home,  use_coefs = TRUE)

\[ \log\left[ \frac { \widehat{P( \operatorname{HomeOwner} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{HomeOwner} = \operatorname{1} )} } \right] = 1.46 \]

Intepretation

How do we interpret this result?

We can use the plogis function to obtain the probability

plogis(coef(home)[[1]])
[1] 0.811068

The probability of an NZAVS participant owning their own home is 0.811068

Notably, the logistic regression has estimated the sample mean to seven decimal places.

mean(nz$HomeOwner, na.rm=TRUE)
[1] 0.811068

Logistic regression with a single co-variate.

Workflow: center and scale continuous predictors

We do this as follows.

# work around for compatibility issue between ggeffects and dplyr. 
nz['Household.INC_s'] <- as.data.frame( scale(nz$Household.INC) )

To ge a reference point for a standard deviation of household income, let’s find the standard deviation for the sample:

# how much is a standard deviation -- which will represent a 1 unit change for the regression coefficient? 
sd( nz$Household.INC, na.rm = TRUE ) # 95,090
[1] 95090.79

Let’s ask about the mean household for income in 2018?

mean( nz$Household.INC, na.rm = TRUE )
[1] 112877.1

Model syntax

We write a logistic regression model with a single covariate as follows:

home2 <- glm(HomeOwner ~ Household.INC_s, data = nz, 
            family = "binomial")

Results:

rs2<-parameters::model_parameters(home2)
rs2
Parameter       | Log-Odds |   SE |       95% CI |     z |      p
-----------------------------------------------------------------
(Intercept)     |     1.58 | 0.05 | [1.48, 1.69] | 29.37 | < .001
Household.INC_s |     0.72 | 0.08 | [0.57, 0.89] |  8.86 | < .001
plot(rs2)

# this is how to quickly generate the equation
equatiomatic::extract_eq(home2,  use_coefs = TRUE)

\[ \log\left[ \frac { \widehat{P( \operatorname{HomeOwner} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{HomeOwner} = \operatorname{1} )} } \right] = 1.58 + 0.72(\operatorname{Household.INC\_s}) \]

Interpretation

How do we interpret this model?

Let’s use the report package

report::report(home2)
We fitted a logistic model (estimated using ML) to predict HomeOwner with Household.INC_s (formula: HomeOwner ~ Household.INC_s). The model's explanatory power is weak (Tjur's R2 = 0.04). The model's intercept, corresponding to Household.INC_s = 0, is at 1.58 (95% CI [1.48, 1.69], p < .001). Within this model:

  - The effect of Household.INC_s is significantly positive (beta = 0.72, 95% CI [0.57, 0.89], p < .001; Std. beta = 0.73, 95% CI [0.57, 0.89])

Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using 

There’s lots of mention of p-values and power claims in the automated report, but what do these words practically mean?

Workflow: graph your results.

We are cultivating a habit of graphing our results. Let’s put this habit to virtuous use.

Here’s a shortcut for obtaining a quick, investigative plot:

plot(nz$Household.INC_s, nz$HomeOwner) 
curve(invlogit(coef(home2)[1] + coef(home2)[2]*x), add=TRUE)

The slightly longer syntax for a ggeffects graph is:

plot(ggeffects::ggpredict(home2, terms = "Household.INC_s"), add.data = TRUE, alpha =.1)

And here we find a curious feature of our data. Income is diffuse. We have someone that makes over 25 standard deviations more than the average income.

hist(nz$Household.INC, breaks = 1000)

Although we graph home ownership, we forgot to graph Household income.

Before launching into a model, your workflow should include graphing your predictors as well as your responses

The range of incomes is: 0, 2.5^{6}.

Note that we also have 8 who report making less than 10000,

Sensitivity of the data to outliers?

Let’s re-run the model while eliminating the extremely rich, and restrict focus to 98% of the data. This restriction would appear to be justified. It would not be surprising if very rich people were to own their own homes.

# Select 98 % of the range
nz2 <-  nz%>%
  dplyr::filter(Household.INC_s < 4)
##
nrow(nz2)/nrow(nz)
[1] 0.979438
home2.1 <- glm(HomeOwner ~ Household.INC_s, data = nz2, 
            family = "binomial")
parameters::model_parameters(home2.1)
Parameter       | Log-Odds |   SE |       95% CI |     z |      p
-----------------------------------------------------------------
(Intercept)     |     1.60 | 0.05 | [1.50, 1.71] | 29.21 | < .001
Household.INC_s |     0.78 | 0.08 | [0.62, 0.95] |  9.27 | < .001

Note that we need a sensible range. The lowest value is not 4 SD from the mean

hist(nz2$Household.INC_s)

range(nz2$Household.INC_s, na.rm = T)
[1] -1.187046  3.755599

Does this matter? Not necessarily. Linear regression does not require that the predictors sample from a normal distribution. This should be obvious, as we have used categorical predictors (e.g. Male/Not_Male). However it is worth explicitly stating. I have heard people who should know better express confusion on this point. Of course, that there are no parametric assumptions for your predictor variables does not let you off the hook. In this case, the extreme values might be distorting our inference.

Let’s write a model with the diminished dataset.

#range(nz$Household.INC_s, na.rm = T)

mp2 <- plot(ggeffects::ggpredict(home2,
                                 terms = "Household.INC_s[all]")) + scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(limits = c(-1.2, 4))
mp2.1 <- plot(ggeffects::ggpredict(home2.1,
                                   terms = "Household.INC_s[all]")) + scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(limits = c(-1.2, 4))

library(patchwork)
mp2 + mp2.1 +
  plot_annotation(title = "Comparison of logistic regression models with tranformations",
                  tag_levels = 'a') 

Notice the trick that I used here:

scale_x_continuous(limits = c(-1.2, 4))

This code allowed me to constrain both graphs to the same x-axis scale. If I left this out the graphs would have looked like this:

#range(nz$Household.INC_s, na.rm = T)
mp2 <- plot(ggeffects::ggpredict(home2,
                                 terms = "Household.INC_s[all]")) + scale_y_continuous(limits = c(0, 1))
mp2.1 <- plot(ggeffects::ggpredict(home2.1,
                                   terms = "Household.INC_s[all]")) + scale_y_continuous(limits = c(0, 1))

library(patchwork)
mp2 + mp2.1 + plot_annotation(title = "Comparison of logistic regression models with tranformations",
                              tag_levels = 'a') +
  plot_layout(guides = 'collect')

The models are for all intents and purposes identical.

You can perform further checks using the following code, however, it should be apparent from the preceding graph that the models two models do not vary.

check1 <- performance::check_model(home2)
check1
check2 <- performance::check_model(home2.1)
check2

Before leaving this example, what would an ordinary linear regression sampling from a normal distribution have returned?

home2LM <- lm(HomeOwner ~ Household.INC_s, data = nz)
home2LM_r <- parameters::model_parameters(home2LM)
home2LM_r
Parameter       | Coefficient |       SE |       95% CI | t(2796) |      p
--------------------------------------------------------------------------
(Intercept)     |        0.81 | 7.27e-03 | [0.80, 0.83] |  111.88 | < .001
Household.INC_s |        0.06 | 7.23e-03 | [0.04, 0.07] |    8.04 | < .001
plot(home2LM_r)

Let’s graph the results and compare the

home2LM_p <- plot(ggeffects::ggpredict(home2LM,
                                       terms = "Household.INC_s[all]"),
                  add.data = TRUE) +
  scale_y_continuous(limits = c(0, 1.5))  +
  scale_x_continuous(limits = c(-1.2, 4)) + theme_classic()
home2LM_p

home2_p <- plot(ggeffects::ggpredict(home2,
                                     terms = "Household.INC_s[all]"),
                add.data = TRUE) +
  scale_y_continuous(limits = c(0, 1.5))  +
  scale_x_continuous(limits = c(-1.2, 4)) + theme_classic()

home2LM_p + home2_p + plot_annotation(subtitle = "Comparison of ordinary least squares regression \n (a) with logistic regression (b) reveals n\ impossible predictions for OLS") +
  plot_layout(guides = "collect")

Logistic regression with a categorical covariate

Let’s use the GenCohort variable.

mg1 <- glm(HomeOwner ~ GenCohort, data = nz, family = "binomial")
parameters::model_parameters(mg1)
Parameter                               | Log-Odds |   SE |         95% CI |      z |      p
--------------------------------------------------------------------------------------------
(Intercept)                             |     2.13 | 0.10 | [ 1.94,  2.34] |  20.68 | < .001
GenCohortGen_Silent *  born< 1946       |     0.26 | 0.28 | [-0.26,  0.85] |   0.94 | 0.347 
GenCohortGenX *  born >=1961 & b.< 1980 |    -0.62 | 0.13 | [-0.87, -0.38] |  -4.90 | < .001
GenCohortGenY *  born >=1980 & b.< 1996 |    -1.87 | 0.14 | [-2.15, -1.59] | -13.00 | < .001
GenCohortGenZ *  born >= 1996           |    -4.91 | 1.04 | [-7.80, -3.30] |  -4.74 | < .001

What does this mean? Let’s graph the results

p_mg1 <-plot(ggeffects::ggpredict(mg1, terms = "GenCohort[all]"))
p_mg1

Lets stratify by income

Household.INC_s

mg2 <- glm(HomeOwner ~ GenCohort + Household.INC_s, data = nz, family = "binomial")
parameters::model_parameters(mg2)
Parameter                               | Log-Odds |   SE |         95% CI |      z |      p
--------------------------------------------------------------------------------------------
(Intercept)                             |     2.60 | 0.12 | [ 2.37,  2.84] |  21.79 | < .001
GenCohortGen_Silent *  born< 1946       |     0.70 | 0.30 | [ 0.15,  1.33] |   2.35 | 0.019 
GenCohortGenX *  born >=1961 & b.< 1980 |    -0.99 | 0.14 | [-1.26, -0.73] |  -7.31 | < .001
GenCohortGenY *  born >=1980 & b.< 1996 |    -2.40 | 0.16 | [-2.71, -2.09] | -14.94 | < .001
GenCohortGenZ *  born >= 1996           |    -4.99 | 1.06 | [-7.91, -3.32] |  -4.72 | < .001
Household.INC_s                         |     1.19 | 0.10 | [ 0.99,  1.39] |  11.62 | < .001

We really don’t see income making a difference to home ownership for boomers. A little sepearation happens among those in Gen X.

p_mg2 <-plot(ggeffects::ggpredict(mg2, terms = c("GenCohort[all]", "Household.INC_s[c(-1,0,3)]")))
p_mg2

Notes

sjPlot::tab_model(home2)
  HomeOwner
Predictors Odds Ratios CI p
(Intercept) 4.87 4.39 – 5.43 <0.001
Household.INC_s 2.06 1.76 – 2.43 <0.001
Observations 2798
R2 Tjur 0.036
plot(ggeffects::ggpredict(home2, 
                     terms = "Household.INC_s[all]"), add.data = TRUE) + scale_x_continuous(limits= c(-1.2,4))

Or to limit our points to the meaningful range:

plot(ggeffects::ggpredict(home2, 
                     terms = "Household.INC_s[all]"), add.data = TRUE) + scale_x_continuous(limits= c(-1.2,4))

Uncertainty arises because we only have 17 people in this jittered nz dataset born after 1996:

table(nz$GenCohort)

Gen Boombers: born >= 1946 & b.< 1961 
                                 1024 
               Gen_Silent: born< 1946 
                                  208 
         GenX: born >=1961 & b.< 1980 
                                 1253 
         GenY: born >=1980 & b.< 1996 
                                  416 
                   GenZ: born >= 1996 
                                   17 
ggplot(nz, (aes(GenCohort, Household.INC))) + geom_jitter(alpha = .5)

Which model should we prefer?

The model mg2 grealy improves on the BIC performance, indicating that we should prefer this model.

per_home <-performance::compare_performance(home2,mg1,mg2)
per_home
# Comparison of Model Performance Indices

Name  | Model |      AIC |      BIC | Tjur's R2 |  RMSE | Sigma | Log_loss | Score_log |   PCP | Score_spherical
----------------------------------------------------------------------------------------------------------------
home2 |   glm | 2590.320 | 2602.193 |     0.036 | 0.381 | 0.962 |    0.462 |      -Inf | 0.708 |                
mg1   |   glm | 2516.922 | 2546.675 |     0.099 | 0.372 | 0.941 |    0.442 |      -Inf | 0.724 |       3.915e-04
mg2   |   glm | 2272.307 | 2307.927 |     0.168 | 0.354 | 0.900 |    0.404 |      -Inf | 0.748 |       5.159e-04

Here’s a graph:

plot(per_home)

Note that we do not include the model that used fewer cases because, from the vantage point of information theory, this would be comparing apples with organges.

We can use the performance package to check the accuracy of the our models

Compare:

performance_accuracy(mg1)
# Accuracy of Model Predictions

Accuracy: 65.78%
      SE: 4.45%-points
  Method: Area under Curve

With:

performance_accuracy(mg2)
# Accuracy of Model Predictions

Accuracy: 76.94%
      SE: 2.54%-points
  Method: Area under Curve

And this reveals little difference, indicating that with this many data, the outliers don’t affect inference.

Poisson regression (counts)

We use a poisson model for rates and counts

\[ y_i \sim Poisson(\lambda_i)\\ log(\lambda_i) = \alpha +\beta x_i\\ E(\lambda|y_i) = exp(\alpha +\beta x_i) \]

Let’s simulate some data (example from Gelman, Hill, and Vehtari (2020), for more on simulated poisson variables go here

set.seed(999)
n <- 50
x <- runif(n, -2, 2)
a <- 1
b <- 2
out <- a + b  * x
set.seed(999)
y <- rpois(n,  exp(out)) # mean of poisson is equal to its variance
fake <- data.frame(x=x, y=y)
hist(fake$y)

Model

pois1 <- glm(y ~ x, data = fake, family = "poisson")
model_parameters(pois1)
Parameter   | Log-Mean |   SE |       95% CI |     z |      p
-------------------------------------------------------------
(Intercept) |     1.01 | 0.11 | [0.79, 1.21] |  9.40 | < .001
x           |     1.99 | 0.08 | [1.84, 2.14] | 26.24 | < .001

Note that in a poisson model, it is the log of the expected values (not the log of the raw data) that the model estimates:

# this is how to quickly generate the equation
equatiomatic::extract_eq(pois1,  use_coefs = TRUE)

\[ \log ({ \widehat{E( \operatorname{y} )} }) = 1.01 + 1.99(\operatorname{x}) \]

We can put this on the data scale

p_pois1 <- plot(ggeffects::ggpredict(pois1, terms = "x"), add.data = TRUE)
p_pois1

Let’s compare this to a model in which we assume a normal distribution

pois2 <- glm(y ~ x, data = fake) # remove "family = `poisson`)
model_parameters(pois2)
Parameter   | Coefficient |   SE |         95% CI | t(48) |      p
------------------------------------------------------------------
(Intercept) |       15.19 | 2.35 | [10.58, 19.81] |  6.45 | < .001
x           |       14.73 | 2.04 | [10.73, 18.73] |  7.22 | < .001

Quick plot:

p_pois2 <- plot(ggeffects::ggpredict(pois2, terms = "x"), add.data = TRUE)
p_pois2

Model checks

performance::check_model(pois1)

We can see the linearity assumption is violated:

performance::check_model(pois2)

library(splines)

pois3 <- glm(y ~ bs(x), data = fake) # remove "family = `poisson`)
model_parameters(pois3)
Parameter      | Coefficient |   SE |           95% CI |  t(46) |      p
------------------------------------------------------------------------
(Intercept)    |       -6.57 | 2.13 | [-10.74,  -2.39] |  -3.08 | 0.003 
x [1st degree] |       43.71 | 7.41 | [ 29.19,  58.23] |   5.90 | < .001
x [2nd degree] |      -67.11 | 5.30 | [-77.50, -56.72] | -12.66 | < .001
x [3rd degree] |      118.01 | 4.46 | [109.26, 126.76] |  26.44 | < .001
p_pois3<- plot(ggeffects::ggpredict(pois3, terms = "x"), add.data = TRUE)
p_pois3

library(patchwork)
p_pois1 + p_pois2 + p_pois3 + plot_annotation(title = "comparison of three assumed distributions", tag_levels = 'a',
                                              subtitle = "The Poisson model (a) fits \nThe gaussian model (b) underfitsthe \nThe spline model (c) overfits") +
  plot_layout(guides = "collect")

The poisson fits best: this is no surprise: we simulated poisson outcomes.

per_pois <-performance::compare_performance(pois1,pois2,pois3)
per_pois
# Comparison of Model Performance Indices

Name  | Model |     AIC |     BIC |   RMSE |  Sigma |    R2 | Nagelkerke's R2 | Score_log | Score_spherical
-----------------------------------------------------------------------------------------------------------
pois1 |   glm | 188.481 | 192.305 |  3.361 |  0.985 |       |           1.000 |    -1.845 |           0.097
pois2 |   glm | 426.698 | 432.434 | 16.249 | 16.584 | 0.521 |                 |           |                
pois3 |   glm | 300.273 | 309.833 |  4.410 |  4.597 | 0.965 |                 |           |                
plot(per_pois)

Negative binomial models (over-dispersed Poissons)

\[ y_i \sim NegBinomial(\lambda_i,\phi)\\ log(\lambda_i) = \alpha +\beta x_i\\ \phi \sim gamma(.01,.01) \]

Over dispersion in the rate parameter.

The expected value of a poisson and the variance of poisson variable are the same: lamda. However often (typically) this assumption is violated, and the random variables in one’s data set are over-dispersed.

To see this, lets simulate data with overdispersion

library(MASS)
n <- 100
x <- runif(n, -2, 2)
a <- 1
b <- 2
out <- a + b  * x

set.seed(999)

y <- rnegbin(n,  mu =exp(out), theta = 2) # mean overdistribution parameter
fake2 <- data.frame(x=x, y=y)
hist(fake2$y, breaks = 50)

We fit a poisson model:


nb1<- glm(y ~ x, family = poisson, fake2)
summary(nb1)

Call:
glm(formula = y ~ x, family = poisson, data = fake2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-9.7583  -1.3352  -0.4345   0.8231  11.6413  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.91028    0.07736   11.77   <2e-16 ***
x            2.07293    0.04596   45.10   <2e-16 ***
---
Signif. codes:  
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 6242.37  on 99  degrees of freedom
Residual deviance:  961.42  on 98  degrees of freedom
AIC: 1249.2

Number of Fisher Scoring iterations: 5
performance::check_overdispersion(nb1)
# Overdispersion test

       dispersion ratio =   9.769
  Pearson's Chi-Squared = 957.369
                p-value = < 0.001

Try a negative binomial model

nb2<- glm.nb(y ~ x,  data = fake2)
compare_models(nb1,nb2)
Parameter    |               nb1 |               nb2
----------------------------------------------------
(Intercept)  | 0.91 (0.76, 1.06) | 1.00 (0.76, 1.25)
x            | 2.07 (1.98, 2.16) | 2.00 (1.80, 2.20)
----------------------------------------------------
Observations |               100 |               100
plot(ggeffects::ggpredict(nb1,terms="x"), add.data = TRUE)

plot(ggeffects::ggpredict(nb2,terms="x"), add.data = TRUE)

It is clear both from the test statistics and the graphs, that the negative binomial model provides a better fit.

How would a normal linear model do here?

lmnb<- lm(y ~ x,  fake2)
plot(ggeffects::ggpredict(lmnb,terms="x"), add.data = TRUE)

The answer is: not very well.

Zero-inflated poisson/ neg binomial regression

\[ y_i \sim ZIPoisson(p_i, \lambda_i)\\ logit(p_i) = \alpha_p + \beta_p x_i \\ log(\lambda) = \alpha_\lambda + \beta\lambda x_i \]

In the nz dataset, volunteering (hours of charity) look to be zero-inflated, and also over-dispersed.

hist(nz$HoursCharity, breaks = 100)

We can quickly check the proportion of people who report zero volunteering:

sum(nz$HoursCharity ==0, na.rm=TRUE)/nrow(nz)
[1] 0.6826594

We can get a sense of over-dispersion by looking at the ration of the sample variation to the sample mean?

sd(nz$HoursCharity, na.rm=TRUE)/mean(nz$HoursCharity, na.rm=TRUE)^2
[1] 1.680429

There’s about 1.68 more dispersion that a poisson model would expect

Let’s use the performance package to formally check both zero-inflation and overdispersed

z1 <-glm(HoursCharity ~ 1, family = "poisson", data = nz)

check_zeroinflation(z1)
# Check for zero-inflation

   Observed zeros: 1992
  Predicted zeros: 536
            Ratio: 0.27

Next check for over-dispersion:

check_overdispersion(z1)
# Overdispersion test

       dispersion ratio =    13.196
  Pearson's Chi-Squared = 37595.179
                p-value =   < 0.001

And indeed, we find both.

Introducing brms for estimating complex regression models

I use the brms package to estimate zero-inflated and/or negative binomial models

Zero_inflated_poisson model syntax

library(brms)

# Requires integer output 
nz$HoursCharity <- as.integer(nz$HoursCharity)

# 
nz['Household.INC_s'] = as.data.frame(scale(nz$Household.INC))

# Scale religion variabile
nz['Relid_s'] = as.data.frame(scale(nz$Relid))

b0<- brms::brm(HoursCharity ~ Relid_s + Household.INC_s, 
                family = "zero_inflated_poisson",
                file = here::here("models", "zeroinflated_poisson_volunteer"), 
               data = nz)
summary(b0)
 Family: zero_inflated_poisson 
  Links: mu = log; zi = identity 
Formula: HoursCharity ~ Relid_s + Household.INC_s 
   Data: nz (Number of observations: 2796) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
                Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept           1.70      0.02     1.67     1.73 1.00
Relid_s             0.00      0.01    -0.02     0.03 1.00
Household.INC_s    -0.09      0.02    -0.12    -0.05 1.00
                Bulk_ESS Tail_ESS
Intercept           3785     3068
Relid_s             4501     3362
Household.INC_s     3704     3054

Family Specific Parameters: 
   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
zi     0.70      0.01     0.68     0.72 1.00     3849
   Tail_ESS
zi     2989

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

My preferred way of graphing the predicted effects of religious identification on volunteering is to generate prediction lines from the posterior samples. This captures uncertainty in an intuitive way. I’ll explain the mechanics when we get to Bayesian estimation, which, by the way, we are already doing here.

plot(
  conditional_effects(
    b0,
    spaghetti = TRUE,
    nsamples = 100,
    select_points = 0.1
  ),
  points = TRUE,
  point_args = list(alpha = 0.1,
                    width = .02)
) #  note this command controls which facet 

Zero_inflated_negbinomial model syntax

b1 <- brms::brm(HoursCharity ~ Relid_s + Household.INC_s, 
                family = "zero_inflated_negbinomial",
                file = here::here("models", "zeroinflated_neg_bin_volunteer"),
                data = nz)
summary(b1)
 Family: zero_inflated_negbinomial 
  Links: mu = log; shape = identity; zi = identity 
Formula: HoursCharity ~ Relid_s + Household.INC_s 
   Data: nz (Number of observations: 2796) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
                Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept           0.89      0.17     0.52     1.17 1.00
Relid_s             0.21      0.06     0.11     0.33 1.00
Household.INC_s    -0.07      0.05    -0.16     0.03 1.00
                Bulk_ESS Tail_ESS
Intercept            949      885
Relid_s             1456     1698
Household.INC_s     2061     2040

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
shape     0.28      0.07     0.16     0.42 1.00      938
zi        0.35      0.11     0.07     0.50 1.00      939
      Tail_ESS
shape      914
zi         697

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Compare fits of the two models using leave one out cross-validation.

Note that on this index, negative numbers are worse (sorry to confuse…)

b0 <- add_criterion(b0, "loo")
b1 <- add_criterion(b1, "loo")

w <-loo_compare(b0, b1, criterion = "loo")
w
   elpd_diff se_diff
b1     0.0       0.0
b0 -1514.8     207.5

Recall we hadbeen using and AIC/BIC convention to estimate improvements in goodness of fit. We can generate an analous index as follows:

cbind(waic_diff = w[, 1] * -2,
      se        = w[, 2] * 2)
   waic_diff       se
b1     0.000   0.0000
b0  3029.657 414.9648

And we can see that the negative binomial model fits much better.

A model for your zeros

The estimates in the graphs above are only for the positive (non-zero components of the model). Let’s look at the results again:

summary(b1)
 Family: zero_inflated_negbinomial 
  Links: mu = log; shape = identity; zi = identity 
Formula: HoursCharity ~ Relid_s + Household.INC_s 
   Data: nz (Number of observations: 2796) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
                Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept           0.89      0.17     0.52     1.17 1.00
Relid_s             0.21      0.06     0.11     0.33 1.00
Household.INC_s    -0.07      0.05    -0.16     0.03 1.00
                Bulk_ESS Tail_ESS
Intercept            949      885
Relid_s             1456     1698
Household.INC_s     2061     2040

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
shape     0.28      0.07     0.16     0.42 1.00      938
zi        0.35      0.11     0.07     0.50 1.00      939
      Tail_ESS
shape      914
zi         697

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The probability of non-volunteering in the preferred model for people who are at the mean Religious Identification and mean Household income in this population is plogis(.35) or 0.5866176. More often than not, we should predict zeros in this population. What predicts the zero component of the model? We can use this syntax:

b2 <- brms::brm(
  bf(HoursCharity ~ Relid_s + Household.INC_s, # note: use `bf` when you have more than one model, as we do here
     zi ~ Relid_s + Household.INC_s),
  family = "zero_inflated_negbinomial",
  file = here::here("models", "zeroinflated_nb_2_volunteer"),
  data = nz)

Let’s look at the results:

sjPlot::tab_model(b2)
  HoursCharity
Predictors Incidence Rate Ratios CI (95%)
Count Model
Intercept 3.40 2.85 – 3.95
Relid_s 1.02 0.94 – 1.11
Household.INC_s 0.93 0.84 – 1.03
Zero-Inflated Model
Intercept 1.06 0.72 – 1.38
Relid_s 0.52 0.42 – 0.61
Household.INC_s 1.02 0.87 – 1.17
Observations 2796
R2 Bayes 0.015

Interpretation

We can graph the results using ggeffects::

Religious identification:

plot(ggeffects::ggpredict(b2, terms = c("Relid_s")), 
    add.data = TRUE,  # doesn't work
     dot.alpha = .2,  
     facet = TRUE)  + ylim(0, 5)

Household income:

plot(ggeffects::ggpredict(b2, terms = c("Household.INC_s")), 
    add.data = TRUE,  # doesn't work
     dot.alpha = .2,  
     facet = TRUE)  + ylim(0, 5) +  xlim(0, 5)

Or using my preferred method

Predicted effects of religious identification:

plot(
  conditional_effects(
    b2,
    spaghetti = TRUE,
    nsamples = 100,
    select_points = 0.1
  ),
  points = TRUE,
 point_args = list(alpha = 0.05,
                    width = .02),
 ask = FALSE
)  + # note this command controls which facet 
  ylim(0,5)

NULL

Note:

Models of class brmsfit always condition on the zero-inflation component, if the model has such a component. Hence, there is no type = “zero_inflated” nor type = “zi_random” for brmsfit-models, because predictions are based on draws of the posterior distribution, which already account for the zero-inflation part of the model.

See package description

Comparing models we find that adding the prdictors improves the model

b2 <- add_criterion(b2, "loo")

w <-loo_compare(b0, b1, b2,  criterion = "loo")
w
   elpd_diff se_diff
b2     0.0       0.0
b1   -50.0      10.1
b0 -1564.8     206.1

Again we can use an -2 * loglik analogue


cbind(waic_diff = w[, 1] * -2,
      se        = w[, 2] * 2)
   waic_diff        se
b2    0.0000   0.00000
b1  100.0416  20.14189
b0 3129.6983 412.13260

Posterior predictive checks

These help to show the implications of our modelling decisions

Zero-inflated poisson PP check

brms::pp_check(b0) + xlim(0, 5)

Zero-inflated negative binomial with no predictors for the zero-inflation PP check

brms::pp_check(b1) + xlim(0, 5)

Zero-inflated negative binomial with zero-inflation predictors PP check

brms::pp_check(b2) + xlim(0, 5)

summary(b2)
 Family: zero_inflated_negbinomial 
  Links: mu = log; shape = identity; zi = logit 
Formula: HoursCharity ~ Relid_s + Household.INC_s 
         zi ~ Relid_s + Household.INC_s
   Data: nz (Number of observations: 2796) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
                   Estimate Est.Error l-95% CI u-95% CI
Intercept              1.22      0.08     1.05     1.37
zi_Intercept           0.05      0.16    -0.33     0.32
Relid_s                0.02      0.04    -0.06     0.10
Household.INC_s       -0.07      0.05    -0.17     0.03
zi_Relid_s            -0.66      0.09    -0.88    -0.50
zi_Household.INC_s     0.02      0.07    -0.14     0.16
                   Rhat Bulk_ESS Tail_ESS
Intercept          1.00     1657     2031
zi_Intercept       1.00     1466     1395
Relid_s            1.00     3142     2943
Household.INC_s    1.00     3404     2974
zi_Relid_s         1.00     2096     1704
zi_Household.INC_s 1.00     2964     2613

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
shape     0.45      0.06     0.33     0.58 1.00     1510
      Tail_ESS
shape     1500

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Reporting

You can start with the report package to generate reports. However, be careful about reporting using boilerplate because it is up to you, as an applied scientist, to interpret your results

For example the report package describes the \(R^2\) statistic. For reasons we can discuss in lab, this statistic is problematic, especially for observational studies.

A better option for expressing practical interest of a result is to describe the expected values of out an outcome across a range of predictors. We have been doing this all along with our graphical models. However, you might want to report the numbers for these expected values as well. For this purpose, the ggeffects package can be handy. You merely leave out the plot component of your syntax:

ggeffects::ggpredict(b2, terms = "Relid_s")
# Predicted counts of HoursCharity
# x = Relid_s

    x | Predicted |       95% CI
--------------------------------
-0.64 |      1.29 | [1.14, 1.46]
-0.25 |      1.51 | [1.36, 1.68]
 0.15 |      1.74 | [1.58, 1.92]
 0.54 |      1.97 | [1.78, 2.18]
 0.94 |      2.19 | [1.96, 2.44]
 1.33 |      2.40 | [2.13, 2.72]
 1.73 |      2.60 | [2.26, 2.99]
 2.12 |      2.78 | [2.36, 3.27]

Adjusted for:
* Household.INC_s = 0.01

Here, we have generated the predicted magnitudes of volunteering across the full range of the response variables (Religious Identification, as measured in standard deviation units). The expected volunteering rate of a secular person when other predictors are set to zero is: 1.23 hours [CI 95% 1.14, 1.46] and for a fully religiously identified person when all other predictors are set to zero it is 2.78 hours (CI 95% [2.35, 3.28])

We might want to focus on different elements of the model. The expected volunteering among the secular population (when other predictors are set to zero) who volunteersis exp(1.23): 3.421 and for a religious person (when other predictors are set to zero) who volunteers is: exp(1.23 + .02 * 2.12):3.569. expected zero-rate among the secular population is plogis(0.05) 0.512; for a fully identified religious person it is plogis(0.05 + -0.65 * 2.12): 0.209. This suggests that religious people are less zero-inflated. However among those who volunteer, religion isn’t predicting much of a difference in the number of hours one volunteers (in the baseline population).

Is this interesting? Rather than averaging across both the zero-inflated and volunteering populations, it might be more informative to separately describe behaviour among each of these distinct populations. Remember, as applied scientists, we are not merely trying to be explicit about our assumptions and decisions. We’re trying to improve our beliefs relating to some concrete question. Our reporting must reflect our scientific interests. It must clarify what we’ve learned about the world, as well as how we remain uncertain.

Summary

Today:

Appendix 1

If you want to graph each predictor separately emply the [[1]] or [[2] syntax as follows:]

Predicted effects of religious identification

b2_p1 <- plot(
  conditional_effects(
    b2,
    spaghetti = TRUE,
    nsamples = 100,
    select_points = 0.1
  ),
  points = TRUE,
  ask = TRUE,
  point_args = list(alpha = 0.1,
                    width = .02)
)[[1]]  + # note this command controls which facet
  ylim(0, 5) + labs(title = "Better title",
                                 subtitle = "better subtitle") +
  xlab("Religious Identification (SD units)") +
  ylab("Hours volunteering in the previous week")

Predicted effects of income:

library("patchwork")
b2_p1 + b2_p2

Appendix 2

We have covered lots of ground today. The generalised linear models described here are the most commponplace. However there are many more. For those who are curious on how to estimated zero inflated binary data, I recommend Solomon Kurz’s work here.

Appendix 3

Aside: In Bayesian estimation we often estimate the standard deviation of the mean \(\sigma\) , however outside these circles people we describe the same parameter as the variance of the mean, or \(\sigma^2\).

It doesn’t really matter the parameters don’t care how we express them. We can say, “Johannes is 2 meters tall,” or we can say “Johannes is the square root of 4 meters tall” and his height will remain indifferent to our convention. So too, the world is indifferent to whether we write $\sigma$ o r $\sigma^2$

Acknowledgments and References

Gelman, Hill, and Vehtari (2020)

Kurz (2020)

Bürkner and Vuorre (2019)

Bürkner, Paul-Christian, and Matti Vuorre. 2019. “Ordinal Regression Models in Psychology: A Tutorial.” Advances in Methods and Practices in Psychological Science 2 (1): 77–101.
Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge University Press.
Kurz, A. Solomon. 2020. Zenodo. https://doi.org/10.5281/zenodo.4080013.

  1. It is worth cultivating this explicit habit because we will need it when clarifying our belielfs about the world prior to modelling the world (as we do in Bayesian modelling).↩︎

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. Source code is available at https://go-bayes.github.io/psych-447/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Bulbulia (2021, April 27). Psych 447: Generalised linear models. Retrieved from https://vuw-psych-447.netlify.app/posts/8_1/

BibTeX citation

@misc{bulbulia2021generalised,
  author = {Bulbulia, Joseph},
  title = {Psych 447: Generalised linear models},
  url = {https://vuw-psych-447.netlify.app/posts/8_1/},
  year = {2021}
}