Estimating binary responses, counts, rates, overdispersion, and zero-inflation
Recall that a regression model combines data from a sample, on the one hand, with the tools of probability theory, on the other hand, to infer features of an unobserved population. We call these unobserved features “parameters.” The task in regression is to estimate uncertainty in these parameters, and where possible, to narrow the bandwidth of such uncertainty. In a nutshell, regression is educated guesswork.
Our focus has been on teaching workflows for estimation that are grounding in three imperatives:
So far, our models have assumed that response data in our models are sampling from a “normal” or “gaussian” distribution. This assumption is useful because we live i an ordered universe from which many of our observations are sampling from a normal distribution. However, although the gaussian distribution is sensible choice for estimating many parameters it is not a sensible choice for estimating all parameters. The linear regression model that we have been using is really a special case of the generalised linear regression model. Today we introduce the generalised linear model. ## A regression model implies a ‘link’ function
So far, we have written the gaussian linear model as:
\[ y_i = \alpha + \beta x_i + e\]
Where y is the response for the i\({^th}\) data point, \(\alpha\) is the intercept, or the expected response for the population when all other regression coefficients are set to zero, \(\beta\) is the adjustment in this expectation for the i\(^{th}\) predictor \(x\) associated with \(y_i\), where both y and x are random variables.
Often applied regression models were use compact matrix notation, where a bolded symbol is short hand for a matrix:
\[\boldsymbol{\eta} = \boldsymbol{\alpha} + \boldsymbol{X}\boldsymbol{\beta}\\\]
For example here, \[\boldsymbol{X}\boldsymbol{\beta}\\\] is shorthand for \[\beta_1X_i + \beta_2X_i + \beta_2X \dots\]
We can write the gaussian linear model in way that makes our assumptions and decisions explicit about what we think/hope it is doing when combining data with probability theory to estimate unknown (and inherently uncertain) parameters.1
Recall that there are two parameters that we need to estimate for gaussian linear model: the mean, and the variance.
\[ y_i \sim Normal ( \mu_i, \sigma)\\ \mu_i = \alpha + \beta x_i \\ \sigma \sim Exponentional (1) \\ \]
We assume that our response variable is a random variable that is sampling from a normal distribution with that has a mean \(\mu_i\) and a variance \(\sigma^2\). For reasons that will become clear in the final lecture, we will write the variance parameter as the square root of the variance parameter, \(\sigma\) which is a standard deviation (see Appendix 3 .appendix3)
Every linear model implies a ‘link’ function:
\(g(\cdot)\) = link function as well as an an inverse link function:
\(h(\cdot) = g^-1(\cdot)\) = inverse link function
Where \((\cdot)\) is the linear predictor: \(\boldsymbol{\eta}\)
In the gaussian linear model we assume
\[g(\cdot) = h(\cdot)\] That is, the link function is an identity function. This is a special case. In the remainder of the lecture we will explore different mapping from the expected outcome of a model and its linear predictors.
We begin with outcomes that are distributed as binary data. Consider home ownership in the nz jitter dataset
:
hist(nz$HomeOwner)
From the graph, we can see response distribution is fully bimodal (excluding the NAs).
The binomial link function does two things:
The logistic function satisfies the first task:
\[logit(\cdot) = log_e\frac{p}{(1-p)}\]
The inverse logit functionsatisfies the second task:
\[logit^-1(\cdot) = log\frac{e^\cdot}{(1-e^\cdot)}\]
We can write this:
\[ \Pr(y_i = 1) = p_i \\ logit(p_i) = \alpha + \beta x_i \\ \Pr(y_i = 1) = logit^{-1}(\alpha + \beta x_i) \] In R these functions are written:
logit <- qlogis
invlogit <- plogis
We begin with outcomes that are distributed as binary data. Consider home ownership in the nz jitter dataset
:
hist(nz$HomeOwner)
From the graph, we can see response distribution is fully bimodal (excluding the NAs).
The binomial link function does two things:
The logistic function satisfies the first task:
\[logit(\cdot) = log_e\frac{p}{(1-p)}\]
The inverse logit functionsatisfies the second task:
\[logit^-1(\cdot) = log\frac{e^\cdot}{(1-e^\cdot)}\]
We can write this:
\[ \Pr(y_i = 1) = p_i \\ logit(p_i) = \alpha + \beta x_i \\ \Pr(y_i = 1) = logit^{-1}(\alpha + \beta x_i) \] In R these functions are written:
logit <- qlogis
invlogit <- plogis
For logistic regression, the functions are employed as follow:
\[ \Pr(y_i = 1) = p_i \\ logit(p_i) = \alpha + \beta x_i \\ \Pr(y_i = 1) = logit^{-1}(\alpha + \beta x_i) \]
Aside: later on we will express paramters for the linear predictors, so something like:
\[Ownership_i \sim binomial (n, Pr_i)\\ Ownersip_i = \alpha + \beta X_i \\ \alpha \sim N(0,\sigma_{alpha})\\ \sigma_{alpha} \sim(0,10) \\ \beta \sim lognormal(0,1) \\ \]
Where the values we assign correspond to a range of expectations bout how the world might be. Let’s not get distracted with these details for now, however, I include them to give you a sense of how notation can help us to clarify our assumptions and decisions – which recall are essential navigations points for applied scientists.
We can write a generalised linear model that predict home ownership as follows:
home <- glm(HomeOwner ~ 1, data = nz,
family = "binomial")
Which gives us the following result
parameters::model_parameters(home)
Parameter | Log-Odds | SE | 95% CI | z | p
-------------------------------------------------------------
(Intercept) | 1.46 | 0.05 | [1.36, 1.55] | 30.38 | < .001
Or plugging this into the equation:
# this is how to quickly generate the equation
equatiomatic::extract_eq(home, use_coefs = TRUE)
\[ \log\left[ \frac { \widehat{P( \operatorname{HomeOwner} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{HomeOwner} = \operatorname{1} )} } \right] = 1.46 \]
How do we interpret this result?
We can use the plogis
function to obtain the probability
The probability of an NZAVS participant owning their own home is 0.811068
Notably, the logistic regression has estimated the sample mean to seven decimal places.
mean(nz$HomeOwner, na.rm=TRUE)
[1] 0.811068
We do this as follows.
# work around for compatibility issue between ggeffects and dplyr.
nz['Household.INC_s'] <- as.data.frame( scale(nz$Household.INC) )
To ge a reference point for a standard deviation of household income, let’s find the standard deviation for the sample:
# how much is a standard deviation -- which will represent a 1 unit change for the regression coefficient?
sd( nz$Household.INC, na.rm = TRUE ) # 95,090
[1] 95090.79
Let’s ask about the mean household for income in 2018?
mean( nz$Household.INC, na.rm = TRUE )
[1] 112877.1
We write a logistic regression model with a single covariate as follows:
home2 <- glm(HomeOwner ~ Household.INC_s, data = nz,
family = "binomial")
Results:
rs2<-parameters::model_parameters(home2)
rs2
Parameter | Log-Odds | SE | 95% CI | z | p
-----------------------------------------------------------------
(Intercept) | 1.58 | 0.05 | [1.48, 1.69] | 29.37 | < .001
Household.INC_s | 0.72 | 0.08 | [0.57, 0.89] | 8.86 | < .001
plot(rs2)
# this is how to quickly generate the equation
equatiomatic::extract_eq(home2, use_coefs = TRUE)
\[ \log\left[ \frac { \widehat{P( \operatorname{HomeOwner} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{HomeOwner} = \operatorname{1} )} } \right] = 1.58 + 0.72(\operatorname{Household.INC\_s}) \]
How do we interpret this model?
Let’s use the report package
report::report(home2)
We fitted a logistic model (estimated using ML) to predict HomeOwner with Household.INC_s (formula: HomeOwner ~ Household.INC_s). The model's explanatory power is weak (Tjur's R2 = 0.04). The model's intercept, corresponding to Household.INC_s = 0, is at 1.58 (95% CI [1.48, 1.69], p < .001). Within this model:
- The effect of Household.INC_s is significantly positive (beta = 0.72, 95% CI [0.57, 0.89], p < .001; Std. beta = 0.73, 95% CI [0.57, 0.89])
Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
There’s lots of mention of p-values and power claims in the automated report, but what do these words practically mean?
We are cultivating a habit of graphing our results. Let’s put this habit to virtuous use.
Here’s a shortcut for obtaining a quick, investigative plot:
The slightly longer syntax for a ggeffects graph is:
And here we find a curious feature of our data. Income is diffuse. We have someone that makes over 25 standard deviations more than the average income.
hist(nz$Household.INC, breaks = 1000)
Although we graph home ownership, we forgot to graph Household income.
Before launching into a model, your workflow should include graphing your predictors as well as your responses
The range of incomes is: 0, 2.5^{6}.
Note that we also have 8 who report making less than 10000,
Let’s re-run the model while eliminating the extremely rich, and restrict focus to 98% of the data. This restriction would appear to be justified. It would not be surprising if very rich people were to own their own homes.
[1] 0.979438
home2.1 <- glm(HomeOwner ~ Household.INC_s, data = nz2,
family = "binomial")
parameters::model_parameters(home2.1)
Parameter | Log-Odds | SE | 95% CI | z | p
-----------------------------------------------------------------
(Intercept) | 1.60 | 0.05 | [1.50, 1.71] | 29.21 | < .001
Household.INC_s | 0.78 | 0.08 | [0.62, 0.95] | 9.27 | < .001
Note that we need a sensible range. The lowest value is not 4 SD from the mean
hist(nz2$Household.INC_s)
range(nz2$Household.INC_s, na.rm = T)
[1] -1.187046 3.755599
Does this matter? Not necessarily. Linear regression does not require that the predictors sample from a normal distribution. This should be obvious, as we have used categorical predictors (e.g. Male/Not_Male). However it is worth explicitly stating. I have heard people who should know better express confusion on this point. Of course, that there are no parametric assumptions for your predictor variables does not let you off the hook. In this case, the extreme values might be distorting our inference.
Let’s write a model with the diminished dataset.
#range(nz$Household.INC_s, na.rm = T)
mp2 <- plot(ggeffects::ggpredict(home2,
terms = "Household.INC_s[all]")) + scale_y_continuous(limits = c(0, 1)) +
scale_x_continuous(limits = c(-1.2, 4))
mp2.1 <- plot(ggeffects::ggpredict(home2.1,
terms = "Household.INC_s[all]")) + scale_y_continuous(limits = c(0, 1)) +
scale_x_continuous(limits = c(-1.2, 4))
library(patchwork)
mp2 + mp2.1 +
plot_annotation(title = "Comparison of logistic regression models with tranformations",
tag_levels = 'a')
Notice the trick that I used here:
scale_x_continuous(limits = c(-1.2, 4))
This code allowed me to constrain both graphs to the same x-axis scale. If I left this out the graphs would have looked like this:
#range(nz$Household.INC_s, na.rm = T)
mp2 <- plot(ggeffects::ggpredict(home2,
terms = "Household.INC_s[all]")) + scale_y_continuous(limits = c(0, 1))
mp2.1 <- plot(ggeffects::ggpredict(home2.1,
terms = "Household.INC_s[all]")) + scale_y_continuous(limits = c(0, 1))
library(patchwork)
mp2 + mp2.1 + plot_annotation(title = "Comparison of logistic regression models with tranformations",
tag_levels = 'a') +
plot_layout(guides = 'collect')
The models are for all intents and purposes identical.
You can perform further checks using the following code, however, it should be apparent from the preceding graph that the models two models do not vary.
check1 <- performance::check_model(home2)
check1
check2 <- performance::check_model(home2.1)
check2
Before leaving this example, what would an ordinary linear regression sampling from a normal distribution have returned?
home2LM <- lm(HomeOwner ~ Household.INC_s, data = nz)
home2LM_r <- parameters::model_parameters(home2LM)
home2LM_r
Parameter | Coefficient | SE | 95% CI | t(2796) | p
--------------------------------------------------------------------------
(Intercept) | 0.81 | 7.27e-03 | [0.80, 0.83] | 111.88 | < .001
Household.INC_s | 0.06 | 7.23e-03 | [0.04, 0.07] | 8.04 | < .001
plot(home2LM_r)
Let’s graph the results and compare the
home2LM_p <- plot(ggeffects::ggpredict(home2LM,
terms = "Household.INC_s[all]"),
add.data = TRUE) +
scale_y_continuous(limits = c(0, 1.5)) +
scale_x_continuous(limits = c(-1.2, 4)) + theme_classic()
home2LM_p
home2_p <- plot(ggeffects::ggpredict(home2,
terms = "Household.INC_s[all]"),
add.data = TRUE) +
scale_y_continuous(limits = c(0, 1.5)) +
scale_x_continuous(limits = c(-1.2, 4)) + theme_classic()
home2LM_p + home2_p + plot_annotation(subtitle = "Comparison of ordinary least squares regression \n (a) with logistic regression (b) reveals n\ impossible predictions for OLS") +
plot_layout(guides = "collect")
Let’s use the GenCohort
variable.
mg1 <- glm(HomeOwner ~ GenCohort, data = nz, family = "binomial")
parameters::model_parameters(mg1)
Parameter | Log-Odds | SE | 95% CI | z | p
--------------------------------------------------------------------------------------------
(Intercept) | 2.13 | 0.10 | [ 1.94, 2.34] | 20.68 | < .001
GenCohortGen_Silent * born< 1946 | 0.26 | 0.28 | [-0.26, 0.85] | 0.94 | 0.347
GenCohortGenX * born >=1961 & b.< 1980 | -0.62 | 0.13 | [-0.87, -0.38] | -4.90 | < .001
GenCohortGenY * born >=1980 & b.< 1996 | -1.87 | 0.14 | [-2.15, -1.59] | -13.00 | < .001
GenCohortGenZ * born >= 1996 | -4.91 | 1.04 | [-7.80, -3.30] | -4.74 | < .001
What does this mean? Let’s graph the results
Lets stratify by income
Household.INC_s
mg2 <- glm(HomeOwner ~ GenCohort + Household.INC_s, data = nz, family = "binomial")
parameters::model_parameters(mg2)
Parameter | Log-Odds | SE | 95% CI | z | p
--------------------------------------------------------------------------------------------
(Intercept) | 2.60 | 0.12 | [ 2.37, 2.84] | 21.79 | < .001
GenCohortGen_Silent * born< 1946 | 0.70 | 0.30 | [ 0.15, 1.33] | 2.35 | 0.019
GenCohortGenX * born >=1961 & b.< 1980 | -0.99 | 0.14 | [-1.26, -0.73] | -7.31 | < .001
GenCohortGenY * born >=1980 & b.< 1996 | -2.40 | 0.16 | [-2.71, -2.09] | -14.94 | < .001
GenCohortGenZ * born >= 1996 | -4.99 | 1.06 | [-7.91, -3.32] | -4.72 | < .001
Household.INC_s | 1.19 | 0.10 | [ 0.99, 1.39] | 11.62 | < .001
We really don’t see income making a difference to home ownership for boomers. A little sepearation happens among those in Gen X.
Additive indicators on the logit scale are non-linear on the data scale. We see this in the previous graph. There’s a curve. This is typical of generalised linear models.
do not interpret the signs of the coefficients. For example, plogis(-3)
is 0.0474259, which is a positive probability.
here are is no error term (\(\sigma^2\)) in logistic regression. We only estimate the mean. The variances cannot be estimated:
sjPlot::tab_model(home2)
HomeOwner | |||
---|---|---|---|
Predictors | Odds Ratios | CI | p |
(Intercept) | 4.87 | 4.39 – 5.43 | <0.001 |
Household.INC_s | 2.06 | 1.76 – 2.43 | <0.001 |
Observations | 2798 | ||
R2 Tjur | 0.036 |
Or to limit our points to the meaningful range:
Uncertainty arises because we only have 17 people in this jittered nz dataset born after 1996:
table(nz$GenCohort)
Gen Boombers: born >= 1946 & b.< 1961
1024
Gen_Silent: born< 1946
208
GenX: born >=1961 & b.< 1980
1253
GenY: born >=1980 & b.< 1996
416
GenZ: born >= 1996
17
ggplot(nz, (aes(GenCohort, Household.INC))) + geom_jitter(alpha = .5)
Which model should we prefer?
The model mg2
grealy improves on the BIC performance, indicating that we should prefer this model.
per_home <-performance::compare_performance(home2,mg1,mg2)
per_home
# Comparison of Model Performance Indices
Name | Model | AIC | BIC | Tjur's R2 | RMSE | Sigma | Log_loss | Score_log | PCP | Score_spherical
----------------------------------------------------------------------------------------------------------------
home2 | glm | 2590.320 | 2602.193 | 0.036 | 0.381 | 0.962 | 0.462 | -Inf | 0.708 |
mg1 | glm | 2516.922 | 2546.675 | 0.099 | 0.372 | 0.941 | 0.442 | -Inf | 0.724 | 3.915e-04
mg2 | glm | 2272.307 | 2307.927 | 0.168 | 0.354 | 0.900 | 0.404 | -Inf | 0.748 | 5.159e-04
Here’s a graph:
plot(per_home)
Note that we do not include the model that used fewer cases because, from the vantage point of information theory, this would be comparing apples with organges.
We can use the performance package to check the accuracy of the our models
Compare:
performance_accuracy(mg1)
# Accuracy of Model Predictions
Accuracy: 65.78%
SE: 4.45%-points
Method: Area under Curve
With:
performance_accuracy(mg2)
# Accuracy of Model Predictions
Accuracy: 76.94%
SE: 2.54%-points
Method: Area under Curve
And this reveals little difference, indicating that with this many data, the outliers don’t affect inference.
We use a poisson model for rates and counts
\[ y_i \sim Poisson(\lambda_i)\\ log(\lambda_i) = \alpha +\beta x_i\\ E(\lambda|y_i) = exp(\alpha +\beta x_i) \]
Let’s simulate some data (example from Gelman, Hill, and Vehtari (2020), for more on simulated poisson variables go here
Model
pois1 <- glm(y ~ x, data = fake, family = "poisson")
model_parameters(pois1)
Parameter | Log-Mean | SE | 95% CI | z | p
-------------------------------------------------------------
(Intercept) | 1.01 | 0.11 | [0.79, 1.21] | 9.40 | < .001
x | 1.99 | 0.08 | [1.84, 2.14] | 26.24 | < .001
Note that in a poisson model, it is the log of the expected values (not the log of the raw data) that the model estimates:
# this is how to quickly generate the equation
equatiomatic::extract_eq(pois1, use_coefs = TRUE)
\[ \log ({ \widehat{E( \operatorname{y} )} }) = 1.01 + 1.99(\operatorname{x}) \]
We can put this on the data scale
Let’s compare this to a model in which we assume a normal distribution
pois2 <- glm(y ~ x, data = fake) # remove "family = `poisson`)
model_parameters(pois2)
Parameter | Coefficient | SE | 95% CI | t(48) | p
------------------------------------------------------------------
(Intercept) | 15.19 | 2.35 | [10.58, 19.81] | 6.45 | < .001
x | 14.73 | 2.04 | [10.73, 18.73] | 7.22 | < .001
Quick plot:
performance::check_model(pois1)
We can see the linearity assumption is violated:
performance::check_model(pois2)
library(splines)
pois3 <- glm(y ~ bs(x), data = fake) # remove "family = `poisson`)
model_parameters(pois3)
Parameter | Coefficient | SE | 95% CI | t(46) | p
------------------------------------------------------------------------
(Intercept) | -6.57 | 2.13 | [-10.74, -2.39] | -3.08 | 0.003
x [1st degree] | 43.71 | 7.41 | [ 29.19, 58.23] | 5.90 | < .001
x [2nd degree] | -67.11 | 5.30 | [-77.50, -56.72] | -12.66 | < .001
x [3rd degree] | 118.01 | 4.46 | [109.26, 126.76] | 26.44 | < .001
library(patchwork)
p_pois1 + p_pois2 + p_pois3 + plot_annotation(title = "comparison of three assumed distributions", tag_levels = 'a',
subtitle = "The Poisson model (a) fits \nThe gaussian model (b) underfitsthe \nThe spline model (c) overfits") +
plot_layout(guides = "collect")
The poisson fits best: this is no surprise: we simulated poisson outcomes.
per_pois <-performance::compare_performance(pois1,pois2,pois3)
per_pois
# Comparison of Model Performance Indices
Name | Model | AIC | BIC | RMSE | Sigma | R2 | Nagelkerke's R2 | Score_log | Score_spherical
-----------------------------------------------------------------------------------------------------------
pois1 | glm | 188.481 | 192.305 | 3.361 | 0.985 | | 1.000 | -1.845 | 0.097
pois2 | glm | 426.698 | 432.434 | 16.249 | 16.584 | 0.521 | | |
pois3 | glm | 300.273 | 309.833 | 4.410 | 4.597 | 0.965 | | |
plot(per_pois)
\[ y_i \sim NegBinomial(\lambda_i,\phi)\\ log(\lambda_i) = \alpha +\beta x_i\\ \phi \sim gamma(.01,.01) \]
Over dispersion in the rate parameter.
The expected value of a poisson and the variance of poisson variable are the same: lamda. However often (typically) this assumption is violated, and the random variables in one’s data set are over-dispersed.
To see this, lets simulate data with overdispersion
We fit a poisson model:
Call:
glm(formula = y ~ x, family = poisson, data = fake2)
Deviance Residuals:
Min 1Q Median 3Q Max
-9.7583 -1.3352 -0.4345 0.8231 11.6413
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.91028 0.07736 11.77 <2e-16 ***
x 2.07293 0.04596 45.10 <2e-16 ***
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 6242.37 on 99 degrees of freedom
Residual deviance: 961.42 on 98 degrees of freedom
AIC: 1249.2
Number of Fisher Scoring iterations: 5
performance::check_overdispersion(nb1)
# Overdispersion test
dispersion ratio = 9.769
Pearson's Chi-Squared = 957.369
p-value = < 0.001
Try a negative binomial model
nb2<- glm.nb(y ~ x, data = fake2)
compare_models(nb1,nb2)
Parameter | nb1 | nb2
----------------------------------------------------
(Intercept) | 0.91 (0.76, 1.06) | 1.00 (0.76, 1.25)
x | 2.07 (1.98, 2.16) | 2.00 (1.80, 2.20)
----------------------------------------------------
Observations | 100 | 100
It is clear both from the test statistics and the graphs, that the negative binomial model provides a better fit.
How would a normal linear model do here?
The answer is: not very well.
\[ y_i \sim ZIPoisson(p_i, \lambda_i)\\ logit(p_i) = \alpha_p + \beta_p x_i \\ log(\lambda) = \alpha_\lambda + \beta\lambda x_i \]
In the nz dataset, volunteering (hours of charity) look to be zero-inflated, and also over-dispersed.
hist(nz$HoursCharity, breaks = 100)
We can quickly check the proportion of people who report zero volunteering:
We can get a sense of over-dispersion by looking at the ration of the sample variation to the sample mean?
There’s about 1.68 more dispersion that a poisson model would expect
Let’s use the performance package to formally check both zero-inflation and overdispersed
z1 <-glm(HoursCharity ~ 1, family = "poisson", data = nz)
check_zeroinflation(z1)
# Check for zero-inflation
Observed zeros: 1992
Predicted zeros: 536
Ratio: 0.27
Next check for over-dispersion:
check_overdispersion(z1)
# Overdispersion test
dispersion ratio = 13.196
Pearson's Chi-Squared = 37595.179
p-value = < 0.001
And indeed, we find both.
brms
for estimating complex regression modelsI use the brms
package to estimate zero-inflated and/or negative binomial models
library(brms)
# Requires integer output
nz$HoursCharity <- as.integer(nz$HoursCharity)
#
nz['Household.INC_s'] = as.data.frame(scale(nz$Household.INC))
# Scale religion variabile
nz['Relid_s'] = as.data.frame(scale(nz$Relid))
b0<- brms::brm(HoursCharity ~ Relid_s + Household.INC_s,
family = "zero_inflated_poisson",
file = here::here("models", "zeroinflated_poisson_volunteer"),
data = nz)
summary(b0)
Family: zero_inflated_poisson
Links: mu = log; zi = identity
Formula: HoursCharity ~ Relid_s + Household.INC_s
Data: nz (Number of observations: 2796)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept 1.70 0.02 1.67 1.73 1.00
Relid_s 0.00 0.01 -0.02 0.03 1.00
Household.INC_s -0.09 0.02 -0.12 -0.05 1.00
Bulk_ESS Tail_ESS
Intercept 3785 3068
Relid_s 4501 3362
Household.INC_s 3704 3054
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
zi 0.70 0.01 0.68 0.72 1.00 3849
Tail_ESS
zi 2989
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
My preferred way of graphing the predicted effects of religious identification on volunteering is to generate prediction lines from the posterior samples. This captures uncertainty in an intuitive way. I’ll explain the mechanics when we get to Bayesian estimation, which, by the way, we are already doing here.
plot(
conditional_effects(
b0,
spaghetti = TRUE,
nsamples = 100,
select_points = 0.1
),
points = TRUE,
point_args = list(alpha = 0.1,
width = .02)
) # note this command controls which facet
summary(b1)
Family: zero_inflated_negbinomial
Links: mu = log; shape = identity; zi = identity
Formula: HoursCharity ~ Relid_s + Household.INC_s
Data: nz (Number of observations: 2796)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept 0.89 0.17 0.52 1.17 1.00
Relid_s 0.21 0.06 0.11 0.33 1.00
Household.INC_s -0.07 0.05 -0.16 0.03 1.00
Bulk_ESS Tail_ESS
Intercept 949 885
Relid_s 1456 1698
Household.INC_s 2061 2040
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
shape 0.28 0.07 0.16 0.42 1.00 938
zi 0.35 0.11 0.07 0.50 1.00 939
Tail_ESS
shape 914
zi 697
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Compare fits of the two models using leave one out cross-validation.
Note that on this index, negative numbers are worse (sorry to confuse…)
b0 <- add_criterion(b0, "loo")
b1 <- add_criterion(b1, "loo")
w <-loo_compare(b0, b1, criterion = "loo")
w
elpd_diff se_diff
b1 0.0 0.0
b0 -1514.8 207.5
Recall we hadbeen using and AIC/BIC convention to estimate improvements in goodness of fit. We can generate an analous index as follows:
cbind(waic_diff = w[, 1] * -2,
se = w[, 2] * 2)
waic_diff se
b1 0.000 0.0000
b0 3029.657 414.9648
And we can see that the negative binomial model fits much better.
The estimates in the graphs above are only for the positive (non-zero components of the model). Let’s look at the results again:
summary(b1)
Family: zero_inflated_negbinomial
Links: mu = log; shape = identity; zi = identity
Formula: HoursCharity ~ Relid_s + Household.INC_s
Data: nz (Number of observations: 2796)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept 0.89 0.17 0.52 1.17 1.00
Relid_s 0.21 0.06 0.11 0.33 1.00
Household.INC_s -0.07 0.05 -0.16 0.03 1.00
Bulk_ESS Tail_ESS
Intercept 949 885
Relid_s 1456 1698
Household.INC_s 2061 2040
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
shape 0.28 0.07 0.16 0.42 1.00 938
zi 0.35 0.11 0.07 0.50 1.00 939
Tail_ESS
shape 914
zi 697
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
The probability of non-volunteering in the preferred model for people who are at the mean Religious Identification and mean Household income in this population is plogis(.35)
or 0.5866176. More often than not, we should predict zeros in this population. What predicts the zero component of the model? We can use this syntax:
Let’s look at the results:
sjPlot::tab_model(b2)
HoursCharity | |||
---|---|---|---|
Predictors | Incidence Rate Ratios | CI (95%) | |
Count Model | |||
Intercept | 3.40 | 2.85 – 3.95 | |
Relid_s | 1.02 | 0.94 – 1.11 | |
Household.INC_s | 0.93 | 0.84 – 1.03 | |
Zero-Inflated Model | |||
Intercept | 1.06 | 0.72 – 1.38 | |
Relid_s | 0.52 | 0.42 – 0.61 | |
Household.INC_s | 1.02 | 0.87 – 1.17 | |
Observations | 2796 | ||
R2 Bayes | 0.015 |
We can graph the results using ggeffects::
Religious identification:
Household income:
Or using my preferred method
Predicted effects of religious identification:
plot(
conditional_effects(
b2,
spaghetti = TRUE,
nsamples = 100,
select_points = 0.1
),
points = TRUE,
point_args = list(alpha = 0.05,
width = .02),
ask = FALSE
) + # note this command controls which facet
ylim(0,5)
NULL
Note:
Models of class brmsfit always condition on the zero-inflation component, if the model has such a component. Hence, there is no type = “zero_inflated” nor type = “zi_random” for brmsfit-models, because predictions are based on draws of the posterior distribution, which already account for the zero-inflation part of the model.
Comparing models we find that adding the prdictors improves the model
b2 <- add_criterion(b2, "loo")
w <-loo_compare(b0, b1, b2, criterion = "loo")
w
elpd_diff se_diff
b2 0.0 0.0
b1 -50.0 10.1
b0 -1564.8 206.1
Again we can use an -2 * loglik analogue
cbind(waic_diff = w[, 1] * -2,
se = w[, 2] * 2)
waic_diff se
b2 0.0000 0.00000
b1 100.0416 20.14189
b0 3129.6983 412.13260
These help to show the implications of our modelling decisions
summary(b2)
Family: zero_inflated_negbinomial
Links: mu = log; shape = identity; zi = logit
Formula: HoursCharity ~ Relid_s + Household.INC_s
zi ~ Relid_s + Household.INC_s
Data: nz (Number of observations: 2796)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI
Intercept 1.22 0.08 1.05 1.37
zi_Intercept 0.05 0.16 -0.33 0.32
Relid_s 0.02 0.04 -0.06 0.10
Household.INC_s -0.07 0.05 -0.17 0.03
zi_Relid_s -0.66 0.09 -0.88 -0.50
zi_Household.INC_s 0.02 0.07 -0.14 0.16
Rhat Bulk_ESS Tail_ESS
Intercept 1.00 1657 2031
zi_Intercept 1.00 1466 1395
Relid_s 1.00 3142 2943
Household.INC_s 1.00 3404 2974
zi_Relid_s 1.00 2096 1704
zi_Household.INC_s 1.00 2964 2613
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
shape 0.45 0.06 0.33 0.58 1.00 1510
Tail_ESS
shape 1500
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
You can start with the report
package to generate reports. However, be careful about reporting using boilerplate because it is up to you, as an applied scientist, to interpret your results
For example the report
package describes the \(R^2\) statistic. For reasons we can discuss in lab, this statistic is problematic, especially for observational studies.
A better option for expressing practical interest of a result is to describe the expected values of out an outcome across a range of predictors. We have been doing this all along with our graphical models. However, you might want to report the numbers for these expected values as well. For this purpose, the ggeffects package can be handy. You merely leave out the plot
component of your syntax:
ggeffects::ggpredict(b2, terms = "Relid_s")
# Predicted counts of HoursCharity
# x = Relid_s
x | Predicted | 95% CI
--------------------------------
-0.64 | 1.29 | [1.14, 1.46]
-0.25 | 1.51 | [1.36, 1.68]
0.15 | 1.74 | [1.58, 1.92]
0.54 | 1.97 | [1.78, 2.18]
0.94 | 2.19 | [1.96, 2.44]
1.33 | 2.40 | [2.13, 2.72]
1.73 | 2.60 | [2.26, 2.99]
2.12 | 2.78 | [2.36, 3.27]
Adjusted for:
* Household.INC_s = 0.01
Here, we have generated the predicted magnitudes of volunteering across the full range of the response variables (Religious Identification, as measured in standard deviation units). The expected volunteering rate of a secular person when other predictors are set to zero is: 1.23 hours [CI 95% 1.14, 1.46] and for a fully religiously identified person when all other predictors are set to zero it is 2.78 hours (CI 95% [2.35, 3.28])
We might want to focus on different elements of the model. The expected volunteering among the secular population (when other predictors are set to zero) who volunteersis exp(1.23)
: 3.421 and for a religious person (when other predictors are set to zero) who volunteers is: exp(1.23 + .02 * 2.12)
:3.569. expected zero-rate among the secular population is plogis(0.05)
0.512; for a fully identified religious person it is plogis(0.05 + -0.65 * 2.12)
: 0.209. This suggests that religious people are less zero-inflated. However among those who volunteer, religion isn’t predicting much of a difference in the number of hours one volunteers (in the baseline population).
Is this interesting? Rather than averaging across both the zero-inflated and volunteering populations, it might be more informative to separately describe behaviour among each of these distinct populations. Remember, as applied scientists, we are not merely trying to be explicit about our assumptions and decisions. We’re trying to improve our beliefs relating to some concrete question. Our reporting must reflect our scientific interests. It must clarify what we’ve learned about the world, as well as how we remain uncertain.
Today:
We have clarified that all linear regression models are “generalised linear model” because all linear regression models imply a link function. Prior to this week, we’d been working with the gaussian linear model, in which the link function is the identify function. However, the Gaussian linear model is a special case.
We have demonstrated how to write four classes of generalised linear models:
We have clarified how to interpret and report the results of a generalised linear model using graphical strategies.
If you want to graph each predictor separately emply the [[1]]
or [[2]
syntax as follows:]
Predicted effects of religious identification
b2_p1 <- plot(
conditional_effects(
b2,
spaghetti = TRUE,
nsamples = 100,
select_points = 0.1
),
points = TRUE,
ask = TRUE,
point_args = list(alpha = 0.1,
width = .02)
)[[1]] + # note this command controls which facet
ylim(0, 5) + labs(title = "Better title",
subtitle = "better subtitle") +
xlab("Religious Identification (SD units)") +
ylab("Hours volunteering in the previous week")
Predicted effects of income:
library("patchwork")
b2_p1 + b2_p2
We have covered lots of ground today. The generalised linear models described here are the most commponplace. However there are many more. For those who are curious on how to estimated zero inflated binary data, I recommend Solomon Kurz’s work here.
Aside: In Bayesian estimation we often estimate the standard deviation of the mean \(\sigma\) , however outside these circles people we describe the same parameter as the variance of the mean, or \(\sigma^2\).
It doesn’t really matter the parameters don’t care how we express them. We can say, “Johannes is 2 meters tall,” or we can say “Johannes is the square root of 4 meters tall” and his height will remain indifferent to our convention. So too, the world is indifferent to whether we write $\sigma$ o r $\sigma^2$
Gelman, Hill, and Vehtari (2020)
It is worth cultivating this explicit habit because we will need it when clarifying our belielfs about the world prior to modelling the world (as we do in Bayesian modelling).↩︎
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. Source code is available at https://go-bayes.github.io/psych-447/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Bulbulia (2021, April 27). Psych 447: Generalised linear models. Retrieved from https://vuw-psych-447.netlify.app/posts/8_1/
BibTeX citation
@misc{bulbulia2021generalised, author = {Bulbulia, Joseph}, title = {Psych 447: Generalised linear models}, url = {https://vuw-psych-447.netlify.app/posts/8_1/}, year = {2021} }