Before the break, we introduced basic concepts in statistical regression, focusing on regression using a single covariate.
Today, we will introduce regression using multiple co-variates. You will acquire the following skills:
Understand how to write a regression model with multiple co-variates in R.
Understand how to interpret a regression model with multiple co-variates in R.
Understand methods for comparing simple and complex models.
Understand how to create a table for your results.
Understand how to graph your results.
Understand how to write equivalent models in an ANOVA framework, in case someone forces you to do this.
In the first instance, linear regression is a method for predicting features of the world, or as we put it in lecture 5:
[R]egression … is method for inferring the expected average features of a population, and the variance of a population, conditional on other features of the population as measured in a sample.
In week 9, we’ll discussion how to use regression for explaining features of the world. However, for the next two weeks, we’ll restrict our interests to prediction.
To referesh your memory about how the magic of prediction by regression works, let’s look at an example. Consider data collected in the 1920s on the relationship between the speed at which a car is travelling and the distance that it takes for that car to stop. The data are found in the cars
dataset, which is automatically loaded when you start R.
Here is our model:
model_simple <- lm(dist ~ speed, data = cars)
Let’s look at the output. You can create a table like this:
parameters::parameters(model_simple)
Parameter | Coefficient | SE | 95% CI | t(48) | p
-------------------------------------------------------------------
(Intercept) | -17.58 | 6.76 | [-31.17, -3.99] | -2.60 | 0.012
speed | 3.93 | 0.42 | [ 3.10, 4.77] | 9.46 | < .001
Recall that we can create an html table in this way:
parameters::parameters(model_simple)%>%
parameters::print_html(caption = "Breaking distance as predicted by speed")
Breaking distance as predicted by speed | |||||
---|---|---|---|---|---|
Parameter | Coefficient | SE | 95% CI | t(48) | p |
(Intercept) | -17.58 | 6.76 | (-31.17, -3.99) | -2.60 | 0.012 |
speed | 3.93 | 0.42 | (3.10, 4.77) | 9.46 | < .001 |
Recall that the report package makes your life easy:
report::report(model_simple)
We fitted a linear model (estimated using OLS) to predict dist with speed (formula: dist ~ speed). The model explains a significant and substantial proportion of variance (R2 = 0.65, F(1, 48) = 89.57, p < .001, adj. R2 = 0.64). The model's intercept, corresponding to speed = 0, is at -17.58 (95% CI [-31.17, -3.99], t(48) = -2.60, p < .05). Within this model:
- The effect of speed is significantly positive (beta = 3.93, 95% CI [3.10, 4.77], t(48) = 9.46, p < .001; Std. beta = 0.81, 95% CI [0.64, 0.98])
Standardized parameters were obtained by fitting the model on a standardized version of the dataset.
However, before you go rushing to report your “signficant” p-value, how shall we intepret this output?
Recall that can write the output of a model as an equation.
The equatiomatic
package in R makes this work easy for you.
# this is how to quickly generate the equation
equatiomatic::extract_eq(model_simple, use_coefs = TRUE)
\[ \operatorname{\widehat{dist}} = -17.58 + 3.93(\operatorname{speed}) \]
\[ \operatorname{\widehat{dist}} = -17.58 + 3.93(\operatorname{speed}) \]
The generic form of the a linear model is:
# this is how to quickly generate the equation
equatiomatic::extract_eq(model_simple, use_coefs = FALSE)
\[ \operatorname{dist} = \alpha + \beta_{1}(\operatorname{speed}) + \epsilon \]
The model says that the expected stopping distance for a car is -17.58 feet when the care is traveling zero mph; To this we an additional 3.93 feet for each additional unit of speed (here in miles per hour).
What strikes you about this model?
If you are like me you probably feel confusion when you see the number “-17.58” predicting speed 😲. Did car manufacturers of the 1920s invent a method for traversing space and time as we know it 🤔? Or is regression a hopeless tool for understanding the world and should you demand your money back for this course 😿?
Let’s look at the data more carefully:
cars%>%
dplyr::arrange(speed)%>%
tibble()
# A tibble: 50 x 2
speed dist
<dbl> <dbl>
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
# … with 40 more rows
We can see that the lowest speed measured in this dataset is 4 miles per hour, at which distance the car stopped in 2 feet. Another car travelling at 4 mph took 10 feet to stop. So there’s variability.
Let’s plug these two numbers into the regression equation that we just estimated. We do not need to copy and paste coefficients from the output of the model. Rather we can recover the intercept as follows:
coef(model_simple)[[1]]
[1] -17.57909
And for each mph thereafter we just multiply by this value
coef(model_simple)[[2]]
[1] 3.932409
So we can write the expected (or predicted) speed for a car travelling at the maximum speed in the dataset (25) as follows:
and for the minimum speed (4) as follows:
Is this any better? We’re still getting a negative speed.
To avoid confusion, let’s turn to the most useful tool in your R toolkit, your graph.
We can plot the model as follows:
library(ggplot2)
ggplot2::ggplot(data = cars,
aes( x = speed, y = dist )) +
geom_smooth(method = "lm") +
geom_point() + theme_classic()
Here, our linear model is minimising the average distance between between observed speed and observed distance in the sample. The model hits at most a few points in the dataset. Otherwise it estimates a response that is either too high or two low.
Note that by allowing our predictor variable to start at 0 we render the intercept term -17.5790949 in our model un-interpretable.
There is no coherent concept we can attach to an expected stopping distance from an expected speed of 0 mp, and the value 0mph never occurs in the dataset.
Center (or center and scale) all your continuous predictors. In such a model, the intercept is interpretable as the expected response when all continuous predictors are set to their sample average, which in this case is 15.4 mph.
Recall that we can center our variables using the following code
# center and create new dataframe
schmars <- cars
# not that I am presently having trouble with dplyr output. The following code will allow me to graph results (which I found by searching stackoverflow.com)
schmars['speed_c'] = as.data.frame(scale(schmars$speed, scale = FALSE) )# work around for a bug - center and convert to decade units
schmars$speed_c
[1] -11.4 -11.4 -8.4 -8.4 -7.4 -6.4 -5.4 -5.4 -5.4 -4.4 -4.4
[12] -3.4 -3.4 -3.4 -3.4 -2.4 -2.4 -2.4 -2.4 -1.4 -1.4 -1.4
[23] -1.4 -0.4 -0.4 -0.4 0.6 0.6 1.6 1.6 1.6 2.6 2.6
[34] 2.6 2.6 3.6 3.6 3.6 4.6 4.6 4.6 4.6 4.6 6.6
[45] 7.6 8.6 8.6 8.6 8.6 9.6
# model
model_simple2 <- lm(dist ~ speed_c, data = schmars)
# graph
parameters::parameters(model_simple2)%>%
print_html()
Model Summary | |||||
---|---|---|---|---|---|
Parameter | Coefficient | SE | 95% CI | t(48) | p |
(Intercept) | 42.98 | 2.18 | (38.61, 47.35) | 19.76 | < .001 |
speed_c | 3.93 | 0.42 | (3.10, 4.77) | 9.46 | < .001 |
Notice that the estimated coefficient in for speed is this second model is the same as the estimated coefficient for speed in the first model:
# evaluate whether the two coefficients are the same at least to five decimal places
round(coef(model_simple)[[2]],5) == round(coef(model_simple2)[[2]],5)
[1] TRUE
And the graph is identical:
ggplot2::ggplot(data = cars,
aes( x = speed, y = dist )) +
geom_smooth(method = "lm") +
geom_point()
However, the intercept is now meaningful. It is the expected (or predicted) outcome when speed is set to its sample average (25mph)
Westill interpret the coefficient for speed as the expected increase or decrease in stopping distance for a one-unit change in speed (which was measured in miles per hour in the population from which this sample was collected. For a car travelling 3 mph faster than the sample average we would expect an average stopping distance of:
And for a car travelling 10.33 mph slower than the sampling average we would expect an average stopping distance of:
Recall that you can create a nice prediction plot using the ggeffects
package:
# graph espected means
plot(ggeffects::ggpredict(model_simple2, terms = "speed_c [all]"),
add.data = TRUE,
dot.alpha = .8,
# jitter = .01,
ci.style = "errorbar") +
ylab("Distance in feet") +
xlab("Speed in MPH (centered at 25 mph") +
labs(title = "Predicted average stopping distance by speed for cars in the 1920s",
subtitle = "Speed centered at 25 MPH") + theme_classic()
or another method:
# graph espected means
plot(ggeffects::ggpredict(model_simple2, terms = "speed_c [all]"),
add.data = TRUE,
dot.alpha = .8,
jitter = .01) +
ylab("Distance in feet") +
xlab("Speed in MPH (centered at 25 mph") +
labs(title = "Predicted average stopping distance by speed for cars in the 1920s",
subtitle = "Speed centered at 25 MPH") + theme_classic()
The essential take home messages about regression are as follows:
We have already encountered multiple regression. Perhaps the most familiar case is prediction. Suppose we want to predict someones expected income based on their gender identification. Suppose we theory is true there might be an added income advantage from height. We can include an additional term in our model and evaluate whether or theory is true. Let’s assess this question using the jittered NZAVS dataset.
Here is a regression model with two predictors. We have centered height to the population average. The outcome is expected household income for the NZ population simultaneously stratified by height and by male identification. (Question: can you think of why household income is not an ideal indicator in a setting where we are looking at gender differences…? We’ll come back to this question in a few weeks.)
# model
model_theory <- lm(income ~ height_cm_c + not_male, data = df)
sjPlot::tab_model(model_theory)
income | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 134460.88 | 124471.39 – 144450.37 | <0.001 |
height_cm_c | 720.23 | 42.63 – 1397.84 | 0.037 |
not_male [Not_Male] | -7719.91 | -21457.23 – 6017.41 | 0.270 |
Observations | 1370 | ||
R2 / R2 adjusted | 0.011 / 0.009 |
How do we interpret this model? First we can write the equation:
# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory, use_coefs = FALSE)
\[ \operatorname{income} = \alpha + \beta_{1}(\operatorname{height\_cm\_c}) + \beta_{2}(\operatorname{not\_male}_{\operatorname{Not\_Male}}) + \epsilon \]
And plugging in the values
# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory, use_coefs = TRUE)
\[ \operatorname{\widehat{income}} = 134460.88 + 720.23(\operatorname{height\_cm\_c}) - 7719.91(\operatorname{not\_male}_{\operatorname{Not\_Male}}) \]
First we need the average sample height (across all gender identifications):
mean(nz_0$HLTH.Height, na.rm=TRUE)
[1] 1.692567
The model tells us the following. The expected household income for the population of men who are at the sample average height is: 1.3446088^{5}.The expected increase in income for each additional centimeter of height whether or not one is male or not-male identified is 720.2335614. The expected reduction in income for people who do not identify as male is -7719.906159.
Are men expected to make -7719.906159 more than woman?
No because on average non-male is shorter and we are stratifying across male and non-male gender identifications. If we focus only on the gender difference we can include male-id and remove height and this gives us:
income | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 140522.50 | 132310.79 – 148734.21 | <0.001 |
not_male [Not_Male] | -17311.52 | -27681.96 – -6941.07 | 0.001 |
Observations | 1370 | ||
R2 / R2 adjusted | 0.008 / 0.007 |
We write the model equation here:
# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory2, use_coefs = FALSE)
\[ \operatorname{income} = \alpha + \beta_{1}(\operatorname{not\_male}_{\operatorname{Not\_Male}}) + \epsilon \]
And plugging in the values
# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory, use_coefs = TRUE)
\[ \operatorname{\widehat{income}} = 134460.88 + 720.23(\operatorname{height\_cm\_c}) - 7719.91(\operatorname{not\_male}_{\operatorname{Not\_Male}}) \]
Thus the expected not-male income is 153753.08 - 28349.98 = or about $NZD 125,400.
We can recover this expectation in the original model by adding the difference between the sample averages in male heights over not-male heights as follows:
Find approx deviations in male height and female heights from sample:
df%>%
group_by(not_male)%>%
summarise(genmean = mean(HLTH.Height, na.rm = TRUE))
# A tibble: 3 x 2
not_male genmean
<chr> <dbl>
1 Male 1.78
2 Not_Male 1.65
3 <NA> 1.69
# difference
df%>%
summarise(mean = mean(HLTH.Height, na.rm = TRUE))
mean
1 1.699764
Plug in expected values for male Id:
coef(model_theory)[[1]] + # intercept at average height
coef(model_theory)[[2]]* 8.3 + #increase in average male height over sample average in cm
coef(model_theory)[[3]]* 0 # code for male id = 0
[1] 140438.8
And for not-male Id:
These are the equivalent expectations to the stratified models.
Either way you cut it, we find that women (and others who do not identify as male) are are expected to make -1.7311516^{4} less in household income than men.
Note that regression allows us to stratify across other segments of a population by adding even more co-variates.
Let’s add education to the model. In the NZAVS education Edu
is a 10-level factor. However, for now, we’ll think of it as numeric. A standard deviation of Edu
is 2.5594953 out of 0 to 10. Adding edu_s
to the model we find:
model_theory3 <- lm(income ~ height_cm_c + not_male + edu_s, data = df)
sjPlot::tab_model(model_theory3)
income | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 139186.50 | 129259.19 – 149113.80 | <0.001 |
height_cm_c | 512.62 | -157.61 – 1182.85 | 0.134 |
not_male [Not_Male] | -14651.10 | -28288.27 – -1013.94 | 0.035 |
edu_s | 20367.57 | 15408.77 – 25326.36 | <0.001 |
Observations | 1348 | ||
R2 / R2 adjusted | 0.057 / 0.055 |
We write the model equation as follows:
# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory3, use_coefs = FALSE)
\[ \operatorname{income} = \alpha + \beta_{1}(\operatorname{height\_cm\_c}) + \beta_{2}(\operatorname{not\_male}_{\operatorname{Not\_Male}}) + \beta_{3}(\operatorname{edu\_s}) + \epsilon \]
And plugging in the values:
# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory3, use_coefs = TRUE)
\[ \operatorname{\widehat{income}} = 139186.5 + 512.62(\operatorname{height\_cm\_c}) - 14651.1(\operatorname{not\_male}_{\operatorname{Not\_Male}}) + 20367.57(\operatorname{edu\_s}) \]
Do height differences predict difference greater income differences ? To assess this we add an interaction term. NOTE: we should always center our interaction terms for the same reason that we should center polynomials: otherwise the multi-collinearity renders the model unstable
# model
model_theory4 <- lm(income ~ height_cm_c * not_male, data = df)
Note that this model is equivalent to:
model_theory4 <- lm(income ~ height_cm_c + not_male + not_male:height_cm_c, data = df)
We write the model equation as follows:
# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory4, use_coefs = FALSE)
\[ \operatorname{income} = \alpha + \beta_{1}(\operatorname{height\_cm\_c}) + \beta_{2}(\operatorname{not\_male}_{\operatorname{Not\_Male}}) + \beta_{3}(\operatorname{height\_cm\_c} \times \operatorname{not\_male}_{\operatorname{Not\_Male}}) + \epsilon \]
And plugging in the values:
# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory4, use_coefs = TRUE)
\[ \operatorname{\widehat{income}} = 129582.33 + 1299.9(\operatorname{height\_cm\_c}) - 4553.46(\operatorname{not\_male}_{\operatorname{Not\_Male}}) - 928.99(\operatorname{height\_cm\_c} \times \operatorname{not\_male}_{\operatorname{Not\_Male}}) \]
The *
in the model is merely shorthand for the two main effects height_cm_c + not_male
\(+\) their interaction not_male:height_cm_c
.
Lots graph the coefficients:
sjPlot::plot_model(model_theory4)
Whoooah what just happened?! There’s huge uncertainty about the main effect of male-id – it is no longer a reliable predictor in this model.
Is the interaction messing up the model? We can look for mutli-collinearity using the following code from the performance package.
z<-performance::check_collinearity(model_theory4)
z ## looks ok
# Check for Multicollinearity
Low Correlation
Term VIF Increased SE Tolerance
height_cm_c 4.68 2.16 0.21
not_male 1.97 1.40 0.51
height_cm_c:not_male 3.10 1.76 0.32
plot(z) ## looks ok
As indicated in the graph, the problem isn’t multi-collinearity.
Does this mean that income is really predicted by height, and gender is a red herring?
I have seen regression models interpreted that way. However this is confused. The confusion mounts when we compare the expected means of the population by stratifying on genderidentification:
modelbased::estimate_means(model_theory4)
not_male | Mean | SE | 95% CI
----------------------------------------------------
Male | 1.30e+05 | 6291.49 | [1.17e+05, 1.42e+05]
Not_Male | 1.25e+05 | 3886.99 | [1.17e+05, 1.33e+05]
p-values are uncorrected.
However, our model has compared male-ids and not-male ids at equivalent levels of height. However, we know that non-male-id’s tend to be shorter.
Wherever you have interactions, it is essential to graph your results. This is so important I’m going to write it again: wherever you have interactions, it is essential to graph the results. Tables are inscrutable. We need probe our model by graphing its predictions. Recall that we can do this using the ggeffects
package.
Here is a graph comparing men/non male at different levels of height.
The interpretation is now clear. Greater height predicts much greater income for males and non_males. However the importance of and equivalent change in height differs for non-males, which is much lower. There is a non-linearity in the predicted effects of height by gender. Or put another way, gender moderates income by height.
terms = c("not_male","height_cm_c")
Or equivalently, height moderates income by gender. We can graph this equivalent interpretation by reversing the order of our syntax. This gives us the same interpretation outcomes but visualised with income on the x-axis:
terms = c("height_cm_c","not_male")
My preferred method for graphing is to include all the data points:
= TRUE, add.data
This allows us to see what happening in our dataset:
pp3 <- plot( x,
add.data = TRUE,
jitter = 0.2,
dot.alpha =.2,
) + scale_y_continuous(limits = c(0,5e+5)) +
theme_classic()
pp3
There is a non-linearity in our model. The model predicts that, on average, short men are expected to make much less than short women. Among non male-ids, height differences are not a reliable predictor of household income. By contrast among male-ids, height differences are a reliable predictor. non-male-ids tend to be shorter than men by about 13cm.
Sample averages:
df%>%
summarise(mean(HLTH.Height, na.rm = TRUE))
mean(HLTH.Height, na.rm = TRUE)
1 1.699764
Sample differences by gender-id:
df%>%
group_by(not_male)%>%
summarise(mean(HLTH.Height, na.rm = TRUE))
# A tibble: 3 x 2
not_male `mean(HLTH.Height, na.rm = TRUE)`
<chr> <dbl>
1 Male 1.78
2 Not_Male 1.65
3 <NA> 1.69
Let’s plug in the values for male-id and not male-id at the population average for each population:
# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory4, use_coefs = TRUE)
\[ \operatorname{\widehat{income}} = 134509.3 + 233.14(\operatorname{height\_cm\_c}) - 5311.89(\operatorname{not\_male}_{\operatorname{Not\_Male}}) - 161.81(\operatorname{height\_cm\_c} \times \operatorname{not\_male}_{\operatorname{Not\_Male}}) \] To get the expected average for the male-id population we could compute the regression equation as follows
coef(model_theory4)[[1]] + # intercept about 170 cm
coef(model_theory4)[[2]] * 80 + # male-id are 8 cm taller than sample average # note we include this even though the coefficient is unreliable
coef(model_theory4)[[3]]* 0 + # male-id are coded as zero +
coef(model_theory4)[[4]]* 80 * 0 # 8cm difference * male-id are coded as zero (which zeros this out)
[1] 233574.1
For the female-id population we compute the regression equation as follows
coef(model_theory4)[[1]] + # intercept about 170 cm
coef(model_theory4)[[2]] * -50 + # not male-id are 5 cm shorter than sample average #
coef(model_theory4)[[3]]* 1 + # not male-id are coded as 1 +
coef(model_theory4)[[4]] * -50 # 5cm shorter * male-id are coded as zero (which zeros this out)
[1] 106483.5
Compare these estimates with the model in which we did not have the interaction but only the additive predictors of male-id and height.
We can do this quickly using the modelbased
package.
modelbased::estimate_means(model_theory2)
not_male | Mean | SE | 95% CI
----------------------------------------------------
Male | 1.41e+05 | 4186.02 | [1.32e+05, 1.49e+05]
Not_Male | 1.23e+05 | 3228.61 | [1.17e+05, 1.30e+05]
p-values are uncorrected.
We find that the estimates are similar.
Which model should we prefer? Answer: it depends on your question!
We’ll come back to this point when we focus on the uses of regression for causal inference (week 9). For now, you should now that there are no canned methods for selecting a model.
However, because reviewers will ask, you can use the AIC or BIC information criteria to select a model. Recall that a decrease in the absolute values of either statistic offers a reason to prefer one model over the other.
performance::compare_performance(model_theory,
model_theory2,
model_theory3,
model_theory4)
# Comparison of Model Performance Indices
Name | Model | AIC | BIC | R2 | R2 (adj.) | RMSE | Sigma
-----------------------------------------------------------------------------------------
model_theory | lm | 35283.610 | 35304.501 | 0.011 | 0.009 | 94407.120 | 94510.656
model_theory2 | lm | 35285.961 | 35301.628 | 0.008 | 0.007 | 94557.131 | 94626.227
model_theory3 | lm | 34651.663 | 34677.695 | 0.057 | 0.055 | 92073.554 | 92210.466
model_theory4 | lm | 35283.908 | 35310.021 | 0.012 | 0.010 | 94348.498 | 94486.535
We can graph the models along different criteria:
xx <- performance::compare_performance(model_theory,
model_theory2,
model_theory3,
model_theory4)
plot(xx)
We find that model three performed the best. Recall that this is the model that included education:
model_theory3 <- lm(income ~ height_cm_c + not_male + edu_s, data = df)
What if we were to include an interaction with height and male-id?
model_theory6 <- lm(income ~ height_cm_c * not_male + edu_s, data = df)
There’s no improvement in this model. After adjusting for education, height, and male-id, we do not see an improvement from including a non-linear adjustment for the height effect for the different male-id factors:
performance::compare_performance(model_theory3,model_theory6)
# Comparison of Model Performance Indices
Name | Model | AIC | BIC | R2 | R2 (adj.) | RMSE | Sigma
-----------------------------------------------------------------------------------------
model_theory3 | lm | 34651.663 | 34677.695 | 0.057 | 0.055 | 92073.554 | 92210.466
model_theory6 | lm | 34652.271 | 34683.509 | 0.058 | 0.055 | 92026.013 | 92197.161
What about if we were to include age? REcall that a linear model can (and often should) include non-linear terms. We encountered this point when we were discussing polynomials and splines. We can include a spline term for age using the bs
command from the splines
package.
library("splines")
model_theory7 <- lm(income ~ bs(age_10_c) + height_cm_c * not_male + edu_s, data = df)
parameters::parameters(model_theory7)
Parameter | Coefficient | SE | 95% CI | t(1340) | p
------------------------------------------------------------------------------------------------------
(Intercept) | 1.03e+05 | 18025.73 | [ 67729.01, 1.38e+05] | 5.72 | < .001
age_10_c [1st degree] | 87255.86 | 44968.06 | [ -959.60, 1.75e+05] | 1.94 | 0.053
age_10_c [2nd degree] | 42518.74 | 26275.83 | [ -9027.49, 94064.97] | 1.62 | 0.106
age_10_c [3rd degree] | -34574.37 | 36902.29 | [ -1.07e+05, 37818.18] | -0.94 | 0.349
height_cm_c | 858.27 | 553.98 | [ -228.49, 1945.02] | 1.55 | 0.122
not_male [Not_Male] | -16286.53 | 7399.89 | [ -30803.16, -1769.89] | -2.20 | 0.028
edu_s | 20025.41 | 2553.22 | [ 15016.66, 25034.15] | 7.84 | < .001
height_cm_c * not_male [Not_Male] | -653.21 | 699.66 | [ -2025.76, 719.34] | -0.93 | 0.351
As judged by the AIC, this model has an improved fit. However as judged by the BIC, this model has reduced fit. This happened because the BIC penalises the extra parameters.
We can graph this outcome as follows:
za<-performance::compare_performance(model_theory3, model_theory7) # age, benefit but only if polynomial
plot(za)
Why might we include age? We might be interested in the predicted comparisons for age among, here for the population that is set to zero in our model (male-id’s who are shorter than average male height by about 83 cms, and have an sample average education). Age here has been converted to 2 x decade units.
We can see here that the interaction term has shifted because have added more variables. However, this is because the population we are predicting here is slightly different from the population before … the change might have occurred merely because we have added more terms or, as we shall come to understand in Week 9, the change might have occurred because adding indicators produce new understanding.
Yes. I do this using the BRMS package.
library(brms)
bf1(y1 ~ x)
bf2(y2 ~ x)
fit <- brms(bf1 + bf2,
set_rescor(rescor = TRUE), # allow estimation of residual correlations
data = data)
We will encounter multivariate regression again in future weeks.
Don’t use it, but if you must. The following code can be useful.
You can write these as linear models.
This is a one-way anova in which the grouping variable “not_male” is the condition and “income” is the outcome.
Anova#anova
m2 <- aov(income ~ not_male, data = df)
parameters::parameters(m2)
Parameter | Sum_Squares | df | Mean_Square | F | p
------------------------------------------------------------
not_male | 9.60e+10 | 1 | 9.60e+10 | 10.72 | 0.001
Residuals | 1.22e+13 | 1368 | 8.95e+09 | |
This is how the report package says you should report. I’ve tweaked the wording because I cannot write “statistically significant” without gagging:
The ANOVA (formula: income ~ not_male) suggests that:
Effect sizes were labelled following Field’s (2013) recommendations.
I recommend the modelbased
package, which comes as part of easystats
to explore your model
These are the estimated means
modelbased::estimate_relation(m2)
not_male Predicted SE CI_low CI_high
1 Male 140522.5 4186.018 132310.8 148734.2
2 Not_Male 123211.0 3228.605 116877.4 129544.5
Compare this against the linear model:
m2l <- lm(income ~ not_male, data = df)
parameters::parameters(m2l)
Parameter | Coefficient | SE | 95% CI | t(1368) | p
--------------------------------------------------------------------------------------
(Intercept) | 1.41e+05 | 4186.02 | [ 1.32e+05, 1.49e+05] | 33.57 | < .001
not_male [Not_Male] | -17311.52 | 5286.46 | [-27681.96, -6941.07] | -3.27 | 0.001
This is how the report package says you should report as follows
We fitted a linear model (estimated using OLS) to predict income with not_male (formula: income ~ not_male). The model explains a statistically significant proportion of variance (R2 = 0.01, F(1, 1734) = 23.52, p < .001, adj. R2 = 0.01). The model’s intercept, corresponding to not_male = 0, is at 1.54e+05 (95% CI [1.45e+05, 1.63e+05], t(1734) = 33.03, p < .001). Within this model:
Standardized parameters were obtained by fitting the model on a standardized version of the dataset.
These are the estimated means, which are identical to the ANOVA.
modelbased::estimate_relation(m2l)
not_male Predicted SE CI_low CI_high
1 Male 140522.5 4186.018 132310.8 148734.2
2 Not_Male 123211.0 3228.605 116877.4 129544.5
I’m not goint to spend any more time with ANOVA. This isn’t because I’m dimissive of ANOVA. It can be a useful framework for certain questions. However, it is never going to produce different answers than you would obtain in a linear regression framework. Moreover the linear regression framework is much more flexible, and can be extending However in case you are required to formulate your regression model, I’ve included some R syntax for you here.
Syntax
aov(Y ~ Grp, data = data)
Assumptions:
Syntax
aov(Y ~ Grp * Male, data = data)
Or this model can be written
aov(Y ~ Grp + Male + Grp:Male, data = data)
This model can be written:
For advanced users, if you wish to write in \(\LaTeX\) you can create a table in many way, for example copying and/pasting the output of the following code into your \(\LaTeX\) document:
parameters::parameters(model_simple)%>%
kable( booktabs = F, "latex")%>%
print()
\begin{tabular}{l|r|r|r|r|r|r|r|r}
\hline
Parameter & Coefficient & SE & CI & CI\_low & CI\_high & t & df\_error & p\\
\hline
(Intercept) & -17.579095 & 6.7584402 & 0.95 & -31.167850 & -3.990340 & -2.601058 & 48 & 0.0123188\\
\hline
speed & 3.932409 & 0.4155128 & 0.95 & 3.096964 & 4.767853 & 9.463990 & 48 & 0.0000000\\
\hline
\end{tabular}
The package I tend to go to for \(\LaTeX\) is the texreg package:
texreg::texreg(list(model_simple),
custom.model.names = c("Cars ..."),
caption = "Breaking distance as predicted by speed",
sideways = F,
scalebox = .5,
#fontsize= "footnotesize",
label = "tab:model_simple",
ci.force.level = 0.95, bold = 0.05,
settingstars = 0,
booktabs = TRUE,
custom.note ="")
\usepackage{graphicx}
\usepackage{booktabs}
\begin{table}
\begin{center}
\scalebox{0.5}{
\begin{tabular}{l c}
\toprule
& Cars ... \\
\midrule
(Intercept) & $\mathbf{-17.58}^{*}$ \\
& $(6.76)$ \\
speed & $\mathbf{3.93}^{***}$ \\
& $(0.42)$ \\
\midrule
R$^2$ & $0.65$ \\
Adj. R$^2$ & $0.64$ \\
Num. obs. & $50$ \\
\bottomrule
\end{tabular}
}
\caption{Breaking distance as predicted by speed}
\label{tab:model_simple}
\end{center}
\end{table}
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. Source code is available at https://go-bayes.github.io/psych-447/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Bulbulia (2021, April 20). Psych 447: Multiple regression (ANOVA, ANOVCA, MANOVA). Retrieved from https://vuw-psych-447.netlify.app/posts/7_1/
BibTeX citation
@misc{bulbulia2021multiple, author = {Bulbulia, Joseph}, title = {Psych 447: Multiple regression (ANOVA, ANOVCA, MANOVA)}, url = {https://vuw-psych-447.netlify.app/posts/7_1/}, year = {2021} }