Psych 447: Multiple regression (ANOVA, ANOVCA, MANOVA)

Joseph Bulbulia

Learning objectives

Before the break, we introduced basic concepts in statistical regression, focusing on regression using a single covariate.

Today, we will introduce regression using multiple co-variates. You will acquire the following skills:

Understand how to write a regression model with multiple co-variates in R.
Understand how to interpret a regression model with multiple co-variates in R.
Understand methods for comparing simple and complex models.
Understand how to create a table for your results.
Understand how to graph your results.
Understand how to write equivalent models in an ANOVA framework, in case someone forces you to do this.

What is a regression model?

In the first instance, linear regression is a method for predicting features of the world, or as we put it in lecture 5:

[R]egression … is method for inferring the expected average features of a population, and the variance of a population, conditional on other features of the population as measured in a sample.

In week 9, we’ll discussion how to use regression for explaining features of the world. However, for the next two weeks, we’ll restrict our interests to prediction.

To referesh your memory about how the magic of prediction by regression works, let’s look at an example. Consider data collected in the 1920s on the relationship between the speed at which a car is travelling and the distance that it takes for that car to stop. The data are found in the cars dataset, which is automatically loaded when you start R.

Here is our model:

model_simple <- lm(dist ~ speed, data = cars)

Let’s look at the output. You can create a table like this:

parameters::parameters(model_simple)

Parameter   | Coefficient |   SE |          95% CI | t(48) |      p
-------------------------------------------------------------------
(Intercept) |      -17.58 | 6.76 | [-31.17, -3.99] | -2.60 | 0.012 
speed       |        3.93 | 0.42 | [  3.10,  4.77] |  9.46 | < .001

Recall that we can create an html table in this way:

parameters::parameters(model_simple)%>%
    parameters::print_html(caption = "Breaking distance as predicted by speed")

Breaking distance as predicted by speed

Parameter	Coefficient	SE	95% CI	t(48)	p
(Intercept)	-17.58	6.76	(-31.17, -3.99)	-2.60	0.012
speed	3.93	0.42	(3.10, 4.77)	9.46	< .001

How do we interpret the results of a regression model?

Recall that the report package makes your life easy:

report::report(model_simple)

We fitted a linear model (estimated using OLS) to predict dist with speed (formula: dist ~ speed). The model explains a significant and substantial proportion of variance (R2 = 0.65, F(1, 48) = 89.57, p < .001, adj. R2 = 0.64). The model's intercept, corresponding to speed = 0, is at -17.58 (95% CI [-31.17, -3.99], t(48) = -2.60, p < .05). Within this model:

  - The effect of speed is significantly positive (beta = 3.93, 95% CI [3.10, 4.77], t(48) = 9.46, p < .001; Std. beta = 0.81, 95% CI [0.64, 0.98])

Standardized parameters were obtained by fitting the model on a standardized version of the dataset.

However, before you go rushing to report your “signficant” p-value, how shall we intepret this output?

Recall that can write the output of a model as an equation.

The equatiomatic package in R makes this work easy for you.

# this is how to quickly generate the equation
equatiomatic::extract_eq(model_simple,  use_coefs = TRUE)

\[ \operatorname{\widehat{dist}} = -17.58 + 3.93(\operatorname{speed}) \]

The generic form of the a linear model is:

# this is how to quickly generate the equation
equatiomatic::extract_eq(model_simple,  use_coefs = FALSE)

\[ \operatorname{dist} = \alpha + \beta_{1}(\operatorname{speed}) + \epsilon \]

The model says that the expected stopping distance for a car is -17.58 feet when the care is traveling zero mph; To this we an additional 3.93 feet for each additional unit of speed (here in miles per hour).

What strikes you about this model?

If you are like me you probably feel confusion when you see the number “-17.58” predicting speed 😲. Did car manufacturers of the 1920s invent a method for traversing space and time as we know it 🤔? Or is regression a hopeless tool for understanding the world and should you demand your money back for this course 😿?

Let’s look at the data more carefully:

cars%>%
  dplyr::arrange(speed)%>%
  tibble()

# A tibble: 50 x 2
   speed  dist
   <dbl> <dbl>
 1     4     2
 2     4    10
 3     7     4
 4     7    22
 5     8    16
 6     9    10
 7    10    18
 8    10    26
 9    10    34
10    11    17
# … with 40 more rows

We can see that the lowest speed measured in this dataset is 4 miles per hour, at which distance the car stopped in 2 feet. Another car travelling at 4 mph took 10 feet to stop. So there’s variability.

Let’s plug these two numbers into the regression equation that we just estimated. We do not need to copy and paste coefficients from the output of the model. Rather we can recover the intercept as follows:

coef(model_simple)[[1]]

[1] -17.57909

And for each mph thereafter we just multiply by this value

coef(model_simple)[[2]]

[1] 3.932409

So we can write the expected (or predicted) speed for a car travelling at the maximum speed in the dataset (25) as follows:

coef(model_simple)[[1]] + coef(model_simple)[[2]] * max(cars$speed, na.rm = TRUE)

[1] 80.73112

and for the minimum speed (4) as follows:

coef(model_simple)[[1]] + coef(model_simple)[[2]] * min(cars$speed, na.rm = FALSE)

[1] -1.84946

Is this any better? We’re still getting a negative speed.

To avoid confusion, let’s turn to the most useful tool in your R toolkit, your graph.

We can plot the model as follows:

library(ggplot2)
ggplot2::ggplot(data = cars, 
                aes( x = speed,  y = dist )) +
  geom_smooth(method = "lm") + 
  geom_point() + theme_classic()

Here, our linear model is minimising the average distance between between observed speed and observed distance in the sample. The model hits at most a few points in the dataset. Otherwise it estimates a response that is either too high or two low.

Note that by allowing our predictor variable to start at 0 we render the intercept term -17.5790949 in our model un-interpretable.

There is no coherent concept we can attach to an expected stopping distance from an expected speed of 0 mp, and the value 0mph never occurs in the dataset.

How to get an interpretable intercept?

Center (or center and scale) all your continuous predictors. In such a model, the intercept is interpretable as the expected response when all continuous predictors are set to their sample average, which in this case is 15.4 mph.

Recall that we can center our variables using the following code

# center and create new dataframe
schmars <- cars

# not that I am presently having trouble with dplyr output. The following code will allow me to graph results (which I found by searching stackoverflow.com)

schmars['speed_c'] = as.data.frame(scale(schmars$speed, scale = FALSE) )# work around for a bug - center and convert to decade units
schmars$speed_c

 [1] -11.4 -11.4  -8.4  -8.4  -7.4  -6.4  -5.4  -5.4  -5.4  -4.4  -4.4
[12]  -3.4  -3.4  -3.4  -3.4  -2.4  -2.4  -2.4  -2.4  -1.4  -1.4  -1.4
[23]  -1.4  -0.4  -0.4  -0.4   0.6   0.6   1.6   1.6   1.6   2.6   2.6
[34]   2.6   2.6   3.6   3.6   3.6   4.6   4.6   4.6   4.6   4.6   6.6
[45]   7.6   8.6   8.6   8.6   8.6   9.6

# model
model_simple2 <- lm(dist ~ speed_c, data = schmars)

# graph
parameters::parameters(model_simple2)%>%
  print_html()

Model Summary

Parameter	Coefficient	SE	95% CI	t(48)	p
(Intercept)	42.98	2.18	(38.61, 47.35)	19.76	< .001
speed_c	3.93	0.42	(3.10, 4.77)	9.46	< .001

Notice that the estimated coefficient in for speed is this second model is the same as the estimated coefficient for speed in the first model:

# evaluate whether the two coefficients are the same at least to five decimal places
round(coef(model_simple)[[2]],5) == round(coef(model_simple2)[[2]],5)

[1] TRUE

And the graph is identical:

ggplot2::ggplot(data = cars, 
                aes( x = speed,  y = dist )) +
  geom_smooth(method = "lm") + 
  geom_point()

However, the intercept is now meaningful. It is the expected (or predicted) outcome when speed is set to its sample average (25mph)

round(coef(model_simple2)[[1]],5)

[1] 42.98

Westill interpret the coefficient for speed as the expected increase or decrease in stopping distance for a one-unit change in speed (which was measured in miles per hour in the population from which this sample was collected. For a car travelling 3 mph faster than the sample average we would expect an average stopping distance of:

coef(model_simple2)[[1]] + coef(model_simple2)[[2]] * 3

[1] 54.77723

And for a car travelling 10.33 mph slower than the sampling average we would expect an average stopping distance of:

coef(model_simple2)[[1]] + coef(model_simple2)[[2]] * 10.33

[1] 83.60178

Recall that you can create a nice prediction plot using the ggeffects package:

# graph espected means
plot(ggeffects::ggpredict(model_simple2, terms = "speed_c [all]"), 
     add.data = TRUE, 
     dot.alpha = .8, 
    # jitter = .01, 
     ci.style = "errorbar") + 
  ylab("Distance in feet") + 
  xlab("Speed in MPH (centered at 25 mph") + 
  labs(title = "Predicted average stopping distance by speed for cars in the 1920s",
       subtitle = "Speed centered at 25 MPH") + theme_classic()

or another method:

# graph espected means
plot(ggeffects::ggpredict(model_simple2, terms = "speed_c [all]"), 
     add.data = TRUE, 
     dot.alpha = .8, 
     jitter = .01) + 
  ylab("Distance in feet") + 
  xlab("Speed in MPH (centered at 25 mph") + 
  labs(title = "Predicted average stopping distance by speed for cars in the 1920s",
       subtitle = "Speed centered at 25 MPH") + theme_classic()

The essential take home messages about regression are as follows:

inspect your data
center your continuous predictors (standardise? maybe…)
graph your model
generate a prediction plot

Multiple regression

We have already encountered multiple regression. Perhaps the most familiar case is prediction. Suppose we want to predict someones expected income based on their gender identification. Suppose we theory is true there might be an added income advantage from height. We can include an additional term in our model and evaluate whether or theory is true. Let’s assess this question using the jittered NZAVS dataset.

Here is a regression model with two predictors. We have centered height to the population average. The outcome is expected household income for the NZ population simultaneously stratified by height and by male identification. (Question: can you think of why household income is not an ideal indicator in a setting where we are looking at gender differences…? We’ll come back to this question in a few weeks.)

# model
model_theory <- lm(income ~ height_cm_c + not_male, data = df)
sjPlot::tab_model(model_theory)

	income
Predictors	Estimates	CI	p
(Intercept)	134460.88	124471.39 – 144450.37	<0.001
height_cm_c	720.23	42.63 – 1397.84	0.037
not_male [Not_Male]	-7719.91	-21457.23 – 6017.41	0.270
Observations	1370
R² / R² adjusted	0.011 / 0.009

How do we interpret this model? First we can write the equation:

# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory,  use_coefs = FALSE)

\[ \operatorname{income} = \alpha + \beta_{1}(\operatorname{height\_cm\_c}) + \beta_{2}(\operatorname{not\_male}_{\operatorname{Not\_Male}}) + \epsilon \]

And plugging in the values

# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory,  use_coefs = TRUE)

\[ \operatorname{\widehat{income}} = 134460.88 + 720.23(\operatorname{height\_cm\_c}) - 7719.91(\operatorname{not\_male}_{\operatorname{Not\_Male}}) \]

First we need the average sample height (across all gender identifications):

mean(nz_0$HLTH.Height, na.rm=TRUE)

[1] 1.692567

The model tells us the following. The expected household income for the population of men who are at the sample average height is: 1.3446088^{5}.The expected increase in income for each additional centimeter of height whether or not one is male or not-male identified is 720.2335614. The expected reduction in income for people who do not identify as male is -7719.906159.

Are men expected to make -7719.906159 more than woman?

No because on average non-male is shorter and we are stratifying across male and non-male gender identifications. If we focus only on the gender difference we can include male-id and remove height and this gives us:

model_theory2 <- lm(income ~ not_male, data = df)
sjPlot::tab_model(model_theory2)

	income
Predictors	Estimates	CI	p
(Intercept)	140522.50	132310.79 – 148734.21	<0.001
not_male [Not_Male]	-17311.52	-27681.96 – -6941.07	0.001
Observations	1370
R² / R² adjusted	0.008 / 0.007

We write the model equation here:

# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory2,  use_coefs = FALSE)

\[ \operatorname{income} = \alpha + \beta_{1}(\operatorname{not\_male}_{\operatorname{Not\_Male}}) + \epsilon \]

And plugging in the values

# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory,  use_coefs = TRUE)

\[ \operatorname{\widehat{income}} = 134460.88 + 720.23(\operatorname{height\_cm\_c}) - 7719.91(\operatorname{not\_male}_{\operatorname{Not\_Male}}) \]

Thus the expected not-male income is 153753.08 - 28349.98 = or about $NZD 125,400.

We can recover this expectation in the original model by adding the difference between the sample averages in male heights over not-male heights as follows:

Find approx deviations in male height and female heights from sample:

df%>%
  group_by(not_male)%>%
  summarise(genmean = mean(HLTH.Height, na.rm = TRUE))

# A tibble: 3 x 2
  not_male genmean
  <chr>      <dbl>
1 Male        1.78
2 Not_Male    1.65
3 <NA>        1.69

# difference 
df%>%
  summarise(mean = mean(HLTH.Height, na.rm = TRUE))

      mean
1 1.699764

Plug in expected values for male Id:

coef(model_theory)[[1]] + # intercept at average height
  coef(model_theory)[[2]]* 8.3 + #increase in average male height over sample average in cm
  coef(model_theory)[[3]]* 0 # code for male id = 0

[1] 140438.8

And for not-male Id:

coef(model_theory)[[1]] + coef(model_theory)[[2]]* - 4.7  + coef(model_theory)[[3]]*1

[1] 123355.9

These are the equivalent expectations to the stratified models.

Either way you cut it, we find that women (and others who do not identify as male) are are expected to make -1.7311516^{4} less in household income than men.

Multiple regression with more than than two covariates

Note that regression allows us to stratify across other segments of a population by adding even more co-variates.

Let’s add education to the model. In the NZAVS education Edu is a 10-level factor. However, for now, we’ll think of it as numeric. A standard deviation of Edu is 2.5594953 out of 0 to 10. Adding edu_s to the model we find:

model_theory3 <- lm(income ~ height_cm_c + not_male + edu_s, data = df)
sjPlot::tab_model(model_theory3)

	income
Predictors	Estimates	CI	p
(Intercept)	139186.50	129259.19 – 149113.80	<0.001
height_cm_c	512.62	-157.61 – 1182.85	0.134
not_male [Not_Male]	-14651.10	-28288.27 – -1013.94	0.035
edu_s	20367.57	15408.77 – 25326.36	<0.001
Observations	1348
R² / R² adjusted	0.057 / 0.055

We write the model equation as follows:

# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory3,  use_coefs = FALSE)

\[ \operatorname{income} = \alpha + \beta_{1}(\operatorname{height\_cm\_c}) + \beta_{2}(\operatorname{not\_male}_{\operatorname{Not\_Male}}) + \beta_{3}(\operatorname{edu\_s}) + \epsilon \]

And plugging in the values:

# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory3,  use_coefs = TRUE)

\[ \operatorname{\widehat{income}} = 139186.5 + 512.62(\operatorname{height\_cm\_c}) - 14651.1(\operatorname{not\_male}_{\operatorname{Not\_Male}}) + 20367.57(\operatorname{edu\_s}) \]

Do height differences predict difference greater income differences ? To assess this we add an interaction term. NOTE: we should always center our interaction terms for the same reason that we should center polynomials: otherwise the multi-collinearity renders the model unstable

Multiple regression with interactions

# model
model_theory4 <- lm(income ~ height_cm_c * not_male, data = df)

Note that this model is equivalent to:

model_theory4 <- lm(income ~ height_cm_c + not_male + not_male:height_cm_c, data = df)

We write the model equation as follows:

# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory4,  use_coefs = FALSE)

\[ \operatorname{income} = \alpha + \beta_{1}(\operatorname{height\_cm\_c}) + \beta_{2}(\operatorname{not\_male}_{\operatorname{Not\_Male}}) + \beta_{3}(\operatorname{height\_cm\_c} \times \operatorname{not\_male}_{\operatorname{Not\_Male}}) + \epsilon \]

And plugging in the values:

# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory4,  use_coefs = TRUE)

\[ \operatorname{\widehat{income}} = 129582.33 + 1299.9(\operatorname{height\_cm\_c}) - 4553.46(\operatorname{not\_male}_{\operatorname{Not\_Male}}) - 928.99(\operatorname{height\_cm\_c} \times \operatorname{not\_male}_{\operatorname{Not\_Male}}) \]

The * in the model is merely shorthand for the two main effects height_cm_c + not_male $+$ their interaction not_male:height_cm_c.

Lots graph the coefficients:

sjPlot::plot_model(model_theory4)

Whoooah what just happened?! There’s huge uncertainty about the main effect of male-id – it is no longer a reliable predictor in this model.

Is the interaction messing up the model? We can look for mutli-collinearity using the following code from the performance package.

z<-performance::check_collinearity(model_theory4)
z ## looks ok

# Check for Multicollinearity

Low Correlation

                 Term  VIF Increased SE Tolerance
          height_cm_c 4.68         2.16      0.21
             not_male 1.97         1.40      0.51
 height_cm_c:not_male 3.10         1.76      0.32

plot(z) ## looks ok

As indicated in the graph, the problem isn’t multi-collinearity.

Does this mean that income is really predicted by height, and gender is a red herring?

I have seen regression models interpreted that way. However this is confused. The confusion mounts when we compare the expected means of the population by stratifying on genderidentification:

modelbased::estimate_means(model_theory4)

not_male |     Mean |      SE |               95% CI
----------------------------------------------------
Male     | 1.30e+05 | 6291.49 | [1.17e+05, 1.42e+05]
Not_Male | 1.25e+05 | 3886.99 | [1.17e+05, 1.33e+05]

p-values are uncorrected.

However, our model has compared male-ids and not-male ids at equivalent levels of height. However, we know that non-male-id’s tend to be shorter.

Wherever you have interactions, it is essential to graph your results. This is so important I’m going to write it again: wherever you have interactions, it is essential to graph the results. Tables are inscrutable. We need probe our model by graphing its predictions. Recall that we can do this using the ggeffects package.

Here is a graph comparing men/non male at different levels of height.

pp1 <- plot(ggeffects::ggpredict(model_theory4, terms = c("not_male","height_cm_c")))
pp1

The interpretation is now clear. Greater height predicts much greater income for males and non_males. However the importance of and equivalent change in height differs for non-males, which is much lower. There is a non-linearity in the predicted effects of height by gender. Or put another way, gender moderates income by height.

terms = c("not_male","height_cm_c")

Or equivalently, height moderates income by gender. We can graph this equivalent interpretation by reversing the order of our syntax. This gives us the same interpretation outcomes but visualised with income on the x-axis:

terms = c("height_cm_c","not_male")

x <- ggeffects::ggpredict(model_theory4, terms = c("height_cm_c","not_male"))
pp2 <- plot( x )
pp2

My preferred method for graphing is to include all the data points:

add.data = TRUE,

This allows us to see what happening in our dataset:

pp3 <- plot( x,
      add.data = TRUE, 
      jitter = 0.2, 
      dot.alpha =.2,
      ) + scale_y_continuous(limits = c(0,5e+5))  + 
  theme_classic()

pp3

There is a non-linearity in our model. The model predicts that, on average, short men are expected to make much less than short women. Among non male-ids, height differences are not a reliable predictor of household income. By contrast among male-ids, height differences are a reliable predictor. non-male-ids tend to be shorter than men by about 13cm.

Sample averages:

df%>%
  summarise(mean(HLTH.Height, na.rm = TRUE))

  mean(HLTH.Height, na.rm = TRUE)
1                        1.699764

Sample differences by gender-id:

df%>%
  group_by(not_male)%>%
  summarise(mean(HLTH.Height, na.rm = TRUE))

# A tibble: 3 x 2
  not_male `mean(HLTH.Height, na.rm = TRUE)`
  <chr>                                <dbl>
1 Male                                  1.78
2 Not_Male                              1.65
3 <NA>                                  1.69

Let’s plug in the values for male-id and not male-id at the population average for each population:

# this is how to quickly generate the equation
equatiomatic::extract_eq(model_theory4,  use_coefs = TRUE)

\[ \operatorname{\widehat{income}} = 134509.3 + 233.14(\operatorname{height\_cm\_c}) - 5311.89(\operatorname{not\_male}_{\operatorname{Not\_Male}}) - 161.81(\operatorname{height\_cm\_c} \times \operatorname{not\_male}_{\operatorname{Not\_Male}}) \] To get the expected average for the male-id population we could compute the regression equation as follows

coef(model_theory4)[[1]] + # intercept about 170 cm
  coef(model_theory4)[[2]] * 80 + # male-id are 8 cm taller than sample average # note we include this even though the coefficient is unreliable
  coef(model_theory4)[[3]]* 0 + # male-id are coded as zero  + 
  coef(model_theory4)[[4]]* 80 * 0 # 8cm difference * male-id are coded as zero  (which zeros this out)

[1] 233574.1

For the female-id population we compute the regression equation as follows

coef(model_theory4)[[1]] + # intercept about 170 cm
  coef(model_theory4)[[2]] * -50 + # not male-id are 5 cm shorter than sample average # 
  coef(model_theory4)[[3]]* 1 + # not male-id are coded as 1  + 
  coef(model_theory4)[[4]] * -50 # 5cm shorter * male-id are coded as zero  (which zeros this out)

[1] 106483.5

Compare these estimates with the model in which we did not have the interaction but only the additive predictors of male-id and height.

We can do this quickly using the modelbased package.

modelbased::estimate_means(model_theory2)

not_male |     Mean |      SE |               95% CI
----------------------------------------------------
Male     | 1.41e+05 | 4186.02 | [1.32e+05, 1.49e+05]
Not_Male | 1.23e+05 | 3228.61 | [1.17e+05, 1.30e+05]

p-values are uncorrected.

We find that the estimates are similar.

How do we select a model?

Which model should we prefer? Answer: it depends on your question!

We’ll come back to this point when we focus on the uses of regression for causal inference (week 9). For now, you should now that there are no canned methods for selecting a model.

However, because reviewers will ask, you can use the AIC or BIC information criteria to select a model. Recall that a decrease in the absolute values of either statistic offers a reason to prefer one model over the other.

performance::compare_performance(model_theory, 
                                 model_theory2,
                                 model_theory3,
                                 model_theory4)

# Comparison of Model Performance Indices

Name          | Model |       AIC |       BIC |    R2 | R2 (adj.) |      RMSE |     Sigma
-----------------------------------------------------------------------------------------
model_theory  |    lm | 35283.610 | 35304.501 | 0.011 |     0.009 | 94407.120 | 94510.656
model_theory2 |    lm | 35285.961 | 35301.628 | 0.008 |     0.007 | 94557.131 | 94626.227
model_theory3 |    lm | 34651.663 | 34677.695 | 0.057 |     0.055 | 92073.554 | 92210.466
model_theory4 |    lm | 35283.908 | 35310.021 | 0.012 |     0.010 | 94348.498 | 94486.535

We can graph the models along different criteria:

xx <- performance::compare_performance(model_theory, 
                                 model_theory2,
                                 model_theory3,
                                 model_theory4)

plot(xx)

We find that model three performed the best. Recall that this is the model that included education:

model_theory3 <- lm(income ~ height_cm_c + not_male + edu_s, data = df)

What if we were to include an interaction with height and male-id?

model_theory6 <- lm(income ~ height_cm_c * not_male + edu_s, data = df)

There’s no improvement in this model. After adjusting for education, height, and male-id, we do not see an improvement from including a non-linear adjustment for the height effect for the different male-id factors:

performance::compare_performance(model_theory3,model_theory6)

# Comparison of Model Performance Indices

Name          | Model |       AIC |       BIC |    R2 | R2 (adj.) |      RMSE |     Sigma
-----------------------------------------------------------------------------------------
model_theory3 |    lm | 34651.663 | 34677.695 | 0.057 |     0.055 | 92073.554 | 92210.466
model_theory6 |    lm | 34652.271 | 34683.509 | 0.058 |     0.055 | 92026.013 | 92197.161

What about if we were to include age? REcall that a linear model can (and often should) include non-linear terms. We encountered this point when we were discussing polynomials and splines. We can include a spline term for age using the bs command from the splines package.

library("splines")
model_theory7 <- lm(income ~ bs(age_10_c) + height_cm_c * not_male + edu_s, data = df)
parameters::parameters(model_theory7)

Parameter                         | Coefficient |       SE |                 95% CI | t(1340) |      p
------------------------------------------------------------------------------------------------------
(Intercept)                       |    1.03e+05 | 18025.73 | [  67729.01, 1.38e+05] |    5.72 | < .001
age_10_c [1st degree]             |    87255.86 | 44968.06 | [   -959.60, 1.75e+05] |    1.94 | 0.053 
age_10_c [2nd degree]             |    42518.74 | 26275.83 | [  -9027.49, 94064.97] |    1.62 | 0.106 
age_10_c [3rd degree]             |   -34574.37 | 36902.29 | [ -1.07e+05, 37818.18] |   -0.94 | 0.349 
height_cm_c                       |      858.27 |   553.98 | [   -228.49,  1945.02] |    1.55 | 0.122 
not_male [Not_Male]               |   -16286.53 |  7399.89 | [ -30803.16, -1769.89] |   -2.20 | 0.028 
edu_s                             |    20025.41 |  2553.22 | [  15016.66, 25034.15] |    7.84 | < .001
height_cm_c * not_male [Not_Male] |     -653.21 |   699.66 | [  -2025.76,   719.34] |   -0.93 | 0.351

As judged by the AIC, this model has an improved fit. However as judged by the BIC, this model has reduced fit. This happened because the BIC penalises the extra parameters.

We can graph this outcome as follows:

za<-performance::compare_performance(model_theory3, model_theory7) # age, benefit but only if polynomial
plot(za)

Why might we include age? We might be interested in the predicted comparisons for age among, here for the population that is set to zero in our model (male-id’s who are shorter than average male height by about 83 cms, and have an sample average education). Age here has been converted to 2 x decade units.

plot( ggeffects::ggpredict(model_theory7, terms = c("age_10_c")) )

We can see here that the interaction term has shifted because have added more variables. However, this is because the population we are predicting here is slightly different from the population before … the change might have occurred merely because we have added more terms or, as we shall come to understand in Week 9, the change might have occurred because adding indicators produce new understanding.

plot( ggeffects::ggpredict(model_theory7, terms = c("height_cm_c","not_male")) )

Can we estimate multiple outcomes at the same time?

Yes. I do this using the BRMS package.

library(brms)
bf1(y1 ~ x)
bf2(y2 ~ x)

fit <- brms(bf1 + bf2, 
            set_rescor(rescor = TRUE), # allow estimation of residual correlations
            data = data)

We will encounter multivariate regression again in future weeks.

ANOVA

Don’t use it, but if you must. The following code can be useful.

ANOVA and T-tests

You can write these as linear models.

This is a one-way anova in which the grouping variable “not_male” is the condition and “income” is the outcome.

Anova

#anova
m2 <- aov(income ~ not_male, data = df)
parameters::parameters(m2)

Parameter | Sum_Squares |   df | Mean_Square |     F |     p
------------------------------------------------------------
not_male  |    9.60e+10 |    1 |    9.60e+10 | 10.72 | 0.001
Residuals |    1.22e+13 | 1368 |    8.95e+09 |       |

This is how the report package says you should report. I’ve tweaked the wording because I cannot write “statistically significant” without gagging:

The ANOVA (formula: income ~ not_male) suggests that:

The main effect of not_male is statistically significant (F(1, 1734) = 23.52, p < .001; Eta2 = 0.01, 90% CI [5.87e-03, 0.02])

Effect sizes were labelled following Field’s (2013) recommendations.

I recommend the modelbased package, which comes as part of easystats to explore your model

These are the estimated means

modelbased::estimate_relation(m2)

  not_male Predicted       SE   CI_low  CI_high
1     Male  140522.5 4186.018 132310.8 148734.2
2 Not_Male  123211.0 3228.605 116877.4 129544.5

Compare this against the linear model:

m2l <- lm(income ~ not_male, data = df)
parameters::parameters(m2l)

Parameter           | Coefficient |      SE |                95% CI | t(1368) |      p
--------------------------------------------------------------------------------------
(Intercept)         |    1.41e+05 | 4186.02 | [ 1.32e+05, 1.49e+05] |   33.57 | < .001
not_male [Not_Male] |   -17311.52 | 5286.46 | [-27681.96, -6941.07] |   -3.27 | 0.001

This is how the report package says you should report as follows

We fitted a linear model (estimated using OLS) to predict income with not_male (formula: income ~ not_male). The model explains a statistically significant proportion of variance (R2 = 0.01, F(1, 1734) = 23.52, p < .001, adj. R2 = 0.01). The model’s intercept, corresponding to not_male = 0, is at 1.54e+05 (95% CI [1.45e+05, 1.63e+05], t(1734) = 33.03, p < .001). Within this model:

The effect of not_male [Not_Male] is significantly negative (beta = -28349.98, 95% CI [-39814.07, -16885.89], t(1734) = -4.85, p < .001; Std. beta = -0.24, 95% CI [-0.34, -0.14])

Standardized parameters were obtained by fitting the model on a standardized version of the dataset.

These are the estimated means, which are identical to the ANOVA.

modelbased::estimate_relation(m2l)

  not_male Predicted       SE   CI_low  CI_high
1     Male  140522.5 4186.018 132310.8 148734.2
2 Not_Male  123211.0 3228.605 116877.4 129544.5

I’m not goint to spend any more time with ANOVA. This isn’t because I’m dimissive of ANOVA. It can be a useful framework for certain questions. However, it is never going to produce different answers than you would obtain in a linear regression framework. Moreover the linear regression framework is much more flexible, and can be extending However in case you are required to formulate your regression model, I’ve included some R syntax for you here.

One-way ANOVA

Syntax

aov(Y ~ Grp, data = data)

Assumptions:

normal distribution of the of the DV’s within each group
homogeneity of variances within each group
random sample from the population
independent observations

Two-way ANOVA

Syntax

aov(Y ~ Grp * Male, data = data)

Or this model can be written

aov(Y ~ Grp + Male + Grp:Male, data = data)

normal distribution of the of the DV’s within each group
homogeneity of variances within each group
random sample from the population
independent observations

MANOVA

This model can be written:

# MANOVA test
model<- manova(cbind(Y1, Y2) ~ GRP, data = data)
summary(model)

Appendix A: LaTeX tables

For advanced users, if you wish to write in $\LaTeX$ you can create a table in many way, for example copying and/pasting the output of the following code into your $\LaTeX$ document:

parameters::parameters(model_simple)%>%
 kable( booktabs = F, "latex")%>%
  print()


\begin{tabular}{l|r|r|r|r|r|r|r|r}
\hline
Parameter & Coefficient & SE & CI & CI\_low & CI\_high & t & df\_error & p\\
\hline
(Intercept) & -17.579095 & 6.7584402 & 0.95 & -31.167850 & -3.990340 & -2.601058 & 48 & 0.0123188\\
\hline
speed & 3.932409 & 0.4155128 & 0.95 & 3.096964 & 4.767853 & 9.463990 & 48 & 0.0000000\\
\hline
\end{tabular}

The package I tend to go to for $\LaTeX$ is the texreg package:

texreg::texreg(list(model_simple),
           custom.model.names = c("Cars ..."),
           caption = "Breaking distance as predicted by speed",
           sideways = F,
           scalebox = .5,
           #fontsize= "footnotesize",
           label = "tab:model_simple",
           ci.force.level = 0.95, bold = 0.05,
           settingstars = 0,
           booktabs = TRUE,
           custom.note ="")


\usepackage{graphicx}
\usepackage{booktabs}

\begin{table}
\begin{center}
\scalebox{0.5}{
\begin{tabular}{l c}
\toprule
 & Cars ... \\
\midrule
(Intercept) & $\mathbf{-17.58}^{*}$ \\
            & $(6.76)$              \\
speed       & $\mathbf{3.93}^{***}$ \\
            & $(0.42)$              \\
\midrule
R$^2$       & $0.65$                \\
Adj. R$^2$  & $0.64$                \\
Num. obs.   & $50$                  \\
\bottomrule
\end{tabular}
}
\caption{Breaking distance as predicted by speed}
\label{tab:model_simple}
\end{center}
\end{table}

Multiple regression (ANOVA, ANOVCA, MANOVA)