Lab 4: Regression and Confounding Bias
Download the R script for this lab (right-click → Save As)
Packages
# install and load packages
if (!require(devtools, quietly = TRUE)) {
install.packages("devtools")
library(devtools)
}
library("margot")
library("tidyverse")
library("splines")
library("report")
library("performance")
library("ggeffects")
library("sjPlot")
library("dplyr")
library("ggplot2")
library("skimr")
library("gtsummary")
library("table1")
library("kableExtra")
library("patchwork")
library("parameters")
Regression
This week we introduce regression.
Learning Outcomes
By learning regression, you will be better equipped to do psychological science and to evaluate psychological research.
What is Regression?
Broadly speaking, a regression model is a method for inferring the expected average features of a population and its variance conditional on other features of the population as measured in a sample.
We shall see that regression encompasses more than this definition; however, this definition makes a start.
To understand regression, we need to understand the following jargon words: population, sample, measurement, and inference.
What is a population?
In science, a population is a hypothetical construct. It is the set of all potential members of a set of things. In psychological science, that set is typically a collection of individuals. We want to understand "The population of all human beings?" or "The New Zealand adult population"; or "The population of undergraduates who may be recruited for IPRP in New Zealand."
What is a sample?
A sample is a randomly realised sub-population from the larger abstract population that a scientific community hopes to generalise about.
Think of selecting balls randomly from an urn. When pulled at random, the balls may inform us about the urn's contents. For example, if we select one white ball and one black ball, we may infer that the balls in the urn are not all white or all black.
What is "measurement"?
A measure is a tool or method for obtaining numerical descriptions of a sample. We often call measures "scales."
A measurement is the numerical description we obtain from sensors such as statistical surveys, census data, twitter feeds, and so forth.
In the course, we have encountered numerical scales, ordinal scales, and factors. The topic of measurement in psychology is very broad. As we shall see, the trail of the serpent of measurement runs across comparative psychological research.
It is essential to remember that measures can be prone to error. Error-prone scales may nevertheless be helpful. However, we need to investigate their utility against the backdrop of specific interests and purposes.
What is a parameter?
In regression, we combine measurements on samples with probability theory to guess about the properties of a population we will never observe. We call these properties "parameters."
What is statistical inference?
The bulk of statistical inference consists of educated guessing about population parameters.
Probability distributions and statistical guessing
Inference is possible because the parameters of naturally occurring populations are structured by data-generating processes that are approximated by probability distributions. A probability distribution is a mathematical function describing a random event's probability. Today we will be focusing on height.1
Today we will discuss the "normal" or "Gaussian distribution." A large number of data-generating processes in nature conform to the normal distribution.
Let us consider some examples of randomly generated samples, which we will obtain using R's rnorm function.
10-person sample of heights
# seed
set.seed(123)
# generate 10 samples, average 170, sd = 20
draws_10 <- rnorm(10, mean = 170, sd = 20)
# ggplot quick histogram
ggplot2::qplot(draws_10, binwidth = 2)
100-person sample of heights
# reproducibility
set.seed(123)
# generate 100 samples, average 170, sd = 20
draws_100 <- rnorm(100, mean = 170, sd = 20)
# graph
ggplot2::qplot(draws_100, binwidth = 2)
10000-person sample of heights
# reproducibility
set.seed(123)
# n = 10,000
draws_10000 <- rnorm(1e5, mean = 170, sd = 20)
# plot
ggplot2::qplot(draws_10000, binwidth = 2)
How can I use regression to infer a population parameter?
We can use R to investigate the average height of our imaginary population from which the preceding samples were randomly drawn. We do this in R by writing an "intercept-only" model as follows:
# syntax for an intercept-only model
model <- lm(outcome ~ 1, data = data)
# base R summary
summary(model)
Using the previous simulations:
N = 10 random draws
# write the model and get a nice table for it
sjPlot::tab_model(lm(draws_10 ~ 1))
N = 100 random draws
sjPlot::tab_model(lm(draws_100 ~ 1))
N = 10,000 random draws
sjPlot::tab_model(lm(draws_10000 ~ 1))
What do we notice about the relationship between sample size and the estimated population average?
sjPlot::tab_model(lm(draws_10 ~ 1),
lm(draws_100 ~ 1),
lm(draws_10000 ~ 1))
Regression with a single covariate
Do mothers' heights predict daughter height? If so, what is the magnitude of the relationship?
Francis Galton is credited with inventing regression analysis. Galton observed that offspring's heights tend to fall between parental height and the population average, which Galton termed "regression to the mean." Galton sought a method for educated guessing about heights, and this led to fitting a line of regression by a method called "least squares." (For a history, see here.)
The following dataset is from "The heredity of height" by Karl Pearson and Alice Lee (1903) (Pearson & Lee, 1903). I obtained it from (Gelman et al., 2020). Let us use this dataset to investigate the relationship between mothers' and daughters' heights.
# import data
df_pearson_lee <-
data.frame(read.table(
url(
"https://raw.githubusercontent.com/avehtari/ROS-Examples/master/PearsonLee/data/MotherDaughterHeights.txt"
),
header = TRUE
))
# centre mother's height for later example
df_pearson_lee_centered <- df_pearson_lee |>
dplyr::mutate(mother_height_c = as.numeric(scale(
mother_height, center = TRUE, scale = FALSE
)))
skimr::skim(df_pearson_lee_centered)
Pearson and Lee collected 5,524 observations from mother/daughter height pairs. Let us examine the data, first by plotting the relationship.
What is happening here?
# explore pearson lee data
explore_md <-
ggplot2::ggplot(data = df_pearson_lee_centered,
aes(y = daughter_height, x = mother_height)) +
geom_jitter(alpha = .2) +
labs(title = "The relationship between mother's height and daughter's height") +
ylab("Daughter's height") +
xlab("Mother's height") +
theme_classic()
# print
print(explore_md)
Is there a linear predictive relationship between these two parameters? In regression we examine the line of best fit.
# regression
m1 <-
lm(daughter_height ~ mother_height, data = df_pearson_lee_centered)
# graph
sjPlot::tab_model(m1)
We can plot the coefficient, but in a model with one predictor, this is not very informative. However, as we continue in the course, we shall see that plotting coefficients can be easier than deciphering the numbers in tables. Here are two methods for plotting.
# get model parameters
t_m1 <- parameters::model_parameters(m1, ci = 0.95)
# plot
plot(t_m1) +
labs(title = "The relationship between mother's height and daughter's height") +
ylab("Daughter's height")
How do we interpret the regression model?
Let us write the equation out in mathematics. How do we read this?2
The maths says that the expected daughter's height in a population is predicted by the average height of the population when mothers' heights are set to zero units (note, this is impossible; we shall come back to this) plus units of daughter's height (inches) for each additional unit of mother's height (inches).
We can plug the output of the model directly into the equation as follows:
Graph the relationship between mother's and daughter's heights
library(ggeffects)
predictions <- ggeffects::ggpredict(m1, terms = "mother_height",
add.data = TRUE,
dot.alpha = .1,
jitter = TRUE)
plot_predictions <-
plot(predictions) +
theme_classic() +
labs(title = "Predicted values of daughter's height from the Pearson/Fox 1903 dataset")
plot_predictions
Regression to predict beyond the range of a dataset
Jyoti Amge is the world's shortest woman at 25 inches. Sandy Allen was the world's tallest woman at 91 inches. What are the expected heights of their daughters and of every intermediary woman in between?
# use the expand.grid command to create a sequence of points for mother's height
df_expand_grid <- expand.grid(mother_height = c(25:91))
# use the predict function to create a new response
df_predict <-
predict(m1,
type = "response",
interval = "confidence",
newdata = df_expand_grid)
# create a new dataframe for the new sequence of points for mother's height and the predicted data
newdata <- data.frame(df_expand_grid, df_predict)
head(newdata)
Graph the predicted results:
# graph the expected results
predplot <- ggplot(data = newdata,
aes(x = mother_height, y = fit)) +
geom_point() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = .1) +
expand_limits(x = c(20, 91), y = c(0, 81)) +
theme_classic() +
labs(title = "Predicted values for a broader population")
# plot the two graphs together (making the x and y axis at the same scale)
library("patchwork")
# old plot with the new axis and y-axis scales, and remove points
plot_height <- plot(predplot, add.data = FALSE) + theme_classic()
plot_height_title <-
plot_height +
expand_limits(x = c(20, 91), y = c(0, 81)) +
labs(title = "Predicted values of daughter's height from the Pearson/Fox 1903 dataset")
# double graph
plot_height_title / predplot +
plot_annotation(title = "What do you notice about these relationships?",
tag_levels = "a")
A simple method for obtaining the predicted values from your fitted model is to obtain the effects output without producing a graph.
# predicted values of mother height on daughter height
library(ggeffects)
ggeffects::ggpredict(m1, terms = "mother_height")
Non-linear relationships
Linear regression assumes linearity conditional on a model. Often your data will not be linear!
Consider the following example:
# simulate nonlinear relationship between x and y
b <- c(2, 0.75)
set.seed(12)
x <- rnorm(100)
set.seed(12)
y <- rnorm(100, mean = b[1] * exp(b[2] * x))
dat1 <- data.frame(x, y)
ot1 <- lm(y ~ x, data = dat1)
# plot linear effect
plot(ggeffects::ggpredict(ot1, terms = "x",
add.data = TRUE,
dot.alpha = .4))
Non-linear relationship as modelled by a polynomial regression:
# model: quadratic
fit_non_linear <- lm(y ~ x + I(x ^ 2), data = dat1)
# predictive plot
plot(ggeffects::ggpredict(fit_non_linear, terms = "x",
add.data = TRUE,
dot.alpha = .4))
Here is another approach:
# non-linear regression
library(splines)
# fit model
fit_non_linear_b <- lm(y ~ x + poly(x, 2), data = dat1)
# graph model
plot(
ggeffects::ggpredict(fit_non_linear_b, terms = "x",
add.data = TRUE,
dot.alpha = .4
))
Non-linear relationship as modelled by a general additive model (spline):
# fit spline: not specified
fit_non_linear_c <- lm(y ~ bs(x), data = dat1)
# model parameters: coefficients are not interpretable
parameters::model_parameters(fit_non_linear_c)
plot(
ggeffects::ggpredict(fit_non_linear_c, terms = "x",
add.data = TRUE,
dot.alpha = .4
))
Centring
Any linear transformation of a predictor is acceptable. Often we centre (or centre and scale) all indicators, which gives us an interpretable intercept (the expected population average when the other indicators are set to their average).
library(ggeffects)
# fit raw data
fit_raw <- lm(daughter_height ~ mother_height, data = df_pearson_lee_centered)
# fit centred data
fit_centered <-
lm(daughter_height ~ mother_height_c, data = df_pearson_lee_centered)
# compare the models
sjPlot::tab_model(fit_raw, fit_centered)
Graph model:
# graph centred model
plot(
ggeffects::ggpredict(fit_centered, terms = "mother_height_c",
add.data = TRUE,
dot.alpha = .4
))
Note: when fitting a polynomial or any interaction, it is important to centre your indicators. We shall come back to this point in later lectures.
Model evaluation
Reviewers will sometimes ask you to assess model fit.
A simple, but flawed way to assess accuracy of your model fit is to compare a model with one covariate with a simple intercept-only model and to assess improvement in either the AIC statistic or the BIC statistic. The BIC is similar to the AIC but adds a penalty for extra predictors. An absolute improvement in either statistic of n > 10 is considered to be a "better" model.
We can use the performance package to generate a table that compares model fits.
# load library
library(performance)
# intercept only
fig_intercept_only <- lm(daughter_height ~ 1, data = df_pearson_lee)
# covariate added
fig_covariate <- lm(daughter_height ~ mother_height, data = df_pearson_lee)
# evaluate
performance::compare_performance(fig_intercept_only, fig_covariate)
What was the model "improvement?"
# improved fit
BIC(fig_intercept_only) - BIC(fig_covariate)
Generate a report
This is easy with the report package.
For example:
report::report_statistics(fig_covariate)
Or, if you want a longer report:
report::report(fig_covariate)
Use "statistically significant" in place of "significant." This will avoid misleading your audience into thinking your result is important when what you intend to communicate is that it is reliable.
Assumptions of Regression
From Gelman and Hill (Gelman & Hill, 2006):
-
Validity. The most important assumption is that the data you are analysing should map to the research question you are trying to answer (Gelman et al., 2020, p. 152).
-
Representativeness. A regression model is fit to data and is used to make inferences about a larger population, hence the implicit assumption in interpreting regression coefficients is that the sample is representative of the population (Gelman et al., 2020, p. 153).
-
Linearity. The most important mathematical assumption of the linear regression model is that its deterministic component is a linear function of the separate predictors: . If additivity is violated, it might make sense to transform the data (for example, if , then ) or to add interactions. If linearity is violated, perhaps a predictor should be put in as or instead of simply linearly. Or a more complicated relationship could be expressed using a nonlinear function such as a spline or Gaussian process (Gelman et al., 2020, p. 153).
-
Independence of errors. The simple regression model assumes that the errors from the prediction line are independent, an assumption that is violated in time-series, spatial, and multilevel settings (Gelman et al., 2020, p. 153).
-
Equal variance of errors. Unequal variance does not affect the most important aspect of a regression model, which is the form of the predictors (Gelman et al., 2020, p. 153).
-
Normality of errors. The regression assumption that is generally least important is that the errors are normally distributed. In fact, for the purpose of estimating the regression line (as compared to predicting individual data points), the assumption of normality is barely important at all. Thus, in contrast to many regression textbooks, we do not recommend diagnostics of the normality of regression residuals (Gelman & Hill, 2006, p. 46). A good way to diagnose violations of some of the assumptions (importantly, linearity) is to plot the residuals versus fitted values or individual predictors (Gelman & Hill, 2006, p. 46).
Common Confusions
Regressions do not automatically give us "effects"
People use the word "effect", but that is not what regression gives us by default.
"Normality Assumption"
Gelman and Hill note that the "normality" assumption is the least important. The assumption pertains to the normality of residuals.
Statistical Independence
This is the motivation for doing multi-level modelling: to condition on dependencies in the data. However, multi-level modelling can produce new problems if the error terms in the model are not independent of the outcome.
External Validity (wrong population)
We sample from undergraduates, but infer about the human population.
The concept of "better model fit" is relative to our interests and purposes. A better (lower) AIC statistic does not tell us whether a model is better for causal inference. We must assess whether a model satisfies the assumptions necessary for valid causal inference.
Simulation Demonstration 1: How Regression Coefficients Mislead
Methodology
-
Data generation: we simulate a dataset for 1,000 individuals, where religious service attendance () affects wealth (), which in turn affects charitable donations (). The simulation is based on predefined parameters that establish as a mediator between and .
-
Parameter definitions: the probability of religious service () is set at 0.5. The effect of on (wealth) is given by . The effect of on (charity) is given by . Standard deviations for and are set at 1 and 1.5, respectively.
-
Model specifications: Model 1 (incorrect assumption) considers a linear regression model assuming as a confounder, including and as regressors on . This model aligns with the data-generating process and correctly identifies as a mediator. According to the rules of d-separation (last week): to identify the total effect of on we must not include . Model 2 (correct model) fits a linear regression model that includes only as a regressor on and omits the mediator .
-
Analysis and comparison: the analysis compares the estimated effects of on under both model specifications. By including as a predictor in Model 1, we induce mediation bias. Model 2 correctly excludes from the model.
-
Presentation: the results are displayed in a comparative table. The table contrasts the regression coefficients and significance levels obtained under each model.
# simulation seed
set.seed(123) # reproducibility
# define the parameters
n <- 1000 # number of observations
p <- 0.5 # probability of A = 1
alpha <- 0 # intercept for L
beta <- 2 # effect of A on L
gamma <- 1 # intercept for Y
delta <- 1.5 # effect of L on Y
sigma_L <- 1 # standard deviation of L
sigma_Y <- 1.5 # standard deviation of Y
# simulate the data: fully mediated effect by L
A <- rbinom(n, 1, p) # binary exposure variable
L <- alpha + beta * A + rnorm(n, 0, sigma_L) # mediator L affected by A
Y <- gamma + delta * L + rnorm(n, 0, sigma_Y) # Y affected only by L
# make the data frame
data <- data.frame(A = A, L = L, Y = Y)
# fit regression in which L is assumed to be a confounder
example_fit_1 <- lm(Y ~ A + L, data = data)
# fit regression in which L is assumed to be a mediator
example_fit_2 <- lm(Y ~ A, data = data)
# create gtsummary tables for each regression model
table1 <- gtsummary::tbl_regression(example_fit_1)
table2 <- gtsummary::tbl_regression(example_fit_2)
# merge the tables for comparison
table_comparison <- gtsummary::tbl_merge(
list(table1, table2),
tab_spanner = c("Model: Wealth assumed confounder",
"Model: Wealth assumed to be a mediator")
)
# make markdown table
markdown_table_0 <- as_kable_extra(table_comparison,
format = "markdown",
booktabs = TRUE)
# print markdown table
markdown_table_0
Compare model fits
By all metrics, Model 1 fits better but it is confounded
performance::compare_performance(example_fit_1, example_fit_2)
Model 1 exhibits mediator bias, but it has a considerably higher , and a lower BIC.
Focussing on the BIC (lower is better): Model 1 fits better, but it is confounded.
BIC(example_fit_1) - BIC(example_fit_2)
In this simulation, we discovered that if we assess "effect" by "model fit" we will get the wrong sign for the coefficient of interest. The use of model fit perpetuates the causality crisis in psychological science (Bulbulia et al., 2022). Few are presently aware of this crisis. Causal diagrams show us how we can do better.
Simulation Demonstration 2: How Regression Coefficients Mislead
Methodology
-
Data generation: we simulate a dataset for 1,000 individuals, where religious service attendance () affects wealth (), and charitable donations () also affect wealth , but and are not causally related.
-
Parameter definitions: the probability of religious service () is set at 0.5. has a mean of 0 and sd = 1, and is independent of . is a linear function of and .
-
Model specifications: Model 1 (incorrect) controls for . Model 2 (correct) does not control for .
-
Analysis and comparison: compare models.
# simulation seed
set.seed(123) # reproducibility
# define parameters
n <- 1000 # number of observations
p <- 0.5 # probability of A = 1
# simulate the data
A_1 <- rbinom(n, 1, p) # binary exposure variable
Y_1 <- rnorm(n, 0, 1) # Y independent of A
L_2 <- rnorm(n, A_1 + Y_1)
# make the data frame
data_collider <- data.frame(A = A_1, L = L_2, Y = Y_1)
# fit regression controlling for collider L
collider_example_fit_1 <- lm(Y ~ A + L, data = data_collider)
# fit regression without collider
collider_example_fit_2 <- lm(Y ~ A, data = data_collider)
summary(collider_example_fit_1)
# create gtsummary tables for each regression model
collider_table1 <- gtsummary::tbl_regression(collider_example_fit_1)
collider_table2 <- gtsummary::tbl_regression(collider_example_fit_2)
# merge the tables for comparison
collider_table_comparison <- gtsummary::tbl_merge(
list(collider_table1, collider_table2),
tab_spanner = c("Model: Wealth assumed confounder",
"Model: Wealth not assumed to be a mediator")
)
# make markdown table
markdown_table_1 <- as_kable_extra(collider_table_comparison,
format = "markdown",
booktabs = TRUE)
# print table
collider_table_comparison
Compare model fits
By all metrics, Model 1 fits better but it is confounded.
is not causally associated with . We generated the data so we know. However, the confounded model shows a better fit, and in this model the coefficient for is statistically significant (and negative).
performance::compare_performance(collider_example_fit_1, collider_example_fit_2)
Again, the BIC (lower is better) for Model 1 is better, but it is entirely confounded.
BIC(collider_example_fit_1) - BIC(collider_example_fit_2)
How does collider bias work? If we know that someone is not attending church, if they are charitable then we can predict they are wealthy. Similarly if we know someone is not wealthy but charitable, we can predict they attend religious service.
However, in this simulation, we know that religion and wealth are not causally associated because we have simulated the data.
In this simulation, we discovered that if we assess "effect" by "model fit", we get the wrong scientific inference. The use of model fit perpetuates the causality crisis in psychological science (Bulbulia et al., 2022). Let us do better.
Resources on Regression that Do Not Perpetuate the Causality Crisis
- Statistical Rethinking (McElreath, 2020)
- Regression and Other Stories (Gelman et al., 2020)
Exercises
Setup
Libraries
library(tidyverse)
library(patchwork)
library(lubridate)
library(kableExtra)
library(gtsummary)
Get data
# package with data
library(margot)
# load data
data("df_nz")
Import Pearson and Lee mother's and daughters data
df_pearson_lee <-
data.frame(read.table(
url(
"https://raw.githubusercontent.com/avehtari/ROS-Examples/master/PearsonLee/data/MotherDaughterHeights.txt"
),
header = TRUE
))
# centre mother's height for later example
df_pearson_lee <- df_pearson_lee |>
dplyr::mutate(mother_height_c = as.numeric(scale(
mother_height, center = TRUE, scale = FALSE
)))
Q1. Create a descriptive table and a descriptive graph for the hlth_weight and hlth_height variables in the df_nz dataset
Select hlth_height, hlth_weight from the nz dataset. Filter only the 2018 wave. Create a descriptive table and graph these two variables. Annotate your workflow (at each step, describe what you are doing and why).
Q2. Regression height ~ weight and report results
Using the df_nz dataset, write a regression model for height as predicted by weight. Create a table for your results. Create a graph/graphs to clarify the results of your regression model. Briefly report your results.
Q3. Regress height ~ male and report results
Using the df_nz dataset, write a regression model for height as predicted by male. Create a table for your results. Create a graph/graphs to clarify the results of your regression model. Briefly report your results.
Q4. Regression to predict
Using the regression coefficients from the Pearson and Lee 1903 dataset, predict the heights of daughters of women in the df_nz dataset.
Q5. Bonus
On average, how much taller or shorter are women in New Zealand as sampled in the 2019 df_nz dataset compared with women in 1903 as sampled in the Pearson and Lee dataset? Clarify your inference.
Solutions
Solutions: Setup
## libraries
library("tidyverse")
library("patchwork")
library("lubridate")
library("kableExtra")
library("gtsummary")
# get data
library("margot")
data("df_nz")
# import pearson and lee mother's and daughters data
# in 1903, pearson and lee collected 5,524 observations from mother/daughter height pairs.
df_pearson_lee <- data.frame(read.table(
url(
"https://raw.githubusercontent.com/avehtari/ROS-Examples/master/PearsonLee/data/MotherDaughterHeights.txt"
),
header = TRUE
))
# centre mother's height for later example
df_pearson_lee <- df_pearson_lee |>
dplyr::mutate(mother_height_c = as.numeric(scale(
mother_height, center = TRUE, scale = FALSE
)))
Solution Q1: Descriptive table
# required libraries
library(gtsummary)
library(dplyr)
library(margot)
# select focal variables and rename them for clarity
df_nzdat <- df_nz |>
dplyr::filter(wave == 2019) |>
dplyr::select(hlth_weight, hlth_height, male) |>
dplyr::rename(weight = hlth_weight,
height = hlth_height) |>
dplyr::select(weight, height, male) |>
dplyr::mutate(weight_c = as.numeric(scale(
weight, scale = FALSE, center = TRUE
)))
# create table
df_nzdat |>
dplyr::select(weight, height) |>
gtsummary::tbl_summary(
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} / {N} ({p}%)"
),
digits = all_continuous() ~ 2,
missing_text = "(Missing)"
) |>
bold_labels()
Here is another approach:
# libs we need
library(table1)
library(tidyverse)
library(tidyr)
library(margot)
# filter 2019 wave
df_nz_1 <- df_nz |>
dplyr::filter(wave == 2019)
# nicer labels
table1::label(df_nz_1$hlth_weight) <- "Weight"
table1::label(df_nz_1$hlth_height) <- "Height"
# table
table1::table1(~ hlth_weight + hlth_height, data = df_nz_1)
Create graph:
library("patchwork")
# create data set where filtering na vals
df_nzdat1 <- df_nzdat |>
dplyr::filter(!is.na(weight),
!is.na(height),
!is.na(male))
weight_density <- ggplot2::ggplot(data = df_nzdat1, aes(x = weight)) +
geom_density(fill = "chocolate2") +
labs(title = "Density plot of weight of NZ sample years 2019/2020") +
theme_classic()
weight_density
height_density <- ggplot2::ggplot(data = df_nzdat1, aes(x = height)) +
geom_density(fill = "blue2") +
labs(title = "Density plot of height of NZ sample years 2019/2020") +
theme_classic()
height_density
weight_density / height_density + plot_annotation(tag_levels = "a")
Here is a density plot:
library(ggplot2)
ggplot2::ggplot(data = df_nzdat1, aes(x = weight, fill = as.factor(male))) +
geom_density() +
labs(title = "Density plot of weight of NZ sample years 2019/2020") +
theme_classic() +
scale_fill_viridis_d()
Solution Q2: Regress height ~ weight and report results
Model:
# regression of height ~ weight
fit_1 <- lm(height ~ weight_c, data = df_nzdat)
Table:
library(parameters)
# table of model
parameters::model_parameters(fit_1) |>
print_html(digits = 2,
select = "{coef}{stars}|({ci})",
column_labels = c("Estimate", "95% CI"))
Prediction:
library(ggeffects)
library(ggplot2)
# generate the plot
plot(ggeffects::ggpredict(fit_1, terms = "weight_c"))
Briefly report your results (note: replace "significant" with "statistically significant"):
report::report(fit_1)
Solution Q3: Regress height ~ male and report results
Model and table:
# regression of height ~ male
fit_2 <- lm(hlth_height ~ male, data = df_nz)
sjPlot::tab_model(fit_2)
Graph:
# plot over the range of the data
plot_fit_2 <- plot(
ggeffects::ggpredict(fit_2, terms = "male[all]",
jitter = .1,
add.data = TRUE,
dot.alpha = .2
))
# plot
plot_fit_2 +
scale_y_continuous(limits = c(1.2, 2.1))
Report:
report::report(fit_2)
Solution Q4: Regression to predict
Using the regression coefficients from the Pearson and Lee 1903 dataset, predict the heights of daughters of women in the df_nz dataset.
# model for daughter height from mother height
fit_3 <- lm(daughter_height ~ mother_height, data = df_pearson_lee)
# create data frame of not_male's in 2019
# notice problem: not_male != woman
# additionally, woman != mother!
df_nz_2 <- df_nz |>
filter(male == 1) |>
dplyr::select(hlth_height) |>
dplyr::mutate(mother_height = hlth_height * 39.36) |>
dplyr::select(mother_height) |>
dplyr::arrange((mother_height))
# find min and max heights, store as objects
min_mother_height <- min(df_nz_2$mother_height, na.rm = TRUE)
max_mother_height <- max(df_nz_2$mother_height, na.rm = TRUE)
# expand grid, use stored objects to define boundaries
df_expand_grid_nz_2 <-
expand.grid(mother_height = seq(
from = min_mother_height,
to = max_mother_height,
length.out = 200
))
# use the predict function to create a new response using the pearson and lee regression model
predict_fit_3 <-
predict(fit_3,
type = "response",
interval = "confidence",
newdata = df_expand_grid_nz_2)
# combine variables into a data frame
df_expand_grid_nz_2_predict_fit_3 <-
data.frame(df_expand_grid_nz_2, predict_fit_3)
# graph the expected average hypothetical heights of "daughters"
predplot2 <-
ggplot(data = df_expand_grid_nz_2_predict_fit_3,
aes(x = mother_height, y = fit)) +
geom_line(colour = "cadetblue") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = .1) +
scale_x_continuous(limits = c(50, 75)) +
scale_y_continuous(limits = c(50, 75)) +
theme_classic() +
xlab("NZ 2019 female population") +
ylab("Predicted daughter heights in inches") +
labs(title = "Regression prediction for hypothetical daughter heights of NZ population in 2019")
# plot
predplot2
Solution Q5: Bonus
On average, how much taller or shorter are women in New Zealand as sampled in 2019 compared with women in 1903 as sampled in the Pearson and Lee dataset?
# create var for 1903 dataset
df_pearson_lee_2 <- df_pearson_lee |>
dplyr::select(mother_height, daughter_height) |>
tidyr::pivot_longer(everything(), names_to = "relation", values_to = "f_height") |>
dplyr::mutate(year_is = factor("1903"))
# create var for the 2019 dataset
df_nz_combine_pearson_lee <- df_nz_2 |>
dplyr::rename(f_height = mother_height) |>
dplyr::mutate(year_is = factor("2019"),
relation = factor("mother"))
head(df_nz_combine_pearson_lee)
# combine data frames row-wise
df_nz_combine_pearson_lee_1 <- rbind(df_pearson_lee_2, df_nz_combine_pearson_lee)
Look at heights in sample:
table(df_nz_combine_pearson_lee_1$year_is)
# box plot
ggplot2::ggplot(data = df_nz_combine_pearson_lee_1,
aes(x = year_is, y = f_height, fill = year_is)) +
geom_boxplot(notch = TRUE) +
labs(title = "Comparison of female height 1903/2019") +
theme_classic() +
scale_fill_viridis_d()
Predict heights out of sample:
# regression model
fit_4 <- lm(f_height ~ year_is, data = df_nz_combine_pearson_lee_1)
# table
sjPlot::tab_model(fit_4)
Women in 2019 are taller.
report::report(fit_4)
Graph:
# get meaningful values for the y-axis range
min_mother_height_for_range <- min(df_nz_combine_pearson_lee_1$f_height, na.rm = TRUE)
max_mother_height_for_range <- max(df_nz_combine_pearson_lee_1$f_height, na.rm = TRUE)
# make graph
plot(
ggeffects::ggpredict(fit_4, terms = "year_is",
add.data = TRUE,
dot.alpha = .2
)) +
labs(title = "Predicted difference in female heights 1903/2019") +
xlab("Study year") +
ylab("Predicted female height (inches)") +
scale_y_continuous(
limits = c(min_mother_height_for_range, max_mother_height_for_range))
Graph with predicted points:
ggeffects::ggpredict(fit_4, terms = "year_is")
# graph with predicted values based on the model
plot(
ggeffects::ggpredict(fit_4, terms = "year_is",
add.data = TRUE,
dot.alpha = .01,
colors = "us",
jitter = .5)) +
labs(title = "Predicted difference in female heights 1903/2019") +
xlab("Study year") +
ylab("Predicted female height (inches)") +
scale_y_continuous(
limits = c(min_mother_height_for_range, max_mother_height_for_range))
Appendix: Conceptual Background
Some preliminaries about science.
Science begins with a question
Science begins with a question about the world. The first step in science, then, is to clarify what you want to know.
Because science is a social practice, you will also need to clarify why your question is interesting: so what?
In short, know your question.
Scientific model (or theory)
Sometimes, scientists are interested in specific features of the world: how did virus x originate? Such a question might have a forensic interest: what constellation of events gave rise to a novel infectious disease?
Scientists typically seek generalisations. How do infectious diseases evolve? How do biological organisms evolve? Such questions have applied interests. How can we better prevent infectious diseases? How did life originate?
A scientific model proposes how nature is structured (and unstructured). For example, the theory of evolution by natural selection proposes that life emerges from variation, inheritance, and differential reproduction/survival.
To evaluate a scientific model, scientists must make generalisations beyond individual cases. This is where statistics shines.
What is statistics?
Mathematics is a logic of certainty.
Statistics is a logic of uncertainty.
A statistical model uses the logic of probability to make better guesses.
Applications of statistical models in science
Scientific models seek to explain how nature is structured. Where scientific models conflict, we can combine statistical models with data collection to evaluate the credibility of one theoretical model over others. To do this, a scientific model must make distinct, non-trivial predictions about the world.
If the predictions are not distinct, the observations will not enable a shift in credibility for one theory over another. Consider the theory that predicts any observation. Such a theory would be better classified as a conspiracy theory; it is compatible with any evidence.
-
The relationship of probability distributions and data-generating processes is complex, intriguing, and both historically and philosophically rich . Because our interests are applied, we will hardly touch upon this richness in this course. ↩
-
Later, we shall prefer a different way of writing regression equations in maths. (Note: writing maths is not maths; it is just encoding the model that we have written.) ↩