Lab 3: Regression and Bias Mechanisms

R script

Download the R script for this lab (right-click → Save As)

This lab introduces regression as a tool for describing associations and then shows why regression coefficients do not become causal effects simply because we fit a model. We begin with a simple simulated regression, and then we examine three common sources of bias: omitted variable bias, mediator bias, and collider bias.

What you will learn

  1. How to simulate a simple regression model and interpret its slope.
  2. Why population inference depends on assumptions, not on software output alone.
  3. How omitted variable bias arises when a common cause is left out.
  4. How mediator bias arises when we condition on a variable on the causal pathway.
  5. How collider bias arises when we condition on a common effect.

Packages

required_packages <- c("tidyverse", "parameters", "report", "ggeffects", "ggdag")
missing_packages <- required_packages[
  !vapply(required_packages, \(pkg) requireNamespace(pkg, quietly = TRUE), logical(1))
]

if (length(missing_packages) > 0) {
  install.packages(missing_packages)
}

library(tidyverse)
library(parameters)
library(report)
library(ggeffects)
library(ggdag)

Regression refresher

A regression model describes how the expected value of an outcome changes with a predictor. In this lab, the first simulation uses study hours as the predictor and exam score as the outcome.

set.seed(123)
n <- 200
study_hours <- rnorm(n, mean = 10, sd = 2)
exam_score <- 50 + 3 * study_hours + rnorm(n, mean = 0, sd = 6)

df_regression <- tibble(
  study_hours = study_hours,
  exam_score = exam_score
)

fit_regression <- lm(exam_score ~ study_hours, data = df_regression)
summary(fit_regression)
report::report(fit_regression)

The coefficient for study_hours tells us how much the expected exam score changes for a one-unit increase in study hours. In this simulation, the data-generating slope is positive, so the fitted line should recover a positive relationship.

We can visualise the fitted line with raw data:

predicted_values <- ggeffects::ggpredict(fit_regression, terms = "study_hours")
plot(predicted_values, dot_alpha = 0.25, show_data = TRUE, jitter = 0.05)

A reminder about inference

Regression helps us summarise patterns in data, but it does not settle causal questions on its own. A coefficient can be statistically precise and still be causally misleading if the model conditions on the wrong variables, omits an important pre-treatment cause, or conditions on a post-treatment variable that should have been left alone.

This is why model fit is not the same as causal identification. A model with a lower BIC or higher $R^2$ can still answer the wrong causal question.

Omitted variable bias

Omitted variable bias occurs when a common cause of treatment and outcome is not included in the model. In the script, l causes both a and y, but a has no causal effect on y.

set.seed(434)
n <- 2000
l <- rnorm(n)
a <- 0.9 * l + rnorm(n)
y <- 1.2 * l + rnorm(n)

df_fork <- tibble(y = y, a = a, l = l)

fit_fork_naive <- lm(y ~ a, data = df_fork)
fit_fork_adjusted <- lm(y ~ a + l, data = df_fork)

parameters::model_parameters(fit_fork_naive)
parameters::model_parameters(fit_fork_adjusted)

The naive model reports an association between a and y because both are driven by l. Once we adjust for l, the coefficient for a should collapse toward zero. This is the logic of adjustment for observed pre-treatment confounding.

The script also draws a DAG for this pattern:

dag_fork <- dagify(
  y ~ l,
  a ~ l,
  exposure = "a",
  outcome = "y"
) |>
  tidy_dagitty(layout = "tree")

ggdag(dag_fork) +
  theme_dag_blank()

Mediator bias

Mediator bias occurs when we condition on a variable that lies on the pathway from treatment to outcome, even though our goal is to estimate the total effect.

set.seed(435)
n <- 2000
a <- rbinom(n, 1, 0.5)
m <- 1.5 * a + rnorm(n)
y <- 2 * m + rnorm(n)

df_pipe <- tibble(y = y, a = a, m = m)

fit_pipe_total <- lm(y ~ a, data = df_pipe)
fit_pipe_overcontrolled <- lm(y ~ a + m, data = df_pipe)

parameters::model_parameters(fit_pipe_total)
parameters::model_parameters(fit_pipe_overcontrolled)

Here the total effect of a on y operates through m. If we condition on m, we block the pathway that carries the treatment effect. As a result, the coefficient for a shrinks toward zero, even though the treatment really does matter.

In words, if you want the total effect, you do not control for the mediator.

Collider bias

Collider bias occurs when we condition on a variable that is caused by both the treatment and the outcome. Before conditioning, the treatment and outcome may be unrelated. After conditioning, we create a spurious association.

set.seed(436)
n <- 2000
a <- rnorm(n)
y <- rnorm(n)
c_var <- a + y + rnorm(n)

df_collider <- tibble(y = y, a = a, c_var = c_var)

fit_collider_unadjusted <- lm(y ~ a, data = df_collider)
fit_collider_adjusted <- lm(y ~ a + c_var, data = df_collider)

parameters::model_parameters(fit_collider_unadjusted)
parameters::model_parameters(fit_collider_adjusted)

The unadjusted model should show little or no relationship between a and y. The adjusted model should create one. That is collider bias.

Compare the three bias patterns

The script finishes by collecting the treatment coefficient from each simulation into a single table:

results <- tibble(
  scenario = c(
    "omitted variable bias",
    "omitted variable bias",
    "mediator bias",
    "mediator bias",
    "collider bias",
    "collider bias"
  ),
  model = c(
    "naive",
    "adjusted for l",
    "total effect",
    "overcontrolled",
    "do not condition",
    "condition on collider"
  ),
  estimate = c(
    coef(fit_fork_naive)[["a"]],
    coef(fit_fork_adjusted)[["a"]],
    coef(fit_pipe_total)[["a"]],
    coef(fit_pipe_overcontrolled)[["a"]],
    coef(fit_collider_unadjusted)[["a"]],
    coef(fit_collider_adjusted)[["a"]]
  )
) |>
  mutate(estimate = round(estimate, 3))

print(results)

This comparison is the point of the lab. Different forms of conditioning error create different forms of bias. Regression does not rescue us from those mistakes. We have to supply the right causal structure.

Exercise

  1. Increase the strength of the common cause l in the omitted variable simulation. What happens to the naive coefficient?
  2. Increase the effect of a on m in the mediator simulation. What happens to the difference between the total-effect and overcontrolled models?
  3. Increase the contribution of a and y to the collider c_var. What happens to the adjusted coefficient?

Take-home message

Regression is useful, but causal interpretation depends on design and assumptions. The practical rule is simple. Control for common causes. Do not control for mediators when estimating total effects. Do not control for colliders.