Week 3: Causal Diagrams — The Structures of Confounding Bias
Date: 11 Mar 2026
- M-bias
- Regression
- Intercept
- Regression coefficient
- Model fit
- Why model fit is misleading for causality
- Create a new
.Rfile called03-lab.Rwith your name, contact, date, and a title such as "Regression and confounding bias." - Copy and paste the code chunks below during class.
- Save in a clearly defined project directory.
You may also download the lab here: Download the R script for Lab 03
Seminar
Learning outcomes
By the end of this week, you will be able to:
- Define confounding bias formally and identify it in a causal diagram.
- State and apply the backdoor criterion to determine which variables to condition on.
- Explain why good regression model fit does not rule out confounding.
- Distinguish confounding problems that longitudinal data can resolve from those it cannot.
- Define M-bias and explain why conditioning on a pre-treatment collider introduces spurious associations.
What is confounding?
Last week we introduced the five elementary causal structures and the four rules of confounding control. This week we apply those tools to the central problem of observational causal inference: confounding.
Confounding exists when a common cause of both the treatment and the outcome creates a non-causal (backdoor) path between them. If this path is left open, the observed association between and conflates the true causal effect with a spurious correlation induced by the shared cause.
Confounding bias exists when there is at least one open backdoor path between the treatment and the outcome . A backdoor path is any path from to that begins with an arrow pointing into .
Consider a simple example. Suppose we want to know whether exercise () causes lower blood pressure (). Health consciousness () is a common cause: health-conscious people exercise more and eat better, which independently lowers blood pressure. The DAG is , with a direct arrow . The path is a backdoor path. If we do not condition on , our estimate of the effect of on will be inflated because it includes the spurious association transmitted through .
The backdoor criterion
How do we know which variables to condition on? Pearl's backdoor criterion provides the answer.
A set of variables satisfies the backdoor criterion relative to treatment and outcome if:
- No element of is a descendant of (we do not condition on anything caused by the treatment).
- blocks every backdoor path from to .
When satisfies the backdoor criterion, conditioning on yields conditional exchangeability: .
The backdoor criterion translates the causal problem into a graphical one. Instead of reasoning about unobservable counterfactuals, we draw a DAG, list all backdoor paths, and check whether our measured covariates block them. If they do, the causal effect is identified. If they do not, we have unresolved confounding, and our estimate will be biased regardless of sample size or statistical sophistication.
Confounding and regression
Regression is the standard tool for "controlling for" confounders. When we write , we are conditioning on and interpreting as the effect of on holding constant.
- Intercept (): the expected value of when all predictors are zero.
- Regression coefficient (): the expected change in for a one-unit change in the predictor, holding other predictors constant.
- Model fit (): the proportion of variance in explained by the predictors.
A critical point for this course: good model fit does not indicate absence of confounding. A regression can explain 90% of the variance in (high ) while conditioning on the wrong variables. If you condition on a mediator instead of a confounder, the model may fit well but the coefficient for will be biased. If you condition on a collider, the model may again fit well but will introduce a spurious association that was not there before. Model fit is a statistical property of the relationship between predictors and outcome. Confounding is a structural (causal) property of the data-generating process. The two are logically independent.
A model can fit the data well while conditioning on the wrong variables. Conditioning on a mediator blocks the causal path and attenuates the estimate. Conditioning on a collider opens a spurious path and inflates or reverses the estimate. Neither error is detectable from , residual plots, or any other goodness-of-fit diagnostic. Only a correctly specified causal diagram can distinguish confounders (condition on them) from mediators and colliders (do not).
Confounding problems resolved by time-series data
Many confounding problems arise because cross-sectional data cannot distinguish cause from effect. When we measure and at the same time, we cannot tell whether caused , caused , or a shared cause produced both. Longitudinal data collection, in which variables are measured at distinct time points, resolves several of these ambiguities.
The following figure presents seven confounding scenarios. In each case, the structural problem can be addressed by measuring the relevant variables in the correct temporal order: confounders before treatment, treatment before outcome.
The seven scenarios include reverse causation (where the apparent cause is actually the effect), confounding by a measured common cause (which can be conditioned on once it is measured before treatment), and cases where the temporal ordering of measurement clarifies which variables are mediators versus confounders. In each case, the solution is the same: measure confounders at baseline (), treatment at wave 1 (), and outcome at wave 2 (). This three-wave panel design is the minimum structure for most observational causal questions in psychology.
Confounding problems not resolved by time-series data alone
Not all confounding problems can be solved by longitudinal data collection. The next figure presents six examples where time-series data are insufficient. Study these carefully before seminar.
M-bias arises when a pre-treatment variable is a collider of two unmeasured confounders: one that affects the treatment and one that affects the outcome. The DAG forms an "M" shape: , with and . Without conditioning on , the path is blocked (because is a collider). Conditioning on opens this path, creating a spurious association between and that did not exist before.
M-bias is counterintuitive because is measured before treatment and appears to be a plausible confounder. The instinct to "control for everything measured at baseline" can introduce bias rather than remove it. The only way to diagnose M-bias is through a causal diagram: it is invisible in the data.
Other scenarios in the figure include residual confounding (where unmeasured variables affect changes in confounders over time even after baseline adjustment), treatment-confounder feedback (where past treatments affect future confounders, creating cycles that standard regression cannot handle), and intermediary confounding in mediation analysis (where a variable that confounds the mediator-outcome relationship is itself affected by treatment, making natural indirect effects unidentifiable).
Worked example: the assumptions in causal mediation
The following figure illustrates the assumptions required for causal mediation analysis. When we decompose the total effect of on into a direct effect and an indirect effect through , we need not only the standard three assumptions (causal consistency, conditional exchangeability, positivity) but also the absence of treatment-induced confounding of the mediator-outcome relationship.
If a confounder of the path is itself affected by treatment , conditioning on it blocks part of the treatment effect (mediator bias) while failing to condition on it leaves confounding open. This dilemma has no solution within the standard regression framework and is one reason why causal mediation analysis requires stronger assumptions than total-effect estimation.
Lab materials: Lab 3: Regression, Graphing, and Simulation