Week 3: Causal Diagrams: The Structures of Confounding Bias

Slides

Date: 11 Mar 2026

Readings

Required

  • Hernán & Robins (2025), chapter 6. link

Optional

Key concepts for the test(s)

  • Confounding bias
  • Backdoor path
  • Backdoor criterion
  • Valid adjustment set
  • M-bias
  • Regression
  • Intercept
  • Regression coefficient
  • Model fit
  • Why model fit is misleading for causality

Lab 3 setup

Use Lab 3: Regression, Graphing, and Simulation for this week's practical work. The optional script is here: Download the R script for Lab 03.


Week 2 introduced five elementary causal structures and the rules of d-separation. This week we use those structures to diagnose confounding bias: when does conditioning on a variable remove bias, and when does it create bias?

Seminar

Motivating example: higher $R^2$, worse identification

Suppose investigators want the total effect of an exercise programme ($A$) on cardiovascular risk ($Y$). They begin with a regression that adjusts for baseline confounders. Then they add body composition measured after the programme ($M$). The model $R^2$ rises, because body composition is a strong correlate of cardiovascular risk.

Did the higher $R^2$ improve the causal estimate? No. If the programme changes body composition and body composition changes cardiovascular risk, then $M$ is a mediator on the path $A \to M \to Y$. Conditioning on $M$ blocks part of the very effect the investigators wanted to estimate. The model fits the observed data better, but it answers the wrong causal question.

A simple map for week 3

Week 3 asks one repeated question: which variables should we condition on, and why?

A practical workflow

  1. Draw the relevant backdoor paths from $A$ to $Y$.
  2. Decide which variables would block those paths.
  3. Check that you are not conditioning on a mediator, a collider, or a descendant of treatment.

That is the logic behind the backdoor criterion. Regression is only one way of carrying out the conditioning decision.

Learning outcomes

By the end of this week, you will be able to:

  1. Define confounding bias and identify it in a DAG.
  2. Apply the backdoor criterion.
  3. Explain why good model fit does not rule out confounding.
  4. Distinguish confounding problems that time ordering can solve from those it cannot.
  5. Define M-bias and explain why conditioning on a pre-treatment collider can introduce bias.

What is confounding?

Confounding exists when a common cause of treatment $A$ and outcome $Y$ opens a non-causal backdoor path.

Definition: confounding bias

Confounding bias exists when at least one backdoor path from $A$ to $Y$ is open.

A backdoor path starts with an arrow into $A$.

Example: exercise and blood pressure. Health consciousness ($L$) may affect exercise ($A$). Health consciousness ($L$) may also affect blood pressure ($Y$). Then $A \leftarrow L \to Y$ is a backdoor path. If we do not condition on $L$, we mix causal association and spurious association.

Quick test

If a path from $A$ to $Y$ starts with an arrow into $A$, treat it as a candidate backdoor path.

The backdoor criterion

Pearl's backdoor criterion tells us when an adjustment set is valid.

Definition: backdoor criterion

A set $L$ satisfies the backdoor criterion for $A$ and $Y$ if:

  1. No variable in $L$ is a descendant of $A$.
  2. $L$ blocks every backdoor path from $A$ to $Y$.

If both conditions hold, conditioning on $L$ supports exchangeability: $Y(a) \coprod A \mid L$.

A short memory aid

  • Block all backdoor paths.
  • Do not adjust for descendants of treatment.

Pair exercise: applying the backdoor criterion

  1. Draw a DAG for the effect of exercise ($A$) on cardiovascular risk ($Y$) with three additional variables: health consciousness ($L_1$), diet ($L_2$), and body composition ($M$, a mediator on the $A \to Y$ path). Include arrows: $L_1 \to A$, $L_1 \to L_2$, $L_2 \to Y$, $L_1 \to Y$, $A \to M \to Y$.
  2. List all paths from $A$ to $Y$.
  3. Check whether ${L_1}$ satisfies the backdoor criterion. Does it block every backdoor path without conditioning on a descendant of $A$?
  4. Explain why adding $M$ to the adjustment set violates the backdoor criterion (which part of the causal path does it block?).

Confounding and regression

Regression is one way to condition on measured variables. For example,

$$ Y = \beta_0 + \beta_1A + \beta_2L + \varepsilon $$

Definition: key regression terms

  • Intercept ($\beta_0$): expected outcome when covariates are zero.
  • Coefficient ($\beta_1$): expected outcome difference per unit change in $A$, conditional on model terms.
  • Model fit ($R^2$): variance explained by the fitted model.

High $R^2$ does not imply no confounding. Fit is a statistical property. Confounding is a causal-structure property. A model can fit the observed data very well and still answer the wrong causal question.

Why model fit is misleading for causality

A model can fit very well and still be causally wrong.

If investigators condition on a mediator, they can block part of the target effect.

If investigators condition on a collider, they can open a spurious path.

Neither problem is diagnosed by $R^2$.

Confounding problems that time ordering can resolve

Cross-sectional measurements blur temporal order. Longitudinal design can resolve several ambiguities. A common strategy is:

  1. Measure confounders at baseline ($t_0$).
  2. Measure treatment at $t_1$.
  3. Measure outcome at $t_2$.

Confounding problems resolved by time-series data (adapted from Bulbulia, 2023)

Confounding problems that time ordering alone cannot resolve

Some problems persist even with multiple waves.

Confounding problems not resolved by time-series data (adapted from Bulbulia, 2023)

Definition: M-bias

M-bias occurs when investigators condition on a pre-treatment collider.

In the structure $U_1 \to L \leftarrow U_2$, with $U_1 \to A$ and $U_2 \to Y$, conditioning on $L$ opens a previously blocked path between $A$ and $Y$.

M-bias is important because "control for everything" is not a safe rule.

Pair exercise: M-bias in practice

  1. Consider the question "does religious attendance increase charitable giving?" Suppose neighbourhood social capital ($L$) is a collider of two unmeasured causes: one that affects attendance ($U_1$) and one that affects giving ($U_2$).
  2. Draw the DAG with $U_1 \to L \leftarrow U_2$, $U_1 \to A$, and $U_2 \to Y$.
  3. Trace what conditioning on $L$ does: which path opens?
  4. State in one sentence why "adjust for all pre-treatment variables" fails here.

Worked example: mediation assumptions

The assumptions in causal mediation (adapted from Bulbulia, 2023)

Mediation analysis needs stronger assumptions than total-effect analysis. Treatment-induced confounding of the mediator-outcome relation can make standard regression unsuitable.

Return to the opening example

Back to the exercise programme example. Higher $R^2$ did not answer the causal question, because adding post-treatment body composition changed the estimand. To estimate the total effect of the programme on cardiovascular risk, we need a defended DAG and a valid adjustment set, not just a better-fitting regression. This is why we separate modelling from causal identification.

What to remember for the test

  • Confounding is about open non-causal backdoor paths.
  • The backdoor criterion tells us when an adjustment set is valid.
  • Regression can implement conditioning, but it cannot tell us which variables should be conditioned on.
  • Better fit is not the same as better identification.

Confounding is one structural threat to causal identification. Week 4 adds two others: selection bias and measurement bias.

Pair exercise: $R^2$ versus identification

  1. Investigator A adjusts for age, income, and education ($R^2 = 0.42$). Investigator B adjusts for age and conscientiousness ($R^2 = 0.31$).
  2. Explain to your partner why higher $R^2$ does not imply less confounding.
  3. Propose a DAG where Investigator A's larger adjustment set introduces bias (hint: include a collider or mediator).
  4. State what would need to be true for Investigator B's smaller set to satisfy the backdoor criterion.

Further reading

All open access: Bulbulia (2024); Hernán & Robins (2025, 6).


Lab materials: Lab 3: Regression, Graphing, and Simulation

Bulbulia, J. A. (2024). Methods in causal inference part 1: Causal diagrams and confounding. Evolutionary Human Sciences, 6, e40. https://doi.org/10.1017/ehs.2024.35

Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if. Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

Neal, B. (2020). Introduction to causal inference from a machine learning perspective. Course Lecture Notes (Draft). https://www.bradyneal.com/Introduction_to_Causal_Inference-Dec17_2020-Neal.pdf

Suzuki, E., Shinozaki, T., & Yamamoto, E. (2020). Causal Diagrams: Pitfalls and Tips. Journal of Epidemiology, 30(4), 153–162. https://doi.org/10.2188/jea.JE20190192