Self-Inflicted Injuries in Causal Inference

Common Errors in Observational Studies

Author
Affiliation

Joseph A. Bulbulia

Victoria University of Wellington, New Zealand

Published

August 12, 2025

Part 1: Motivating Example

1990s observational studies indicated 30% all-cause mortality reduction from estrogen therapies?

NoteIn the 1980s and 1990s estrogen treatments appeared to benefit postmenopausal women

Hazard ratio for all-cause mortality: 0.68 current users vs. never users [@grodstein2006hormone]

Standard Medical Advice

  • 1992 American College of Obstetricians and Gynecologists

  • 1992 American College of Physicians

  • 1993 National Cholesterol Education Program

  • 1996 American Heart Association

Women’s Health Initiative: Evaluate Estrogens Experimentally

  • Massive randomised, double-blind, placebo-controlled trial

  • 16,000 U.S. women aged 50-79 years

  • Assigned to Estrogen plus Progestin therapy

  • Women followed approximately every year for up to 8 years.

Findings: Clear Discrepancy

WarningExperimental Findings: opposite of observational findings

All-cause mortality Hazard Ratio: 1.23 for initiators vs non-initiators. [@manson2003estrogen]

Medical community response: Reject all observational studies

  • Can observational studies ever be trusted?
  • Should observational studies ever be funded again?

What is Causality?

David Hume’s Two Definitions in Enquiries (1751)

Definition 1:

We may define a cause to be an object followed by another, and where all the objects, similar to the first, are followed by objects similar to the second…

Definition 2:

Or, in other words, where,

Enquiries Concerning Human Understanding, and Concerning the Principles of Morals 1751

Our lives are filled with “What If?” questions

[Image: Sliding Doors movie poster - illustrating the “what if” concept]

[Image: Robert Frost poem - “Two roads diverged in a yellow wood”]

The Fundamental Problem of Causal Inference

To quantify a causal effect requires a counterfactual contrast:

\[\tau_{you} = \Big[Y_{\text{you}}(a =1) - Y_{\text{you}}(a=0)\Big]\]

Where, \(Y(a)\) denotes the potential outcome under an intervention \(A = a\). Here, we assume a binary intervention. At any time, we may observe the outcome of only

However, we only observe facts not counterfactuals

\[ Y_i|A_i = 1 \implies Y_i(0)|A_i = 1~ \textcolor{red}{\text{is counterfactual}} \]

Average Treatment Effect in randomised controlled experiments work from assumptions

\[ \text{Average Treatment Effect} = \left[ \begin{aligned} &\left( \underbrace{\mathbb{E}[Y(1)|A = 1]}_{\text{observed}} + \textcolor{cyan}{\underbrace{\mathbb{E}[Y(1)|A = 0]}_{\text{unobserved}}} \right) \\ &- \left( \underbrace{\mathbb{E}[Y(0)|A = 0]}_{\text{observed}} + \textcolor{cyan}{\underbrace{\mathbb{E}[Y(0)|A = 1]}_{\text{unobserved}}} \right) \end{aligned} \right] \]

Under identifying assumptions, we may infer causal effects from associations

\[ \text{ATE} = \sum_{l} \left( \mathbb{E}[Y|A=1, \textcolor{cyan}{L=l}] - \mathbb{E}[Y|A=0, \textcolor{cyan}{L=l}] \right) \times \textcolor{cyan}{\Pr(L=l)} \]

Where \(L\) is a set of measured covariates and \(A\coprod Y(a)|L\)

Part 2: The First Self-Inflicted Error: The What Error

Paradigmatic Concern: Confounding by Common Cause

[DAG: Common Cause] - A confounding variable L causes both treatment A and outcome Y

Where assumptions justify, we may condition on measured confounders to obtain balance.

[DAG: Common Cause] - A confounding variable L causes both treatment A and outcome Y

[DAG: Adjusted for L] - By conditioning on L, we block the confounding path

Error 1 – What (over-adjustment)

“What Error #1”: Mediator Bias

[DAG: Mediator Bias] - Variable L mediates the effect of A on Y (A → L → Y)

Data Generating Process

Code
library(tidyverse)

set.seed(123)                           # reproducibility
n        <- 1000
service  <- rbinom(n, 1, 0.5)          # A
wealth   <- 2 * service + rnorm(n)     # L
charity  <- 1.5 * wealth + rnorm(n, sd = 1.5)  # Y

sim1 <- tibble(service, wealth, charity)

Model comparison

Controlling for the mediator reverses the sign of the service coefficient.
Characteristic
Model A: Omit L
Model B: Control for L
Beta 95% CI p-value Beta 95% CI p-value
service 2.9 2.6, 3.2 <0.001 -0.27 -0.53, -0.01 0.043
wealth


1.6 1.5, 1.7 <0.001
Abbreviation: CI = Confidence Interval

Which model looks better?

Code
library(performance)

compare_performance(fit_adjust, fit_omit, rank = TRUE)
# Comparison of Model Performance Indices

Name       | Model |    R2 | R2 (adj.) |  RMSE | Sigma | AIC weights
--------------------------------------------------------------------
fit_adjust |    lm | 0.682 |     0.682 | 1.473 | 1.475 |        1.00
fit_omit   |    lm | 0.312 |     0.311 | 2.169 | 2.171 |   2.88e-168

Name       | AICc weights | BIC weights | Performance-Score
-----------------------------------------------------------
fit_adjust |         1.00 |        1.00 |           100.00%
fit_omit   |    2.91e-168 |   3.35e-167 |             0.00%
Code
# BIC(fit_adjust) - BIC(fit_omit)  # negative → "better" fit

“What Error #2”: Collider Bias

[DAG: Collider Bias] - Variable L is caused by both A and Y (A → L ← Y)

Data Generating Process

Code
set.seed(2025)

n          <- 1000
service    <- rbinom(n, 1, 0.5)                            # A
donations  <- rnorm(n)                                     # Y
wealth     <- rnorm(n, mean = service + donations, sd = 1) # collider L

sim2 <- tibble(service, wealth, donations)

Model Comparison

Adding the collider creates a spurious, significant effect of A.
Characteristic
Model A: Omit L
Model B: Control for L (collider)
Beta 95% CI p-value Beta 95% CI p-value
service 0.00 -0.12, 0.13 >0.9 -0.53 -0.63, -0.44 <0.001
wealth


0.51 0.48, 0.54 <0.001
Abbreviation: CI = Confidence Interval

Which model looks better?

Code
compare_performance(fit_biased, fit_correct, rank = TRUE)
# Comparison of Model Performance Indices

Name        | Model |        R2 |  R2 (adj.) |  RMSE | Sigma | AIC weights
--------------------------------------------------------------------------
fit_biased  |    lm |     0.526 |      0.525 | 0.707 | 0.708 |        1.00
fit_correct |    lm | 3.722e-06 | -9.983e-04 | 1.028 | 1.029 |   1.42e-162

Name        | AICc weights | BIC weights | Performance-Score
------------------------------------------------------------
fit_biased  |         1.00 |        1.00 |           100.00%
fit_correct |    1.43e-162 |   1.65e-161 |             0.00%
Code
BIC(fit_biased) - BIC(fit_correct)
[1] -740.425

Take-Home

Important

Relying on model fit perpetuates the causality crisis in psychology [@bulbulia2022].
Draw the DAG first; decide what belongs in the model before looking at numbers.

The What Error is widespread in experimental studies in the social sciences

  • “Overall, we find that 46.7% of the experimental studies published in APSR, AJPS, and JOP from 2012 to 2014 engaged in posttreatment conditioning (35 of 75 studies) …”

  • “About 1 in 4 drop cases or subset the data based on post-treatment criteria, and nearly a third include post-treatment variables as covariates

  • “Most tellingly, nearly 1 in 8 articles directly conditions on variables that the authors themselves show as being an outcome of the experiment – an unambiguous indicator of a fundamental lack of understanding … that conditioning on posttreatment variables can invalidate results from randomized experiments.

  • “Empirically, then, the answer to the question of whether the discipline already understands posttreatment bias is clear: It does not.[@montgomery2018]

Mediator Bias control strategy: Longitudinal Hygiene

\[ \mediatorT \]

\[ \commoncausesolvedT \]

How to Tame The What Error? Hide your future

Use Repeated Measures on the Same Individuals

Collider bias control strategy: Hide your future

\[\colliderT\]

\[\commoncausesolvedT\]

Collider bias by proxy control strategy: Hide your future

\[\commoncausechildT\]

\[\commoncausesolvedchildT\]

Post-exposure collider bias control strategy: Hide your future

\[\mediatorcolliderT\]

\[\commoncausesolvedT\]

Finally, our biggest worry: Unmeasured Common Causes

\[\downstreamT\]

Unmeasured common cause strategy: Hide your future

\[\downstreamT\]

\[\commoncausesolvedT\]

And because , we should consistently report sensitivity analyses.

Are longitudinal data + sensitivity analysis enough?

If the data we collect were like this: \[Y_{\text{time 0}} ~~...~~ A_{\text{time 1}}\]

We should not be tempted to model this.

\[\ytoacrazyT\]

Are longitudinal data + sensitivity analysis enough?

this is wrong?

\[\ytoacrazyT\]

Error 2: When (time-zero)

Longitudinal data are not enough

Temporal ordering was with the observational hormone studies in the 80s/90s that modelled .

How?

Researchers failed to emulate an experiment in their data (Target Trial)

  • Women’s Health Initiative: overall hazard ratio 1.23 (0.99, 1.53)

  • Women’s Health Initiative: when broken down by years to follow-up:

    • 0-2 years 1.51 (1.06, 2.14)
    • 2-5 years 1.31 (0.93, 1.83)
    • 5 or more years 0.67 (0.41, 1.09)
NoteSurvivor Bias.

The observational results can be

NoteEmulating a target trial with observational data recovers experimental effects.

Re-modelling recovers experimental findings [@hernan2016].

Visualising the When Error

Figure 1: Three ways to start the clock. Only one is right.

Target Trial Check List

Table 1: Target-trial specification for the hormone-therapy example.
element definition
Eligibility Post-menopausal women aged 50–79, no prior CHD
Treatment Initiate oestrogen + progestin on day of Rx pickup
Comparison No hormone-therapy initiation on that day
Outcome All-cause mortality
Follow-up 8 years or until death / loss to follow-up

Error 3 – Who (population and heterogeneity)

Illustration

New Zealand Attitudes and Values Study (NZAVS): A step-by-step workflow for causal inference

The New Zealand Attitudes and Values Study:

  • Planned 20-year longitudinal study, currently in its 14th year, > 77k measured

  • Sample frame is drawn randomly from the NZ Electoral Roll.

Causal Questions

  1. Average Treatment Effect: Difference in averages of social outcomes (a) if everyone in New Zealand were to attend religious service at least 1x per month (b) if everyone were not to attend religious service at all.

  2. Heterogeneous Treatment Effects: Machine learning of groups who differ in such effects. Focus here on Qini Curve at 20% and 30% of full “budget”.

  3. Use Qini results to select for Policy Trees: Clarify who responds strongly/weakly.

Note: all results use “honest” learning – sample is split into training/test sets (here 70/30).

Outcomes

  1. Sense of Belonging
  2. Social Support
  3. Charitable Donations (annual)
  4. Volunteering (weekly)

All outcomes measured one year after the intervention

Three-wave Panel Design: control for baseline exposure and outcome among confounder set.

\[ \threeLONG \]

Target Population: Residents of New Zealand in 2018 had they not been censored through 2021.

Census Weights (Age X Gender X Ethnicity)

(IPCW Stabilised & Trimmed)

Religious Service Intervention

\[f(A) = \begin{cases} 1 & \text{if } a \ge 0 \text{ At least 1 x per month Religious Attendance} \\ 0 & \text{if } a = 0 \text{ Monthly Religious Service} \end{cases}\]

Histogram Showing Split

Love Plot of Propensity Scores

Average Treatment Effect Results:

Workflow

  • Target population: New Zealand Population by 2018 census weights
  • Outcomes: all NZAVS outcomes relating to pro-sociality (outcome-wide study)
  • Missing data: inverse-probability of censoring weights non-parametrically estimated/ causal forests permit missing data at baseline
  • Eligibility: participated in NZAVS 2018 (wave-10) and 2019 (wave 11, exposure wave) may have been lost to follow-up in wave 12, 46,377.
  • Bonferroni correct for multiple comparisons / E-value threshold = 1.2

Transition table: Religious Attendance

From / To State 0 State 1 Total
State 0 28545 681 29226
State 1 887 3593 4480

Average Treatment Effect

The following outcomes showed reliable causal evidence (E‑value lower bound > 1.2): - Volunteering (weekly, log): 0.154(0.056,0.252); on the original scale, 11.264 minutes(3.943,19.158). E‑value bound = 1.287 - log Charity Donate: 0.13(0.045,0.215); on the original scale, 39.935 (12.247,74.883). E‑value bound = 1.25

Graph ATE

How Much Better if We Treated by CATE?

Who Should We Treat First? Belonging

Who Should We Treat First? Charitable Donations

Closing

Error Need Remedy
What Control the right variables Draw a DAG first
When Start the clock on time Emulate a target trial
Who Treat the right people Weight to target + investigate heterogeneity

Thanks

  • TRT Grant 0418 & Max Planck Institute for Evolutionary Anthropology for support.
  • Chris G. Sibley (NZAVS lead Investigator) for the heavy data lifting
  • 77,490 NZAVS participants for their time
  • with questions or if you would like to become involved:

NZAVS