Simulation Guide

Simulations are pedagogical tools that let us see what causal inference methods do when we know the truth. In observational research, we never know the true causal effect. In a simulation, we build the data-generating process ourselves, so we can compare each method's estimate against the ground-truth parameter. The four simulations in this course illustrate distinct threats to valid causal inference and distinct strategies for addressing them.

Download the R script

Required R packages

tidyverse, stdReg, gtsummary, clarify, grf

Generalisability and transportability

Connects to Week 4: external validity and selection bias.

This simulation creates two populations that differ in the prevalence of an effect modifier $Z$ . In the sample, $Z = 1$ is rare ( $p = 0.1$ ); in the target population, $Z = 1$ is common ( $p = 0.5$ ). The treatment effect depends on $Z$ : individuals with $Z = 1$ benefit more from treatment. Because the sample under-represents these high-benefit individuals, the naive sample Average Treatment Effect (ATE) underestimates the population ATE.

The simulation fits three models: an unweighted model on the sample, a weighted model on the sample (using inverse-probability-of-sampling weights), and an oracle model on the full population. The regression coefficients are nearly identical across all three models, yet the marginal ATEs differ. This dissociation is the central lesson: model coefficients describe conditional associations, but the ATE is a marginal quantity that depends on the distribution of effect modifiers in the target population. Weighting the sample to match the population distribution of $Z$ recovers the correct ATE.

The script also includes a manual calculation section that shows exactly what stdReg does under the hood: create counterfactual datasets in which everyone receives $A = 0$ and everyone receives $A = 1$ , predict outcomes under each scenario, and take the mean difference. This "g-computation" procedure makes the marginalisation step explicit.

Key takeaway

Regression coefficients can be correct and yet the ATE can still be wrong for the target population. External validity requires that the distribution of effect modifiers in the sample matches the target, or that we adjust for the mismatch.

Cross-sectional data ambiguity

Connects to Week 3: confounding versus mediation.

This simulation generates data in which $A$ causes $L$ , and $L$ causes $Y$ . The variable $L$ is therefore a mediator, not a confounder. Two models are fit: one that conditions on $L$ (treating it as a confounder) and one that omits $L$ (treating it as a mediator). The model that conditions on $L$ returns a near-zero estimate for the effect of $A$ on $Y$ because it blocks the very path through which $A$ operates. The model that omits $L$ correctly recovers the total effect.

The crux of the problem is that with cross-sectional data alone, the investigator cannot distinguish a confounder from a mediator. Both the fork $A \leftarrow L \to Y$ and the chain $A \to L \to Y$ produce the same observed association between $A$ , $L$ , and $Y$ . The correct modelling decision depends on the assumed causal structure, which the data themselves do not reveal.

Warning

Good model fit does not resolve this ambiguity. A model that conditions on a mediator can fit the data well while returning a biased causal estimate. Model fit is a statistical property; confounding is a structural (causal) property.

Confounding control strategies

Connects to Weeks 3–4: conditioning choices and the backdoor criterion.

This simulation builds a three-wave panel structure with a baseline covariate $L_{0}$ , a prior outcome $Y_{0}$ , a prior exposure $A_{0}$ , an unmeasured confounder $U$ , a treatment $A_{1}$ , and an outcome $Y_{2}$ . The true treatment effect is $δ_{A_{1}} = 0.3$ , and the outcome also depends on $Y_{0}$ , $A_{0}$ , $L_{0}$ , their interactions, and $U$ .

Three models are compared. The "no control" model regresses $Y_{2}$ on $A_{1}$ alone and overestimates the effect because it leaves all confounding paths open. The "standard" model adds $L_{0}$ but still omits $Y_{0}$ and $A_{0}$ , leaving residual confounding. The "interaction" model conditions on $L_{0}$ , $A_{0}$ , $Y_{0}$ , and their interactions with $A_{1}$ , recovering an estimate close to the true value.

The simulation uses the clarify package to obtain simulation-based confidence intervals for each ATE. The progressive improvement from no control to standard to interaction control illustrates that closing more backdoor paths moves the estimate closer to the truth, but only conditioning on the right set of variables eliminates confounding entirely.

Key takeaway

In a three-wave panel, conditioning on the prior exposure, prior outcome, and baseline covariates (along with their interactions) is ordinarily necessary to satisfy the backdoor criterion. Omitting any of these leaves residual confounding.

Causal forest estimation

Connects to Week 8: machine learning for heterogeneous treatment effects.

This simulation uses the same data-generating process as the confounding control simulation, then fits a causal forest (from the grf package) with $L_{0}$ , $A_{0}$ , and $Y_{0}$ as covariates. The causal forest is a non-parametric method that estimates individual-level treatment effects $\overset{τ}{^} (x)$ by partitioning the covariate space adaptively. Unlike the parametric models above, the causal forest does not require the investigator to specify interaction terms; it discovers them from the data.

The simulation reports the causal forest's ATE estimate and its standard error. Comparing this estimate to the parametric interaction model illustrates two points. First, the causal forest can recover the ATE without requiring the analyst to guess the correct functional form. Second, the causal forest provides a standard error that accounts for the adaptive splitting, making valid inference possible even in the non-parametric setting.

Key takeaway

Causal forests automate the discovery of heterogeneous treatment effects but still require the investigator to supply the correct set of confounders. Machine learning solves the functional-form problem, not the identification problem.

Running the simulations

To run the full script, open R and execute:

source("scripts/simulations.R")

Each section prints its results to the console. You can also run individual sections by selecting and executing the relevant code blocks. The script sets random seeds so that results are reproducible.

Keyboard shortcuts

PSYC 434: Conducting Research Across Cultures

Simulation Guide

Generalisability and transportability

Cross-sectional data ambiguity

Confounding control strategies

Causal forest estimation

Running the simulations