Simulation Guide

Simulations are pedagogical tools that let us see what causal inference methods do when we know the truth. In observational research, we never know the true causal effect. In a simulation, we build the data-generating process ourselves, so we can compare each method's estimate against the ground-truth parameter. The four simulations in this guide illustrate distinct threats to valid causal inference and distinct strategies for addressing them.

The data-generating helpers live in the causalworkshop package. Each section below shows the helper call followed by the analysis we apply to it, so you can see the reasoning step by step.

Required R packages

causalworkshop, tidyverse, stdReg, gtsummary, clarify, grf. Follow Lab Setup: R Packages and Build Tools, then install causalworkshop from GitHub with pak::pak("go-bayes/causalworkshop") (version $\geq$ 0.6.0).


Generalisability and transportability

Connects to Week 4: external validity and selection bias.

This simulation creates two populations that differ in the prevalence of an effect modifier $Z$. In the sample, $Z = 1$ is rare ($p = 0.1$); in the target population, $Z = 1$ is common ($p = 0.5$). The treatment effect depends on $Z$: individuals with $Z = 1$ benefit more from treatment. Because the sample under-represents these high-benefit individuals, the naive sample Average Treatment Effect (ATE) underestimates the population ATE.

library(causalworkshop)

data <- simulate_ate_data_with_weights(
  n_sample      = 10000,
  n_population  = 100000,
  p_z_sample    = 0.1,
  p_z_population = 0.5,
  beta_a        = 1,
  beta_z        = 2.5,
  beta_az       = 0.5,
  noise_sd      = 0.5,
  seed          = 123
)

sample_data     <- data$sample_data       # y_sample, a_sample, z_sample, weights
population_data <- data$population_data   # y_population, a_population, z_population

The simulation fits three models: an unweighted model on the sample, a weighted model on the sample (using inverse-probability-of-sampling weights), and an oracle model on the full population. The regression coefficients are nearly identical across all three models, yet the marginal ATEs differ. This dissociation is the central lesson: model coefficients describe conditional associations, but the ATE is a marginal quantity that depends on the distribution of effect modifiers in the target population. Weighting the sample to match the population distribution of $Z$ recovers the correct ATE.

The legacy script also includes a manual calculation that shows exactly what stdReg does under the hood: create counterfactual datasets in which everyone receives $A = 0$ and everyone receives $A = 1$, predict outcomes under each scenario, and take the mean difference. This "g-computation" procedure makes the marginalisation step explicit.

Key takeaway

Regression coefficients can be correct and yet the ATE can still be wrong for the target population. External validity requires that the distribution of effect modifiers in the sample matches the target, or that we adjust for the mismatch.


Cross-sectional data ambiguity

Connects to Week 3: confounding versus mediation.

This simulation generates data in which $A$ causes $L$, and $L$ causes $Y$. The variable $L$ is therefore a mediator, not a confounder.

data_cross <- simulate_mediation_example(
  n     = 1000,
  beta  = 2,      # effect of A on L
  delta = 1.5,    # effect of L on Y
  seed  = 123
)

fit_blocked <- lm(Y ~ A + L, data = data_cross)   # incorrectly conditions on the mediator
fit_total   <- lm(Y ~ A,     data = data_cross)   # correctly recovers the total effect

The model that conditions on $L$ returns a near-zero estimate for the effect of $A$ on $Y$ because it blocks the very path through which $A$ operates. The model that omits $L$ correctly recovers the total effect (true value: beta * delta = 3).

The crux of the problem is that with cross-sectional data alone, the investigator cannot distinguish a confounder from a mediator. Both the fork $A \leftarrow L \rightarrow Y$ and the chain $A \rightarrow L \rightarrow Y$ produce the same observed association between $A$, $L$, and $Y$. The correct modelling decision depends on the assumed causal structure, which the data themselves do not reveal.

Warning

Good model fit does not resolve this ambiguity. A model that conditions on a mediator can fit the data well while returning a biased causal estimate. Model fit is a statistical property; confounding is a structural (causal) property.


Confounding control strategies

Connects to Weeks 3–4: conditioning choices and the backdoor criterion.

This simulation builds a three-wave panel structure with a baseline covariate $L_0$, a prior outcome $Y_0$, a prior exposure $A_0$, an unmeasured confounder $U$, a treatment $A_1$, and an outcome $Y_2$. The true treatment effect is $\delta_{A_1} = 0.3$, and the outcome also depends on $Y_0$, $A_0$, $L_0$, their interaction, and $U$.

data_panel <- simulate_three_wave_panel(
  n        = 10000,
  delta_A1 = 0.3,
  seed     = 123
)

fit_no_control  <- lm(Y_2 ~ A_1,                          data = data_panel)
fit_standard    <- lm(Y_2 ~ A_1 + L_0,                    data = data_panel)
fit_interaction <- lm(Y_2 ~ A_1 * (L_0 + A_0 + Y_0),      data = data_panel)

Three models are compared. The "no control" model regresses $Y_2$ on $A_1$ alone and overestimates the effect because it leaves all backdoor paths open. The "standard" model adds $L_0$ but still omits $Y_0$ and $A_0$, leaving residual confounding. The "interaction" model conditions on $L_0$, $A_0$, $Y_0$, and their interactions with $A_1$, recovering an estimate close to the true value.

The simulation uses the clarify package to obtain simulation-based confidence intervals for each ATE. The progressive improvement from no control to standard to interaction control illustrates that closing more backdoor paths moves the estimate closer to the truth, but only conditioning on the right set of variables eliminates confounding entirely.

Key takeaway

In a three-wave panel, conditioning on the prior exposure, prior outcome, and baseline covariates (along with their interactions) is ordinarily necessary to satisfy the backdoor criterion. Omitting any of these leaves residual confounding.


Causal forest estimation

Connects to Week 8: machine learning for heterogeneous treatment effects.

This simulation reuses simulate_three_wave_panel(), then fits a causal forest (from the grf package) with $L_0$, $A_0$, and $Y_0$ as covariates.

library(grf)

data_panel <- simulate_three_wave_panel(seed = 123)

W <- as.matrix(data_panel$A_1)
Y <- as.matrix(data_panel$Y_2)
X <- as.matrix(data_panel[, c("L_0", "A_0", "Y_0")])

fit_cf <- causal_forest(X, Y, W)
average_treatment_effect(fit_cf)

If grf warns that estimated treatment propensities are low for some observations, read that as a positivity warning, not as a broken simulation. The example still recovers the true effect closely, but the warning is a useful reminder that causal forests need overlap between treated and untreated observations.

The causal forest is a non-parametric method that estimates individual-level treatment effects $\hat{\tau}(x)$ by partitioning the covariate space adaptively. Unlike the parametric models above, the causal forest does not require the investigator to specify interaction terms; it discovers them from the data.

Comparing the causal forest estimate to the parametric interaction model illustrates two points. First, the causal forest can recover the ATE without requiring the analyst to guess the correct functional form. Second, the causal forest provides a standard error that accounts for the adaptive splitting, making valid inference possible even in the non-parametric setting.

Key takeaway

Causal forests automate the discovery of heterogeneous treatment effects but still require the investigator to supply the correct set of confounders. Machine learning solves the functional-form problem, not the identification problem.