Simulation Guide
Simulations are pedagogical tools that let us see what causal inference methods do when we know the truth. In observational research, we never know the true causal effect. In a simulation, we build the data-generating process ourselves, so we can compare each method's estimate against the ground-truth parameter. The four simulations in this guide illustrate distinct threats to valid causal inference and distinct strategies for addressing them.
The data-generating helpers live in the causalworkshop package. Each
section below shows the helper call followed by the analysis we apply to
it, so you can see the reasoning step by step.
Required R packages
causalworkshop,tidyverse,stdReg,gtsummary,clarify,grf. Follow Lab Setup: R Packages and Build Tools, then installcausalworkshopfrom GitHub withpak::pak("go-bayes/causalworkshop")(version $\geq$ 0.6.0).
Generalisability and transportability
Connects to Week 4: external validity and selection bias.
This simulation creates two populations that differ in the prevalence of an effect modifier $Z$. In the sample, $Z = 1$ is rare ($p = 0.1$); in the target population, $Z = 1$ is common ($p = 0.5$). The treatment effect depends on $Z$: individuals with $Z = 1$ benefit more from treatment. Because the sample under-represents these high-benefit individuals, the naive sample Average Treatment Effect (ATE) underestimates the population ATE.
library(causalworkshop)
data <- simulate_ate_data_with_weights(
n_sample = 10000,
n_population = 100000,
p_z_sample = 0.1,
p_z_population = 0.5,
beta_a = 1,
beta_z = 2.5,
beta_az = 0.5,
noise_sd = 0.5,
seed = 123
)
sample_data <- data$sample_data # y_sample, a_sample, z_sample, weights
population_data <- data$population_data # y_population, a_population, z_population
The simulation fits three models: an unweighted model on the sample, a weighted model on the sample (using inverse-probability-of-sampling weights), and an oracle model on the full population. The regression coefficients are nearly identical across all three models, yet the marginal ATEs differ. This dissociation is the central lesson: model coefficients describe conditional associations, but the ATE is a marginal quantity that depends on the distribution of effect modifiers in the target population. Weighting the sample to match the population distribution of $Z$ recovers the correct ATE.
The legacy script also includes a manual calculation that shows exactly
what stdReg does under the hood: create counterfactual datasets in
which everyone receives $A = 0$ and everyone receives $A = 1$,
predict outcomes under each scenario, and take the mean difference. This
"g-computation" procedure makes the marginalisation step explicit.
Key takeaway
Regression coefficients can be correct and yet the ATE can still be wrong for the target population. External validity requires that the distribution of effect modifiers in the sample matches the target, or that we adjust for the mismatch.
Cross-sectional data ambiguity
Connects to Week 3: confounding versus mediation.
This simulation generates data in which $A$ causes $L$, and $L$ causes $Y$. The variable $L$ is therefore a mediator, not a confounder.
data_cross <- simulate_mediation_example(
n = 1000,
beta = 2, # effect of A on L
delta = 1.5, # effect of L on Y
seed = 123
)
fit_blocked <- lm(Y ~ A + L, data = data_cross) # incorrectly conditions on the mediator
fit_total <- lm(Y ~ A, data = data_cross) # correctly recovers the total effect
The model that conditions on $L$ returns a near-zero estimate for the
effect of $A$ on $Y$ because it blocks the very path through which
$A$ operates. The model that omits $L$ correctly recovers the total
effect (true value: beta * delta = 3).
The crux of the problem is that with cross-sectional data alone, the investigator cannot distinguish a confounder from a mediator. Both the fork $A \leftarrow L \rightarrow Y$ and the chain $A \rightarrow L \rightarrow Y$ produce the same observed association between $A$, $L$, and $Y$. The correct modelling decision depends on the assumed causal structure, which the data themselves do not reveal.
Warning
Good model fit does not resolve this ambiguity. A model that conditions on a mediator can fit the data well while returning a biased causal estimate. Model fit is a statistical property; confounding is a structural (causal) property.
Confounding control strategies
Connects to Weeks 3–4: conditioning choices and the backdoor criterion.
This simulation builds a three-wave panel structure with a baseline covariate $L_0$, a prior outcome $Y_0$, a prior exposure $A_0$, an unmeasured confounder $U$, a treatment $A_1$, and an outcome $Y_2$. The true treatment effect is $\delta_{A_1} = 0.3$, and the outcome also depends on $Y_0$, $A_0$, $L_0$, their interaction, and $U$.
data_panel <- simulate_three_wave_panel(
n = 10000,
delta_A1 = 0.3,
seed = 123
)
fit_no_control <- lm(Y_2 ~ A_1, data = data_panel)
fit_standard <- lm(Y_2 ~ A_1 + L_0, data = data_panel)
fit_interaction <- lm(Y_2 ~ A_1 * (L_0 + A_0 + Y_0), data = data_panel)
Three models are compared. The "no control" model regresses $Y_2$ on $A_1$ alone and overestimates the effect because it leaves all backdoor paths open. The "standard" model adds $L_0$ but still omits $Y_0$ and $A_0$, leaving residual confounding. The "interaction" model conditions on $L_0$, $A_0$, $Y_0$, and their interactions with $A_1$, recovering an estimate close to the true value.
The simulation uses the clarify package to obtain simulation-based
confidence intervals for each ATE. The progressive improvement from no
control to standard to interaction control illustrates that closing more
backdoor paths moves the estimate closer to the truth, but only
conditioning on the right set of variables eliminates confounding
entirely.
Key takeaway
In a three-wave panel, conditioning on the prior exposure, prior outcome, and baseline covariates (along with their interactions) is ordinarily necessary to satisfy the backdoor criterion. Omitting any of these leaves residual confounding.
Causal forest estimation
Connects to Week 8: machine learning for heterogeneous treatment effects.
This simulation reuses simulate_three_wave_panel(), then fits a causal
forest (from the grf package) with $L_0$, $A_0$, and $Y_0$ as
covariates.
library(grf)
data_panel <- simulate_three_wave_panel(seed = 123)
W <- as.matrix(data_panel$A_1)
Y <- as.matrix(data_panel$Y_2)
X <- as.matrix(data_panel[, c("L_0", "A_0", "Y_0")])
fit_cf <- causal_forest(X, Y, W)
average_treatment_effect(fit_cf)
If grf warns that estimated treatment propensities are low for some
observations, read that as a positivity warning, not as a broken
simulation. The example still recovers the true effect closely, but the
warning is a useful reminder that causal forests need overlap between
treated and untreated observations.
The causal forest is a non-parametric method that estimates individual-level treatment effects $\hat{\tau}(x)$ by partitioning the covariate space adaptively. Unlike the parametric models above, the causal forest does not require the investigator to specify interaction terms; it discovers them from the data.
Comparing the causal forest estimate to the parametric interaction model illustrates two points. First, the causal forest can recover the ATE without requiring the analyst to guess the correct functional form. Second, the causal forest provides a standard error that accounts for the adaptive splitting, making valid inference possible even in the non-parametric setting.
Key takeaway
Causal forests automate the discovery of heterogeneous treatment effects but still require the investigator to supply the correct set of confounders. Machine learning solves the functional-form problem, not the identification problem.