Lab 6: Conditional Average Treatment Effects

R script

Download the R script for this lab (right-click → Save As)

This lab explores why functional form matters for estimating heterogeneous treatment effects. You will compare parametric and non-parametric estimators, examine individual-level predictions from causal forests, and test whether treatment effects genuinely vary across individuals.

What you will learn

Why OLS can miss treatment effect heterogeneity
How to extract individual treatment effect predictions from a causal forest
How to test for significant heterogeneity using test_calibration()
How to identify which covariates drive effect modification

Why functional form matters

When treatment effects vary across individuals, the method we use to estimate them matters. A linear model assumes effects change at a constant rate with each covariate; a causal forest can capture non-linear and interactive patterns.

library(causalworkshop)
library(grf)
library(tidyverse)

The simulate_nonlinear_data() function generates data where the true treatment effect surface is deliberately non-linear, so that flexible methods outperform rigid ones:

# simulate data with non-linear treatment effects
d_nl <- simulate_nonlinear_data(n = 2000, seed = 2026)

# compare four estimation methods
result <- compare_ate_methods(d_nl)

All four methods (OLS, polynomial, GAM, causal forest) recover the overall ATE reasonably well. But their ability to predict individual effects differs dramatically:

# compare RMSE for individual-level predictions
print(result$summary)

RMSE tells the story

RMSE (root mean squared error) measures how well each method predicts the true individual treatment effect $τ (x_{i})$ . A lower RMSE means the method captures the heterogeneity pattern more accurately. OLS assumes a linear effect surface and typically has the highest RMSE.

Individual treatment effects from the causal forest

Now we return to the NZAVS data from Lab 5. The causal forest estimates $τ (x_{i})$ for each individual: what would their outcome change be if they were treated versus untreated?

# simulate NZAVS data (same as Lab 5)
d <- simulate_nzavs_data(n = 5000, seed = 2026)
d0 <- d |> filter(wave == 0)
d1 <- d |> filter(wave == 1)
d2 <- d |> filter(wave == 2)

# construct matrices
covariate_cols <- c(
  "age", "male", "nz_european", "education", "partner", "employed",
  "log_income", "nz_dep", "agreeableness", "conscientiousness",
  "extraversion", "neuroticism", "openness",
  "community_group", "wellbeing"
)

X <- as.matrix(d0[, covariate_cols])
Y <- d2$wellbeing
W <- d1$community_group

# fit causal forest
cf <- causal_forest(
  X, Y, W,
  num.trees = 1000,
  honesty = TRUE,
  tune.parameters = "all",
  seed = 2026
)

Extract predicted individual treatment effects:

# predicted treatment effects for each individual
tau_hat <- predict(cf)$predictions

# summary statistics
cat("Mean tau_hat:  ", round(mean(tau_hat), 3), "\n")
cat("SD tau_hat:    ", round(sd(tau_hat), 3), "\n")
cat("Range tau_hat: ", round(range(tau_hat), 3), "\n")

Compare with the true individual effects:

# true individual effects from the data-generating process
tau_true <- d0$tau_community_wellbeing

# how well does the forest recover individual effects?
cat("Correlation(tau_hat, tau_true):", round(cor(tau_hat, tau_true), 3), "\n")
cat("RMSE:", round(sqrt(mean((tau_hat - tau_true)^2)), 3), "\n")

Visualise the distribution of predicted effects:

# histogram of predicted treatment effects
ggplot(data.frame(tau_hat = tau_hat), aes(x = tau_hat)) +
  geom_histogram(bins = 40, fill = "steelblue", alpha = 0.7) +
  geom_vline(xintercept = mean(tau_hat), colour = "red", linetype = "dashed") +
  labs(
    title = "Distribution of predicted treatment effects",
    x = expression(hat(tau)(x)),
    y = "Count"
  ) +
  theme_minimal()

Interpreting the histogram

If treatment effects were homogeneous, this histogram would be tightly concentrated around the ATE. A wide spread indicates heterogeneity: some people benefit more from community group participation than others.

Test for heterogeneity

The test_calibration() function tests whether the forest has detected genuine heterogeneity, or whether the variation in $τ (x)$ is just noise.

# test for heterogeneity
cal_test <- test_calibration(cf)
print(cal_test)

Reading the calibration test

The key row is differential.forest.prediction. If its coefficient is significantly greater than zero (p < 0.05), the forest has detected meaningful variation in treatment effects beyond the overall mean. The mean.forest.prediction row tests whether the average effect is non-zero.

Variable importance

Which covariates drive the heterogeneity? The variable_importance() function measures how frequently each variable is used for splitting in the forest:

# variable importance
var_imp <- variable_importance(cf)
importance_df <- data.frame(
  variable = colnames(X),
  importance = as.numeric(var_imp)
) |>
  arrange(desc(importance))

print(importance_df)

Cross-reference with ground truth

The true treatment effect formula for community group participation on wellbeing is:

$τ = 0.20 + 0.10 \times extraversion + 0.05 \times partner - 0.03 \times neuroticism^{2}$

So extraversion, partner status, and neuroticism should appear as important variables. Does the forest recover this pattern?

Subgroup analysis

We can examine whether predicted effects differ across subgroups defined by the important covariates:

# compare effects by extraversion
high_extra <- tau_hat[d0$extraversion > 0]
low_extra <- tau_hat[d0$extraversion <= 0]

cat("Mean tau_hat (high extraversion):", round(mean(high_extra), 3), "\n")
cat("Mean tau_hat (low extraversion): ", round(mean(low_extra), 3), "\n")
cat("Difference:                      ", round(mean(high_extra) - mean(low_extra), 3), "\n")

# compare effects by partner status
partnered <- tau_hat[d0$partner == 1]
unpartnered <- tau_hat[d0$partner == 0]

cat("\nMean tau_hat (partnered):  ", round(mean(partnered), 3), "\n")
cat("Mean tau_hat (unpartnered):", round(mean(unpartnered), 3), "\n")
cat("Difference:                ", round(mean(partnered) - mean(unpartnered), 3), "\n")

Do the subgroup differences match the ground truth?

The tau formula adds $+ 0.10 \times extraversion$ and $+ 0.05 \times partner$ . Highly extraverted and partnered individuals should show larger predicted treatment effects. Check whether this matches what you observe.

Predicted vs true effects scatter plot

# scatter plot of predicted vs true individual effects
ggplot(data.frame(true = tau_true, predicted = tau_hat),
       aes(x = true, y = predicted)) +
  geom_point(alpha = 0.1, colour = "steelblue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "red") +
  labs(
    title = "Predicted vs true individual treatment effects",
    x = expression(tau(x)),
    y = expression(hat(tau)(x))
  ) +
  theme_minimal()

Key takeaway

Causal forests can detect meaningful heterogeneity in treatment effects without requiring the analyst to specify the functional form in advance. The test_calibration() function provides a formal test for heterogeneity, and variable_importance() identifies which covariates drive it. In Lab 8, we will use these individual predictions to evaluate targeting strategies.

Exercises

Lab diary

Complete at least two of the following exercises for your lab diary.

Different seed. Run compare_ate_methods() with seed = 42 instead of seed = 2026. Do the relative RMSE rankings change? Why or why not?
Different exposure-outcome pair. Fit a causal forest for volunteer_work on self_esteem. Run test_calibration() and variable_importance(). Which covariates drive heterogeneity? Does this match the ground-truth tau formula? (Hint: check the simulate_nzavs_data documentation.)
Why does OLS miss heterogeneity? In one paragraph, explain why a linear model that includes only main effects cannot capture the $- 0.03 \times neuroticism^{2}$ term in the treatment effect formula. What would you need to add to the linear model to capture this non-linearity?

Keyboard shortcuts

PSYC 434: Conducting Research Across Cultures