Estimating ATE and CATE using Causal Forests

Introduction

The following briefly walks through the results of ATE and CATE estimation and reporting.

Method - Example Report

cat(boilerplate_generate_text(
  category = "template",
  sections = c("grf.simple"),
  # global_vars = list(
  #   name_outcomes_lower = name_outcomes_lower,
  #   name_exposure_lower = name_exposure_lower,
  #   name_exposure_capfirst = name_exposure_variable
  # ),
  db = unified_db
))

Results

Average Treatment Effects

ate_results$plot
Figure 1: Average Treatment Effects on Multi-dimensional Wellbeing
Outcome ATE 2.5 % 97.5 % E-Value E-Value bound
Volunteering Hours 0.299 0.27 0.328 1.953 1.876
Charitable Giving 0.182 0.163 0.201 1.641 1.589
Table 1: Average Treatment Effects on Multi-dimensional Wellbeing

Confidence intervals were adjusted for multiple comparisons using Bonferroni correction (\alpha = 0.05). E‑values were also adjusted using Bonferroni correction (\alpha = 0.05).

The following outcomes present reliable causal evidence for average treatment effects (E‑value lower bound > 1.2):

  • Volunteering Hours: 0.299(0.270,0.328); on the original scale, 0.388(0.350,0.426). E‑value bound = 1.88
  • Charitable Giving: 0.182(0.163,0.201); on the original scale, 0.228(0.204,0.252). E‑value bound = 1.59

Heterogeneous Treatment Effects

We begin by examining the distribution of individual treatment effects (Ο„α΅’) across our sample. Figure 2 presents the estimated treatment effects for each individual, revealing substantial variability in how people respond to {name_exposure_lower}.

β„Ή detected models_binary structure, using $results
β„Ή processing 2 models: "model_t2_charity_outcome_z" and "model_t2_volunteer_outcome_z"
β„Ή Applied label mapping: model_t2_charity_outcome_z -> Charitable Giving
β„Ή Applied label mapping: model_t2_volunteer_outcome_z -> Volunteering Hours
β„Ή tau values range from -0.351 to 0.748
βœ” created faceted plot for 2 models: "Charitable Giving" and "Volunteering Hours"
Figure 2: Distribution of Individual Treatment Effects (Ο„α΅’) Across Outcomes

The histograms above show considerable heterogeneity in treatment effects across individuals in the charitable giving condition. To determine whether this variability is systematic (i.e., predictable based on individual characteristics) rather than random noise, we employ two complementary approaches: Qini curves to assess the reliability of heterogeneous effects, and policy trees to identify subgroups with differential treatment responses.

β„Ή generating results text with 1 sections
β„Ή using results from unified database
β„Ή processing section: grf.interpretation_qini
β„Ή applying template variables to grf.interpretation_qini
βœ” successfully generated results text with 1 section(s)

Qini Curves

The Qini curve shows the cumulative gain as we expand a targeting rule down the CATE ranking.

  • Beneficial exposure: we add individuals from the top positive CATEs downward; the baseline is β€˜expose everyone.’
  • Detrimental exposure: we first flip outcome direction (so higher values represent more harm; see ), then add the exposure starting with individuals whose CATEs show the greated harm, gradually including those predicted to be more resistant to harm; the baseline is β€˜expose everyone.’ The curve therefore quantifies the harm by when those most suceptible to harm are exposed.

If the Qini curve stays above its baseline, a targeted policy increases the outcome more than a one-size-fits-all alternative.

RATE AUTOC RESULTS
# only show if there are autoc results
if (length(model_groups$autoc) > 0) {
  cat(rate_interp$autoc_results)
} else {
  cat("No significant RATE AUTOC results were found.")
}

Evidence for heterogeneous treatment effects (policy = treat best responders) using AUTOC

AUTOC uses logarithmic weighting to focus treatment on top responders.

Positive RATE estimates for: Charitable Giving.

Estimates (Charitable Giving: 0.259 (95% CI 0.245, 0.273)) show robust heterogeneity.

Negative RATE estimates for: Volunteering Hours.

Estimates (Volunteering Hours: -0.156 (95% CI -0.178, -0.134)) caution against CATE prioritisation.

# only use if you have reliable qini results
if (length(reliable_ids) > 0) {
  cat(qini_gain$qini_explanation)
} else {
  cat("No significant heterogeneous treatment effects were detected using Qini curve analysis.")
}

The QINI curve compares targeted treatment allocation (based on individual treatment effects) versus uniform allocation (based on average treatment effect). Small differences in the expected values of the treatment after the entire population is treated are expected due to out of sample cross-validation (all estimates are tested on data the model has not seen). We computed expected policy effects from prioritising individuals by CATE at 10% and 40% spend levels.

Charitable Giving At 10% spend: CATE prioritisation is beneficial (diff: 0.05 [95% CI 0.05, 0.06]). At 40% spend: CATE prioritisation is beneficial (diff: 0.12 [95% CI 0.10, 0.13]).

Volunteering Hours At 10% spend: CATE prioritisation worsens outcomes compared to ATE. At 40% spend: No reliable benefits from CATE prioritisation.

# only use if you have multiple qini results
if (length(reliable_ids) > 0) {
  knitr::kable(
    qini_gain$summary_table |> 
      mutate(across(where(is.numeric), ~ round(., 2))),
    format = "markdown",
    caption = "Qini Curve Results"
  )
} else {
  cat("*Note: Qini curve table only displayed when multiple significant results are found.*")
}
Model Spend 10% Spend 40%
Charitable Giving 0.05 [0.05, 0.06] 0.12 [0.10, 0.13]
Volunteering Hours -0.01 [-0.02, -0.00] -0.01 [-0.03, 0.00]
Qini Curve Results
Figure 3: Qini Curves for Heterogeneous Treatment Effects

Decision Rules (Who is Most Sensitive to Treatment?)

cat(
  boilerplate::boilerplate_generate_text(
    category     = "results",
    sections     = c("grf.interpretation_policy_tree"),
    global_vars  = global_vars,
    db           = unified_db
  )
)
β„Ή generating results text with 1 sections
β„Ή using results from unified database
β„Ή processing section: grf.interpretation_policy_tree
β„Ή applying template variables to grf.interpretation_policy_tree
βœ” successfully generated results text with 1 section(s)

Policy Trees

We used policy trees (Sverdrup et al. 2024; Athey and Wager 2021a, 2021b) to find straightforward β€˜if-then’ rules for who benefits most from treatment, based on participant characteristics. Because we flipped some measures, a higher predicted effect always means greater improvement. Policy trees can uncover small but important subgroups whose treatment responses stand out, even when the overall differences might be modest.

The following pages present policy trees for each outcome with reliable heterogeneous effects. Each tree shows: (1) the decision rules for treatment assignment, (2) the distribution of treatment effects across subgroups, and (3) visual representation of how covariates split the population into groups with differential treatment responses.



#### Policy Tree 1: Charitable Giving
Figure 4: Policy Trees for Treatment Assignment
# use this text below your decision tree graphs
if (length(reliable_ids) > 0) {
  cat(policy_text, "\n")
}

Policy Tree Interpretations (depth 1)

Findings for Charitable Giving at the end of study

The policy-tree analysis divided the sample on baseline Age. Respondents who scored ≀ -0.726 (original: -0.726) formed one branch and were assigned to the Control policy. Those with baseline Age > -0.726 (original: -0.726) formed the second branch and were assigned to Treated.

Treatment-effect heterogeneity

The policy tree produced two terminal leaves. Conditional average treatment effects (CATEs) were estimated within each leaf:

β€’ Leaf 1β€”baseline age ≀ -0.726 (n = 2,357; 23.6% of the test set): 68.1% were recommended control. The mean outcome under control was -0.101, the mean under treatment was 0.090, yielding a CATE of 0.191.

β€’ Leaf 2β€”baseline age > -0.726 (n = 7,643; 76.4% of the test set): 97.1% were recommended treated. The mean outcome under control was -0.108, the mean under treatment was 0.088, yielding a CATE of 0.196.

Overall policy performance

Across the full test set (N = 10,000), the policy prescribes control for 1,826 participants (18.3%) and treated for 8,174 participants (81.7%).

Discussion

S1: Estimating and Interpreting Heterogeneous Treatment Effects with GRF

Qini Curves

The Qini curve shows the cumulative gain as we expand a targeting rule down the CATE ranking.

  • Beneficial exposure: we add individuals from the top positive CATEs downward; the baseline is β€˜expose everyone.’
  • Detrimental exposure: we first flip outcome direction (so higher values represent more harm; see ), then add the exposure starting with individuals whose CATEs show the greated harm, gradually including those predicted to be more resistant to harm; the baseline is β€˜expose everyone.’ The curve therefore quantifies the harm by when those most suceptible to harm are exposed.

S2: Strengths and Limitations of Causal Forests

cat(
  boilerplate::boilerplate_generate_text(
    category     = "discussion",
    sections     = c("strengths.strengths_grf_short"),
    global_vars  = global_vars,
    db           = unified_db
  )
)
β„Ή generating discussion text with 1 sections
β„Ή using discussion from unified database
β„Ή processing section: strengths.strengths_grf_short
β„Ή applying template variables to strengths.strengths_grf_short
βœ” successfully generated discussion text with 1 section(s)

We used causal forests to uncover how the treatment effect changes across people who differ in age, gender, baseline scores, and other measured characteristics (Tibshirani et al. 2024). This flexible, non-parametric method avoids the rigid functional‐form assumptions of linear or logistic regression and can capture complex, higher-order interactions that would otherwise be missed.

Tibshirani, Julie, Susan Athey, Erik Sverdrup, and Stefan Wager. 2024. Grf: Generalized Random Forests. https://github.com/grf-labs/grf.
Sverdrup, Erik, Ayush Kanodia, Zhengyuan Zhou, Susan Athey, and Stefan Wager. 2024. Policytree: Policy Learning via Doubly Robust Empirical Welfare Maximization over Trees. https://CRAN.R-project.org/package=policytree.
Athey, Susan, and Stefan Wager. 2021a. β€œPolicy Learning With Observational Data.” Econometrica 89 (1): 133–61. https://doi.org/10.3982/ECTA15732.
β€”β€”β€”. 2021b. β€œPolicy Learning with Observational Data.” Econometrica 89 (1): 133–61. https://doi.org/https://doi.org/10.3982/ECTA15732.
Wager, Stefan, and Susan Athey. 2018. β€œEstimation and Inference of Heterogeneous Treatment Effects Using Random Forests.” Journal of the American Statistical Association 113 (523): 1228–42. https://doi.org/10.1080/01621459.2017.1319839.

To guard against over-fitting we split the data: one portion trained the forest, and the other evaluated its predictions train/test ratio: 50/50. On the evaluation set we computed three complementary metrics: (i) rate statistics: the area under the treatment–outcome curve (AUTOC) which summarise how well the model ranks individuals from β€˜most likely to benefit’ to β€˜least likely to benefit’ based on their baseline caracteristics; (ii) Qini curves which show the cumulative gain achieved by treating successively larger fractions of the population – which is relevant to understanding gains from different spend levels. policy trees, which convert the forest’s complex predictions into a short set of human-readable if–then rules that can guide targeting in practice (Sverdrup et al. 2024; Athey and Wager 2021a, 2021b). Collectively, these tools help to clarify whether heterogeneous effects exist, and also how much extra benefit a data-driven targeting policy might yield over random allocation (Wager and Athey 2018).

Although causal forests improve on traditional parametric methods, every observational approach carries risks.

First, causal inference in observational settings inevitably relies on untestable ignorability assumptions (treatments are β€˜as good as random’ conditional on measured covariates). Whether we have measured all factors that may jointly influence treatment assignment and the outcomes cannot be evaluated by statistical tests. If strong common causes of both the exposure and outcome are unobserved or poorly measured, our estimates will be biased. Interpreting subgroup findings can also be challenging: statistically significant differences are not always large enough to matter in real life. More basic, perhaps, methods for detecting real differences rely on measures that are, in survey research, inherently noisy, and noise often, but no always, attenuates affects. The twin dangers of mistaking noise for signal, or of obscuring singles from noise, abide. Such limitations must be kept firmly in mind when interpreting results.

References