Week 8: Heterogeneous Treatment Effects and Machine Learning

Date: 29 Apr 2026

Readings

Required

Optional

  • VanderWeele et al. (2020)
  • Suzuki et al. (2020)
  • Bulbulia (2024)
  • Hoffman et al. (2023)

Key concepts

  1. ATE and CATE answer different causal questions.
  2. Causal forests estimate heterogeneity, not just average effects.
  3. Honest splitting and out-of-bag estimation reduce overfitting.
  4. RATE and Qini assess whether targeting has practical value.

Week 6 introduced the conditional average treatment effect (CATE): the contrast for a subgroup defined by baseline covariates at time zero. Estimating CATE with parametric models requires the analyst to guess the functional form. This week introduces causal forests, which learn the heterogeneity surface from data. The machine-learning step changes the estimator, not the identification logic: treatment must still be well-defined, subgroup variables must still precede treatment, and exchangeability and positivity still do the causal work.

Seminar

Motivating example

Suppose a university can fund a community-socialising programme for only 30% of students.

If we treat everyone, budget fails. If we choose badly, impact is small.

So we need a defensible ranking of expected benefit.

From ATE to CATE

The average treatment effect is

$$ \text{ATE}=\mathbb{E}[Y(1)-Y(0)]. $$

The conditional average treatment effect is

$$ \tau(x)=\mathbb{E}[Y(1)-Y(0)\mid X=x]. $$

Here $X$ must be a baseline profile measured before treatment begins. ATE answers "does it work on average?" CATE answers "for whom does it work more or less?"

Why linear interaction models are often not enough

A small interaction model assumes a simple shape.

Real treatment-response surfaces are often non-linear and high-dimensional. In that setting, pre-specified terms can miss important structure.

From regression trees to causal forests

Understanding causal forests requires three steps.

Regression tree. A regression tree splits the covariate space with yes/no questions ("Age $\le$ 20?", "Baseline wellbeing $> 0.3$?"). Each terminal leaf predicts the sample mean of the outcome for units that land there. The result is a piecewise-constant surface, not a global line. A single tree is interpretable but unstable: small changes in the data can shift splits and predictions.

Regression forest. A random forest grows many trees on bootstrap samples and averages their outputs. Averaging cancels much of the noise that makes any one tree unreliable (Breiman (2001)).

Causal forest. To estimate treatment contrasts rather than outcomes, each tree plays an "honest" two-step game (Wager & Athey (2018)). One subsample chooses splits that separate treated from control units. A different subsample estimates treatment-control contrasts within each leaf. The forest uses those leaf-level contrasts to estimate the CATE surface:

$$ \hat{\tau}(x) \approx \tau(x)=\mathbb{E}[Y(1) - Y(0) \mid X = x]. $$

Because individual leaf estimates are noisy and point in many directions, their average is far less variable. The progression matters: students cannot reason about causal forests without first understanding what a tree does and why averaging helps.

Key intuition

A regression tree tiles the covariate space into locally flat regions. A forest averages many such tiles to smooth away noise. A causal forest adds honest splitting so the averaged contrasts estimate treatment effects, not just predictions.

Pair exercise: from tree to forest to causal forest

  1. Explain the three-step progression to your partner in your own words.
  2. Name one strength and one weakness of a single regression tree.
  3. Explain why averaging many trees (a forest) helps with the weakness you identified.
  4. State the two differences between a regression forest and a causal forest: (a) what is the target quantity? (b) what does honest splitting add? Explain in one sentence why honest splitting is necessary when we estimate treatment contrasts rather than predictions.

Honest splitting and out-of-bag prediction

Honest splitting separates model selection from estimation. This separation matters because we estimate parameters for an entire population under two exposures, at most one of which is observed for any individual. If the same data chose the splits and estimated the contrasts, the forest would overfit to noise in the training sample.

The forest adds a second safeguard: out-of-bag (OOB) prediction. Each $\hat{\tau}(x_i)$ is averaged only over trees that never used observation $i$ in their split phase. OOB prediction is analogous to leave-one-out cross-validation but comes for free from the bootstrap structure.

Together, honesty and OOB deliver reliable point estimates and uncertainty intervals even in high-dimensional settings with many covariates.

Missing data in grf

grf can treat missingness as a splitting attribute (MIA) rather than deleting rows by default.

That can preserve sample size, but it does not remove identification concerns. We still need a causal missingness argument, and we still need covariates defined before treatment if they are to index $\tau(x)$.

Is heterogeneity actionable?

After estimating $\hat{\tau}(x)$, we rank units from largest to smallest estimated effect and ask: does treating high-ranked units first yield meaningfully larger gains than treating at random?

The Targeting Operator Characteristic (TOC) curve answers this question. It plots cumulative gain against treatment coverage:

$$ G(q)=\frac{1}{n}\sum_{i=1}^{\lfloor qn\rfloor}\hat{\tau}_{(i)},\qquad 0\le q\le 1, $$

where $\hat{\tau}{(1)} \ge \hat{\tau}{(2)} \ge \cdots$ are the sorted estimated effects. The horizontal axis $q$ is the fraction of the population we would treat. The vertical axis $G(q)$ is the cumulative gain from treating that top-$q$ slice.

Two integrals of the TOC curve summarise targeting value:

RATE AUTOC (area under the TOC) puts equal weight on every value of $q$. It answers: if benefits concentrate among the best prospects, how much can we gain by selecting them? A steep initial rise followed by flattening indicates concentrated heterogeneity.

RATE Qini weights the mid-range of $q$ more heavily. It answers: at a realistic, moderate budget (say, treating 20-50% of individuals), does targeting improve on random allocation? Qini is the practical metric when investigators face a fixed budget constraint.

Both summaries must be computed on held-out or cross-fitted data, not in-sample rankings. The forest reuses information across trees through its bootstrap structure. Evaluating RATE or Qini on the training fold produces optimistic estimates. Computing them on a separate fold blocks this bias and yields honest confidence intervals.

Pair exercise: reading a TOC curve

  1. A university socialising programme produces a TOC curve that rises steeply for the top 20% of ranked students, then flattens.
  2. Interpret the shape: what does the steep rise mean about where treatment gains are concentrated?
  3. Suppose the AUTOC is large but the Qini at $q = 0.3$ (a 30% budget) is small. Explain what this combination means for a decision-maker with a fixed budget.
  4. Why would computing the TOC curve on the same data used to train the causal forest produce misleading results? State the problem in one sentence.

Workflow for this week

  1. Specify the causal estimand, treatment, time zero, and assumptions.
  2. Estimate $\hat{\tau}(x)$ with a causal forest.
  3. Evaluate targeting value with RATE/Qini.
  4. Decide whether heterogeneity is large enough to inform allocation.

Return to the opening example

Back to the university budget.

The question is not only whether the programme works. The question is whether gains are concentrated enough that targeting improves outcomes under a real budget constraint.

Week 9 turns that ranking into transparent policy rules.

Pair exercise: should we target?

  1. The ATE is 0.15 SD. The Qini curve is statistically significant at a budget of $q = 0.3$ (treating 30% of the population).
  2. State the causal estimand that the Qini addresses (what question does it answer beyond the ATE?).
  3. List two non-statistical reasons a decision-maker might choose not to target (consider ethics, logistics, or governance).
  4. A colleague argues "targeting lonely students for a socialising programme is stigmatising." Draft a two-sentence response that takes the concern seriously while explaining what the evidence does and does not show.

Lab materials: Lab 8: RATE and QINI Curves

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

Bulbulia, J. A. (2024). A practical guide to causal inference in three-wave panel studies. PsyArXiv Preprints. https://doi.org/10.31234/osf.io/uyg3d

Hoffman, K. L., Salazar-Barreto, D., Rudolph, K. E., & Díaz, I. (2023). Introducing longitudinal modified treatment policies: A unified framework for studying complex exposures. https://doi.org/10.48550/arXiv.2304.09460

Suzuki, E., Shinozaki, T., & Yamamoto, E. (2020). Causal Diagrams: Pitfalls and Tips. Journal of Epidemiology, 30(4), 153–162. https://doi.org/10.2188/jea.JE20190192

VanderWeele, T. J., Mathur, M. B., & Chen, Y. (2020). Outcome-wide longitudinal designs for causal inference: A new template for empirical studies. Statistical Science, 35(3), 437–466.

Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228–1242. https://doi.org/10.1080/01621459.2017.1319839