Week 8: Heterogeneous Treatment Effects and Machine Learning

Date: 29 Apr 2026

Key idea

Machine learning changes the estimator, not the identification logic. A causal forest can rank who is predicted to benefit most, but that ranking is worth acting on only when held-out evidence shows it beats treating at random.

Readings

Required

Optional

  • VanderWeele et al. (2020)
  • Suzuki et al. (2020)
  • Bulbulia (2024)
  • Hoffman et al. (2023)

Key concepts

  1. ATE and CATE answer different causal questions.
  2. Causal forests estimate heterogeneity, not just average effects.
  3. Honest splitting, sample splitting, and out-of-bag prediction answer different overfitting problems.
  4. Doubly-robust scores are the bridge from unobserved potential outcomes to evaluating rankings.
  5. RATE, AUTOC, and Qini assess whether a CATE ranking has practical value.

Week 6 introduced the conditional average treatment effect (CATE): the contrast for a subgroup defined by baseline covariates at time zero. Investigators can estimate CATEs for theory-driven subgroups by stratifying first, then estimating effects separately within each subgroup. They can also use interaction models, which require a chosen functional form. This week introduces causal forests, which learn the heterogeneity surface from data. The machine-learning step changes the estimator, not the identification logic: treatment must still be well-defined, subgroup variables must still precede treatment, and exchangeability and positivity still do the causal work. What machine learning adds is the ability to estimate $\tau(x)$ at high resolution without committing in advance to a particular shape for the heterogeneity.

Where we are in the heterogeneity sequence

  • Week 6: define effect modification and CATE.
  • Week 8: estimate CATE rankings and ask whether rankings contain useful targeting information.
  • Week 9: turn modelled heterogeneity into interpretable policy trees.
  • Week 10: ask whether the measured outcomes and covariates support the interpretation we place on the tree.
  • Final assessment: report outcome-wide ATEs, then interpret policy trees as readable targeting summaries.

Seminar

Motivating example

Suppose a university can fund a community-socialising programme for only thirty percent of students. Treating everyone is infeasible. Choosing badly wastes scarce places. We need a defensible ranking of expected benefit, and we need to know whether the ranking carries enough information to be worth the administrative cost of targeting at all.

This budget example motivates ranking. In the assignment, policy trees identify interpretable treatment regions using the policytree reward objective; the treated share is whatever the selected rule implies. The right question this week is: does the CATE ranking contain enough real information to make targeting worth considering? If the answer is no, the simplest defensible scarce-budget policy may be random allocation among eligible students. The week's tools help investigators answer that question with held-out evidence rather than asserted intuition.

From ATE to CATE

Relying on the average treatment effect alone can hide large differences in who benefits. Today's question is how large that variation is, how to estimate it from data, and whether the variation is useful enough to justify tailoring an exposure to individual circumstances.

The average treatment effect is

$$ \text{ATE}=\mathbb{E}[Y(1)-Y(0)]. $$

The conditional average treatment effect is

$$ \tau(x)=\mathbb{E}[Y(1)-Y(0)\mid X=x]. $$

Here $X$ must be a baseline profile measured before treatment begins. ATE answers "does it work on average?" CATE answers "for whom does it work more or less?"

Recall, the individual treatment effect $Y_i(1) - Y_i(0)$ is unidentified, because for any single person we observe at most one of the potential outcomes. CATE is the most granular contrast we can identify: the average of individual effects across people who share covariate profile $x$. Two people with identical baseline covariates may differ in their realised potential outcomes, so $\tau(x)$ smooths over within-profile variation that is, by definition, beyond reach.

Outcome-wide effects before heterogeneity

Before asking whether effects vary across people, ask whether the treatment appears to affect the outcome at all, and whether that evidence is robust across the outcomes investigators care about. Investigators rarely look at one outcome in isolation. A programme that improves sense of purpose but worsens belonging, self-esteem, or life satisfaction raises a different decision problem from a programme whose benefits point in the same direction across outcomes.

For the final assignment, the core sequence is outcome-wide ATEs followed by policy trees. Students choose one exposure, estimate average treatment effects across the prespecified outcome family, and then report policy trees as interpretable targeting rules. RATE and Qini belong to the Week 8 teaching sequence: they help explain whether a CATE ranking contains targeting information. They are not the assignment's main heterogeneity output. The assignment asks for policy trees because a tree turns modelled heterogeneity into a rule a non-specialist can read, question, and apply.

Keep the objects separate:

ObjectQuestionOutput
Outcome-wide ATEsWhich outcomes show credible average evidence?Four-outcome ATE table or plot
CATEFor whom does the effect vary?$\hat{\tau}(X)$ estimates
RATE / QiniDoes the ranking carry targeting value?Diagnostic curves and summaries
Policy treeCan we state a readable rule?Depth-1 or depth-2 allocation rule
Measurement checksDo the variables mean what the rule assumes?Cautions about construct meaning and comparability

Outcome-wide average treatment effect (ATE) forest plot for community-group participation across four outcomes (sense of purpose, belonging, self-esteem, life satisfaction). Each row shows the estimated average treatment effect on the risk-difference scale with a 95% confidence interval, accompanied by an E-value bound that quantifies how strong an unmeasured confounder would need to be on the relative-risk scale to explain the result away. Estimates whose E-value bound is close to 1 are fragile: relatively modest unmeasured confounding could explain those rows away. The final assignment uses this outcome-wide ATE view before interpreting policy trees.

A single-outcome CATE example

The CATE machinery below sits inside this larger assignment logic. We pick one outcome at a time to make the heterogeneity story concrete. The final report applies the outcome-wide ATE plus policy-tree sequence across the prespecified outcomes.

Histogram of estimated subgroup-level treatment effects from a small causal forest fit on the simulated NZAVS panel (outcome: sense of purpose). Each bar counts how many people received an estimate $\hat{\tau}(x)$ in that bin. The orange line marks the average treatment effect; the dashed line marks zero. A wide spread is the empirical signature of heterogeneity, not of individual-level estimates: each bar is a count of people whose covariate profile produced an estimate near that value.

Causal estimand, statistical estimand, estimator

The same workflow that organised weeks 5 and 6 still organises this week. The full ten-step version lives on the causal workflow reference page; the five-step digest below is the version specific to heterogeneous treatment effects.

Workflow for heterogeneous treatment effects

  1. State the causal estimand. The target is $\tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x]$: the conditional average treatment effect at baseline profile $X = x$, where $X$ is measured at time zero.
  2. Defend the identifying assumptions. Conditional exchangeability $Y(a) \coprod A \mid X$, consistency, and positivity must hold within every covariate profile that indexes $\tau(x)$. The machine-learning step does not weaken these requirements.
  3. Construct a statistical estimand that targets the causal one. Under the assumptions in step 2, the causal contrast equals an observable contrast: $\tau(x) = \mathbb{E}[Y \mid A = 1, X = x] - \mathbb{E}[Y \mid A = 0, X = x]$. This is the quantity we estimate from data.
  4. Estimate it. A causal forest learns the two conditional expectations non-parametrically, with honest splitting and out-of-bag prediction (below). The forest returns $\hat{\tau}(x)$ for each unit without committing in advance to a functional form for the heterogeneity.
  5. Evaluate before deciding. Report the ATE, the distribution of $\hat{\tau}(x)$, and targeting metrics (RATE, Qini) on held-out data, accompanied by the standard sensitivity analyses (E-values, missing-data diagnostics). Treat RATE and Qini as evidence about whether the CATE ranking is useful. Week 9 asks the separate policy-learning question: whether a short allocation rule can recover enough of that value to be worth using.

The bridge that does the causal work is step 3. Without it, the forest is a flexible regression and nothing more. With it, the forest's two conditional means inherit causal meaning from the identification assumptions, and their difference $\hat{\tau}(x)$ targets $\tau(x)$. Machine learning replaces the guessed functional form of an interaction model with a learned one; it does not replace the identification argument.

Three-stage pipeline showing how heterogeneous-treatment-effect analysis bridges what we want to know to what the forest produces. Left: the causal estimand $\tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x]$ is the unobservable contrast we care about. Centre: under exchangeability, consistency, and positivity, the causal estimand equals an observable contrast of conditional means $\mathbb{E}[Y \mid A = 1, X = x] - \mathbb{E}[Y \mid A = 0, X = x]$, which is the statistical estimand the forest can target. Right: the causal forest estimates the two conditional means non-parametrically and returns $\hat{\tau}(x)$. Identification is what carries causal meaning across the first arrow; estimation is what carries it across the second.

Why pre-specified subgroup checks are often not enough

If theory names a subgroup in advance, a simple non-parametric check is to split the data into those strata and estimate the ATE separately in each subgroup. For example, investigators might estimate the effect among younger and older adults separately, then compare the two subgroup estimates. This is descriptive heterogeneity analysis, not policy optimisation: it reports how $\hat{\tau}$ varies across investigator-defined groups.

The strength of stratification is clarity. The weakness is that the answer is only as broad as the strata chosen. Subgroup estimates can also be noisy when strata are small, and repeated subgroup searches create the same multiplicity and over-interpretation problems that outcome-wide designs try to discipline. Use stratified subgroup comparisons when there is a theoretical reason to expect a difference; be cautious when the subgroup list is exploratory.

Linear interaction models provide another pre-specified check, but they add modelling assumptions about shape.

A small interaction model assumes a simple shape. If the analyst writes $\tau(x) = \beta_0 + \beta_1 x_1 + \beta_2 x_2$ and stops there, every covariate must enter linearly and the heterogeneity along $x_1$ must be the same regardless of $x_2$. Real treatment-response surfaces are often non-linear, contain interactions among more than two variables, and carry sharp regime changes (the effect of a programme on purpose may climb steeply with extraversion up to a saturation point, then flatten). A pre-specified interaction model can miss all three features.

A pre-specified linear interaction model can still be useful when the scientific question is a specific, pre-declared contrast. For discovery, it is a weak test of heterogeneity because its answer is limited to the variables, interactions, and functional form chosen in advance.

A non-parametric estimator like a causal forest sidesteps the functional-form problem by letting the data suggest where to draw splits and how deep to go. The benefit is reach: instead of testing a few named subgroups, the forest searches over the whole baseline feature space $X$. The cost is interpretability: the forest is no longer a small set of coefficients you can read off, and a high-dimensional $\hat{\tau}(X)$ surface is hard to translate into a defensible policy decision. Week 9 handles that cost by fitting shallow policy trees on top of the forest.

Single-covariate sketch contrasting a parametric linear interaction model with a causal forest under a non-linear truth. The blue curve is the true treatment-effect function $\tau(x)$: the effect climbs through an intermediate range of extraversion and saturates at high values. The dashed orange line is the best-fitting linear interaction model; because it is committed in advance to a constant slope, it overshoots in the flat region at low extraversion, undershoots through the steep climb, and undershoots again on the plateau. The green step function is a causal-forest approximation; its piecewise-constant shape hugs the truth without committing to a functional form. Shaded orange regions mark where the linear model's structural error is largest. With more trees and more data, the forest's steps grow finer and the green curve smooths toward the blue.

From regression trees to causal forests

Understanding causal forests requires three steps.

Regression tree. A regression tree splits the covariate space with yes/no questions ("Age $\le$ 20?", "Baseline purpose $> 0.3$?"). Each terminal leaf predicts the sample mean of the outcome for units that land there. The result is a piecewise-constant surface, not a global line. A single tree is interpretable but unstable: small changes in the data can shift splits and predictions. The instability is the price of letting the data choose the splits.

A two-covariate sketch of how a single causal tree partitions the covariate space into rectangular leaves. Filled circles mark treated units; open triangles mark control units. The tree's three splits (one along $X_1$, two along $X_2$) carve the plane into four leaves, and each leaf returns its own treated-minus-control contrast $\hat{\tau}$. The value of $\hat{\tau}$ jumps at every split boundary and is constant inside each leaf, which is what makes the surface piecewise-constant. A causal forest averages many such tiled surfaces, each cut differently, to produce a smoother estimate of $\tau(x)$.

Regression forest. A random forest grows many trees on bootstrap samples and averages their outputs. Averaging cancels much of the noise that makes any one tree unreliable (Breiman, 2001). The price is interpretability: the forest's prediction surface is the average of hundreds of jagged tile patterns, with no single tree responsible for any particular prediction.

Causal forest. Each tree still asks "where should I split the covariate space?" — but the question it answers is different. A regression tree looks for splits that group units with similar outcomes: it finds cutpoints where the average value of $Y$ jumps. A causal tree looks for splits that group units with similar treatment effects: it finds cutpoints where the gap between treated and control units jumps. So the tree's splits land on covariates that change the effect of treatment, not covariates that merely predict the outcome.

Each tree then plays an honest two-step game on its training subsample (Wager & Athey, 2018). The first half decides where the splits go. The second, non-overlapping half estimates the treated-minus-control gap inside each resulting leaf. This is called honesty: the data that choose a promising subgroup are separate from the data used to estimate that subgroup's treatment contrast. The forest averages those leaf-level gaps across hundreds of trees to estimate the CATE surface:

$$ \hat{\tau}(x) \approx \tau(x)=\mathbb{E}[Y(1) - Y(0) \mid X = x]. $$

The leaf-level contrast for an individual leaf is just the mean outcome among treated units in that leaf minus the mean among controls. Because individual leaf estimates are noisy and point in many directions, their average across hundreds of trees is far less variable. The progression matters: students cannot reason about causal forests without first understanding what a tree does and why averaging helps.

Key intuition

A regression tree tiles the covariate space into locally flat regions. A forest averages many such tiles to smooth away noise. A causal forest adds honest splitting so the averaged contrasts estimate treatment effects, not just predictions.

Pair exercise: from tree to forest to causal forest

  1. Explain the three-step progression to your partner in your own words.
  2. Name one strength and one weakness of a single regression tree.
  3. Explain why averaging many trees (a forest) helps with the weakness you identified.
  4. State the two differences between a regression forest and a causal forest: (a) what is the target quantity? (b) what does honest splitting add? Explain in one sentence why honest splitting is necessary when we estimate treatment contrasts rather than predictions.

Honest splitting and out-of-bag prediction

Three related ideas appear in this lecture: honesty, out-of-bag prediction, and held-out evaluation. They all reduce overfitting, but they act at different points in the workflow.

Honesty happens inside each tree. It separates model selection from estimation. The first half of a tree's training subsample decides where to split; the second half estimates the leaf-level treatment-control gap. The two halves do not share information. This separation matters because we estimate parameters under two exposures, at most one of which is observed for any individual. If the same data chose the splits and estimated the contrasts, the forest would chase lucky treatment-control gaps. The leaves that look most informative on the training sample would often be the leaves whose gaps were inflated by chance. The CATE estimates would inherit that inflation.

Schematic of honest splitting inside a single tree. The bootstrap sample for tree $t$ is partitioned into two disjoint halves. Half A (left) chooses cutpoints for the splits using a criterion that separates treated from control units; the resulting tree skeleton has empty leaves. Half B (right) routes its rows down that fixed skeleton and computes the treated-minus-control contrast $\hat{\tau}$ in each leaf. Because no row is used by both halves, the data that chose a leaf cannot inflate the contrast measured within that leaf. The forest averages many such trees, each grown on a fresh bootstrap sample with a fresh A/B split.

Out-of-bag (OOB) prediction happens after the trees are grown. Each tree is trained on a subsample, so some observations are left out of that tree. An OOB prediction for observation $i$ averages only trees that did not train on observation $i$. OOB is close in spirit to leave-one-out prediction: the case being scored was not used to fit the trees that score it. In a causal forest, OOB predictions help keep $\hat{\tau}(x_i)$ from looking too good simply because observation $i$ helped build the model.

Held-out evaluation is broader sample splitting at the analysis level. We may fit the forest on one fold and evaluate a targeting curve on another fold. Honesty protects leaf-level effect estimation; OOB protects unit-level forest prediction; held-out evaluation protects the claim that a targeting rule would work beyond the data used to learn it.

Together, these safeguards support more credible heterogeneity estimates in high-dimensional settings with many covariates. They do not create exchangeability, fix poor measurement, or make post-treatment covariates safe.

Doubly-robust scores: the evaluation bridge

The next problem is evaluation. A CATE ranking says who is expected to benefit most, but to evaluate that ranking we need to ask counterfactual questions such as: what would the mean outcome be if the top 30% by $\hat{\tau}(X)$ received treatment? We do not observe both $Y_i(1)$ and $Y_i(0)$ for any person, so raw outcomes cannot answer that question.

The workaround is to create action-specific pseudo-outcomes. For each treatment action $a$, the augmented inverse-propensity-weighted (AIPW) score combines two pieces of information:

  1. an outcome-model prediction for what would happen under action $a$;
  2. a correction from the people who actually received action $a$, weighted by how probable that action was for their covariates.

The score is

$$ \Gamma_i(a) = \underbrace{\mu_a(X_i)}_{\textcolor{blue}{\text{outcome model prediction}}} + \underbrace{\frac{\mathbb{1}\lbrace A_i = a\rbrace}{\Pr(A = a \mid X_i)}}_{\textcolor{teal}{\text{IPW weight}}}\thinspace\underbrace{\bigl(Y_i - \mu_a(X_i)\bigr)}_{\textcolor{red}{\text{residual correction}}}, $$

where $\mu_a(x)$ is the causal forest's estimate of the outcome surface under treatment $a$, and $\Pr(A = a \mid x)$ is the propensity for treatment $a$ given covariates.

The score is called doubly robust because averaging $\Gamma_i(a)$ can recover the mean potential outcome $\mathbb{E}[Y(a)]$ if either the outcome model or the propensity model is well specified, along with the usual causal assumptions. Both models can still be wrong, and unmeasured confounding still matters. "Doubly robust" means there are two statistical routes to the same estimand, not that the analysis is protected against every threat.

Targeting metrics and the policy-tree evaluation in Week 9 use AIPW scores rather than raw outcomes. The scores let us evaluate rules that assign different actions to different people, even though each person was observed under only one action.

Is heterogeneity actionable?

After estimating $\hat{\tau}(x)$, we rank units from largest to smallest estimated effect. That ranking is not yet a policy. It is a diagnostic object: a list from "most expected benefit" to "least expected benefit". The evaluation question is whether this list contains enough real information to improve allocation.

The first question is: does treating high-ranked units first yield meaningfully larger gains than treating at random?

The Targeting Operator Characteristic (TOC) curve answers this question. For each treatment coverage $q$, it plots how much larger the average treatment effect is among the top-$q$ ranked individuals than the average effect across the whole population:

$$ \text{TOC}(q)=\frac{1}{\lfloor qn\rfloor}\sum_{i=1}^{\lfloor qn\rfloor}\hat{\tau}{(i)};-;\frac{1}{n}\sum{i=1}^{n}\hat{\tau}_i,\qquad 0 < q \le 1, $$

where $\hat{\tau}_{(1)} \ge \hat{\tau}_{(2)} \ge \cdots$ are the sorted estimated effects. The first term is the mean estimated effect in the top-$q$ slice; the second is the overall ATE. The horizontal axis $q$ is the fraction of the population we would treat. The vertical axis is the gain from selecting that top-$q$ slice by the CATE ranking rather than choosing a slice of the same size at random: a random slice has the overall ATE as its average effect, so the TOC measures how far the ranked slice beats it. At $q = 1$ the top slice is the whole population, so the TOC returns to zero.

TOC curve from the same small causal forest used above. Targeting rate $q$ on the horizontal axis; gain over random assignment on the vertical axis. A steep early rise indicates concentrated heterogeneity: the highest-ranked individuals carry most of the treatment benefit. A flat curve indicates none. The curve must return to zero at $q = 1$, since at full coverage every targeting rule is identical to "treat everyone".

The TOC curve is the most direct answer to "does targeting beat random?" but it carries no single-number summary. Two summaries of the curve are useful. Both are forms of the RATE (rank-weighted average treatment effect), a family of summaries that average targeting gains over parts of the ranked population:

RATE AUTOC means the area under the TOC curve. It puts equal weight on every value of $q$. It answers: across all possible treatment coverages, how much can we gain by selecting people using the CATE ranking rather than selecting at random? A large AUTOC indicates concentrated heterogeneity. A small AUTOC indicates that the ranking carries little practical information.

RATE Qini weights the middle of the ranked population more heavily. It answers: at realistic, moderate coverages, does targeting improve on random allocation? Qini is the practical metric when investigators face a fixed budget constraint, for example treating 20-50% of eligible people.

These metrics tell us whether the ranking has value. They do not tell us whether the ranking is understandable, fair, or administratively usable.

Reading a Qini curve

Qini curve produced by margot::margot_plot_qini_batch() for a single-outcome causal forest fit on the simulated NZAVS panel (n = 5 000, 1 000 trees, exposure community-group participation, outcome purpose). The horizontal axis is the spend rate (proportion treated); the vertical axis is the average gain in outcome over a no-treatment baseline. The dashed grey line traces the policy "treat everyone equally", which by construction reaches the average treatment effect at $q = 1$. The solid orange line traces the CATE-targeted policy: at each spend level, the rule treats the top-$q$ share of individuals by estimated benefit. Vertical markers at 10% and 40% spend mark common policy budgets. The gap between the orange and grey lines at any $q$ is the lift from targeting at that budget.

Read the curve in three passes.

First, look at the right edge. Both lines must meet at $q = 1$, because a hundred-percent treatment rate is treat-everyone whatever the rule. The shared end-point is the ATE.

Second, look at the gap between the orange and grey lines across the middle range. In this fit the orange line sits roughly $0.05$ above the grey line at the 40% marker and roughly $0.02$ above at the 10% marker. Those gaps are the lift from targeting at those budgets, expressed in the outcome's standard-deviation units. The 0.05 lift at 40% spend says that a CATE-targeted rule covering forty percent of the population delivers, on average, about $0.05$ more standard-deviation units of purpose per eligible person, averaged across the whole eligible population, than a treat-everyone-equally policy at the same coverage.

Third, look at the shape near $q = 0$. A curve that rises steeply over the leftmost few percent indicates that the very top of the ranking is heavily concentrated. A curve that hugs the diagonal everywhere indicates that the ranking carries no useful targeting information. Under a fixed budget, random allocation may then be more defensible than a complex targeting system. Without a fixed budget, the simpler choice may be to offer treatment uniformly.

Both summaries, AUTOC and Qini, must be computed on held-out data, not in-sample rankings. The forest reuses information across trees through its bootstrap structure. Evaluating RATE or Qini on the training fold produces optimistic estimates. Computing them on a separate fold blocks this bias and yields more credible confidence intervals.

Pair exercise: reading a TOC curve

  1. A university socialising programme produces a TOC curve that rises steeply for the top 20% of ranked students, then flattens.
  2. Interpret the shape: what does the steep rise mean about where treatment gains are concentrated?
  3. Suppose the AUTOC is large but the Qini at $q = 0.3$ (a 30% budget) is small. Explain what this combination means for a decision-maker with a fixed budget.
  4. Why would computing the TOC curve on the same data used to train the causal forest produce misleading results? State the problem in one sentence.

Workflow for this week

The week's tools fit together as a short pipeline:

  1. Specify the causal estimand: treatment, time zero, target population, outcome, and the identification assumptions.
  2. Fit a causal forest with honest splitting on baseline covariates measured at time zero.
  3. Check that heterogeneity is real. The causal-forest calibration test asks whether $\hat{\tau}(x)$ varies across people more than chance alone would produce; the coefficient to read is the one grf labels differential.forest.prediction, where a value near 1 means the heterogeneity estimates track real variation and a value near 0 means there is little to use. Run this early, to avoid spending time on a flat forest.
  4. Estimate targeting value with RATE-AUTOC and RATE-Qini on held-out data, and inspect the slope of the Qini curve around plausible budgets.
  5. Run sensitivity analysis on the underlying ATE before staking a policy on it.
  6. Decide whether heterogeneity is large enough to inform allocation. If yes, Week 9 turns the ranking into a transparent rule. If no, use a simpler allocation rule: random allocation under a fixed budget, or uniform provision when treatment can be offered to everyone.

Treat these steps as diagnostics, not as a verdict. A positive calibration test or RATE is evidence that effects vary, but variation improves on a blanket policy only when the better action differs across covariate regions, or a budget forces a choice. Whether a particular rule actually wins is settled by the policy tree's held-out policy value, compared against the blanket policies of treating everyone or treating no-one (Week 9).

Return to the opening example

Back to the university budget. The question is not only whether the programme works. The question is whether gains are concentrated enough that targeting improves outcomes under a real budget constraint, and whether the targeting evidence survives the standard sensitivity checks. A causal forest with a Qini curve that hovers near the diagonal is also a finding: it tells the university that the CATE ranking may not justify selective allocation. The university could then use a transparent lottery among eligible students, expand capacity, or spend the targeting budget on a different question.

The hard problem at the end of this lecture is opacity. A causal forest predicts over the entire feature space $X$. That is a major gain over conventional subgroup analysis, because it can discover heterogeneity we did not specify in advance. It is also a decision problem: a high-dimensional ranking is difficult to explain, contest, audit, or implement. Week 9 begins by returning to the outcome-wide screen, because policy work should start from outcomes with credible average evidence. It then uses policy trees as a workaround for the ranking problem. A shallow policy tree gives up some targeting precision in exchange for a rule that a decision-maker can read, defend, and apply.

Pair exercise: should we target?

  1. The ATE is 0.15 SD. The Qini curve is statistically significant at a budget of $q = 0.3$ (treating 30% of the population).
  2. State the causal estimand that the Qini addresses (what question does it answer beyond the ATE?).
  3. List two non-statistical reasons a decision-maker might choose not to target (for example, stigma, logistics, cost, or who gets to decide).
  4. A colleague argues "targeting lonely students for a socialising programme is stigmatising." Draft a two-sentence response that takes the concern seriously while explaining what the evidence does and does not show.

Missing data in grf

grf can treat missingness as a splitting attribute (MIA) rather than deleting rows by default. A missing value is itself informative: it can correlate with treatment, with the outcome, or with both. MIA lets the tree send "missing" units down whichever branch produces the cleaner treatment-control contrast.

That can preserve sample size, but it does not remove identification concerns. We still need a causal missingness argument, and we still need covariates defined before treatment if they are to index $\tau(x)$. MIA is a convenience for estimation; it does not absolve the analyst of the duty to model why data are missing.

Further material

Susan Athey's lecture on causal forests gives a deeper account of the statistical machinery introduced above. The relevant material on causal forests starts around the eighteen-minute mark.


Lab materials: Lab 8: RATE and QINI Curves

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

Bulbulia, J. A. (2024). A practical guide to causal inference in three-wave panel studies. PsyArXiv Preprints. https://doi.org/10.31234/osf.io/uyg3d

Hoffman, K. L., Salazar-Barreto, D., Rudolph, K. E., & Díaz, I. (2023). Introducing longitudinal modified treatment policies: A unified framework for studying complex exposures. https://doi.org/10.48550/arXiv.2304.09460

Suzuki, E., Shinozaki, T., & Yamamoto, E. (2020). Causal Diagrams: Pitfalls and Tips. Journal of Epidemiology, 30(4), 153–162. https://doi.org/10.2188/jea.JE20190192

VanderWeele, T. J., Mathur, M. B., & Chen, Y. (2020). Outcome-wide longitudinal designs for causal inference: A new template for empirical studies. Statistical Science, 35(3), 437–466.

Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228–1242. https://doi.org/10.1080/01621459.2017.1319839