Lab 9: Policy Trees (multi-outcome workflow with margot)

R script

Download the R script for this lab

The script checks that the required packages are already installed and downloads the ~80 MB cache on first run. Run the setup block below before class, restart R, then run the script.

This lab moves from a CATE ranking to an explicit allocation rule. Lab 8 asked whether targeting has value using RATE and QINI curves. Lab 9 asks how to turn that information into a short rule someone could read, explain, and contest.

What you will learn

  1. How to read depth-1 and depth-2 policy trees.
  2. How to select tree depth using a stated parsimony threshold.
  3. How to interpret policy coverage: the share of people the rule treats.
  4. How to translate one tree into plain language.
  5. How to state the limits of a policy-tree rule.

Connection to previous labs

Lab 5 introduced the average treatment effect. Lab 6 introduced the conditional average treatment effect (CATE). Lab 8 evaluated whether targeting based on a CATE ranking has practical value. This lab adds the next idea: turning a CATE ranking into a transparent allocation rule.

How this lab is different

Earlier labs taught the pieces separately: average treatment effects, conditional average treatment effects (CATEs), causal forests, RATE, and QINI. Today's lab uses a cached margot workflow so we can put those pieces together and spend the session reading policy trees. The cache is a teaching object: it gives us fitted forests, depth comparisons, policy-tree plots, and prose summaries without asking every laptop to refit the models during class.

Use the course workflow as normative. The full manuscript workflows are more complex because they answer different questions, use real data, and add extra diagnostics. This lab teaches the sequence of decisions students need for the course: estimate effects, inspect heterogeneity, state a parsimony rule, read the allocation rule, and explain its limits.

Cached fits

The cache holds three artefacts:

  • models_binary — the batch causal forest (one per outcome) with augmented inverse-propensity-weighted (AIPW) scores, out-of-bag predictions, and the combined ATE table.
  • policy_tree_stability — bootstrap-based stability output for each outcome's depth-1 and depth-2 policy trees.
  • policy_workflow — the interpretive layer: depth recommendations, plots, and auto-generated prose summaries.

The fit script that produced the cache is at scripts/fit-lab-09-cache.R. You do not need to run it during the lab; it is there so you can see exactly what was fitted.

Teaching simplification

The cache is deliberately smaller than the lab's research scripts. It is designed to teach the sequence of decisions, not to reproduce the full employer-gratitude analysis workflow.

Setup

Install packages before class, then restart R. Follow Lab Setup: R Packages and Build Tools first. The lab script no longer tries to build GitHub packages while the lab is running, because this failed on some student laptops and can take a long time even when it works.

cran_packages <- c(
  "ggplot2", "dplyr", "tibble", "arrow", "qs2", "googledrive"
)
missing_cran <- cran_packages[
  !vapply(cran_packages, \(p) requireNamespace(p, quietly = TRUE), logical(1))
]
if (length(missing_cran) > 0) install.packages(missing_cran)

if (!requireNamespace("pak", quietly = TRUE)) install.packages("pak")
if (!requireNamespace("margot", quietly = TRUE)) {
  pak::pak("go-bayes/margot")
}
if (!requireNamespace("causalworkshop", quietly = TRUE) ||
    packageVersion("causalworkshop") < "0.6.2") {
  pak::pak("go-bayes/causalworkshop")
  if ("causalworkshop" %in% loadedNamespaces()) {
    stop("causalworkshop was upgraded; please restart R and re-run.", call. = FALSE)
  }
}

suppressPackageStartupMessages({
  library(causalworkshop)
  library(margot)
  library(ggplot2)
  library(dplyr)
})

If the GitHub package installation still fails after installing build tools, stop there and use the course lab machine or the pre-installed lab environment. Do not try to debug compilers during the lab.

Load the cached fits. The first call downloads roughly 80 MB into a per-user cache directory; subsequent calls read from disk.

cache <- causalworkshop::load_policy_learning_cache()

models_binary <- cache$models_binary
policy_tree_stability <- cache$policy_tree_stability
wf <- cache$policy_workflow

For reference, the cache was produced by the following sequence:

# models_binary <- margot::margot_causal_forest(...)
# full call shown in the R script; it is abbreviated here because it is long

policy_tree_stability <- margot::margot_policy_tree_stability(
  model_results = models_binary,
  depth = 2,
  n_iterations = 100,
  vary_type = "split_only",
  label_mapping = label_mapping,
  seed = 2026
)

wf <- margot::margot_policy_workflow(
  stability = policy_tree_stability,
  original_df = df_grf,
  label_mapping = label_mapping,
  audience = "policy",
  interpret_models = "recommended",
  plot_models = "recommended"
)

The first line is abbreviated because fitting the four causal forests takes too long for the lab. The full call is commented in the R script.

The four outcomes are purpose, belonging, self-esteem, and life satisfaction at wave 2. The exposure is community-group participation at wave 1. We pass a label mapping to every plot and table call so the figures are legible.

label_mapping <- list(
  model_t2_purpose = "Sense of Purpose",
  model_t2_belonging = "Belonging",
  model_t2_self_esteem = "Self-esteem",
  model_t2_life_satisfaction = "Life satisfaction"
)

Step 1: quick evidence check

The combined ATE table holds one row per outcome with the risk-difference effect, a 95% confidence interval, and two E-values. This is a quick check before reading policy trees. Do not spend the lab reinterpreting RATE or QINI; that was Lab 8.

print(models_binary$combined_table)

Convert the table into a forest plot:

ate_plot <- margot_plot(
  models_binary$combined_table,
  type = "RD",
  order = "magnitude_desc",
  e_val_bound_threshold = 1.2,
  label_mapping = label_mapping,
  save_path = tempdir()
)
print(ate_plot$plot)

Reading the forest plot

The dashed line at zero is the null. Estimates whose E-value bound is close to 1 are fragile: a relatively modest unmeasured confounder could explain the effect away. Treat those rows with caution when you describe the result. The plot's $interpretation slot contains a one-sentence draft naming the outcomes most worth treating as causal. Read it as a draft, not a verdict.

Step 2: quick heterogeneity check

The ATE asks "does it work on average?". The next question is "does it work the same for everyone?". margot_omnibus_hetero_test() wraps grf::test_calibration(). The output's "Differential prediction" coefficient is what matters: a positive coefficient with a small p-value means the forest sees genuine heterogeneity, not just sampling noise.

omnibus <- margot_omnibus_hetero_test(
  models_binary,
  label_mapping = label_mapping
)
print(omnibus)

What if heterogeneity is absent?

If an outcome shows reliable mean prediction but no differential prediction, the forest is telling you that targeting will not beat random allocation for that outcome. Reporting only the ATE is the honest move.

Step 3: policy-tree summary tables

Start with the policy-tree summary. Coverage is the share of participants the learned rule recommends for treatment. This is an output of the tree, not a fixed budget chosen by the analyst.

cat("\n=== policy-tree one-page summary ===\n")
print(wf$policy_brief_df)

Then compare depth-1 and depth-2. A deeper tree is useful only if the gain is worth the extra complexity.

cat("\n=== depth comparison ===\n")
print(wf$best$depth_summary_df)

The depth choice is not mechanical. Investigators must state how much extra policy value is needed before a depth-2 tree is worth using. margot exposes that choice through min_gain_for_depth_switch.

depth_thresholds <- c(0.005, 0.01, 0.03)
depth_sensitivity <- dplyr::bind_rows(lapply(depth_thresholds, \(threshold) {
  best_at_threshold <- suppressMessages(suppressWarnings(
    margot::margot_policy_summary_compare_depths(
      policy_tree_stability,
      label_mapping = label_mapping,
      min_gain_for_depth_switch = threshold,
      verbose = FALSE
    )
  ))
  best_at_threshold$depth_summary_df |>
    dplyr::transmute(
      threshold = threshold,
      outcome = outcome_label,
      selected_depth = depth_selected,
      pv_depth1 = pv_depth1,
      pv_depth2 = pv_depth2,
      gain_depth2_minus_depth1 = pv_depth2 - pv_depth1
    )
}))

cat("\n=== depth selection sensitivity ===\n")
print(depth_sensitivity)

Parsimony threshold

A threshold of 0.005 says a depth-2 tree is worth using for a gain above 0.005 outcome units. A threshold of 0.03 says the extra split must buy at least 0.03 units. The threshold is an investigator judgement about interpretability, implementation cost, and expected value.

Coverage matters

If a policy tree recommends treatment for 90-97% of people, it is mostly saying "treat nearly everyone". That can be a valid learned rule, but it is not a scarce-budget allocation policy. If a real programme can only treat 20% of people, the budget constraint must be added separately.

In this lab the policy-tree objective contains no treatment-cost term. A do not treat leaf means the outcome-only rule assigns the no-treatment action for that profile; it does not report money saved or resources freed. A cost-sensitive policy would need a different objective, for example subtracting a cost c from the treatment reward and refitting the tree across plausible values of c.

Step 4: render policy trees

A policy tree converts the personalised CATE into a transparent allocation rule: "treat people in this leaf, do not treat in this leaf". The lab caps depth at two, so each rule asks at most three yes/no questions before committing.

Two functions render each outcome's policy tree. margot_plot_decision_tree() returns the tree diagram alone, the rule as a slide-ready flowchart. margot_plot_policy_tree() returns the prediction-points scatter, the same rule shown as a partition of the underlying covariate space. The script saves every plot to lab-09-policy-tree-plots/ so you can inspect the trees outside the RStudio plot pane.

model_ids <- names(label_mapping)

plot_dir <- file.path(getwd(), "lab-09-policy-tree-plots")
dir.create(plot_dir, showWarnings = FALSE, recursive = TRUE)

policy_tree_plots <- list()
for (m in model_ids) {
  cat("\n=== ", label_mapping[[m]], " ===\n", sep = "")
  depth1_tree <- margot_plot_decision_tree(
    policy_tree_stability,
    model_name = m,
    max_depth = 1,
    label_mapping = label_mapping
  )
  depth2_tree <- margot_plot_decision_tree(
    policy_tree_stability,
    model_name = m,
    max_depth = 2,
    label_mapping = label_mapping
  )
  depth2_scatter <- margot_plot_policy_tree(
    policy_tree_stability,
    model_name = m,
    max_depth = 2,
    label_mapping = label_mapping
  )

  suppressWarnings(print(depth1_tree))
  suppressWarnings(print(depth2_tree))
  suppressWarnings(print(depth2_scatter))

  suppressWarnings(ggplot2::ggsave(file.path(plot_dir, paste0(m, "-depth1-tree.png")),
    depth1_tree, width = 8, height = 5, dpi = 150
  ))
  suppressWarnings(ggplot2::ggsave(file.path(plot_dir, paste0(m, "-depth2-tree.png")),
    depth2_tree, width = 8, height = 5, dpi = 150
  ))
  suppressWarnings(ggplot2::ggsave(file.path(plot_dir, paste0(m, "-depth2-scatter.png")),
    depth2_scatter, width = 8, height = 5, dpi = 150
  ))

  policy_tree_plots[[m]] <- list(
    depth1_tree = depth1_tree,
    depth2_tree = depth2_tree,
    depth2_scatter = depth2_scatter
  )
}

saved_policy_plots <- list.files(plot_dir, pattern = "\\.png$", full.names = TRUE)
print(saved_policy_plots)

Reading a policy tree

Each non-leaf node names a covariate and a threshold. Branches go left if the condition is true and right otherwise. Each leaf says "treat" or "do not treat". A useful policy tree can be repeated by a clinician, teacher, or community organiser without opening the model.

Read "do not treat" as "the rule assigns the no-treatment action under the current outcome-only objective." The current tree compares expected wellbeing outcomes under treatment and no treatment; it does not attach a resource saving to the no-treatment action.

Inspect every graph

Open every file printed in saved_policy_plots. For each outcome, compare the depth-1 tree, the depth-2 tree, and the depth-2 scatter plot. Ask three questions: what rule is being learned, whether the extra depth changes the rule in a useful way, and whether the scatter plot suggests a stable separation or a fragile threshold.

Step 5: translate one rule

After inspecting all graphs, use the Purpose tree as the worked example:

worked_model <- "model_t2_purpose"
file.path(plot_dir, paste0(worked_model, "-depth2-tree.png"))

Open that file. Write the rule as a sentence:

If [condition], recommend treatment; otherwise [condition].

Then add one sentence about coverage: what share of people would be treated under the rule?

Finally, add one limitation. For example: the splitters are not causes, the rule may rely on proxy variables, the coverage may be too broad for a scarce programme, or the threshold cases may be fragile.

Step 6: read the auto-generated prose

margot synthesises a draft narrative from the policy-workflow object. Read it after you have inspected the trees yourself. The prose is generated from the same numbers you saw above; it does not bring new information.

cat(wf$report_prose)

Read the prose critically

Auto-generated text is a draft. Check the numbers against the tables you printed earlier. Replace any phrasing that overstates causal certainty. The prose is a starting point for your own writing, not a final answer.

Step 7: audit against simulator ground truth

The analyses above never see the truth. The simulator stores the true individual treatment effects in tau_community_<outcome> columns. Ranking outcomes by their true population mean tau lets you audit margot's recommendations without circularity.

d <- simulate_nzavs_data(n = 5000, seed = 2026)
d0 <- d |> filter(wave == 0)

true_tau_table <- tibble::tibble(
  outcome = c("Sense of Purpose", "Belonging", "Self-esteem", "Life satisfaction"),
  true_mean_tau = c(
    mean(d0$tau_community_purpose),
    mean(d0$tau_community_belonging),
    mean(d0$tau_community_self_esteem),
    mean(d0$tau_community_life_satisfaction)
  ),
  true_sd_tau = c(
    sd(d0$tau_community_purpose),
    sd(d0$tau_community_belonging),
    sd(d0$tau_community_self_esteem),
    sd(d0$tau_community_life_satisfaction)
  )
) |>
  arrange(desc(true_mean_tau))

print(true_tau_table)

The true_sd_tau column is the headroom for targeting: it tells you whether even a perfect ranking would deliver appreciable extra benefit beyond the average. Use this table to keep the policy trees honest. A tree cannot recover heterogeneity that is weak, noisy, or absent.

Ethical considerations

A high-value rule still needs judgement

A policy tree maximises expected treatment benefit under the objective you give it. Before anyone acts on it, ask:

  • Objective. Does the rule optimise wellbeing gain only, or has a treatment cost been specified in the same units as the outcome?
  • Proxy variables. Does the tree split on variables correlated with protected characteristics such as ethnicity, gender, or socioeconomic status?
  • Fairness. Does targeting those who benefit most leave out people who would still benefit?
  • Transparency. Can a community organiser, teacher, clinician, or member of the public understand the rule?
  • Override. When should someone override the rule because they know relevant context the model cannot see?

When presenting policy tree results, discuss these trade-offs plainly.

Exercises

Lab diary

Complete at least two of the following exercises for your lab diary.

  1. Translate a policy tree. Pick one outcome. Translate its depth-2 policy tree into a rule a non-technical reader could follow.

  2. Compare depth. For the same outcome, compare the depth-1 and depth-2 rules. Does the extra depth seem worth the extra complexity? Use the depth summary table and the plots.

  3. Coverage and budget. Use wf$policy_brief_df. If the rule treats more than 80% of people, explain why that is not a scarce-budget policy.

  4. Discuss override. In one paragraph, describe a scenario in which a community-group co-ordinator should override a policy tree recommendation. What information would they have that the model does not?