Welcome to PSYC 434
Conducting Research Across Cultures | Trimester 1, 2026
Prof Joseph Bulbulia | Victoria University of Wellington
Assessments
| Assessment | CLOs | Weight | Due |
|---|---|---|---|
| Lab diaries (8 × 1.25%) | 1, 2, 3 | 10% | Weekly (satisfactory/not) |
| In-class test 1 | 2 | 20% | 22 April (w7) |
| In-class test 2 | 2, 3 | 20% | 20 May (w11) |
| In-class presentation | 1, 2, 3 | 10% | 27 May (w12) |
| Research report (Option A or B) | 1, 2, 3 | 40% | Friday 29 May |
Accessing Lectures and Readings
- Seminar: Wednesdays, 14:10–17:00, Easterfield Building EA120
- Schedule: see the Schedule page for topics, readings, and assignments
- Lectures: weekly content pages contain slides, recordings, and lab materials
- Tests: in the same room as the seminar (bring a pen/pencil and one A4 sheet of notes; no devices)
Contact
| Course coordinator | Prof Joseph Bulbulia, joseph.bulbulia@vuw.ac.nz |
| Office | EA313 |
| Office hours | Tuesday 14:00-15:00 or by appointment |
| R help | Boyang Cao, caoboya@myvuw.ac.nz |
Course Description
From the VUW course catalogue:
This course focuses on theoretical and practical challenges for conducting research involving individuals from more than one cultural background or ethnicity. Topics include defining and measuring culture; developing culture-sensitive studies; choice of language and translation; communication styles and bias; questionnaire and interview design; qualitative and quantitative data analysis for cultural and cross-cultural research; minorities, power, and ethics in cross-cultural research; and ethno-methodologies and indigenous research methodologies. Appropriate background for this course: PSYC 338.
Course Learning Objectives
-
Understanding causal inference. Students will develop a clear understanding of causal inference concepts and workflows, with emphasis on how they address common pitfalls in cross-cultural research. We focus first on how to ask causal questions in comparative psychology, and only then on how to answer them: designing studies, analysing data, and drawing appropriately confident conclusions about cause and effect.
-
Understanding measurement in comparative settings. A substantial portion of this course is devoted to measurement in psychological research. We cover classical techniques for constructing and psychometrically validating measures across cultures, and clarify why statistical tests alone cannot ensure we are measuring what we intend to measure.
-
Statistical programming in R. Students will learn the basics of programming in the statistical language R, gaining computational tools for applying causal inference methods to real data.
-
Computing fundamentals: the command line, Git, and GitHub. Students will learn to navigate their computer through the terminal, manage projects with Git, and collaborate through GitHub. These skills matter because the most powerful tools available to researchers today, from LLMs to cloud computing, operate through text-based interfaces. Students who understand their machines will get far more out of them.
Licence
© 2026 Joseph Bulbulia. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Licence.
Schedule
Weekly Schedule (2026)
| Week | Date (Wed) | Content | Lab | Main Readings |
|---|---|---|---|---|
| w1 | 25 Feb | How to ask a question in psychological science? | Git, GitHub, and required R/RStudio setup | H&R Ch 1 |
| w2 | 4 Mar | Causal diagrams: elementary structures | RStudio project workflow and R basics | H&R Ch 1–2 |
| w3 | 11 Mar | Causal diagrams: confounding bias | Regression, graphing, and simulation | H&R Ch 6 |
| w4 | 18 Mar | Selection bias and measurement bias | Regression and confounding bias | H&R Ch 6–9 |
| w5 | 25 Mar | Causal inference: average treatment effects | Average treatment effects | H&R Ch 1–3 (review) |
| w6 | 1 Apr | Effect modification / CATE | CATE and effect modification | H&R Ch 4–5; GRF guide |
| — | 8 Apr | Mid-trimester break | — | — |
| — | 15 Apr | Mid-trimester break | — | — |
| w7 | 22 Apr | In-class test 1 (20%) | — | — |
| w8 | 29 Apr | Heterogeneous treatment effects and machine learning | RATE and QINI curves | GRF guide; MAQ |
| w9 | 6 May | Resource allocation and policy trees | Policy trees | Policy learning |
| w10 | 13 May | Classical measurement theory from a causal perspective | End-to-End Research Report | VanderWeele (2022) |
| w11 | 20 May | In-class test 2 (20%) | — | — |
| w12 | 27 May | Student presentations (10%) | — | — |
H&R = Hernán, M. A. & Robins, J. M. (2025). Causal Inference: What If. Free PDF · Book website
Labs
Labs run in the final 60–90 minutes of the seminar. Nine labs across weeks 1–6 and 8–10. Your best eight lab diaries count toward the 10% assessment. See Assessments for due dates.
Assessments
Overview
| Assessment | CLOs | Weight | Due |
|---|---|---|---|
| Lab diaries (8 × 1.25%) | 1, 2, 3 | 10% | Weekly |
| In-class test 1 | 2 | 20% | 22 April (w7) |
| In-class test 2 | 2, 3 | 20% | 20 May (w11) |
| In-class presentation | 1, 2, 3 | 10% | 27 May (w12) |
| Research report | 1, 2, 3 | 40% | Friday 29 May |
Assessment 1: Lab Diaries (10%)
Nine weekly diaries, one per lab (weeks 1–6 and 8–10). There are no labs in week 7 (test 1), week 11 (test 2), or week 12 (presentations). Your best eight diaries count (8 × 1.25%), so you may miss one without penalty. Each diary is graded satisfactory/not satisfactory. You receive full credit for submitting a satisfactory entry. Diaries are due by the end of the lab session.
| Diary | Week | Due date |
|---|---|---|
lab-01.md | w1 | Wed 25 Feb |
lab-02.md | w2 | Wed 4 Mar |
lab-03.md | w3 | Wed 11 Mar |
lab-04.md | w4 | Wed 18 Mar |
lab-05.md | w5 | Wed 25 Mar |
lab-06.md | w6 | Wed 1 Apr |
lab-08.md | w8 | Wed 29 Apr |
lab-09.md | w9 | Wed 6 May |
lab-10.md | w10 | Wed 13 May |
What to write
Each diary is a short reflection (~150 words) covering:
- What the lab covered and what you did.
- A connection to the week's readings or lecture content.
- One thing you found useful, surprising, or challenging.
Several labs have focussed exercises.
The labs are marked full-credit/no-credit. If your diary shows that you engaged with the lab, you will get full credit, even if some exercises are wrong.
Format
Write each diary as a plain markdown (.md) file named by week number:
lab-01.md, lab-02.md, …, lab-10.md (there is no lab-07.md). Use
GitHub-flavoured markdown formatting: headings, paragraphs, bold,
italics, and lists. Because you push diaries to GitHub, your files
will render there automatically. These submissions build your markdown
fluency; later in the course you will use Quarto to extend markdown to
PDF and Word.
Submission
Push your diary files to your private GitHub Classroom repository set up in Lab 1. The commit timestamp is your submission record. Your repository is private and visible only to you and the course coordinator; no additional sharing step is needed.
Markdown example
Here is a minimal diary entry showing basic markdown formatting:
# Lab 01: Introduction to R This week we installed R and RStudio, then ran our first script. The exercise connected to the lecture on **causal questions** by showing how we structure data for analysis. I found the following steps useful: - Creating an RStudio project - Writing a short R script - Pushing changes to GitHub
Assessment 2: In-Class Test 1 (20%) — 22 April
Covers material from weeks 1–6 (causal questions, causal diagrams, confounding, average treatment effect (ATE), effect modification). Test duration is 50 minutes. The allocated time is 1 hour 50 minutes. Required: pen/pencil and one A4 sheet of notes. No devices permitted.
Test Location
The test is in class. Come to the seminar room (EA120) with a writing instrument.
Assessment 3: In-Class Test 2 (20%) — 20 May
Covers material from weeks 8–10 (heterogeneous treatment effects, machine learning, resource allocation, policy trees, and classical measurement theory). Same format and conditions as test 1: in class, on paper, with one A4 sheet of notes, no devices, and no AI tools.
Use the Assessment Self-Checks while preparing. They list the things you should be able to state clearly: the causal question, heterogeneous treatment effects, policy trees, and measurement assumptions.
Test Location
The test is in class. Come to the seminar room (EA120) with a writing instrument.
Assessment 4: In-Class Presentation (10%) — 27 May
You will present your research report (Option A) or your Marsden EOI concept (Option B). Your job is to answer two questions for a non-specialist audience: what is it, and why should anyone care?
The presentation is 10 minutes, followed by one question from the panel. You must answer the question after your talk. You may ask one brief clarifying question before answering.
You may use the whiteboard and paper notes. Do not use slides, handouts, devices, or other materials.
Your talk should cover the following points, in this order.
- Title and motivation (what is it, so what).
- Causal question, target population, exposure, and outcomes.
- A simple causal diagram showing your identification strategy.
- Causal estimand and analysis plan (what you will estimate, and how).
- One key limitation or risk, and how you will address it.
The full grade-banded rubric is in the Presentation Rubric.
Assessment 5: Research Report (40%) — Due Friday 29 May
You choose your format
Students choose one of two formats for the research report:
- Option A: Research Report — quantify an average treatment effect using the synthetic dataset produced by the course simulator.
- Option B: Marsden Fund EOI — write a first-round Marsden Fund Expression of Interest using the causal inference framework.
You must declare your choice by submitting the option form on Nuku by Friday 3 April (end of w6). If no declaration is received by this date, Option A is assumed.
Generate your data using the
causalworkshoppackage:# install (once) install.packages("pak") pak::pak("go-bayes/causalworkshop") # generate data library(causalworkshop) d <- simulate_nzavs_data(n = 5000, seed = 2026)Choose one exposure (
religious_serviceorvolunteer_work) and report effects outcome-wide on all four wellbeing outcomes (purpose,belonging,self_esteem,life_satisfaction). The third exposure in the simulator,community_group, is reserved for the worked example in Lab 9 and may not be used for the report. Lab sessions support you in this assignment. We assume no statistical background.
Current template and historical examples
- Start from the current research-report template. If the course-site download fails, use the Google Drive mirror. It is the 2026 scaffold for Option A.
- A historical report example is available here: Example PSYC434 Report. Use it to see tone and structure, not as a checklist for this year's narrower simulated-data assignment. Do not treat its topic, subgroup choices, or ethical framing as a model for this year's report.
- A historical Marsden one-pager is available here: Marsden one-page example. It predates the current criteria. For Option B, follow the 2026 requirements below and the official RSNZ templates.
Use the Assessment Self-Checks before drafting your report or presentation. In particular, check that policy tree splits are interpreted as descriptive targeting rules, not as identified causes of differential response.
Late Penalty
Late assignments, and assignments with extensions, may be subject to delays in marking and may not receive comprehensive feedback.
Assignments submitted late without an approved extension will incur a grade penalty of 5% of the total marks available for the assignment per day late (i.e., in 24-hr increments), up to a maximum of 5 days (up to 24 hrs late = −5%; up to 48 hrs late = −10%, etc.).
Assignments submitted more than five days late without an approved extension will not be graded unless exceptional circumstances are accepted by the Course Coordinator.
Option A: Research Report
Estimate the Average Treatment Effect of a single exposure on all four
wellbeing outcomes using the synthetic three-wave panel generated by
causalworkshop::simulate_nzavs_data(). The outcome-wide design follows
VanderWeele et al. (2020) and the workflow
practised in Labs 8–10. You choose one exposure (religious_service or
volunteer_work); the four outcomes (purpose, belonging,
self_esteem, life_satisfaction) are fixed.
- Introduction: 800-word hard limit.
- Discussion: 1,000-word hard limit (extended to accommodate outcome-wide interpretation).
- Methods/Results: concise, no specific word limit.
- American Psychological Association (APA) style. Submit as a single PDF with R code appendix.
Assessment Criteria (Option A)
The criteria are listed in the order in which they should appear in the
report and follow the ten-step causal workflow:
first define the target population, then the contrast and outcomes,
then check the identification assumptions, then state the causal
estimand, then estimate, then report, then ethics. The same order is
reflected in the manuscript.qmd scaffold inside the research-report
template.
Introduction. Set up the scientific interest of the question. Explain why this population and topic matter, what is already known, what remains uncertain, and why a causal answer would be useful. The Introduction should read like the beginning of a scientific paper, not like a numbered workflow checklist.
Define the target population. State the target population: for whom would the answer apply if the assumptions held? Would the exposure apply to all New Zealanders, or only to a subset? Identify subpopulations for whom the contrast is ill-defined or for whom positivity is unlikely to hold. Explain how the synthetic sample relates to your target population. Comment on transportability.
Eligibility criteria. Force yourself to think concretely. Name the people for whom the answer is relevant ("would a Kiwi reader say this applies to me?"), and the people for whom the intervention would be incoherent (for example, increase religious-service attendance is not a coherent intervention for a self-described atheist with no intention of attending). Excluding the second group up front protects positivity and clarifies what the estimate is about.
State the causal question. State your question clearly after naming
the target population. Frame it as a causal question: among this
population, what would later wellbeing look like if everyone received
the exposure compared with if no one received it? Note that the question
is outcome-wide: you ask how the exposure affects four wellbeing
outcomes jointly, rather than picking one favourite outcome after
looking at the results. Describe any practical or ethical relevance.
Confirm data are from the course simulator distributed in
causalworkshop and that you use three waves.
Determine the exposure. Define the exposure $A$. Explain the contrast (binary or modified treatment policy). Address the consistency assumption — would two participants with the same recorded exposure value share the same potential outcomes? Confirm the exposure precedes the outcomes in time.
Determine the outcomes. Define each outcome $Y_k$ for $k = 1, \dots, 4$. State the scale (continuous, z-scored), the timing (wave 2, post-exposure), and how each outcome relates to the question. Following VanderWeele's outcome-wide approach (VanderWeele et al., 2020), the four outcomes together index wellbeing as a composite construct rather than four independent endpoints. Explain why this design reduces selective reporting (selecting only the outcome that gave the strongest result after seeing the data) and lets the reader read the pattern across the family rather than over-interpreting any single outcome.
Sample characteristics. Provide descriptive statistics for baseline demographics, exposure prevalence, and outcome distributions. Make clear the data are simulated.
Account for confounders. Define the baseline confounder set $L$. Justify each variable as a plausible cause of both $A$ and the $Y_k$. Confirm temporal order. Include baseline values of the exposure and of every outcome in $L$ — lagged-self adjustment (treating each variable's own past value as a covariate). Use z-scores for continuous covariates and one-hot encoding for categorical covariates with three or more levels. Describe how the adjustment set supports conditional exchangeability: that for two people who match on $L$, exposure status is, on average, as good as random with respect to potential outcomes. Note positivity (overlap of exposure levels) is required as well.
Draw a causal diagram. Include measured confounders, unmeasured confounders, and time indices. The diagram should make the identification strategy legible at a glance. Recall that directed acyclic graph (DAG) arrows are qualitative claims about which variables can affect which others, not commitments to specific causal mechanisms; the diagram is for checking d-separation (a graphical rule for reading off conditional independencies).
Identify the causal estimand. State the causal contrast for each outcome: $$ \mathrm{ATE}_k = \mathbb{E}[Y_k(1) - Y_k(0)], \quad k = 1, \dots, 4. $$ State the joint causal estimand of interest (the vector of four ATEs) and how you will summarise it.
Missing data. Describe how missing data are handled. Use inverse-probability-of-censoring weights (IPCW) — a method that reweights observed cases so they stand in for those who dropped out — for attrition, or state why the synthetic panel does not require them.
Model approach. Estimate the four ATEs in one batch using
margot::margot_causal_forest() (a wrapper around
grf::causal_forest() with sensible defaults: honest splitting,
doubly-robust ATEs, and a shared adjustment set). Briefly explain how
machine learning is used and what "doubly robust" means (the estimator
is consistent if either the outcome model or the propensity-score model
is well specified). State the tuning choices (number of trees, honesty,
seed). The research-report-template scaffold (Lab 10) configures these
choices in setup.R; report the values you used.
Multiple-testing correction. Because you report four outcomes, apply
a Bonferroni correction at $\alpha_{FW} = 0.05$ and report the
multiplicity-adjusted confidence intervals.
margot::margot_plot(adjust = "bonferroni") does this automatically.
Sensitivity analysis. Report an E-value (VanderWeele & Ding,
2017) for every outcome's point estimate and for
the lower bound of its multiplicity-adjusted (Bonferroni) confidence
interval. margot::margot_plot() recomputes the bound E-value at the
multiplicity-adjusted limit and reports it in the transformed table.
Interpret each E-value in plain language.
Report results. Present the four ATE estimates,
multiplicity-adjusted 95% confidence intervals, and E-values in a single
forest plot and an accompanying table. Order outcomes by effect
magnitude. Interpret the pattern as a whole, not outcome by outcome. Use
the auto-drafted prose returned by margot::margot_plot() as a starting
point; revise it for clarity and your audience.
Identify optimal policy trees. A policy tree finds the partition of
the population that maximises a utility function (here, the benefit of
treating people the rule recommends to treat, relative to a
no-one-treated baseline). The tree therefore identifies an optimal
targeting rule, not a set of subgroup contrasts; each leaf is a
partition, not a comparison group with its own causal effect estimate.
After estimating the four ATEs, fit depth-1 and depth-2 policy trees per
outcome using the margot policy-tree workflow (Lab 9 walks through the
steps; Lab 10 wires it into the report). Use the parsimony rule from the
Reporting Guide to choose between depth-1 and
depth-2 (the course default is min_gain_for_depth_switch = 0.01
outcome units). Translate the leaves of the chosen tree into plain
language a non-specialist could repeat. Report the policy value with its
95% confidence interval, and the proportion of the sample assigned to
treatment. Read the split variables as descriptors of where the rule
sends people, not as identified causes of differential response. The
current margot workflow does not force a fixed percentage treated; if
you discuss a budget cap, state it as an extra decision problem.
Note the graphing rule for policy trees. The template's setup.R
already calls margot::margot_select_grf_policy_trees() to decide which
trees to plot; you do not have to call it yourself. The rule keeps a
tree only when both the policy-value and the treated-uplift lower
confidence limits exceed zero (the course default thresholds are zero
for both). What you do need to do: (i) state the thresholds in your
methods so a reader can see the rule, and (ii) report what the rule
decided. Outcomes that fail the rule remain in your tables and prose;
their trees do not appear as figures. If no outcome passes the rule, say
so and discuss what an unidentified targeting story implies.
Discussion. Pull the report together in five short subsections (the template provides the scaffold; the whole discussion is capped at 1,000 words):
- Summary of results. Restate the causal question and summarise the pattern across the four outcomes (direction, rough magnitude, multiplicity-adjusted uncertainty, and the policy-tree targeting story). Describe the pattern, not each outcome in turn.
- Limitations. Acknowledge the identification assumptions and name the one most probable to fail in a real-data study. Note that the data are simulated, and what that implies for external validity.
- Importance of methods. Briefly motivate the outcome-wide + Bonferroni + policy-tree workflow against simpler alternatives. Two or three sentences.
- Importance of findings (theory and policy). State what the result would mean for theory and for practice if it held in real data. Name a decision-maker who would care and the smallest change to their practice the evidence could support.
- Ethics. Three to five sentences naming the equity, proxy-variable, and oversight considerations that would need attention before anyone acted on the policy tree. State one value judgement the analysis depends on. The aim is to flag the considerations, not to resolve them; the model cannot settle public values.
Contributor Roles Taxonomy (CRediT) and artificial-intelligence (AI) disclosure. Append a CRediT contributor statement (a National Information Standards Organization (NISO) standard) and an AI disclosure statement (per the course's AI use policy) after the Discussion. The template provides headings and short placeholder text for both.
Option B: Marsden Fund Expression of Interest (EOI)
Write a first-round Marsden Fund Expression of Interest (EOI) following the RSNZ 2026 guidelines. Your research question must use the causal inference framework taught in this course. Assume an Ecology, Human Behaviour, and Evolution (EHB) panel.
Templates and Guidelines
Download the official 2026 RSNZ templates before you begin:
Formatting: 12-point Times New Roman, single spacing, 2 cm margins. Submit as a single PDF.
Required Sections
Section numbers follow the 2026 RSNZ EOI form.
1a. Research Title (max 25 words). Plain language, no jargon. The title should be accessible to a scientifically literate non-specialist.
1d. Research Summary (max 200 words). This summary must be standalone: assessors outside your discipline will read it. Answer four questions in this order:
- What is the current state of the field? (1–2 sentences establishing the gap or problem.)
- What do you aim to do? (State the causal question plainly.)
- How will you do it? (Name the data source, design, and analytic approach.)
- What do you expect to find? (One sentence on anticipated results and their significance.)
2. Vision Mātauranga (max 200 words). Describe how the proposed research relates to the four Vision Mātauranga (VM) themes: (i) indigenous innovation, drawing on Māori knowledge, resources, and people; (ii) taiao, achieving environmental sustainability through iwi and hapū relationships with land and sea; (iii) hauora/oranga, improving health and social purpose; (iv) mātauranga, exploring indigenous knowledge and its contribution to NZ research. If none of the themes apply, you may state "not applicable" with a considered justification.
3a. Abstract (max 1 page). Cover the following: aims of the research; importance of the research area; novelty, originality, insight, and ambition of the proposed work; potential impact; methodology; and your capacity to deliver.
3b. Benefit Statement (max 400 words, 1 page). Describe the economic, environmental, or health benefit of the research to New Zealand. Explain why NZ is the right place for this research and describe potential impacts for Māori. In a student context the benefit case may be aspirational, but it must be concrete.
3c. References (max 3 pages). Bold your own name. Include article titles and full author lists (up to 12 authors; use "et al." thereafter).
3d. Roles and Resources (max 1 page). Describe the contributions of each team member, the resources required, and any ethical considerations. Use the Roles and Resources form.
Assessment Criteria (Option B)
Research. Quantifiable impact potential through novelty, originality, insight, and ambition. Rigorous methods grounded in prior research. Ability and capacity to deliver.
Benefit. Economic, environmental, or health benefit to New Zealand. Rationale for NZ-based research. In a student context the benefit case may be aspirational but must be concrete.
Vision Mātauranga. Relation to VM themes; where relevant, engagement with Māori. "Not applicable" is acceptable with considered justification.
Causal reasoning (course-specific). Well-defined causal question, clearly stated causal estimand, appropriate identification strategy. This criterion carries substantial weight.
For the full Marsden Fund assessment criteria, see the RSNZ 2026 EOI Guidelines (pdf).
AI use in this course
You may use AI tools in this course. You do not have to.
AI use policy
- You may use AI for coding help, brainstorming, and clearer writing.
- Use it as a critical reader: ask it to find holes in your argument, push you to clearer thinking, and point out vague wording. Push back when it gives bland or wrong advice.
- You are responsible for all submitted work. Verify all claims, code, and references.
- You must be able to explain your work in your own words.
- For lab diaries and the final report, add a short note if AI use is substantial (tool, date, and how it was used).
- If AI output shaped your submission in an important way, acknowledge it as a source.
- AI tools are not permitted in in-class tests.
- Do not upload confidential, identifiable, or sensitive information.
Useful AI uses while studying:
- Ask for a plain-language explanation of a concept after you have checked the lecture notes.
- Ask whether your answer states the target population, causal contrast, outcome, estimand, and assumptions.
- Ask where your wording sounds associational when you mean causal.
- Ask for a similar practice question, then answer that question yourself.
Poor AI uses:
- Asking it to write an answer you have not attempted.
- Trusting its answer without checking the course materials.
- Letting it swap the causal steps for vague statistics language.
- Using it to memorise phrases rather than understand the logic.
VanderWeele, T. J., & Ding, P. (2017). Sensitivity analysis in observational research: Introducing the E-value. Annals of Internal Medicine, 167(4), 268–274. https://doi.org/10.7326/M16-2607
VanderWeele, T. J., Mathur, M. B., & Chen, Y. (2020). Outcome-wide longitudinal designs for causal inference: A new template for empirical studies. Statistical Science, 35(3), 437–466.
Extensions and Materials
Extensions
Extension requests for final report received before the mid-term break
Request a new due date by emailing the course coordinator before 3 April 2026. Reasonable requests will be considered, including periods where major assessments cluster in the same week.
Extensions for laboratory diaries
Laboratory diaries are due by the end of the lab session. To receive credit for a diary, you must attend the lab unless an absence is approved. There are nine labs, but only your best eight diaries are counted, so one missed diary can occur without penalty.
If you are physically unwell, or you are caring for an unwell dependant, do not attend class. Email the course coordinator as soon as possible so the absence can be recorded. When you are able, upload the diary for that week. If the diary shows serious engagement with the lab content, no late points will be deducted.
If you are unable to attend class or lab for personal or work-related reasons unrelated to health, email the course coordinator before class where possible. Requests are considered case by case, and any alternative submission arrangement is determined by the coordinator.
Extensions for Presentations and Class-Tests
Presentations and class tests are in-class assessments. If you or a dependant is unwell, email the course coordinator and we will arrange a rescheduled in-class assessment.
Late Penalty
This is the standard School of Psychological Sciences policy on lateness penalties.
Late assignments, and assignments with extensions, may be subject to delays in marking and may not receive comprehensive feedback.
Assignments submitted late without an approved extension will incur a grade penalty of 5% of the total marks available for the assignment per day late (i.e., in 24-hr increments), up to a maximum of 5 days (up to 24 hrs late = −5%; up to 48 hrs late = −10%, etc.).
Assignments submitted more than five days late without an approved extension will not be graded unless exceptional circumstances are accepted by the Course Coordinator.
Materials and Equipment
Bring a laptop from week 1. Install R and RStudio in week 1 for data analysis sessions. You may use RStudio or any other IDE/editor you prefer. Contact the instructor if you lack computer access.
For in-class tests, bring a writing utensil. Electronic devices are not permitted during tests.
Course Readings
Primary text
Hernán & Robins (2025)
Chapters 1–9 are the required reading for this course. The book is freely available from the link above. Abbreviated H&R on the schedule and in lecture notes.
Reading strategy
Read each chapter before the corresponding lecture week. The chapters are short (roughly 10–15 pages each) and written in accessible prose with worked examples. Focus on understanding the concepts rather than memorising notation.
Week-by-week readings
Week 1: How to ask a question in psychological science
Required: Hernán & Robins (2025), chapter 1. PDF
Optional: Briggs (2021) (history of measurement in psychology); Bandalos (2018) (psychometrics); Pearl & Mackenzie (2018) (accessible introduction to causal inference); Bulbulia (2024a) (causal questions in psychology).
Week 2: Causal diagrams — five elementary structures
Required: Hernán & Robins (2025), chapters 1–2. PDF
Optional: Bulbulia (2024a); Bulbulia (2024d) (experimental design and causal diagrams).
Week 3: Causal diagrams — the structures of confounding bias
Required: Hernán & Robins (2025), chapter 6. PDF
Optional: Bulbulia (2024a).
Week 4: Selection bias and measurement bias
Required: Hernán & Robins (2025), chapters 6–9. PDF
Optional: Bulbulia (2024c) (WEIRD samples and external validity); Bulbulia (2024b) (SWIGs and time-varying confounding).
Week 5: Average treatment effects
Required: Hernán & Robins (2025), chapters 1–2 (review identification assumptions). PDF; Cashin et al. (2025) — the TARGET (Transparent Reporting of Observational Studies Emulating a Target Trial) statement, a reporting checklist for studies that emulate a target trial.
Week 6: Effect modification and CATE
Required: Hernán & Robins (2025), chapters 4–5. PDF
Optional: GRF documentation (causal forests, used in labs 6, 8, and 9).
Week 7: In-class test 1
No new reading. Review Hernán & Robins (2025), chapters 1–9, the Causal Workflow, and the Test 1 Study Sheet.
Week 8: Heterogeneous treatment effects and machine learning
Required: GRF documentation — the Causal Forest article, the RATE article, and the Application: heterogeneity in clinical trials article. These are the operational references for the analysis workflow used in Labs 8–10 and the research report.
Optional: Wager & Athey (2018) (causal forests theory); VanderWeele et al. (2020) (measurement error and the potential outcomes framework, useful for the report).
Week 9: Resource allocation and policy trees
Required: GRF documentation — the Policy Learning article and the Qini curve article. Read alongside VanderWeele et al. (2020) and the lab's worked example.
Optional: Background on equity audits and governance considerations as named in the Week 9 lecture.
Week 10: Classical measurement theory from a causal perspective
Required: VanderWeele (2022) — Constructed Measures and Causal Inference: Towards a New Model of Measurement for Psychosocial Constructs. The paper sets out the structural-causal critique of classical measurement theory and motivates the multiple-versions-of-treatment perspective used in the lecture and the lab.
Optional: VanderWeele et al. (2020) (measurement error in the potential-outcomes framework); Hernán & Cole (2009) (causal diagrams and measurement bias).
Week 11: In-class test 2
No new reading. Review the Week 8, Week 9, and Week 10 lectures and the GRF documentation listed under Weeks 8–9.
Week 12: Student presentations
No new reading. Review the Presentation Rubric and the Reporting Guide.
General supplementary references
These are not required but provide additional depth on topics covered in the course.
- Neal (2020), chapters 1–2. Covers the same foundations as Hernán & Robins (2025) with a machine-learning orientation.
- Pearl et al. (2016). Compact introduction to the graphical (structural) approach to causation.
- Generalised Random Forests (GRF) website. Documentation, guides, and vignettes used in weeks 6, 8, and 9.
Bandalos, D. L. (2018). Measurement theory and applications for the social sciences. Guilford Publications.
Briggs, D. C. (2021). Historical and conceptual foundations of measurement in the human sciences: Credos and controversies. Routledge.
Bulbulia, J. A. (2024a). Methods in causal inference part 1: Causal diagrams and confounding. Evolutionary Human Sciences, 6, e40. https://doi.org/10.1017/ehs.2024.35
Bulbulia, J. A. (2024b). Methods in causal inference part 2: Interaction, mediation, and time-varying treatments. Evolutionary Human Sciences, 6, e41. https://doi.org/10.1017/ehs.2024.32
Bulbulia, J. A. (2024c). Methods in causal inference part 3: Measurement error and external validity threats. Evolutionary Human Sciences, 6, e42. https://doi.org/10.1017/ehs.2024.33
Bulbulia, J. A. (2024d). Methods in causal inference part 4: Confounding in experiments. Evolutionary Human Sciences, 6, e43. https://doi.org/10.1017/ehs.2024.34
Cashin, A. G., Hansford, H. J., Hernán, M. A., Swanson, S. A., Lee, H., Jones, M. D., Dahabreh, I. J., Dickerman, B. A., Egger, M., Garcia-Albeniz, X., et al. (2025). Transparent reporting of observational studies emulating a target trial—the TARGET statement. JAMA, 334(12), 1084–1093. https://doi.org/10.1001/jama.2025.13350
Hernán, M. A., & Cole, S. R. (2009). Invited commentary: Causal diagrams and measurement bias. American Journal of Epidemiology, 170(8), 959–962. https://doi.org/10.1093/aje/kwp293
Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if. Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Neal, B. (2020). Introduction to causal inference from a machine learning perspective. Course Lecture Notes (Draft). https://www.bradyneal.com/Introduction_to_Causal_Inference-Dec17_2020-Neal.pdf
Pearl, J., Glymour, M., & Jewell, N. P. (2016). Causal inference in statistics: A primer. John Wiley & Sons.
Pearl, J., & Mackenzie, D. (2018). The book of why: The new science of cause and effect. Basic books.
VanderWeele, T. J. (2022). Constructed measures and causal inference: Towards a new model of measurement for psychosocial constructs. Epidemiology, 33(1), 141–151. https://doi.org/10.1097/EDE.0000000000001434
VanderWeele, T. J., Mathur, M. B., & Chen, Y. (2020). Outcome-wide longitudinal designs for causal inference: A new template for empirical studies. Statistical Science, 35(3), 437–466.
Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228–1242. https://doi.org/10.1080/01621459.2017.1319839
Week 1: How to Ask a Question in Psychological Science?
Lab: Git and GitHub
Key idea
A claim becomes a causal question only when it names two states to compare, a target population, an outcome, and a time horizon. Even then one hard limit remains: for any individual, only one side of the contrast is ever observed.
Readings
Required: Hernán & Robins (2025), chapter 1. Recommended readings are listed at the end of this page.
Key concepts for the test(s)
Today we introduce three problems that recur throughout the course:
- Defining the question: a causal question requires a clear contrast
- Specifying the target population: the answer depends on who the question is about
- Unobservability of causal effects: we never observe both sides of the contrast for one person
Before next week
Bring your laptop in Week 1. Install R and RStudio in Week 1. Instructions are in Lab 2: Install R and Set Up Your IDE.
Motivating example: does social media harm adolescent wellbeing?
Orben & Przybylski (2019) reports a negative association between time spent on social media and wellbeing among British teenagers. The observed correlation was 0.04, comparable in magnitude to the association between wearing glasses and wellbeing in the same dataset.
The finding was widely reported as evidence that social media harms young people. Some investigators argued the conclusions were not strong enough: those teenagers who most frequently engaged with social media exhibited the lowest wellbeing scores, and the negative association is non-linear (Twenge et al., 2020).
Questions about whether social media use harms young people are in the news. On 18 February 2026, CNN reported testimony in ongoing litigation about adolescent social media use (link). Courts, legislators, and parents are making decisions right now on the basis of findings reported in scientific journals.
Yet what do the associational findings really tell us? Can we move from associations to policy-relevant causal conclusions about whether social media use harms young people? If so, what steps would we need to take? And for whom would our conclusions generalise?
These questions will occupy us over the coming weeks. The aim of this course is to provide you with a set of skills that enable you to ask and answer causal questions using observational data, and to identify variability in response across subgroups in the population of interest.
A simple map for week 1
This week gives you a checklist for deciding whether a claim is even asking a causal question yet.
Three questions to ask of any causal claim
- What are the two states of the world being compared?
- For which population is the comparison meant to hold?
- What part of the contrast is necessarily missing from observation?
If a study claim does not answer the first two questions, it is still too vague. If it forgets the third question, it will slip from causal language into loose talk about associations.
Psychology begins with a question
Before we can answer whether social media harms teenagers, we must ask a question that is clear enough to be answered. Wellbeing is broad, so for a worked example we narrow it to one specific outcome: a person's sense of purpose. Even "Does social media harm purpose?" is not yet a causal question, because it does not specify what is being compared with what. A causal question compares two states of the world. This question names only one.
We will use two words that are easy to confuse. An association asks whether two variables co-occur. A causal effect, on the other hand, asks what would happen if we changed something about the world.
Consider the difference. "Is time on social media associated with lower purpose?" asks whether two variables co-occur. "Would adolescent purpose improve if we replaced two hours of nightly doom-scrolling with two hours of study?" asks what would happen under a specific comparison. The second question states a contrast (scrolling versus studying), a population (adolescents), an outcome (purpose), and a time horizon (nightly, over some stated period). The first question does not.
The comparison between two states is what we call a causal contrast, or contrast for short. A contrast is the simplest structure a causal question can have: state A versus state B, for a defined group, measured on a defined outcome, over a defined time horizon.
A practical template is: for population, what is the effect of intervention versus control on outcome, measured by measure after an exposure period of time horizon? The arrow of time is built in: the intervention comes first, then we measure the outcome.
Five parts of a usable causal question
- Population: who is the question about?
- Intervention: what state of the world are we interested in?
- Control: what is the comparison condition?
- Outcome: what do we measure?
- Timing: when do we measure it?
Everything in this course follows from the demand that psychological questions specify their contrasts. Every causal question in this course must satisfy this five-part template before we turn to the data. This lecture introduces three problems that make specifying a contrast harder than it first appears.
Problem 1: both sides of the contrast must be precisely defined
When investigators evaluate "time on social media," what do they mean? Passive scrolling, direct messaging, and creative content production may differ in their consequences. We need an interval over which the behaviour occurs: one week, one month, one year. We need to specify what the comparison condition is: passive scrolling versus studying, versus socialising in person, versus something else. Without precise specification, the question has no answer because it has not yet been asked.
The two sides of a contrast have names. The condition whose consequences we want to evaluate is the intervention. The state we compare it against is the control. "Intervention" and "control" are placeholders: neither implies a medical procedure or a laboratory setting. They simply label the two states of the world that define our comparison.
Precision extends to what we measure. Purpose is one specific outcome: a person's sense that their life has direction and meaning. It is distinct from related wellbeing outcomes such as life satisfaction, self-esteem, anxiety, and depressive symptoms, which need not move together. We must define the outcome, its measure (for example, a sense-of-purpose measure on a 1 to 7 scale), and the time frame over which we assess it. Later weeks examine several wellbeing outcomes together in an outcome-wide design; this week we keep one outcome, purpose, in view. The consequences of scrolling for a teenager's purpose in five hundred years are zero because life ends.
Notice that specifying interventions and outcomes forces us to order events along a timeline. For one state to influence another, it must precede it. There must be a contrast condition and a stated time horizon, because timing affects the magnitude of interest. The effects of scrolling for five minutes for three weeks (contrasted with no social media) might differ from the effects of scrolling for five hours every day for five years.
In later weeks we extend this idea to more complex questions with more than two states of the world, or with sequences of actions over time. The same demand applies: name the states, the population, the outcome, and the time horizon.
Problem 2: the answer to a causal question depends on the population
The teenagers that Orben & Przybylski (2019) studied were a convenience sample from one country at one moment in time. Would the association of 0.04 hold in other countries? Would it hold today? The concept of "teenager" is itself vague. It lumps thirteen-year-olds, essentially children, with nineteen-year-olds, essentially adults. The answer to a causal question may systematically differ by age, gender, socioeconomic background, or parental attention.
Before we can evaluate whether social media influences purpose, we must specify the target population. The answer to a contrast for one population may differ from, or reverse for, another. There is no abstract answer to a causal question without reference to both the contrast conditions and a population.
The distinction between the sample population (who you studied) and the target population (who you want to learn about) is central to external validity, which we formalise in Week 4. We return to population specificity when we discuss variation in responses across subgroups (Week 6, 8, and 9) and transportability (Week 4).
Problem 3: no more than one side of the contrast can be measured for each individual
Consider Alice, who takes up two hours of doom-scrolling each night before bed. Suppose she enrolled in an experiment and was randomised to the doom-scrolling condition. The contrast is studying mathematics for two hours each night. At the end of the trial Alice reports a high sense of purpose. Can we say that doom-scrolling caused Alice's sense of purpose to be higher than it would have been under the mathematics condition?
We cannot, because no more than one side of the contrast can be measured for Alice in a given period. Alice followed the doom-scrolling protocol. She did not follow the mathematics protocol. We observe only one state of the world for Alice, never both.
This is the central logical problem in causal inference. A causal effect compares two possible futures for the same person, but we only ever observe one future.
We formalise this with potential outcomes notation. Let $Y_i(1)$ denote the outcome that person $i$ would experience under the intervention ($A = 1$), and $Y_i(0)$ the outcome under the control condition ($A = 0$). The individual causal effect is:
$$\delta_i = Y_i(1) - Y_i(0)$$
This quantity, $\delta_i$, is the contrast at the level of a single person: the difference between what would happen to person $i$ under treatment and what would happen under control. We observe only one of $Y_i(1)$ or $Y_i(0)$ for any individual. The individual causal effect $\delta_i$ is therefore never directly observable. The obstacle is logical: no amount of data collection, no statistical technique, and no machine learning algorithm can reveal both potential outcomes for the same person at the same time.
Pair exercise: formulating a contrast
- Take the headline "Screen time linked to poor sleep in teenagers."
- Write a causal question that specifies both sides of the contrast (screen time versus what?), a defined outcome (which aspect of sleep?), a target population, and a time horizon.
- Swap with your partner and critique: is the other side of the contrast well defined? Is the population specific enough?
The individual causal effect is unobservable, so causal inference shifts to populations: what can we learn about the average effect across a defined group?
From individuals to populations
If individual causal effects are unobservable, what can we learn? We can learn about average effects across a population. The average treatment effect (ATE) is:
$$\text{ATE} = \mathbb{E}[Y(1) - Y(0)]$$
This is the expected difference in the outcome if everyone in the target population experienced the intervention versus if everyone experienced the control condition. The ATE is a population-level quantity. It tells us what would happen on average, not what would happen to any particular person.
Causal inference contrasts counterfactual states at the population or subpopulation level. When we say "social media influences purpose," we mean that on average, across a defined population, one pattern of use changes purpose relative to the counterfactual of another pattern. We must specify the contrast conditions and the population for this statement to have content.
A short memory aid
- A causal question needs a contrast.
- A causal answer is always population-specific.
- Individual causal effects are not directly observed.
What we learned
Return to the social media question. Orben & Przybylski (2019) found an association of 0.04 between social media use and lower wellbeing. Courts and legislators are treating this as evidence of harm. We now see three reasons why the leap from association to causation fails.
First, "the influence of social media on wellbeing" is undefined until we specify the interventions (scrolling versus what?), the outcomes (which dimension of wellbeing?), and the time frame. Second, the answer to a causal question depends on the target population, and the populations that matter (thirteen-year-olds in Aotearoa New Zealand today) may differ from the population studied (British teenagers before 2019). Third, the individual causal effect is never observable; we can only recover average effects under assumptions we have not yet stated.
The lesson is that before answering a question we must ask it. Psychology begins with a clearly defined question. A well-defined causal question requires a contrast between at least two interventions, a specified outcome and time horizon, and a target population. These are the five parts of the template: population, intervention, control, outcome, timing. In later weeks we add the further question of whether the observed data can identify that contrast.
Most psychological research cannot randomise the variables we care about. We cannot randomly assign people to experience trauma, adopt a religion, or lose a job. Week 2 introduces the randomised experiment as the benchmark for causal inference and the graphical tools (causal diagrams) that allow us to reason about causation when randomisation is impossible.
Pair exercise: three problems in one claim
- Take the claim "Religion improves mental health."
- Specify the contrast by naming a concrete intervention and control condition (religion versus what, exactly?).
- Specify the population (for whom, where, and when?).
- Specify the outcome, the measure, and the timing (what do we measure, and when do we measure it after the exposure period?).
- Rewrite the claim using the course template: for population, what is the effect of intervention versus control on outcome, measured by measure after an exposure period of time horizon?
- Swap with your partner and critique: is the contrast precise, is the population defensible, and does the timing make sense (intervention first, outcome later)?
Further reading
For an accessible introduction to causal inference and its history, see Pearl & Mackenzie (2018). The two core causal questions and the formal treatment of causal inference appear in Bulbulia (2024).
Lab materials: Lab 1: Git and GitHub
Bulbulia, J. A. (2024). Methods in causal inference part 1: Causal diagrams and confounding. Evolutionary Human Sciences, 6, e40. https://doi.org/10.1017/ehs.2024.35
Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if. Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Orben, A., & Przybylski, A. K. (2019). The association between adolescent well-being and digital technology use. Nature Human Behaviour, 3(2), 173–182. https://doi.org/10.1038/s41562-018-0506-1
Pearl, J., & Mackenzie, D. (2018). The book of why: The new science of cause and effect. Basic books.
Twenge, J. M., Haidt, J., Joiner, T. E., et al. (2020). Underestimating digital media harm. Nature Human Behaviour, 4, 346–348. https://doi.org/10.1038/s41562-020-0839-4
Week 2: Causal Diagrams: Five Elementary Structures
Key idea
A causal directed acyclic graph (DAG) does not create causal knowledge. It makes the assumptions behind a causal claim explicit, so they can be stated, checked, and challenged.
Readings
Required
Optional
- Barrett (2023). See also: Introduction to DAGs and Common Structures of Bias.
Key concepts for the test(s)
- Internal validity
- External validity
- Causal directed acyclic graph (causal DAG)
- Five elementary causal structures
- Confounding
- d-separation
- Backdoor path
- Conditioning
- Fork
- Chain
- Collider bias
- Mediator bias
- Four rules of confounding control
Lab 2 setup
Use Lab 2: Install R and Set Up Your IDE for this week's practical work. The optional script is here: Download the R script for Lab 02.
Seminar
Motivating example: the Salk vaccine
The 1954 field trial of the Salk polio vaccine was a multi-site study conducted across many communities in the United States (Dublin, 1955; Francis Jr., 1955). Two protocols were used in different participating areas. In the observed-control protocol, second-grade children whose parents consented received the vaccine. Children in the first and third grades served as controls. In the placebo-controlled protocol, children were randomised to vaccine or placebo under double-blind conditions.
The placebo-controlled protocol supported a causal conclusion. The vaccine reduced paralytic polio (Francis Jr., 1955). The observed-control protocol did not support the same conclusion, because parental consent shaped vaccine assignment.
Consent and polio susceptibility shared a common cause—socioeconomic status—which opened a backdoor path between vaccine receipt and the outcome. Vaccination status and polio risk differed for reasons other than the vaccine itself. That is the structure of confounding: a fork in which a common cause affects both treatment and outcome.
Same question. Different assignment mechanism. Different estimate, different reliability.
Why this week matters
In Week 1 we defined causal questions. This week we learn how to represent structural assumptions using a causal directed acyclic graph (causal DAG): a diagram in which nodes represent variables and arrows represent assumed causal directions, with no cycles. A causal DAG does not create causal knowledge. It makes assumptions explicit, checkable, and discussable.
A simple map for week 2
For test purposes, most week 2 questions reduce to three steps.
How to read a DAG
- Identify the path structure: fork, chain, or collider.
- Ask whether the path is open or blocked as drawn.
- Ask what conditioning would do: close the path or open it.
If you can do those three things, you can usually explain the bias logic in words.
Independence language
To say precisely what randomisation buys us, we need a compact way to write that two quantities are independent. The notation below is exactly that shorthand. It carries no causal force on its own.
We never observe $Y(a)$ for more than one treatment level per person (Week 1's unobservability problem). The notation $Y(a)$ refers to "the outcome this person would have under treatment level $a$." The observed $Y$ equals $Y(a)$ only for the level of $A$ that person actually received.
- $A \coprod Y(a)$: $A$ and $Y(a)$ are independent.
- $A \cancel\coprod Y(a)$: $A$ and $Y(a)$ are dependent.
- $A \coprod Y(a)\mid L$: $A$ and $Y(a)$ are independent once we condition on $L$.
Conditioning means restricting attention to observations that share the same value of a variable (or, equivalently, adjusting for that variable in an analysis). In a causal DAG, a conditioned variable is drawn inside a box.
Randomisation and exchangeability
Under random assignment,
$$ Y(a) \coprod A. $$
We call this state unconditional exchangeability. Under this state, a difference in means identifies the average treatment effect (ATE):
$$ \widehat{\text{ATE}}=\hat{\mathbb{E}}[Y\mid A=1]-\hat{\mathbb{E}}[Y\mid A=0]. $$
In observational studies, unconditional exchangeability usually fails.
Working definitions
Internal validity concerns whether the study contrast estimates the target causal contrast in the study population. A threat to internal validity corresponds to an open non-causal path between $A$ and $Y$.
External validity concerns whether that causal contrast transports to the target population. We will say more about transportation failures in week 4.
Confounding bias occurs when treatment groups differ systematically in ways that affect the outcome, so that the observed association between $A$ and $Y$ does not equal the causal effect. (Confounding bias is a failure of internal validity).
Confounding arises because there is a common cause of the treatment and the outcome.
In causal graph terminology, confounding corresponds to an open non-causal path from $A$ to $Y$. Such a path is called a backdoor path: a path between $A$ and $Y$ that begins with an arrow pointing into $A$. The four rules of confounding control below show how to close these paths, and Week 3 develops the backdoor criterion in full.
Causal DAG notation and elements
- $A$: treatment or exposure.
- $Y$: outcome.
- $Y(a)$: potential outcome under intervention level $a$.
- $L$: measured confounder set.
- $U$: unmeasured cause.
- $M$: mediator.
- $\bar{X}$: sequence of variables.
- $\mathcal{R}$: chance mechanism, including randomisation.
- Arrows encode assumed causal direction.
- Boxes indicate conditioning.
- Open red paths indicate biasing pathways.
Five elementary structures
The five structures below correspond to the labelled panels in the figure.
- No causal relation: $A \coprod B$. The variables are statistically independent.
- Direct causation: $A\to B$. The variables are statistically dependent: $A \cancel\coprod B$.
- Fork: $A\to B$ and $A\to C$. Because $A$ causes both $B$ and $C$, they are associated. Conditioning on $A$ removes that association: $B \coprod C \mid A$.
- Chain: $A\to B\to C$. Because $B$ mediates the effect of $A$ on $C$, they are associated. Conditioning on $B$ blocks the path: $A \coprod C \mid B$.
- Collider: $A\to C\leftarrow B$. Because $A$ and $B$ both cause $C$ but do not cause each other, they are marginally independent. Conditioning on $C$ opens a spurious association: $A \cancel\coprod B \mid C$. The intuition: when two independent causes can both produce the same effect, knowledge of the effect makes the two causes informative about each other—if one cause was absent, the other becomes more probable. Conditioning on $C$ therefore induces a non-causal association between its causes.
These five structures generate all patterns of conditional independence and dependence in a causal DAG. Understanding which structures block and which transmit association is the basis for confounding control.
Three questions for any path
- Is this path causal or non-causal?
- Is it open or blocked right now?
- What would conditioning on the middle variable do?
Pair exercise: naming the structure
For each scenario, name the elementary structure, state whether the two end variables are marginally associated, and predict what conditioning does.
- Socioeconomic status (SES) causes both neighbourhood quality and health outcomes. What structure links neighbourhood and health? What happens if you condition on SES?
- A drug reduces inflammation, and inflammation causes pain. What structure links the drug to pain? What happens if you condition on inflammation?
- Genetics affects blood pressure and diet affects blood pressure, but genetics and diet do not cause each other. What structure links genetics and diet through blood pressure? What happens if you condition on blood pressure?
Where do causal assumptions come from?
A causal DAG encodes assumptions. Those assumptions do not come from the data. They come from prior knowledge: theory, mechanism, previous studies, and domain expertise. This dependence on existing knowledge might seem circular. If we need knowledge to draw a causal DAG, and a causal DAG is required for causal inference, where do we start?
Otto Neurath's metaphor of the ship at sea captures the answer:
We are like sailors who on the open sea must reconstruct their ship but are never able to start afresh from the bottom. Where a beam is taken away a new one must at once be put there, and for this the rest of the ship is used as support. In this way, by using the old beams and driftwood, the ship can be shaped entirely anew, but only by gradual reconstruction. (Neurath, 1973, p. 199)
Causal diagrams are planks in Neurath's boat. We build them from the best available knowledge, test their implications, and revise when evidence warrants. The alternative, letting data alone determine causal structure, is not available. Data reveal associations. Associations are compatible with many causal structures. Without assumptions, the data do not tell us which structure generated them.
Pair exercise: Neurath's ship and your own causal DAG
- Draw a causal DAG from your own research interest with at least four variables.
- Identify one fork and one chain in your causal DAG.
- Swap with your partner. Your partner plays sceptic: challenge one arrow by proposing an alternative causal direction or an omitted common cause.
- Revise your causal DAG in response. State what changed and why.
With this grounding in the five elementary structures and the sources of our assumptions, we can now state the three identification conditions that any causal analysis requires.
Three identification assumptions
Assumption 1: Causal consistency
If person $i$ receives $A_i=a$, then $Y_i=Y_i(a)$.
Assumption 2: Conditional exchangeability
After conditioning on an adequate set $L$,
$$ Y(a) \coprod A \mid L. $$
Assumption 3: Positivity
For all treatment levels and covariate strata used for inference,
$$ P(A=a\mid L=l)>0. $$
In DAG terms, conditional exchangeability requires that, after conditioning on $L$, every backdoor path between $A$ and $Y$ is closed. Positivity requires that the data contain observations at every treatment level within the covariate strata we intend to use—a property that is itself a finding, not an assumption we can simply assert.
Pair exercise: checking assumptions against a causal DAG
- Draw a causal DAG for the Salk vaccine example. Include: parental consent ($L$), vaccine assignment ($A$), polio outcome ($Y$), and socioeconomic status ($U$) as an unmeasured common cause of $L$ and $Y$.
- In the observational design (assignment by parental consent), check exchangeability: is $Y(a) \coprod A$? Trace the open backdoor path.
- In the randomised design, check exchangeability: is $Y(a) \coprod A$? Explain why the path is now blocked.
- Check positivity in each design. In which design is a positivity violation more probable, and why?
Four rules of confounding control
- Condition on common causes (rule 1: close the fork). If $L$ causes both $A$ and $Y$, the fork $A \leftarrow L \to Y$ opens a backdoor path. Conditioning on $L$ closes it. When $L$ is unmeasured, conditioning on a measured proxy can reduce, though not eliminate, confounding.
- Do not condition on mediators when estimating total effects (rule 2: preserve the chain). If $A \to M \to Y$, conditioning on $M$ blocks part of the causal path we want to estimate.
- Do not condition on colliders (rule 3: avoid opening the collider). If $A \to C \leftarrow Y$, conditioning on $C$ opens a spurious path between $A$ and $Y$. The "control for everything" instinct is unsafe for this reason.
- Treat descendants carefully (rule 4: trace the structure first). Conditioning on a descendant of a variable is akin to partially conditioning on that variable. A descendant of a collider can transmit collider bias; a descendant of a confounder can partially reduce confounding.
A short rulebook for the test
- Common cause: usually condition.
- Mediator: do not condition if you want the total effect.
- Collider: do not condition.
- Descendant: ask what it is downstream of before you adjust for it.
A note on the generality of d-separation
Two variables are d-separated ("directionally separated") in a causal DAG when every path between them is blocked. In practice, this means the DAG implies conditional independence once the relevant conditioning set is stated. The rules above focus on confounding, but d-separation is more general than confounding control. It is the reason the same DAG logic works for collider bias, mediator bias, and measurement problems. For example, in a collider $A \to C \leftarrow Y$, $A$ and $Y$ are d-separated when we do not condition on $C$, but conditioning on $C$ opens the path and they become d-connected. The same blocking logic—ask whether every path is open or closed—applies in all three bias settings.
Return to the opening example
The Salk example is a structural lesson about assignment. The observed-control design produced a biased effect estimate because socioeconomic status confounded the comparison. The randomised design blocked that path. Causal DAGs help us state this lesson before analysis. First we define the question. Then we draw assumptions. Then we choose an adjustment set. Then we estimate.
The five structures, three assumptions, and four rules form a single framework: the four rules apply the five structures to the problem of bias, and the three assumptions specify the conditions under which those rules produce an unbiased estimate.
Epilogue: avoid "within-person" and "between-person"
Students often describe designs as "within-person" or "between-person". These labels feel intuitive, but they hide the causal object. "Between-person" in particular can mislead because it sounds like we compare two different populations. In an experiment we have one population, which we project into two potential states under two intervention levels. Randomisation lets two groups stand in for those two projected states.
In this course we instead name a target population, two intervention regimes, an outcome, and the time that we measure that outcome. This framing works for any target population, including a single person studied over time (Week 1's Alice example). The key lesson from Week 2 is that even when we target a population-level average, we still need a defensible assignment story. Causal DAGs let us state, and critique, the assumptions that connect the observed data to the causal contrast. They do not rescue imprecise language. They force us to say what we compare, for whom, and why.
Further reading
The identification assumptions and randomisation framework are treated in Hernán & Robins (2025) (chapters 1-2) and Bulbulia (2024a). See also Bulbulia (2024b).
Lab materials: Lab 2: Install R and Set Up Your IDE
Barrett, M. (2023). Ggdag: Analyze and create elegant directed acyclic graphs. https://github.com/malcolmbarrett/ggdag
Bulbulia, J. A. (2024a). Methods in causal inference part 1: Causal diagrams and confounding. Evolutionary Human Sciences, 6, e40. https://doi.org/10.1017/ehs.2024.35
Bulbulia, J. A. (2024b). Methods in causal inference part 4: Confounding in experiments. Evolutionary Human Sciences, 6, e43. https://doi.org/10.1017/ehs.2024.34
Dublin, T. D. (1955). 1954 poliomyelitis vaccine field trial: Plan, field operations, and follow-up observations. JAMA, 158(14), 1258–1265. https://doi.org/10.1001/jama.1955.02960140020003
Francis Jr., T. (1955). Evaluation of the 1954 poliomyelitis vaccine field trial: Further studies of results determining the effectiveness of poliomyelitis vaccine (salk) in preventing paralytic poliomyelitis. JAMA, 158(14), 1266–1270. https://doi.org/10.1001/jama.1955.02960140028004
Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if. Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Neurath, O. (1973). Anti-spengler. In M. Neurath & R. S. Cohen (Eds.), Empiricism and sociology (pp. 158–213). Springer Netherlands. https://doi.org/10.1007/978-94-010-2525-6_6
Week 3: Causal Diagrams: The Structures of Confounding Bias
Date: 11 Mar 2026
Key idea
Which variables to adjust for is a question about causal structure, not about model fit. A regression can fit the data better and still answer the wrong causal question, because $R^2$ cannot detect confounding.
Readings
Required
Optional
Key concepts for the test(s)
- Confounding bias
- Backdoor path
- Backdoor criterion
- Valid adjustment set
- M-bias
- Regression
- Intercept
- Regression coefficient
- Model fit
- Why model fit is misleading for causality
Lab 3 setup
Use Lab 3: Regression, Graphing, and Simulation for this week's practical work. The optional script is here: Download the R script for Lab 03.
Week 2 introduced five elementary causal structures and the rules of d-separation. This week we use those structures to diagnose confounding bias and ask: when does conditioning on a variable remove bias, and when does it create bias?
Seminar
Motivating example: higher $R^2$, worse identification
Suppose investigators want the total effect of an exercise programme ($A$) on cardiovascular risk ($Y$). They begin with a regression that adjusts for baseline confounders. Then they add body composition measured after the programme ($M$). The model $R^2$ rises, because body composition is a strong correlate of cardiovascular risk.
Did the higher $R^2$ improve the causal estimate? No. If the programme changes body composition and body composition changes cardiovascular risk, then $M$ is a mediator on the path $A \to M \to Y$. Conditioning on $M$ blocks part of the very effect the investigators wanted to estimate. The model fits the observed data better, but it answers the wrong causal question.
A simple map for week 3
Week 3 asks one repeated question: which variables should we condition on, and why?
A practical workflow
- Draw the relevant backdoor paths from $A$ to $Y$.
- Decide which variables would block those paths.
- Check that you are not conditioning on a mediator, a collider, or a descendant of treatment.
That is the logic behind the backdoor criterion. Regression is only one way of carrying out the conditioning decision.
Learning outcomes
By the end of this week, you will be able to:
- Define confounding bias and identify it in a DAG.
- Apply the backdoor criterion.
- Explain why good model fit does not rule out confounding.
- Distinguish confounding problems that time ordering can solve from those it cannot.
- Define M-bias and explain why conditioning on a pre-treatment collider can introduce bias.
What is confounding?
Confounding exists when a common cause of treatment $A$ and outcome $Y$ opens a non-causal backdoor path.
Definition: confounding bias
Confounding bias exists when at least one backdoor path from $A$ to $Y$ is open.
A backdoor path starts with an arrow into $A$.
Example: exercise and blood pressure. Health consciousness ($L$) may affect exercise ($A$). Health consciousness ($L$) may also affect blood pressure ($Y$). Then $A \leftarrow L \to Y$ is a backdoor path. If we do not condition on $L$, we mix causal association and spurious association.
Quick test
If a path from $A$ to $Y$ starts with an arrow into $A$, treat it as a candidate backdoor path.
The backdoor criterion
Pearl's backdoor criterion tells us when an adjustment set is valid.
Definition: backdoor criterion
A set $L$ satisfies the backdoor criterion for $A$ and $Y$ if:
- No variable in $L$ is a descendant of $A$.
- $L$ blocks every backdoor path from $A$ to $Y$.
If both conditions hold, conditioning on $L$ supports exchangeability: $Y(a) \coprod A \mid L$.
A short memory aid
- Block all backdoor paths.
- Do not adjust for descendants of treatment.
Pair exercise: applying the backdoor criterion
- Draw a DAG for the effect of exercise ($A$) on cardiovascular risk ($Y$) with three additional variables: health consciousness ($L_1$), diet ($L_2$), and body composition ($M$, a mediator on the $A \to Y$ path). Include arrows: $L_1 \to A$, $L_1 \to L_2$, $L_2 \to Y$, $L_1 \to Y$, $A \to M \to Y$.
- List all paths from $A$ to $Y$.
- Check whether ${L_1}$ satisfies the backdoor criterion. Does it block every backdoor path without conditioning on a descendant of $A$?
- Explain why adding $M$ to the adjustment set violates the backdoor criterion (which part of the causal path does it block?).
Confounding and regression
Regression is one way to condition on measured variables. For example,
$$ Y = \beta_0 + \beta_1A + \beta_2L + \varepsilon $$
Definition: key regression terms
- Intercept ($\beta_0$): expected outcome when covariates are zero.
- Coefficient ($\beta_1$): expected outcome difference per unit change in $A$, conditional on model terms.
- Model fit ($R^2$): variance explained by the fitted model.
High $R^2$ does not imply no confounding. Fit is a statistical property. Confounding is a causal-structure property. A model can fit the observed data very well and still answer the wrong causal question.
Why model fit is misleading for causality
A model can fit very well and still be causally wrong.
If investigators condition on a mediator, they can block part of the target effect.
If investigators condition on a collider, they can open a spurious path.
Neither problem is diagnosed by $R^2$.
Confounding problems that time ordering can resolve
Cross-sectional measurements blur temporal order. Longitudinal design can resolve several ambiguities. A common strategy is:
- Measure confounders at baseline ($t_0$).
- Measure treatment at $t_1$.
- Measure outcome at $t_2$.
Measuring variables in this order rules out two ambiguities at once. The outcome cannot have caused the treatment, because it is measured later. A measured confounder cannot be a consequence of the treatment, because it is measured earlier. Time stamps make the assumed arrow directions checkable rather than merely asserted.
Confounding problems that time ordering alone cannot resolve
Time ordering fixes the direction of arrows. It does not supply missing variables. If a common cause of treatment and outcome is never measured, for example a stable disposition that drives both, then adding waves does not close that backdoor path: repeated measurement of the wrong variables cannot substitute for measuring the right ones. Time ordering also cannot resolve M-bias, where the offending structure is a pre-treatment collider rather than an open backdoor path.
Definition: M-bias
M-bias occurs when investigators condition on a pre-treatment collider.
In the structure $U_1 \to L \leftarrow U_2$, with $U_1 \to A$ and $U_2 \to Y$, conditioning on $L$ opens a previously blocked path between $A$ and $Y$.
M-bias is important because "control for everything" is not a safe rule.
Pair exercise: M-bias in practice
- Consider the question "does religious attendance increase charitable giving?" Suppose neighbourhood social capital ($L$) is a collider of two unmeasured causes: one that affects attendance ($U_1$) and one that affects giving ($U_2$).
- Draw the DAG with $U_1 \to L \leftarrow U_2$, $U_1 \to A$, and $U_2 \to Y$.
- Trace what conditioning on $L$ does: which path opens?
- State in one sentence why "adjust for all pre-treatment variables" fails here.
Note on mediation assumptions
Mediation analysis needs stronger assumptions than total-effect analysis. One reason is treatment-induced confounding: the treatment $A$ can affect a variable that then confounds the mediator-outcome relation, so no fixed adjustment set both blocks the confounding and leaves the mediated path intact. When this happens, standard regression cannot recover the direct and indirect effects. We do not estimate mediation in this course. The point here is only that splitting a total effect into separate paths carries assumptions beyond those needed for the total effect itself.
Return to the opening example
Back to the exercise programme example. Higher $R^2$ did not answer the causal question, because adding post-treatment body composition changed the estimand. To estimate the total effect of the programme on cardiovascular risk, we need a defended DAG and a valid adjustment set, not just a better-fitting regression. This is why we separate modelling from causal identification.
What to remember for the test
- Confounding is about open non-causal backdoor paths.
- The backdoor criterion tells us when an adjustment set is valid.
- Regression can implement conditioning, but it cannot tell us which variables should be conditioned on.
- Better fit is not the same as better identification.
Confounding is one structural threat to causal identification. Week 4 adds two others: selection bias and measurement bias.
Pair exercise: $R^2$ versus identification
- Investigator A adjusts for age, income, and education ($R^2 = 0.42$). Investigator B adjusts for age and conscientiousness ($R^2 = 0.31$).
- Explain to your partner why higher $R^2$ does not imply less confounding.
- Propose a DAG where Investigator A's larger adjustment set introduces bias (hint: include a collider or mediator).
- State what would need to be true for Investigator B's smaller set to satisfy the backdoor criterion.
Further reading
All open access: Bulbulia (2024); Hernán & Robins (2025, 6).
Lab materials: Lab 3: Regression, Graphing, and Simulation
Bulbulia, J. A. (2024). Methods in causal inference part 1: Causal diagrams and confounding. Evolutionary Human Sciences, 6, e40. https://doi.org/10.1017/ehs.2024.35
Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if. Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Neal, B. (2020). Introduction to causal inference from a machine learning perspective. Course Lecture Notes (Draft). https://www.bradyneal.com/Introduction_to_Causal_Inference-Dec17_2020-Neal.pdf
Suzuki, E., Shinozaki, T., & Yamamoto, E. (2020). Causal Diagrams: Pitfalls and Tips. Journal of Epidemiology, 30(4), 153–162. https://doi.org/10.2188/jea.JE20190192
Week 4: Selection Bias and Measurement Bias
Key idea
Confounding, selection bias, and measurement error are three forms of one underlying problem: the data we would need are missing, and what we observe stands in for them imperfectly. Causal diagrams capture much of this shared structure, though some aspects of selection bias and measurement error resist a clean graphical treatment.
Readings
Required
Optional
Key concepts for the test(s)
- Independent/uncorrelated measurement error bias
- Independent/correlated measurement error bias
- Dependent/uncorrelated measurement error bias
- Dependent/correlated measurement error bias
- Collider bias as a distinct mechanism
- Selection bias and transportability
Lab 4 setup
Use Lab 4: Writing Regression Models for this week's practical work. The lab page links the student practice script first and the instructor script second.
Seminar
Motivating example: one study, two failure modes
Suppose investigators recruit participants for a bilingualism study through university mailing lists. Recruitment is selective. People with high academic motivation and strong language confidence are more likely to enrol.
Now add measurement error. Suppose the cognitive task is validated only in English. Then non-English-dominant participants may be mismeasured.
One study can fail in two ways. It can fail through biased sampling. It can fail through biased measurement.
Learning outcomes
By the end of this week, you should be able to:
- Use fork, chain, and collider structures to recognise how bias enters a study.
- Classify four structural types of measurement error.
- Explain how conditioning on selection can bias causal contrasts.
- Distinguish target, source, and analytic populations for transport claims.
The same graph logic still applies
Weeks 2-3 used causal diagrams to find confounders. That can make a DAG look like a tool for one job. It is more general. A small number of graph structures determine which paths carry association and which are blocked, and once you can recognise a fork, a chain, and a collider, you can diagnose bias in almost any study. Selection problems, attrition problems, and measurement problems look harder only because they involve more nodes; the logic is the same.
Five elementary structures
- No causal relation: no arrow connects the variables.
- Direct causation: one variable causes another.
- Fork: one variable is a common cause of two others.
- Chain: one variable sits on the path between two others.
- Collider: one variable is a common effect of two others.
When you see $A \coprod Y$, read this as "A is statistically independent of Y". When you see $A \cancel\coprod Y$, read this as "A is statistically dependent on Y".
Four practical rules
- Condition on common causes or their good proxies when you want to block a non-causal backdoor path.
- Do not condition on mediators when you want the total effect, because you would block part of the causal pathway.
- Do not condition on colliders, because conditioning opens a path that was previously blocked.
- Be careful with descendants and proxies, because conditioning on them can behave like conditioning on their parent.
The new work this week is to find those same structures inside two problems that do not at first look like confounding. Selection bias often appears because study entry, study retention, or analytic restriction acts like a collider, or a descendant of a collider. Measurement bias often appears because the variable we record is a noisy proxy, or a downstream consequence, of the variable we actually care about.
Pair check: what does conditioning do?
- In a fork, $A \leftarrow L \to Y$, what does conditioning on $L$ do?
- In a chain, $A \to M \to Y$, what happens if we condition on $M$ when we want the total effect of $A$ on $Y$?
- In a collider, $A \to C \leftarrow Y$, what happens if we condition on $C$?
You only need one sentence for each answer.
Common causal questions as graphs
Different questions need different graphs, different estimands, and different assumptions, but they share one picture language: the same diagrams represent confounding, selection, measurement, mediation, and transport problems. The effect-modification graph below (the open circle on an arrow marks effect modification, not a standard causal arrow) returns when we distinguish two mechanisms of selection bias.
A typology of measurement error bias
A causal diagram also brings structure to measurement error. Four cases cover most of what investigators meet.
Four structural types of measurement error
Measurement error is classified along two dimensions.
Dimension 1: independent vs dependent
- Independent (undirected): one true variable does not causally affect another variable's measurement error.
- Dependent (directed): one true variable causally affects another variable's measurement error. Epidemiology usually calls this differential measurement error, and the independent case non-differential.
Dimension 2: uncorrelated vs correlated
- Uncorrelated: errors do not share a common cause.
- Correlated: errors share a common cause.
Combinations:
- Independent, uncorrelated: often attenuates effects toward the null. Example: a self-report anxiety scale adds random noise to the true score. The noise is unrelated to treatment status, so it blurs the signal without creating a false one.
- Independent, correlated: can create spurious associations even when no causal effect exists. Example: societies with advanced record-keeping produce more precise records of both religious beliefs and social complexity. The shared cause (record-keeping quality) induces a non-causal association between treatment and outcome measures.
- Dependent, uncorrelated: can open non-causal paths from treatment to measured outcome. Example: participants who receive an intervention report their outcomes more favourably because the treatment itself changes how they interpret survey items. The exposure causally affects measurement of the outcome.
- Dependent, correlated: can bias in either direction, and the direction is hard to predict analytically. Example: social complexity shapes how historical archives record both religious beliefs and governance structures, and the errors in both records share a common cause in elite patronage of scribes.
Pair exercise: classifying measurement error
For each scenario, classify the measurement error using the two dimensions (independent/dependent, correlated/uncorrelated) and name the type number (1-4).
- A self-report screen-time measure adds noise because participants guess rather than track. Social desirability also inflates purpose reports. The errors are unrelated to each other and unrelated to treatment status.
- A cognitive test for bilingualism effects is validated only in English. Non-English-dominant participants are systematically mismeasured. The treatment (bilingualism) causally affects measurement of the outcome.
- A cross-cultural study uses the same translation team for exposure and outcome instruments. Shared translation quality introduces correlated errors in both measures.
For scenario 2, draw a short DAG showing how the treatment ($A$) creates a path through the measurement node to the recorded outcome.
Selection bias and transportability
Selection bias occurs when inclusion in the analytic sample depends on variables related to treatment, outcome, or effect modifiers. Selection bias threatens validity in two structurally distinct ways.
Mechanism 1: collider conditioning (internal validity). When both treatment and outcome affect who enters the sample, selection acts as a collider. Conditioning on it opens a non-causal path between $A$ and $Y$. The estimate is biased for the population it claims to describe.
Mechanism 2: effect modifier imbalance (external validity). Even without confounding, a sample can fail to generalise. Suppose treatment $A$ is randomised, so no backdoor paths are open. If a variable $Z$ modifies the effect of $A$ on $Y$, and $Z$ is distributed differently in the analytic sample than in the target population, the sample ATE does not equal the population ATE. No non-causal path is opened; the internal validity is intact. The problem is that the average treatment effect is a weighted average of subgroup effects, and the weights differ between populations.
The open circle on the arrow from $Z$ to $Y$ denotes effect modification: $Z$ changes the size of $A$'s effect on $Y$. This is not a standard causal arrow. No confounding is present. Yet if $Z$ is distributed differently in the sample than in the target population, the sample ATE does not transport.
This second mechanism does not require a collider. A study of exercise and blood pressure conducted entirely in young adults may correctly estimate the ATE for young adults. If older adults benefit more (effect modification by age), the sample ATE underestimates the population ATE. The design is unconfounded but the conclusion does not transport.
Transportability asks whether effect-relevant structure is compatible between analytic and target populations. This requires knowing where effect modifiers differ, not just whether the sample is "representative" in some demographic sense.
Target, source, and analytic populations
- Target population: where we want the causal claim to apply.
- Source population: where recruitment occurs.
- Analytic sample: who is actually analysed.
Transportability requires that effect-relevant structure is compatible between analytic and target populations.
Collider bias: a distinct mechanism
Collider bias can feel new because the earlier weeks mostly taught you how to close open backdoor paths. Here the warning runs in the opposite direction: some conditioning decisions create bias rather than remove it.
Why collider bias is not confounding. Confounding arises from an open backdoor path through a common cause: $A \leftarrow L \to Y$. We usually reduce confounding by conditioning on $L$. Collider bias works in the opposite direction. In the structure $A \to C \leftarrow Y$, the path is blocked at first ($A \coprod Y$). Conditioning on $C$ opens a spurious association ($A \cancel\coprod Y \mid C$).
Why collider bias is not identical to selection bias. When collider conditioning happens through sample restriction, it appears as selection bias because the sample is truncated. Berkson's bias is the classic example. But collider bias can also appear when we stratify or adjust for a common effect inside a complete dataset. In that case the problem comes from the analytic decision, not from who entered the sample.
Why the same DAG rules still work. Pearl's d-separation criterion tells us which paths are opened and closed by conditioning. That is why DAGs help with more than confounding. The same framework lets us reason about collider bias, mediator bias, measurement error, and selection problems.
For this course, the practical upshot is simple: never condition on common effects, whether through sample restriction, stratification, or statistical adjustment. Conditioning on a collider opens a non-causal path.
Pair exercise: collider bias versus confounding
- A hospital study investigates whether depression ($A$) slows recovery ($Y$). Ward admission ($C$) depends on both depression severity and injury severity. Only admitted patients are analysed.
- Draw a DAG with $A \to C \leftarrow Y$ (ward admission as a collider of depression and recovery-related injury severity).
- Explain the non-causal path that opens when the study conditions on $C$ by restricting to admitted patients.
- Your partner argues "this is just confounding by injury severity." Counter by explaining the structural difference: confounding is an open backdoor path through a common cause, whereas collider bias opens a previously blocked path by conditioning on a common effect.
- Propose one design change that avoids this bias.
Attrition and the measurement-error analogy
Right-censoring (attrition) can bias causal estimates through two distinct mechanisms. The first is distortion: if the outcome affects who drops out, conditioning on the end-of-study sample conditions on a common effect of exposure and outcome. This opens a non-causal path. The bias is an internal validity problem; the estimate is wrong for the population it claims to describe.
The second mechanism is restriction: if effect modifiers are distributed differently among survivors than in the baseline population, the average treatment effect (ATE) estimated from the end-of-study sample may not match the ATE for the target population. No non-causal path is opened, but the sample no longer represents the population of interest. This is an external validity problem; the estimate may be correct for survivors but does not transport.
The structural parallel to measurement error is direct. Distortion through attrition mirrors dependent measurement error (type 3 above): the outcome causally affects what is recorded. Restriction through attrition mirrors independent measurement error off the null (type 1): the signal is diluted because the analytic sample differs from the target in composition. Investigators should diagnose which mechanism is operating, because the remedies differ.
Inverse-probability-of-censoring weights (IPCW) address distortion: each participant still observed at the end of the study is up-weighted by the inverse of their estimated probability of remaining, so the analysed cases stand in for the similar participants who dropped out. The weights remove the bias only when the measured covariates capture why people left. Restriction is handled differently, by reweighting the sample to match the target population's distribution of effect modifiers.
WEIRD samples and effect heterogeneity
A WEIRD (Western, Educated, Industrialised, Rich, Democratic) sample is not automatically invalid. The problem is Mechanism 2: if effect modifiers are distributed differently between the analytic and target populations, and treatment effects vary by those modifiers, the sample ATE does not transport. A perfectly unconfounded study in a WEIRD sample can produce a correct estimate for that sample and a wrong estimate for the population of interest.
Preview of Week 10
This week treats measurement from the vantage point of causal inference: what happens when the recorded variable differs from the variable in the causal question? Week 10 focuses on the problems with classical test theory, including measurement invariance and constructed measures. For now, the main point is narrower: if an instrument behaves differently across groups, between-group contrasts can reflect measurement artefact.
Return to the opening example
Back to the bilingualism study. Two design checks are non-negotiable. First, why did these participants enter the analytic sample? Second, do the instruments measure the same constructs across participants? If either check fails, causal interpretation weakens.
With the structural threat landscape mapped (confounding, selection, measurement), Week 5 shows how the three identification assumptions introduced in Week 2 connect a causal question to a population-level causal contrast.
Pair exercise: auditing a study for two failure modes
- Return to the bilingualism example from the start of this lecture.
- Name the selection bias mechanism (what variable is acting as a collider or filter?).
- Name the measurement bias type from the four-type classification (independent/dependent, correlated/uncorrelated).
- Write a two-sentence design critique stating both problems and how each distorts the causal contrast.
Further reading
All open access: J. A. Bulbulia (2024c); J. A. Bulbulia (2024a); J. A. Bulbulia (2024b).
Lab materials: Lab 4: Writing Regression Models
Bulbulia, J. A. (2024a). Methods in causal inference part 1: Causal diagrams and confounding. Evolutionary Human Sciences, 6, e40. https://doi.org/10.1017/ehs.2024.35
Bulbulia, J. A. (2024b). Methods in causal inference part 2: Interaction, mediation, and time-varying treatments. Evolutionary Human Sciences, 6, e41. https://doi.org/10.1017/ehs.2024.32
Bulbulia, J. A. (2024c). Methods in causal inference part 3: Measurement error and external validity threats. Evolutionary Human Sciences, 6, e42. https://doi.org/10.1017/ehs.2024.33
Bulbulia, J., & Hine, D. W. (2024). Causal inference in environmental psychology. PsyArXiv. https://osf.io/preprints/psyarxiv/tbjx8
Hernán, M. A. (2017). Invited commentary: Selection bias without colliders | american journal of epidemiology | oxford academic. American Journal of Epidemiology, 185(11), 1048–1050. https://doi.org/10.1093/aje/kwx077
Hernán, M. A., & Cole, S. R. (2009). Invited commentary: Causal diagrams and measurement bias. American Journal of Epidemiology, 170(8), 959–962. https://doi.org/10.1093/aje/kwp293
Hernán, M. A., Hernández-Díaz, S., & Robins, J. M. (2004). A structural approach to selection bias. Epidemiology, 15(5), 615–625. https://www.jstor.org/stable/20485961
Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if. Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
VanderWeele, T. J., & Hernán, M. A. (2012). Results on differential and dependent measurement error of the exposure and the outcome using signed directed acyclic graphs. American Journal of Epidemiology, 175(12), 1303–1310. https://doi.org/10.1093/aje/kwr458
Week 5: Causal Inference: Average Treatment Effects
Date: 25 Mar 2026
Key idea
Causal inference replaces unobservable counterfactual averages with observed averages. Three identification assumptions, consistency, exchangeability, and positivity, license that substitution. They are design commitments to defend before estimation, not boxes to tick afterwards.
Readings
Required
- Hernán & Robins (2025), chapters 1-3. link
- Cashin et al. (2025). TARGET (Transparent Reporting of Observational Studies Emulating a Target Trial) statement.
Optional
- Neal (2020), chapters 1-2.
Key concepts for the test(s)
- The fundamental problem of causal inference
- Average (marginal) treatment effects
- Causal consistency
- Exchangeability
- Positivity
Lab
Use Lab 5: Average Treatment Effects for this week's hands-on work.
Terminology
In these notes, we use "potential outcomes" and "counterfactual outcomes" interchangeably.
Weeks 1 through 4 built a framework for asking causal questions and identifying the structural threats that obstruct answers: confounding, selection bias, and measurement error. Week 2 stated the three identification assumptions. This week shows how those assumptions, together with a well-specified target trial, connect a causal question to an estimable population contrast. We asked, "where can bias enter?" Week 5 asks, "what causal contrast do we want, and what assumptions let observed data stand in for the missing counterfactuals?"
Seminar
Opening example: one question, two different answers
Observational studies often suggest that students who choose to use a mindfulness app have lower anxiety than students who do not. Randomised trials usually find a smaller and less consistent average benefit once investigators standardise when the intervention begins, what counts as treatment, and when outcomes are measured.
Same substantive question. Different design. Different answer.
This week explains why.
A simple map for this week
This lecture has three moves.
Three moves
- Write the causal contrast we want.
- See why we cannot observe that contrast for one person.
- Use identification assumptions to recover a population average from observed data.
Potential outcomes and DAGs do different jobs. Potential outcomes define the causal contrast. DAGs help us judge whether exchangeability is plausible, which variables belong in $L$, and whether the design has a coherent time zero.
Move 1: state the causal question
Return to the mindfulness example. Let $A=1$ denote starting a guided mindfulness app at the beginning of semester and completing one 10-minute session per day for eight weeks. Let $A=0$ denote not starting the app. Let $Y$ be anxiety symptoms at week 8.
For student $i$, $Y_i(1)$ is the outcome under the app-based intervention, and $Y_i(0)$ the outcome without it. The individual causal effect is:
$$ Y_i(1)-Y_i(0). $$
We never observe both terms for one student at one time.
The fundamental problem of causal inference
The individual causal effect requires two quantities, but we can observe at most one. If student $i$ starts the app ($A_i = 1$), we observe $Y_i(1)$ but not $Y_i(0)$. If the student does not start it ($A_i = 0$), we observe $Y_i(0)$ but not $Y_i(1)$. The unobserved term is the counterfactual.
This is a structural feature of the physical world: a person cannot simultaneously exist under two incompatible conditions. No dataset, however large, contains both potential outcomes for one individual at one time. Individual causal effects are missing by necessity.
Pair exercise: building a potential outcomes table
- Consider four students in the mindfulness example. Construct a table with columns: person ($i$), $Y_i(1)$, $Y_i(0)$, $\delta_i = Y_i(1) - Y_i(0)$, treatment received ($A_i$), and observed outcome ($Y_i^{\mathrm{obs}}$).
- Assign plausible values: let two students receive $A = 1$ and two receive $A = 0$. Make the true individual effects vary (e.g., one positive, one negative, two zero).
- Compute the true ATE from all four $\delta_i$ values.
- Compute the naive difference in means: $\bar{Y}{A=1}^{\mathrm{obs}} - \bar{Y}{A=0}^{\mathrm{obs}}$.
- Do the two quantities match? Explain why the discrepancy arises (or why it does not).
Move 2: from individuals to populations
Because individual effects are unobservable, we target a population causal estimand: the quantity we want our analysis to estimate. The average treatment effect (ATE) is:
$$ \text{ATE}=\mathbb{E}[Y(1)-Y(0)]. $$
This is the mean contrast if everyone in the target population received $A=1$ versus $A=0$.
Move 3: connect the causal estimand to observed data
The three identification assumptions are easier to remember if you ask what each one contributes to the argument.
Assumption 1: causal consistency
If $A_i=a$, then $Y_i=Y_i(a)$. The observed outcome equals the potential outcome corresponding to the treatment actually received.
Data scientists estimate parameters for observed data. Causal inference goes further: we estimate contrasts involving counterfactual parameters. We compute the average response when the entire target population is exposed, then when the entire population is unexposed, then contrast these averages. Consistency is what allows us to bridge from counterfactual to observed. Without it, potential outcomes remain purely theoretical.
The general switching equation expresses the observed outcome as a function of treatment and both potential outcomes:
$$ Y_i^{obs} = A_i \cdot Y_i(1) + (1 - A_i) \cdot Y_i(0). $$
Each person carries two potential outcomes, but we observe only the one selected by their actual treatment. For treated individuals ($A_i = 1$), the switching equation reduces to:
$$ Y_i^{obs} = 1 \cdot Y_i(1) + 0 \cdot Y_i(0) = Y_i(1). $$
For untreated individuals ($A_i = 0$):
$$ Y_i^{obs} = 0 \cdot Y_i(1) + 1 \cdot Y_i(0) = Y_i(0). $$
In short:
$$ Y_i = Y_i(1) \quad \text{if } A_i = 1; \qquad Y_i = Y_i(0) \quad \text{if } A_i = 0. $$
Consistency subsumes two conditions that are sometimes stated separately. No interference requires that one person's treatment does not affect another person's outcome. Treatment-version irrelevance requires that there is only one version of each treatment level. Both are special cases: if treatments are heterogeneous or if interference exists, the potential outcome $Y(a)$ is ill-defined. Consistency fails when treatment versions are mixed under one label. If "mindfulness practice" includes different apps, dosages, and start dates under the same label, $Y(1)$ does not refer to one intervention.
Assumption 2: exchangeability
Within levels of the measured covariates $L$, treatment is independent of the potential outcomes: once we know a person's $L$, their treatment status carries no further information about how they would respond under either exposure (Chatton et al. (2020); Hernán & Robins (2025)).
In a randomised trial, exchangeability holds unconditionally:
$$ Y(a) \coprod A. $$
In observational data, we require conditional exchangeability. For each $a$:
$$ Y(a) \coprod A \mid L, $$
where $L$ is the set of covariates sufficient to ensure the independence of the counterfactual outcomes and the exposure. Equivalently, $A \coprod Y(a) \mid L$. When this condition holds, counterfactual outcomes are independent of actual exposures received, conditional on $L$.
Exchangeability cannot be verified from observed data. It can only be defended by subject-matter knowledge and a plausible DAG. This is the no-unmeasured-confounding assumption.
Assumption 3: positivity
The probability of receiving every value of the exposure within all strata of covariates is greater than zero (Hernán & Robins (2025); Westreich & Cole (2010)):
$$ 0 < P(A=a \mid L=l) < 1, \quad \forall, a \in \mathcal{A},; \forall, l \in \mathcal{L}. $$
There are two types of positivity violation.
Random non-positivity occurs when the sample data do not contain all levels of exposure within strata for whom exposures are defined. For example, if no participants aged 22–24 received treatment, investigators must extrapolate from other ages. Random non-positivity can be addressed by modelling assumptions, but those assumptions carry their own risks.
Deterministic non-positivity occurs when it is scientifically impossible for certain strata to receive specific levels of exposure. For example, biological males cannot receive hysterectomy. Deterministic violations require restricting the analysis to scientifically plausible cases.
Positivity is the one identification assumption we can partially check empirically. The propensity score is the conditional probability of receiving treatment given covariates, $P(A=1 \mid L)$. Plot propensity score distributions and look for gaps or near-zero densities.
What each assumption buys us
- Consistency links one observed outcome to one potential outcome. It is what connects counterfactual quantities to data.
- Exchangeability lets observed outcomes in one group stand in for the missing counterfactual outcomes in the other.
- Positivity ensures that the needed comparison exists in every relevant subgroup.
How the assumptions recover population contrasts
Start with the easiest case: a randomised trial, where exchangeability holds without conditioning. The assumptions work in sequence:
$$ \begin{aligned} \underbrace{\mathbb{E}[Y(1)]}_{\textcolor{blue}{\text{everyone treated}}} &\overset{\text{exchangeability}}{=} \underbrace{\mathbb{E}[Y(1)\mid A=1]}_{\textcolor{blue}{\text{treated arm}}} \overset{\text{consistency}}{=} \underbrace{\mathbb{E}[Y \mid A=1]}_{\textcolor{teal}{\text{observed treated mean}}}, \newline \underbrace{\mathbb{E}[Y(0)]}_{\textcolor{red}{\text{everyone untreated}}} &\overset{\text{exchangeability}}{=} \underbrace{\mathbb{E}[Y(0)\mid A=0]}_{\textcolor{red}{\text{control arm}}} \overset{\text{consistency}}{=} \underbrace{\mathbb{E}[Y \mid A=0]}_{\textcolor{orange}{\text{observed untreated mean}}}. \end{aligned} $$
So the ATE becomes
$$ \begin{aligned} \text{ATE} &= \underbrace{\mathbb{E}[Y(1)]}_{\textcolor{blue}{\text{counterfactual treated mean}}} {}- \underbrace{\mathbb{E}[Y(0)]}_{\textcolor{red}{\text{counterfactual untreated mean}}} \newline &= \underbrace{\mathbb{E}[Y \mid A=1]}_{\textcolor{teal}{\text{observed treated mean}}} {}- \underbrace{\mathbb{E}[Y \mid A=0]}_{\textcolor{orange}{\text{observed untreated mean}}}. \end{aligned} $$
This is the key identification move. We replace missing counterfactual averages with observed group averages.
In observational data, the same logic works only after conditioning on a sufficient set $L$. Positivity then ensures that each relevant stratum contains both treated and untreated individuals, so those adjusted comparisons are estimable.
Pair exercise: tracing the identification logic
- Your partner claims "students who chose the mindfulness app had lower anxiety, therefore the app works."
- Walk through each identification assumption in turn. Where does the reasoning break?
- Check consistency: were all students labelled $A = 1$ receiving the same intervention?
- Check exchangeability: is $Y(a) \coprod A$, or could the students who chose the app differ systematically from those who did not?
- Check positivity: are there covariate strata where no students used (or declined) the app?
- State which assumption is most plausible violated and why.
The observational-data version
Assume consistency, exchangeability given $L$, and positivity. Then:
$$ \mathbb{E}[Y(a)] = \sum_l \mathbb{E}[Y \mid A=a, L=l]P(L=l). $$
So the ATE is identified by standardisation:
$$ \text{ATE}=\sum_l \underbrace{\Big(\mathbb{E}[Y \mid A=1,L=l]-\mathbb{E}[Y \mid A=0,L=l]\Big)}_{\textcolor{teal}{\text{within-stratum observed contrast}}} \underbrace{P(L=l)}_{\textcolor{blue}{\text{stratum weight}}}. $$
What we can check, and what we cannot
Positivity is the only assumption we can directly inspect with data. If some covariate strata contain no treated (or no untreated) individuals, the contrast for those strata relies on model extrapolation rather than observed comparisons.
Consistency requires that "treatment" means the same thing for everyone labelled $A = 1$. In the mindfulness example, beginning a guided app at semester start, trying one unguided breathing exercise in week 5, and attending a group class irregularly are different interventions. A well-specified target trial defines treatment precisely enough that consistency is defensible.
Exchangeability cannot be verified from observed data. We can check whether measured covariates are balanced after adjustment, but we cannot test whether unmeasured common causes remain. This is why the DAG matters: it forces investigators to state which variables they believe are sufficient and to defend that belief with subject-matter knowledge.
Design and subject-matter knowledge are not optional extras. They are what makes identification assumptions assessable.
Quick diagnostic
- If treatment is vague, worry about consistency.
- If treated and untreated people differ in causes of the outcome, worry about exchangeability.
- If one treatment level barely occurs in some subgroup, worry about positivity.
Pair exercise: designing a target trial
- Your intervention is daily meditation (20 minutes). Your outcome is anxiety symptoms at 6 months. Your target population is university students.
- State the causal estimand precisely: what are the two contrast conditions?
- Define time zero (when does follow-up begin?).
- Name two baseline covariates you would adjust for, and give a causal rationale for each (draw a short DAG if it helps).
- Describe one plausible positivity failure in this setting (a subgroup where one side of the contrast is effectively empty).
The causal workflow
The identification assumptions are not items on a checklist to be ticked off after analysis. They are design commitments that must be defended before estimation begins. The following workflow organises these commitments into a sequence. Each step depends on the ones before it. Also see the course causal workflow reference page.
Step 0: define the target population. Say exactly who the answer is meant to inform before choosing the exposure or outcomes. "University students" may be too broad if the intervention is a guided app that only some students could realistically use. The population choice shapes which treatment versions are coherent, which outcomes matter, and where positivity may fail.
Step 1: state a well-defined treatment. Specify the hypothetical intervention precisely enough that every member of the target population could, in principle, receive it. "Mindfulness" is too vague: people meditate with different apps, in groups, alone, once, or every day. A clearer intervention is: "start a guided mindfulness app at the beginning of semester and complete one 10-minute session per day for eight weeks." Precision here underwrites consistency and makes the timeline visible.
Step 2: establish time zero. Define the point at which treatment assignment begins and follow-up starts. We cannot do this until the treatment is specified, because time zero is the moment that intervention becomes assigned or initiated. In the mindfulness example, time zero is the beginning of semester when students start the app or are assigned not to start it. Without a clear time zero, consistency is undermined and exchangeability is hard to assess.
Step 3: state a well-defined outcome. Define the outcome so the causal contrast is meaningful and temporally anchored. "Sense of Purpose" is underspecified; "psychological distress one year post-intervention measured with the Kessler-6" is interpretable and reproducible. Include timing, scale, and instrument.
Step 4: evaluate exchangeability. Make the case that potential outcomes are independent of treatment conditional on covariates: $Y(a) \coprod A \mid L$ (Hernán & Robins (2025)). Use design and diagnostics: DAGs, subject-matter arguments, pre-treatment covariate balance, and overlap checks. If exchangeability is doubtful, redesign rather than rely solely on modelling.
Step 5: ensure causal consistency. Consistency requires that, for units receiving a treatment version compatible with level $a$, the observed outcome equals $Y(a)$. It also presumes well-defined versions and no interference between units (VanderWeele & Hernan (2013); Hernán & Robins (2025)). When multiple versions exist, either refine the intervention so versions are irrelevant to $Y(a)$, or condition on version-defining covariates.
Step 6: check positivity. Each treatment level must occur with non-zero probability at every covariate profile needed for exchangeability (Westreich & Cole (2010)). Diagnose limited overlap using propensity score distributions and extreme weights. Consider design-stage remedies (trimming, restriction) before estimation.
Step 7: ensure measurement aligns with the scientific question. Be explicit about probable forms of measurement error (classical, Berkson, differential, misclassification) and their structural implications for bias (Hernán & Robins (2025); Bulbulia (2024)). Where feasible, incorporate validation studies or calibration models.
Step 8: preserve representativeness from start to finish. Differential attrition, non-response, or measurement processes tied to treatment and outcomes can induce selection bias in the presence of true effects (Hernán (2017); Bulbulia (2024)). Plan strategies such as inverse probability weighting for censoring, multiple imputation under defensible mechanisms, and sensitivity analyses for data missing not at random.
Step 9: document the reasoning that supports steps 0–8. Make assumptions, disagreements, and judgement calls legible. Register or time-stamp the analytic plan. Include identification arguments, code, and data where possible. Report robustness and sensitivity analyses. Transparent reasoning is a scientific result in its own right (Ogburn & Shpitser (2021)).
A note on reporting standards
When a study that emulates a target trial is published, journals in epidemiology and medicine increasingly expect a thorough, standardised account of its design and assumptions. The TARGET statement (Transparent Reporting of Observational Studies Emulating a Target Trial; Cashin et al. (2025)) is one such reporting format. It asks authors to state, point by point, the target population, the treatment strategies, time zero, the outcomes, the identification assumptions, and the sensitivity analyses. Each of those points corresponds to a step of the causal workflow above.
Read the TARGET statement once, to see the level of detail a professional causal analysis is held to. You are not asked to apply its full checklist in this course. For the final assessment, Option A reports follow the course research-report template and reporting guide, which ask for the same reasoning in a shorter, course-specific form.
Return to the opening example
The mindfulness discrepancy illustrates what happens when investigators fail to emulate a target trial. If "users" are defined as students who have already adopted the app, the design mixes recent starters with persistent users who may differ in motivation, baseline distress, and help-seeking. That makes the exposed group look healthier than a true start-of-intervention comparison would justify.
The lesson is that design comes before estimation. If the hypothetical trial is not specified, the identifying assumptions are hard to interpret and even harder to defend. We now have the tools to identify and estimate an average causal contrast for a defined population. Week 6 asks the next question: does that contrast vary across subgroups?
Lab materials: Lab 5: Average Treatment Effects
Appendix A: notation variants
Equivalent notations for the individual contrast include
$$ Y_i^{1} - Y_i^{0} $$
and
$$ Y_i(a=1) - Y_i(a=0). $$
Bulbulia, J. A. (2024). Methods in causal inference part 3: Measurement error and external validity threats. Evolutionary Human Sciences, 6, e42. https://doi.org/10.1017/ehs.2024.33
Cashin, A. G., Hansford, H. J., Hernán, M. A., Swanson, S. A., Lee, H., Jones, M. D., Dahabreh, I. J., Dickerman, B. A., Egger, M., Garcia-Albeniz, X., et al. (2025). Transparent reporting of observational studies emulating a target trial—the TARGET statement. JAMA, 334(12), 1084–1093. https://doi.org/10.1001/jama.2025.13350
Chatton, A., Le Borgne, F., Leyrat, C., Gillaizeau, F., Rousseau, C., Barbin, L., Laplaud, D., Léger, M., Giraudeau, B., & Foucher, Y. (2020). G-computation, propensity score-based methods, and targeted maximum likelihood estimator for causal inference with different covariates sets: a comparative simulation study. Scientific Reports, 10(1), 9219. https://doi.org/10.1038/s41598-020-65917-x
Hernán, M. A. (2017). Invited commentary: Selection bias without colliders | american journal of epidemiology | oxford academic. American Journal of Epidemiology, 185(11), 1048–1050. https://doi.org/10.1093/aje/kwx077
Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if. Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Neal, B. (2020). Introduction to causal inference from a machine learning perspective. Course Lecture Notes (Draft). https://www.bradyneal.com/Introduction_to_Causal_Inference-Dec17_2020-Neal.pdf
Ogburn, E. L., & Shpitser, I. (2021). Causal modelling: The two cultures. Observational Studies, 7(1), 179–183. https://doi.org/10.1353/obs.2021.0006
VanderWeele, T. J., & Hernan, M. A. (2013). Causal inference under multiple versions of treatment. Journal of Causal Inference, 1(1), 1–20.
Westreich, D., & Cole, S. R. (2010). Invited commentary: positivity in practice. American Journal of Epidemiology, 171(6). https://doi.org/10.1093/aje/kwp436
Week 6: Effect Modification and CATE
Date: 1 Apr 2026
Key idea
Interaction, effect modification, a regression product term, and the conditional average treatment effect (CATE) are four distinct things. Keeping them apart is what makes a claim about who benefits more defensible.
Required reading
Optional reading
Key concepts for assessment
- Causal estimand versus statistical estimand
- Interaction (joint interventions)
- Effect modification (one intervention, subgroup contrasts)
- CATE $\tau(x)$ and estimated CATE $\hat{\tau}(x)$
- Why statistical interaction terms do not automatically imply causal effect modification
Week 5 defined the average treatment effect (ATE) and the assumptions required to estimate it from a well-defined intervention at a clear time zero. An average, though, can hide meaningful variation. This week extends the framework from "does the intervention work on average?" to "for whom does it work more, or less?"
The main difficulty this week is vocabulary. Psychology often uses "interaction", "moderation", "heterogeneity", and "personalised effects" as if they were interchangeable. They are not.
Seminar
Motivating example
A randomised exercise programme lowers blood pressure by 3 mmHg on average. That average can hide meaningful variation. Some participants improve a lot, while others barely change. If we only report the ATE, we can miss the information needed for treatment and policy decisions.
A simple map for this week
Keep these four ideas separate from the start.
Four ideas to keep separate
- Interaction: the joint effect of two interventions.
- Effect modification: variation in the effect of one intervention across subgroups.
- Regression product term: a feature of a statistical model, such as $A \times G$.
- CATE: the subgroup-level causal contrast, $\tau(x)$.
Rule of thumb
If you cannot write the estimand as $\mathbb{E}[Y(1) - Y(0) \mid X = x]$ for baseline $X$ measured at time zero, you are not estimating a CATE.
First distinction: interaction versus effect modification
Start with the scientific question, not the software output. If the design involves one intervention and subgroup contrasts, the question is about effect modification. If the design involves two interventions taken together, the question is about interaction.
Interaction
Interaction concerns two interventions, not one. Let $A$ and $B$ be interventions and let $Y$ be the outcome. On the additive scale, interaction is
$$ \mathbb{E}[Y(1,1)] - \mathbb{E}[Y(1,0)] - \mathbb{E}[Y(0,1)] + \mathbb{E}[Y(0,0)]. $$
If this contrast is non-zero, the joint effect is not additive on this scale.
Effect modification
Effect modification concerns one intervention across subgroups. For a subgroup variable $G$, effect modification exists when
$$ \mathbb{E}[Y(1) - Y(0) \mid G = g_1] \neq \mathbb{E}[Y(1) - Y(0) \mid G = g_2]. $$
This is still the effect of $A$ on $Y$. It is not the causal effect of $G$ on $Y$.
Scale matters
Interaction and effect-modification claims are scale-specific. A difference-scale result need not match a ratio-scale result.
Second distinction: causal modification versus model terms
This is where many psychology papers go wrong.
A regression interaction term ($A\times G$) is a model parameter. Causal effect modification is a property of potential-outcome contrasts under identification assumptions. A model term can be non-zero because of misspecification or bias, so it is not causal evidence by itself.
Pair exercise: interaction versus effect modification
- A study reports a "significant exercise-by-age interaction" in a regression of blood pressure on exercise, age, and their product term.
- State the causal estimand for interaction (hint: it requires four potential outcomes under joint interventions on exercise and age, which is conceptually odd because we cannot intervene on age).
- State the causal estimand for effect modification (hint: it involves one intervention on exercise, with subgroup contrasts across age groups).
- Which concept, interaction or effect modification, matches the study design? Give a reason the regression interaction term could be non-zero without any causal modification (e.g., model misspecification or collider bias).
CATE as the operational target
For a measured baseline profile $X=x$ defined at time zero,
$$ \tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x]. $$
$\tau(x)$ is a subgroup average causal contrast. For person $i$, $\hat{\tau}(X_i)$ is an estimate of that subgroup contrast. $\hat{\tau}(X_i)$ is not the unobservable individual contrast $Y_i(1)-Y_i(0)$.
Personalised effects versus true individual effects
Students sometimes read $\hat{\tau}(X_i)$ as the effect of treatment on person $i$. It estimates something coarser. Person $i$'s true effect, $Y_i(1) - Y_i(0)$, requires both potential outcomes, and we observe at most one. What $\hat{\tau}(X_i)$ estimates is the average effect across all people who share person $i$'s measured profile $X_i$. The estimate is "personalised" in the sense that it uses person $i$'s covariates, but it remains a subgroup average. Two people with identical $X_i$ can have different true effects if they differ on unmeasured variables.
When the literature refers to "individualised treatment effects," the intended meaning is almost always $\hat{\tau}(X_i)$: an estimated subgroup average, not the unknowable individual contrast.
Identification reminders
Week 4's graph logic still matters here. Week 5's design logic still matters too: the treatment must remain well-defined, covariates must precede treatment, and subgroup contrasts are causal only if the same identification conditions still hold. Effect-modification questions are still causal questions, so confounding does not disappear just because we are now interested in subgroup differences.
For interaction with two interventions, we need identification of the joint intervention contrast. A common condition is conditional exchangeability for joint treatment assignment:
$$ Y(a, b) \coprod (A, B) \mid L. $$
Here $L$ must block all relevant backdoor paths from $A$ and $B$ to $Y$.
For effect modification of $A$ by $G$, we still need valid control of confounding for the $A \to Y$ relation, typically within strata of $G$.
For a larger handout version of these effect-modification graphs, see Effect modification using causal graphs.
Effect modification by proxy
A variable can modify the treatment effect without directly causing the outcome. In the graph below, $Z$ is the direct effect modifier (open circle: it changes the size of $A$'s effect on $Y$). $G$ inherits this modification through its association with $Z$.
Whether $G$ remains an effect modifier depends on what else is in the model. If investigators condition on $Z$, then $G$ becomes independent of $Y$ and is no longer an effect modifier. Effect modification is relative to the adjustment set, not an intrinsic property of $G$ (VanderWeele & Robins (2007); VanderWeele (2012)).
d-separation does not imply absence of effect modification
The graph below poses a subtler problem. To identify the effect of $A$ on $Y$, we condition on $L$. But $G$ causes $L$, and conditioning on $L$ d-separates $G$ from $Y$. Does this mean $G$ is not an effect modifier?
No. Even when $G \coprod Y \mid L$, the CATE for a given level of $G$ is a weighted average of the $L$-specific treatment effects, where the weights come from the distribution of $L$ given $G$:
$$ \tau(g) = \mathbb{E}\left[\mathbb{E}[Y(1) - Y(0) \mid L] \middle| G = g\right]. $$
Two conditions are sufficient for effect modification by $G$. First, the effect of $A$ on $Y$ varies across levels of $L$. Second, the distribution of $L$ differs across levels of $G$ (which it does, because $G \to L$). When both hold, $\tau(g)$ varies with $g$. Effect modification by $G$ is present even though $G$ has no direct structural path to $Y$ after conditioning.
A concrete instance shows this. Suppose $L$ has two levels. The effect of $A$ on $Y$ is $+2$ when $L$ is low and $+6$ when $L$ is high, so the effect varies across $L$. Among younger people, $80%$ are low-$L$ and $20%$ high-$L$, giving $\tau(\text{young}) = 0.8(2) + 0.2(6) = 2.8$. Among older people, $30%$ are low-$L$ and $70%$ high-$L$, giving $\tau(\text{old}) = 0.3(2) + 0.7(6) = 4.8$. The CATE differs by age even though age has no direct arrow to $Y$: age changes the mix of $L$, and $L$ is where the effect varies.
This result has practical consequences. Investigators who equate d-separation with absence of effect modification will miss genuine heterogeneity. A non-significant regression interaction term between $A$ and $G$, after adjusting for $L$, does not prove that $G$ is irrelevant to treatment targeting. The CATE can still vary by $G$ because $G$ shifts the covariate distribution over which the $L$-specific effects are averaged.
Two rules of thumb
- A variable can modify the treatment effect even if it has no direct arrow to the outcome in the adjusted DAG.
- Whether a variable is an effect modifier depends on what other variables are in the conditioning set. Effect modification is relative, not absolute.
Pair exercise: why conditioning changes effect modification
- An exercise programme ($A$) targets blood pressure ($Y$). Age ($G$) affects fitness ($L$), and $L$ affects $Y$. There is no direct $G \to Y$ path.
- Explain to your partner why the CATE still varies by age, even without a direct $G \to Y$ arrow (hint: the distribution of $L$ differs across age groups).
- A colleague fits a regression with an $A \times G$ interaction term and finds it non-significant. They conclude "age does not modify the treatment effect." Evaluate this conclusion.
- Describe a scenario where two apparent effect modifiers ($G_1$ and $G_2$) both show significant CATE variation individually, but the variation disappears when you condition on both simultaneously.
Why flexible estimators matter
With many covariates, hand-built interaction models are fragile for four reasons. First, the number of possible interactions grows combinatorially: $k$ covariates generate $\binom{k}{2}$ pairwise interactions and far more higher-order terms. Second, each interaction subgroup contains fewer observations, so estimates become noisy. Third, searching across many interactions inflates false-positive rates unless corrected. Fourth, the analyst must specify the functional form in advance, and real treatment-response surfaces are rarely linear.
Flexible estimators such as causal forests learn the heterogeneity surface from data. They can recover non-linear and high-dimensional patterns without requiring the analyst to guess the correct specification. These estimators help with functional form, but they do not remove confounding by design.
Demo: functional form matters
This simulation randomises treatment, so there is no confounding. The only challenge is functional form.
# install once
# install.packages("pak")
# pak::pak("go-bayes/causalworkshop")
library(causalworkshop)
# simulate data with non-linear heterogeneous effects
d <- simulate_nonlinear_data(n = 2000, seed = 2026)
# compare four estimation methods
results <- compare_ate_methods(d)
# summary table: ATE and individual-level RMSE
results$summary
# plot: estimated vs true treatment effects
results$plot_comparison
# plot: estimated effect as a function of x1
results$plot_by_x1
All methods recover the ATE in this simulation. They differ in how well they recover heterogeneity.
Return to the opening example
Back to exercise and blood pressure, the ATE tells us whether the programme helps on average. The CATE tells us where gains are concentrated. For policy and clinical decisions, we usually need both.
After the mid-trimester break and Test 1 (Week 7), Week 8 introduces machine-learning methods that estimate these subgroup contrasts in high dimensions, without requiring the analyst to specify the functional form in advance.
Pair exercise: from average to subgroup
- An exercise programme has ATE = 3 mmHg reduction in blood pressure.
- Construct a scenario where the conditional average treatment effect (CATE) is 8 mmHg for one subgroup and $-2$ mmHg for another, consistent with this ATE (specify group sizes).
- Explain what a policy-maker reading only the ATE is missing.
- Your partner claims "$\hat{\tau}(X_i) = 8$ means the programme will reduce my blood pressure by 8 mmHg." Correct this claim using the distinction between estimated subgroup averages and unobservable individual effects.
Lab materials: Lab 6: CATE and Effect Modification
Appendix A: additive interaction simplification
Starting from
$$ \begin{aligned} &\big(\mathbb{E}[Y(1,1)] - \mathbb{E}[Y(0,0)]\big) \newline &\quad {}- \big(\mathbb{E}[Y(1,0)] - \mathbb{E}[Y(0,0)]\big) \newline &\quad {}- \big(\mathbb{E}[Y(0,1)] - \mathbb{E}[Y(0,0)]\big) \end{aligned} $$
we collect terms to obtain
$$ \mathbb{E}[Y(1,1)] - \mathbb{E}[Y(1,0)] - \mathbb{E}[Y(0,1)] + \mathbb{E}[Y(0,0)]. $$
Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if. Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
VanderWeele, T. J. (2009). On the distinction between interaction and effect modification. Epidemiology, 863–871.
VanderWeele, T. J. (2012). Confounding and Effect Modification: Distribution and Measure. Epidemiologic Methods, 1(1), 55–82. https://doi.org/10.1515/2161-962X.1004
VanderWeele, T. J., & Robins, J. M. (2007). Four types of effect modification: a classification based on directed acyclic graphs. Epidemiology (Cambridge, Mass.), 18(5), 561–568. https://doi.org/10.1097/EDE.0b013e318127181b
Week 7: In-Class Test 1 (20%)
22 April 2026
This week is the first in-class test, covering material from weeks 1–6.
Test 1 answers (released)
Suggested answers to the 2026 in-class test are now available: Test 1 Answers (PDF). These are model answers, not the only acceptable phrasing — credit was awarded for any defensible reasoning that addressed each part of the question.
What is covered
- Causal questions, contrasts, target populations, and unobservability (week 1)
- Causal diagrams: elementary structures (week 2)
- Confounding bias structures (week 3)
- Selection bias and measurement bias (week 4)
- Average treatment effects and the three fundamental assumptions (week 5)
- Effect modification and CATE (week 6)
Format
- Duration: 1 hour 50 minutes
- Closed book, one A4 sheet of notes permitted, no devices
- Bring a pen or pencil
- Location: EA120 (the usual seminar room)
Reminders
- Bring a pen or pencil
- You may bring in one A4 sheet with notes.
- No electronic devices permitted during the test.
- Arrive on time. The test begins at 14:10.
- Write clearly; illegible answers cannot be marked.
Week 8: Heterogeneous Treatment Effects and Machine Learning
Date: 29 Apr 2026
Key idea
Machine learning changes the estimator, not the identification logic. A causal forest can rank who is predicted to benefit most, but that ranking is worth acting on only when held-out evidence shows it beats treating at random.
Readings
Required
Optional
Key concepts
- ATE and CATE answer different causal questions.
- Causal forests estimate heterogeneity, not just average effects.
- Honest splitting, sample splitting, and out-of-bag prediction answer different overfitting problems.
- Doubly-robust scores are the bridge from unobserved potential outcomes to evaluating rankings.
- RATE, AUTOC, and Qini assess whether a CATE ranking has practical value.
Week 6 introduced the conditional average treatment effect (CATE): the contrast for a subgroup defined by baseline covariates at time zero. Investigators can estimate CATEs for theory-driven subgroups by stratifying first, then estimating effects separately within each subgroup. They can also use interaction models, which require a chosen functional form. This week introduces causal forests, which learn the heterogeneity surface from data. The machine-learning step changes the estimator, not the identification logic: treatment must still be well-defined, subgroup variables must still precede treatment, and exchangeability and positivity still do the causal work. What machine learning adds is the ability to estimate $\tau(x)$ at high resolution without committing in advance to a particular shape for the heterogeneity.
Where we are in the heterogeneity sequence
- Week 6: define effect modification and CATE.
- Week 8: estimate CATE rankings and ask whether rankings contain useful targeting information.
- Week 9: turn modelled heterogeneity into interpretable policy trees.
- Week 10: ask whether the measured outcomes and covariates support the interpretation we place on the tree.
- Final assessment: report outcome-wide ATEs, then interpret policy trees as readable targeting summaries.
Seminar
Motivating example
Suppose a university can fund a community-socialising programme for only thirty percent of students. Treating everyone is infeasible. Choosing badly wastes scarce places. We need a defensible ranking of expected benefit, and we need to know whether the ranking carries enough information to be worth the administrative cost of targeting at all.
This budget example motivates ranking. In the assignment, policy trees
identify interpretable treatment regions using the policytree reward
objective; the treated share is whatever the selected rule implies. The
right question this week is: does the CATE ranking contain enough real
information to make targeting worth considering? If the answer is no,
the simplest defensible scarce-budget policy may be random allocation
among eligible students. The week's tools help investigators answer that
question with held-out evidence rather than asserted intuition.
From ATE to CATE
Relying on the average treatment effect alone can hide large differences in who benefits. Today's question is how large that variation is, how to estimate it from data, and whether the variation is useful enough to justify tailoring an exposure to individual circumstances.
The average treatment effect is
$$ \text{ATE}=\mathbb{E}[Y(1)-Y(0)]. $$
The conditional average treatment effect is
$$ \tau(x)=\mathbb{E}[Y(1)-Y(0)\mid X=x]. $$
Here $X$ must be a baseline profile measured before treatment begins. ATE answers "does it work on average?" CATE answers "for whom does it work more or less?"
Recall, the individual treatment effect $Y_i(1) - Y_i(0)$ is unidentified, because for any single person we observe at most one of the potential outcomes. CATE is the most granular contrast we can identify: the average of individual effects across people who share covariate profile $x$. Two people with identical baseline covariates may differ in their realised potential outcomes, so $\tau(x)$ smooths over within-profile variation that is, by definition, beyond reach.
Outcome-wide effects before heterogeneity
Before asking whether effects vary across people, ask whether the treatment appears to affect the outcome at all, and whether that evidence is robust across the outcomes investigators care about. Investigators rarely look at one outcome in isolation. A programme that improves sense of purpose but worsens belonging, self-esteem, or life satisfaction raises a different decision problem from a programme whose benefits point in the same direction across outcomes.
For the final assignment, the core sequence is outcome-wide ATEs followed by policy trees. Students choose one exposure, estimate average treatment effects across the prespecified outcome family, and then report policy trees as interpretable targeting rules. RATE and Qini belong to the Week 8 teaching sequence: they help explain whether a CATE ranking contains targeting information. They are not the assignment's main heterogeneity output. The assignment asks for policy trees because a tree turns modelled heterogeneity into a rule a non-specialist can read, question, and apply.
Keep the objects separate:
| Object | Question | Output |
|---|---|---|
| Outcome-wide ATEs | Which outcomes show credible average evidence? | Four-outcome ATE table or plot |
| CATE | For whom does the effect vary? | $\hat{\tau}(X)$ estimates |
| RATE / Qini | Does the ranking carry targeting value? | Diagnostic curves and summaries |
| Policy tree | Can we state a readable rule? | Depth-1 or depth-2 allocation rule |
| Measurement checks | Do the variables mean what the rule assumes? | Cautions about construct meaning and comparability |
A single-outcome CATE example
The CATE machinery below sits inside this larger assignment logic. We pick one outcome at a time to make the heterogeneity story concrete. The final report applies the outcome-wide ATE plus policy-tree sequence across the prespecified outcomes.
Causal estimand, statistical estimand, estimator
The same workflow that organised weeks 5 and 6 still organises this week. The full ten-step version lives on the causal workflow reference page; the five-step digest below is the version specific to heterogeneous treatment effects.
Workflow for heterogeneous treatment effects
- State the causal estimand. The target is $\tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x]$: the conditional average treatment effect at baseline profile $X = x$, where $X$ is measured at time zero.
- Defend the identifying assumptions. Conditional exchangeability $Y(a) \coprod A \mid X$, consistency, and positivity must hold within every covariate profile that indexes $\tau(x)$. The machine-learning step does not weaken these requirements.
- Construct a statistical estimand that targets the causal one. Under the assumptions in step 2, the causal contrast equals an observable contrast: $\tau(x) = \mathbb{E}[Y \mid A = 1, X = x] - \mathbb{E}[Y \mid A = 0, X = x]$. This is the quantity we estimate from data.
- Estimate it. A causal forest learns the two conditional expectations non-parametrically, with honest splitting and out-of-bag prediction (below). The forest returns $\hat{\tau}(x)$ for each unit without committing in advance to a functional form for the heterogeneity.
- Evaluate before deciding. Report the ATE, the distribution of $\hat{\tau}(x)$, and targeting metrics (RATE, Qini) on held-out data, accompanied by the standard sensitivity analyses (E-values, missing-data diagnostics). Treat RATE and Qini as evidence about whether the CATE ranking is useful. Week 9 asks the separate policy-learning question: whether a short allocation rule can recover enough of that value to be worth using.
The bridge that does the causal work is step 3. Without it, the forest is a flexible regression and nothing more. With it, the forest's two conditional means inherit causal meaning from the identification assumptions, and their difference $\hat{\tau}(x)$ targets $\tau(x)$. Machine learning replaces the guessed functional form of an interaction model with a learned one; it does not replace the identification argument.
Why pre-specified subgroup checks are often not enough
If theory names a subgroup in advance, a simple non-parametric check is to split the data into those strata and estimate the ATE separately in each subgroup. For example, investigators might estimate the effect among younger and older adults separately, then compare the two subgroup estimates. This is descriptive heterogeneity analysis, not policy optimisation: it reports how $\hat{\tau}$ varies across investigator-defined groups.
The strength of stratification is clarity. The weakness is that the answer is only as broad as the strata chosen. Subgroup estimates can also be noisy when strata are small, and repeated subgroup searches create the same multiplicity and over-interpretation problems that outcome-wide designs try to discipline. Use stratified subgroup comparisons when there is a theoretical reason to expect a difference; be cautious when the subgroup list is exploratory.
Linear interaction models provide another pre-specified check, but they add modelling assumptions about shape.
A small interaction model assumes a simple shape. If the analyst writes $\tau(x) = \beta_0 + \beta_1 x_1 + \beta_2 x_2$ and stops there, every covariate must enter linearly and the heterogeneity along $x_1$ must be the same regardless of $x_2$. Real treatment-response surfaces are often non-linear, contain interactions among more than two variables, and carry sharp regime changes (the effect of a programme on purpose may climb steeply with extraversion up to a saturation point, then flatten). A pre-specified interaction model can miss all three features.
A pre-specified linear interaction model can still be useful when the scientific question is a specific, pre-declared contrast. For discovery, it is a weak test of heterogeneity because its answer is limited to the variables, interactions, and functional form chosen in advance.
A non-parametric estimator like a causal forest sidesteps the functional-form problem by letting the data suggest where to draw splits and how deep to go. The benefit is reach: instead of testing a few named subgroups, the forest searches over the whole baseline feature space $X$. The cost is interpretability: the forest is no longer a small set of coefficients you can read off, and a high-dimensional $\hat{\tau}(X)$ surface is hard to translate into a defensible policy decision. Week 9 handles that cost by fitting shallow policy trees on top of the forest.
From regression trees to causal forests
Understanding causal forests requires three steps.
Regression tree. A regression tree splits the covariate space with yes/no questions ("Age $\le$ 20?", "Baseline purpose $> 0.3$?"). Each terminal leaf predicts the sample mean of the outcome for units that land there. The result is a piecewise-constant surface, not a global line. A single tree is interpretable but unstable: small changes in the data can shift splits and predictions. The instability is the price of letting the data choose the splits.
Regression forest. A random forest grows many trees on bootstrap samples and averages their outputs. Averaging cancels much of the noise that makes any one tree unreliable (Breiman, 2001). The price is interpretability: the forest's prediction surface is the average of hundreds of jagged tile patterns, with no single tree responsible for any particular prediction.
Causal forest. Each tree still asks "where should I split the covariate space?" — but the question it answers is different. A regression tree looks for splits that group units with similar outcomes: it finds cutpoints where the average value of $Y$ jumps. A causal tree looks for splits that group units with similar treatment effects: it finds cutpoints where the gap between treated and control units jumps. So the tree's splits land on covariates that change the effect of treatment, not covariates that merely predict the outcome.
Each tree then plays an honest two-step game on its training subsample (Wager & Athey, 2018). The first half decides where the splits go. The second, non-overlapping half estimates the treated-minus-control gap inside each resulting leaf. This is called honesty: the data that choose a promising subgroup are separate from the data used to estimate that subgroup's treatment contrast. The forest averages those leaf-level gaps across hundreds of trees to estimate the CATE surface:
$$ \hat{\tau}(x) \approx \tau(x)=\mathbb{E}[Y(1) - Y(0) \mid X = x]. $$
The leaf-level contrast for an individual leaf is just the mean outcome among treated units in that leaf minus the mean among controls. Because individual leaf estimates are noisy and point in many directions, their average across hundreds of trees is far less variable. The progression matters: students cannot reason about causal forests without first understanding what a tree does and why averaging helps.
Key intuition
A regression tree tiles the covariate space into locally flat regions. A forest averages many such tiles to smooth away noise. A causal forest adds honest splitting so the averaged contrasts estimate treatment effects, not just predictions.
Pair exercise: from tree to forest to causal forest
- Explain the three-step progression to your partner in your own words.
- Name one strength and one weakness of a single regression tree.
- Explain why averaging many trees (a forest) helps with the weakness you identified.
- State the two differences between a regression forest and a causal forest: (a) what is the target quantity? (b) what does honest splitting add? Explain in one sentence why honest splitting is necessary when we estimate treatment contrasts rather than predictions.
Honest splitting and out-of-bag prediction
Three related ideas appear in this lecture: honesty, out-of-bag prediction, and held-out evaluation. They all reduce overfitting, but they act at different points in the workflow.
Honesty happens inside each tree. It separates model selection from estimation. The first half of a tree's training subsample decides where to split; the second half estimates the leaf-level treatment-control gap. The two halves do not share information. This separation matters because we estimate parameters under two exposures, at most one of which is observed for any individual. If the same data chose the splits and estimated the contrasts, the forest would chase lucky treatment-control gaps. The leaves that look most informative on the training sample would often be the leaves whose gaps were inflated by chance. The CATE estimates would inherit that inflation.
Out-of-bag (OOB) prediction happens after the trees are grown. Each tree is trained on a subsample, so some observations are left out of that tree. An OOB prediction for observation $i$ averages only trees that did not train on observation $i$. OOB is close in spirit to leave-one-out prediction: the case being scored was not used to fit the trees that score it. In a causal forest, OOB predictions help keep $\hat{\tau}(x_i)$ from looking too good simply because observation $i$ helped build the model.
Held-out evaluation is broader sample splitting at the analysis level. We may fit the forest on one fold and evaluate a targeting curve on another fold. Honesty protects leaf-level effect estimation; OOB protects unit-level forest prediction; held-out evaluation protects the claim that a targeting rule would work beyond the data used to learn it.
Together, these safeguards support more credible heterogeneity estimates in high-dimensional settings with many covariates. They do not create exchangeability, fix poor measurement, or make post-treatment covariates safe.
Doubly-robust scores: the evaluation bridge
The next problem is evaluation. A CATE ranking says who is expected to benefit most, but to evaluate that ranking we need to ask counterfactual questions such as: what would the mean outcome be if the top 30% by $\hat{\tau}(X)$ received treatment? We do not observe both $Y_i(1)$ and $Y_i(0)$ for any person, so raw outcomes cannot answer that question.
The workaround is to create action-specific pseudo-outcomes. For each treatment action $a$, the augmented inverse-propensity-weighted (AIPW) score combines two pieces of information:
- an outcome-model prediction for what would happen under action $a$;
- a correction from the people who actually received action $a$, weighted by how probable that action was for their covariates.
The score is
$$ \Gamma_i(a) = \underbrace{\mu_a(X_i)}_{\textcolor{blue}{\text{outcome model prediction}}} + \underbrace{\frac{\mathbb{1}\lbrace A_i = a\rbrace}{\Pr(A = a \mid X_i)}}_{\textcolor{teal}{\text{IPW weight}}}\thinspace\underbrace{\bigl(Y_i - \mu_a(X_i)\bigr)}_{\textcolor{red}{\text{residual correction}}}, $$
where $\mu_a(x)$ is the causal forest's estimate of the outcome surface under treatment $a$, and $\Pr(A = a \mid x)$ is the propensity for treatment $a$ given covariates.
The score is called doubly robust because averaging $\Gamma_i(a)$ can recover the mean potential outcome $\mathbb{E}[Y(a)]$ if either the outcome model or the propensity model is well specified, along with the usual causal assumptions. Both models can still be wrong, and unmeasured confounding still matters. "Doubly robust" means there are two statistical routes to the same estimand, not that the analysis is protected against every threat.
Targeting metrics and the policy-tree evaluation in Week 9 use AIPW scores rather than raw outcomes. The scores let us evaluate rules that assign different actions to different people, even though each person was observed under only one action.
Is heterogeneity actionable?
After estimating $\hat{\tau}(x)$, we rank units from largest to smallest estimated effect. That ranking is not yet a policy. It is a diagnostic object: a list from "most expected benefit" to "least expected benefit". The evaluation question is whether this list contains enough real information to improve allocation.
The first question is: does treating high-ranked units first yield meaningfully larger gains than treating at random?
The Targeting Operator Characteristic (TOC) curve answers this question. For each treatment coverage $q$, it plots how much larger the average treatment effect is among the top-$q$ ranked individuals than the average effect across the whole population:
$$ \text{TOC}(q)=\frac{1}{\lfloor qn\rfloor}\sum_{i=1}^{\lfloor qn\rfloor}\hat{\tau}{(i)};-;\frac{1}{n}\sum{i=1}^{n}\hat{\tau}_i,\qquad 0 < q \le 1, $$
where $\hat{\tau}_{(1)} \ge \hat{\tau}_{(2)} \ge \cdots$ are the sorted estimated effects. The first term is the mean estimated effect in the top-$q$ slice; the second is the overall ATE. The horizontal axis $q$ is the fraction of the population we would treat. The vertical axis is the gain from selecting that top-$q$ slice by the CATE ranking rather than choosing a slice of the same size at random: a random slice has the overall ATE as its average effect, so the TOC measures how far the ranked slice beats it. At $q = 1$ the top slice is the whole population, so the TOC returns to zero.
The TOC curve is the most direct answer to "does targeting beat random?" but it carries no single-number summary. Two summaries of the curve are useful. Both are forms of the RATE (rank-weighted average treatment effect), a family of summaries that average targeting gains over parts of the ranked population:
RATE AUTOC means the area under the TOC curve. It puts equal weight on every value of $q$. It answers: across all possible treatment coverages, how much can we gain by selecting people using the CATE ranking rather than selecting at random? A large AUTOC indicates concentrated heterogeneity. A small AUTOC indicates that the ranking carries little practical information.
RATE Qini weights the middle of the ranked population more heavily. It answers: at realistic, moderate coverages, does targeting improve on random allocation? Qini is the practical metric when investigators face a fixed budget constraint, for example treating 20-50% of eligible people.
These metrics tell us whether the ranking has value. They do not tell us whether the ranking is understandable, fair, or administratively usable.
Reading a Qini curve
Read the curve in three passes.
First, look at the right edge. Both lines must meet at $q = 1$, because a hundred-percent treatment rate is treat-everyone whatever the rule. The shared end-point is the ATE.
Second, look at the gap between the orange and grey lines across the middle range. In this fit the orange line sits roughly $0.05$ above the grey line at the 40% marker and roughly $0.02$ above at the 10% marker. Those gaps are the lift from targeting at those budgets, expressed in the outcome's standard-deviation units. The 0.05 lift at 40% spend says that a CATE-targeted rule covering forty percent of the population delivers, on average, about $0.05$ more standard-deviation units of purpose per eligible person, averaged across the whole eligible population, than a treat-everyone-equally policy at the same coverage.
Third, look at the shape near $q = 0$. A curve that rises steeply over the leftmost few percent indicates that the very top of the ranking is heavily concentrated. A curve that hugs the diagonal everywhere indicates that the ranking carries no useful targeting information. Under a fixed budget, random allocation may then be more defensible than a complex targeting system. Without a fixed budget, the simpler choice may be to offer treatment uniformly.
Both summaries, AUTOC and Qini, must be computed on held-out data, not in-sample rankings. The forest reuses information across trees through its bootstrap structure. Evaluating RATE or Qini on the training fold produces optimistic estimates. Computing them on a separate fold blocks this bias and yields more credible confidence intervals.
Pair exercise: reading a TOC curve
- A university socialising programme produces a TOC curve that rises steeply for the top 20% of ranked students, then flattens.
- Interpret the shape: what does the steep rise mean about where treatment gains are concentrated?
- Suppose the AUTOC is large but the Qini at $q = 0.3$ (a 30% budget) is small. Explain what this combination means for a decision-maker with a fixed budget.
- Why would computing the TOC curve on the same data used to train the causal forest produce misleading results? State the problem in one sentence.
Workflow for this week
The week's tools fit together as a short pipeline:
- Specify the causal estimand: treatment, time zero, target population, outcome, and the identification assumptions.
- Fit a causal forest with honest splitting on baseline covariates measured at time zero.
- Check that heterogeneity is real. The causal-forest
calibration test asks whether $\hat{\tau}(x)$ varies across
people more than chance alone would produce; the coefficient to read
is the one
grflabelsdifferential.forest.prediction, where a value near 1 means the heterogeneity estimates track real variation and a value near 0 means there is little to use. Run this early, to avoid spending time on a flat forest. - Estimate targeting value with RATE-AUTOC and RATE-Qini on held-out data, and inspect the slope of the Qini curve around plausible budgets.
- Run sensitivity analysis on the underlying ATE before staking a policy on it.
- Decide whether heterogeneity is large enough to inform allocation. If yes, Week 9 turns the ranking into a transparent rule. If no, use a simpler allocation rule: random allocation under a fixed budget, or uniform provision when treatment can be offered to everyone.
Treat these steps as diagnostics, not as a verdict. A positive calibration test or RATE is evidence that effects vary, but variation improves on a blanket policy only when the better action differs across covariate regions, or a budget forces a choice. Whether a particular rule actually wins is settled by the policy tree's held-out policy value, compared against the blanket policies of treating everyone or treating no-one (Week 9).
Return to the opening example
Back to the university budget. The question is not only whether the programme works. The question is whether gains are concentrated enough that targeting improves outcomes under a real budget constraint, and whether the targeting evidence survives the standard sensitivity checks. A causal forest with a Qini curve that hovers near the diagonal is also a finding: it tells the university that the CATE ranking may not justify selective allocation. The university could then use a transparent lottery among eligible students, expand capacity, or spend the targeting budget on a different question.
The hard problem at the end of this lecture is opacity. A causal forest predicts over the entire feature space $X$. That is a major gain over conventional subgroup analysis, because it can discover heterogeneity we did not specify in advance. It is also a decision problem: a high-dimensional ranking is difficult to explain, contest, audit, or implement. Week 9 begins by returning to the outcome-wide screen, because policy work should start from outcomes with credible average evidence. It then uses policy trees as a workaround for the ranking problem. A shallow policy tree gives up some targeting precision in exchange for a rule that a decision-maker can read, defend, and apply.
Pair exercise: should we target?
- The ATE is 0.15 SD. The Qini curve is statistically significant at a budget of $q = 0.3$ (treating 30% of the population).
- State the causal estimand that the Qini addresses (what question does it answer beyond the ATE?).
- List two non-statistical reasons a decision-maker might choose not to target (for example, stigma, logistics, cost, or who gets to decide).
- A colleague argues "targeting lonely students for a socialising programme is stigmatising." Draft a two-sentence response that takes the concern seriously while explaining what the evidence does and does not show.
Missing data in grf
grf can treat missingness as a splitting attribute (MIA) rather than
deleting rows by default. A missing value is itself informative: it can
correlate with treatment, with the outcome, or with both. MIA lets the
tree send "missing" units down whichever branch produces the cleaner
treatment-control contrast.
That can preserve sample size, but it does not remove identification concerns. We still need a causal missingness argument, and we still need covariates defined before treatment if they are to index $\tau(x)$. MIA is a convenience for estimation; it does not absolve the analyst of the duty to model why data are missing.
Further material
Susan Athey's lecture on causal forests gives a deeper account of the statistical machinery introduced above. The relevant material on causal forests starts around the eighteen-minute mark.
Lab materials: Lab 8: RATE and QINI Curves
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Bulbulia, J. A. (2024). A practical guide to causal inference in three-wave panel studies. PsyArXiv Preprints. https://doi.org/10.31234/osf.io/uyg3d
Hoffman, K. L., Salazar-Barreto, D., Rudolph, K. E., & Díaz, I. (2023). Introducing longitudinal modified treatment policies: A unified framework for studying complex exposures. https://doi.org/10.48550/arXiv.2304.09460
Suzuki, E., Shinozaki, T., & Yamamoto, E. (2020). Causal Diagrams: Pitfalls and Tips. Journal of Epidemiology, 30(4), 153–162. https://doi.org/10.2188/jea.JE20190192
VanderWeele, T. J., Mathur, M. B., & Chen, Y. (2020). Outcome-wide longitudinal designs for causal inference: A new template for empirical studies. Statistical Science, 35(3), 437–466.
Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228–1242. https://doi.org/10.1080/01621459.2017.1319839
Week 9: Resource Allocation and Policy Trees
Date: 6 May 2026
Key idea
A policy tree turns an opaque heterogeneity ranking into a short allocation rule that a non-specialist can read, apply, and contest. The variables the tree splits on are useful for targeting, but they are not thereby causes.
Readings
Required
Optional
Key concepts
- Outcome-wide evidence asks whether a prespecified exposure has a credible pattern across several outcomes.
- Policy learning estimates utility over allocation rules.
- Policy value must be evaluated out of sample.
- Shallow policy trees trade a little value for a lot of interpretability.
- Fairness constraints are design choices, not automatic outputs.
Week 8 introduced causal forests and ranking diagnostics one outcome at a time so the CATE machinery was visible. Week 9 connects that machinery to the final assignment's decision sequence. First, estimate outcome-wide ATEs for the prespecified outcome family. Second, for each outcome with enough evidence to discuss, use policy trees to summarise where treatment appears most valuable. Rankings alone are not policy. This week asks how much of the underlying heterogeneity a short, publicly defensible policy tree can recover.
Where we are in the heterogeneity sequence
- Week 6: define effect modification and CATE.
- Week 8: estimate CATE rankings and ask whether rankings contain useful targeting information.
- Week 9: turn modelled heterogeneity into interpretable policy trees.
- Week 10: ask whether the measured outcomes and covariates support the interpretation we place on the tree.
- Final assessment: report outcome-wide ATEs, then interpret policy trees as readable targeting summaries.
Seminar
Motivating example
A district health board is considering a community-group intervention. The analysis begins outcome-wide: purpose, belonging, self-esteem, and life satisfaction are all possible targets, and investigators should not pick whichever outcome looks most convenient after the fact. Suppose the outcome-wide screen suggests that sense of purpose is the clearest outcome to inspect further. Week 8 then asks whether effects on purpose vary across residents and whether a CATE ranking has targeting value. Week 9 asks the next question: can that ranking be turned into a rule the board could describe in public?
We work through purpose for concreteness. The final assignment applies the same broad logic across the four wellbeing outcomes: report the outcome-wide ATEs, then use policy trees as the interpretable heterogeneity output. RATE and Qini remain useful for understanding rankings, but they are not the report scaffold.
Where Week 9 fits
A policy tree is not the first analysis to run. It is a late-stage summary, used after the outcome-wide ATEs have established which outcomes need interpretation. Week 8 estimated and evaluated single-outcome CATE rankings as teaching diagnostics; Week 9 reopens the outcome-wide screen as the gate before policy work, then fits, evaluates, and interprets the policy tree.
Outcome-wide evidence before targeting
Before fitting a policy tree, ask whether the underlying average effects are credible enough to justify a targeting exercise. In the final assignment, students estimate one exposure across four outcomes: purpose, belonging, self-esteem, and life satisfaction. This is the same outcome-wide screen introduced in Week 8, now used as the context for policy-tree interpretation. The point is to read the pattern across outcomes, not to hunt for whichever row happens to look best.
VanderWeele's outcome-wide design starts from a common problem in social and psychological science: exposures often plausibly affect several outcomes, and investigators can tell a convincing story after the fact if they focus only on whichever row looks strongest. An outcome-wide design disciplines that choice. It asks about a prespecified family of outcomes under the same exposure, target population, time zero, and adjustment set (VanderWeele et al., 2020).
The motivation is disciplined comparison. A broad pattern across outcomes can support a stronger substantive interpretation than one isolated estimate. A specific pattern can show that the exposure seems more relevant for some outcomes than others. A null pattern can prevent investigators from over-selling a single noisy association. Because the same exposure is being evaluated repeatedly, outcome-wide designs also require multiplicity control and sensitivity analysis.
Read the forest plot in four passes:
- Is the pattern broad, outcome-specific, or mostly null?
- Which outcomes remain statistically reliable after multiplicity correction?
- Which estimates are robust enough to unmeasured confounding to discuss seriously?
- Which outcome, if any, has enough evidence to justify asking a Week 8 targeting question and a Week 9 policy-tree question?
When a single exposure is tested against several outcomes, the chance that at least one confidence interval excludes zero by chance alone is higher than the nominal per-test error rate. A simple correction is Bonferroni:
$$ \alpha_{\text{per outcome}} = \frac{\alpha_{\text{family-wise}}}{K}. $$
With $K = 4$ outcomes and $\alpha_{\text{family-wise}} = 0.05$, each outcome is tested at $\alpha = 0.0125$. Equivalently, report a 98.75% confidence interval for each outcome. This is conservative, but it is transparent and easy to explain.
The second question is sensitivity to unmeasured confounding. The E-value is the minimum association strength, on the risk-ratio scale, that an unmeasured confounder would need to have with both the exposure and the outcome, above and beyond the measured covariates, to explain away the estimated effect (VanderWeele & Ding, 2017). Larger E-values mean that a stronger unmeasured confounder would be needed.
Do not treat E-values as a universal pass/fail rule. Whether an E-value is large enough depends on the study design, measured covariates, outcome, and the kinds of unmeasured confounding that remain plausible. In your report, state the E-value for the point estimate and for the confidence-limit closest to the null, then interpret it in context. The confidence-limit E-value is usually the more cautious summary because it asks how strong confounding would need to be to move the uncertainty interval to include no effect. VanderWeele and Mathur also recommend reporting E-values with enough context that readers can compare them with known covariate-outcome and covariate-exposure associations rather than reading them as standalone thresholds (VanderWeele & Mathur, 2020).
From ranking to policy
Week 8 produced a ranked list: who is predicted to benefit most. A ranking is informative, but it is not a rule. A rule maps any covariate profile $x$ to an action $d(x) \in {0, 1}$, "treat" or "do not treat". Two profiles that are nearly identical should receive nearly identical decisions, and a ranking gives no such guarantee, because rank position drifts with sample size.
The causal estimand has changed. Conditional average treatment effect (CATE) estimation asks for a person-level contrast, $\tau(x) = \mathbb{E}[Y(1)-Y(0)\mid X=x]$. Policy learning asks which allocation rule has the highest expected utility if applied to the target population. We need a way to score competing rules: how much utility each would deliver, on average, if applied to the whole population. Call that number the rule's policy value, written $V(d)$:
$$ V(d) = \mathbb{E}[Y(d(X))]. $$
In the course lab, utility is the outcome $Y$ itself. That assumes the treatment and control actions have equal cost, or that costs are being ignored for the teaching example. If treatment costs money, time, risk, or staff capacity, the utility should be a net utility such as $U(a) = Y(a) - c_a(X)$. If costs change, the best policy can change too: a rule that maximises wellbeing gain when treatment is cheap may no longer maximise net utility when treatment is expensive or scarce. With costs included, the rule should treat when the expected benefit is large enough to justify the cost, not simply when the expected benefit is positive.
The default policytree problem does not set a fixed percentage of
people to treat. Instead, it estimates the utility of candidate
allocation rules and asks which action has the highest estimated reward
in each leaf. The share treated is an output of the fitted rule, not an
input.
If a project has a fixed budget, the policy problem can be written with a treatment-share cap:
$$ \mathbb{E}[d(X)] \le q. $$
That is useful to discuss, but it is not what the course assignment enforces. In the assignment scaffold, policy trees identify regions where the estimated reward for treatment is higher than control, then report the expected mean differences and the size of those regions. Production scripts add further safeguards and interpretation layers, including more outcomes, sample weights, adverse-outcome flipping (reorienting harmful outcomes so a higher value always means a better result), cross-validated heterogeneity interpretation, depth comparison, and larger stability runs.
Two practical complications remain. First, $V(d)$ is a counterfactual quantity, since for any individual we observe at most one of $Y(0)$ or $Y(1)$. We need to evaluate candidate rules without assigning everyone to them. Second, the rule must be small enough to defend in front of a non-statistical audience. Both constraints push the analysis toward shallow, transparent decision trees evaluated with doubly-robust scores.
Why policy trees
A causal forest maps a high-dimensional covariate vector $X$ to a
personalised CATE score $\hat{\tau}(X)$. The score says how much
treatment is expected to change the outcome for people with covariates
$X=x$. It does not decide how to allocate a programme. The
policytree algorithm bridges that gap by comparing the utility of
allocation rules. It collapses the forest's many $\hat{\tau}(X)$
values into a single shallow decision tree, where each split is chosen
to maximise expected policy value subject to the depth budget (Sverdrup
et al., 2024).
The algorithm proceeds greedily, top down. At the root, it searches every covariate and every threshold for the split that delivers the largest gain in policy value when each child receives the locally-optimal action. It then recurses into each child until the depth limit is reached. Because the rule must hold for the entire population, the search uses doubly-robust scores rather than raw outcomes.
Define $\Gamma_i(a)$ as person $i$'s estimated outcome contribution if an allocation rule assigned action $a$. In the lab, outcome contribution and utility contribution are the same because we are treating $Y$ as the utility. The augmented inverse-propensity-weighted (AIPW) version is:
$$ \Gamma_i(a) = \mu_a(X_i) + \frac{\mathbb{1}{A_i = a}}{\Pr(A = a \mid X_i)},\bigl(Y_i - \mu_a(X_i)\bigr), $$
where $\mu_a(x)$ is the outcome model and $\Pr(A = a \mid x)$ is the propensity score. The propensity score is not a potential outcome. It is the estimated probability of receiving action $a$ given covariates. The first term is the outcome model's best guess at $Y_i(a)$ for everyone. The second term uses the people who actually received action $a$ to correct that guess. The propensity score appears in the denominator because only a fraction of people with covariates like $X_i$ received action $a$; inverse-propensity weighting scales their residuals so they represent the whole covariate stratum, not only the observed treated or untreated cases.
Under consistency, exchangeability, positivity, and correct specification of either the outcome model or the propensity model, averaging $\Gamma_i(a)$ estimates the mean potential outcome $\mathbb{E}[Y(a)]$. That is why $\Gamma_i(a)$ is useful for policy learning: it is an action-specific pseudo-outcome. In the simple lab case, that pseudo-outcome is also a pseudo-utility. If costs are included, the same logic applies after replacing $Y(a)$ with net utility $U(a)$.
This is the bridge from individual scores to policy value. For any candidate rule $d$, evaluate the score for the action that rule assigns to each person, then average:
$$ \widehat{V}(d) = \frac{1}{n}\sum_{i=1}^{n}\Gamma_i(d(X_i)). $$
The policy tree search compares candidate rules by this average. A split is useful when assigning different actions in the two child nodes raises $\widehat{V}(d)$ relative to a simpler rule. The tree therefore searches for a simple allocation rule with high estimated utility, and each leaf names the action — treat or control — that maximises that utility.
In this course we cap tree depth at two. Three reasons motivate the cap. First, at most three yes/no questions per rule means the logic fits on a slide for policy-makers or clinicians. Second, each leaf retains enough observations to yield a stable effect estimate, and stability matters more than precision when the audience is non-technical. Third, deeper trees increase computational complexity faster than they improve payoffs; the gain from a depth-3 tree over a depth-2 tree is usually small relative to the loss in clarity.
Pair exercise: reading a simple policy rule
Suppose a depth-2 policy tree identifies a strong-response region:
- First split on deprivation: high versus low.
- Among the high-deprivation group, split on baseline loneliness: high versus low.
- The largest estimated mean difference is in the high-deprivation, high-loneliness leaf.
In pairs:
- Draw this tree and label the high-response leaf.
- Write the high-response region as one sentence a community organiser could repeat.
- Suppose 40% of residents are high-deprivation, and half of those are high-loneliness. What share of all eligible residents falls in the high-response region? Show the multiplication.
- The lab reports strong-response regions and expected mean differences; it does not force a 20% treatment cap. Why is it still useful to know the approximate size of the region?
Reading a policy tree
Two visualisations work in tandem. The decision-tree diagram shows the rule abstractly: which covariate, which threshold, which leaf. The prediction-points scatter shows the rule as a partition of the underlying covariate space, with each individual coloured by the assigned action.
Read the two together. Where the scatter has many control-coloured points clustered against a treat boundary, the rule is treating individuals whose own predicted benefit is small but whose neighbourhood benefit is large. That is the cost of forcing a sharp threshold onto a smooth surface, and it is one reason a policy tree should never be the only output of an analysis. The underlying CATE estimates and the calibration test from week 8 carry information the tree discards.
Choosing tree depth
The depth budget is a design choice, not a property of the data. A depth-1 rule asks one question; a depth-2 rule asks two or three. Whether the extra depth pays off depends on whether the second-level splits carve out subregions in which the locally optimal action differs from the parent's recommendation.
Compare the two depths for community-group participation on purpose.
On the same held-out sample, the depth-1 policy delivers an estimated policy value of $0.215$ (the rule's expected outcome on the held-out fold, in the outcome's own units); the depth-2 policy delivers $0.243$, a point gain of $0.028$ (about thirteen percent above depth-1). The gain is real in the fitted comparison, but small relative to the additional complexity.
The course rule is prespecified. Under this parsimony rule, prefer the shallower tree unless the depth-2 point gain in held-out policy value clears the stated gain threshold. Uncertainty intervals, stability checks, equity considerations, and implementation burden then inform how confidently investigators should act on the selected rule. For a high-stakes clinical intervention, a threshold-clearing gain may justify the extra complexity only with strong validation. For a community programme that must be administered by volunteers, the simpler depth-1 rule may be preferable even when depth-2 clears the point-gain threshold, because the rule is more likely to be applied correctly in the field. A rule a worker mis-remembers is not the rule that was evaluated.
The full research workflow compares both depths automatically. Lab 9 shows depth-1 and depth-2 outputs so you can practise the same judgement: examine both trees, then justify the chosen depth in the methods section.
Pair exercise: depth-1 versus depth-2
- Look at the depth-1 and depth-2 policy trees above. State each rule in one sentence of plain language.
- The depth-2 rule lifts policy value by $0.028$ over the depth-1 rule. For $10,000$ eligible people, describe what a small average lift can mean at population scale.
- Name one reason to use the depth-1 rule despite the lift.
- Name one reason to use the depth-2 rule.
- State what extra evidence would make you more comfortable choosing the depth-2 rule.
Off-policy evaluation
A policy fitted to a sample will always look good on that sample. The serious test is held-out evaluation. The research workflow uses AIPW scores computed on a held-out fold, then estimates policy value and its sampling distribution by averaging over individuals' scores under the proposed rule. The estimator is doubly robust in the sense above; the standard errors come from a plug-in or bootstrap calculation that respects the sampling design.
Two consequences follow. First, the same rule can have different value
estimates depending on which fold is used; small samples make this drift
visible. The lab's stability analysis (margot_policy_tree_stability())
repeats the fit across many train-test splits to surface this
variability. Rules that change splits across resamples are flagged as
unstable, even if any single fit looks reasonable. Second, the policy
value's confidence interval reflects only sampling noise. Model
misspecification, residual confounding, and measurement error sit
outside the interval. Sensitivity analysis on the underlying causal
estimates and checks on the rule itself (below) carry the rest of the
inferential burden.
If the held-out value is statistically reliably above the no-treatment
baseline, the rule is worth considering. If the held-out value is
indistinguishable from random allocation, the rule is not. The
policy_value_audit that margot_policy_workflow() returns flags
both cases automatically.
Practical interpretation
A policy tree's splits are selected to maximise policy value, not to
identify causally privileged variables. If a depth-2 tree splits on
openness and neuroticism, that does not mean these variables are deep
causes of purpose; it means they help separate high-value and low-value
treatment regions under the reward objective supplied to policytree.
The same data with a different outcome could pick out an entirely
different splitter.
Conversely, a variable that does not appear in the tree is not
necessarily causally irrelevant. The greedy search may have found a
single composite that explains most of the heterogeneity, leaving
secondary variables unused. Variable-importance summaries from the
underlying causal forest (e.g. grf::variable_importance()) give a
complementary picture of what the forest used to estimate
$\hat{\tau}(x)$, even if those variables do not appear at the top of
any policy tree.
When you write up a policy-tree analysis, name the splitters, state the policy value with its confidence interval, and explicitly disclaim the causal interpretation of the splits. Readers who skim past those clarifications will otherwise treat the splitters as causes.
Heterogeneity as scientific discovery
CATE machinery maps treatment effects across a high-dimensional covariate space. It helps test whether our conventional categories (gender, age group, clinical severity) capture the differences that matter. Sometimes they do; often they do not. Discovering where the forest finds meaningful splits can generate fresh hypotheses about who responds and why, even when no policy decision is on the table. A forest that splits on loneliness rather than age, for example, suggests a hypothesis about social connection; it does not prove that loneliness is the causal source of the heterogeneity. This use of heterogeneity is exploratory, not confirmatory, and a follow-up study designed around the discovered subgroup is the proper test.
Why effect modifiers are descriptive
Heterogeneous treatment effect (HET) estimates are descriptions of how treatment contrasts vary across measured covariates. They are not causal effects of the covariates that appear in the forest or policy tree. A variable can be a useful effect modifier because it is correlated with a deeper cause, because it is downstream of a deeper cause, or because it is the best measured proxy available.
Consider a simple randomised experiment. The exposure $A$ affects the outcome $Y$, and $Z$ is the variable that directly helps explain why the treatment effect differs. If $Z$ also affects $G$, then $G$ can look like an effect modifier by proxy. The modifier describes where effects vary; it does not identify the root cause of that variation.
If investigators condition on $Z$, $G$ may no longer help describe effect heterogeneity. The apparent importance of a modifier is therefore relative to the other variables in the model.
The same point matters for allocation rules. Suppose a policy tree splits on deprivation. That split may be useful for prediction and allocation, but it does not prove that deprivation is the causal source of the different treatment response. If ethnicity or other upstream causes affect deprivation, and deprivation is the strongest measured marker in the data, the tree may split on deprivation while the deeper causal explanation lies upstream.
The practical rule is: use HETs for description, targeting, and hypothesis generation. Do not read a splitter as a cause without a separate design that identifies the causal effect of that splitter.
Fairness and public judgement
Efficiency is not the only consideration. A rule that maximises policy value can still be hard to explain, hard to apply, or unacceptable to the people affected by it. Three concerns recur.
Proxy variables can affect social groups differently. A split on deprivation, income, postcode, age, or baseline wellbeing may reproduce group differences even when group membership is not used explicitly. Removing an explicit variable does not remove its proxies.
Targeting concentrates resources on those who benefit most, by design. People who would benefit somewhat from the intervention receive nothing under a tight budget. That trade-off may be defensible, but it is a value choice. State it plainly rather than hiding it inside an objective function.
Statistical evidence can inform public judgement. It cannot make the judgement for us. Statisticians and psychologists can estimate benefits, harms, uncertainty, and subgroup patterns. Public allocation also depends on values that citizens and institutions debate and decide. The analyst's job is to make the trade-offs visible.
Before recommending a rule, investigators should check:
- Who gains and who loses?
- Are protected groups differentially affected through proxies?
- Does the rule reduce or worsen disparities?
- Can affected communities understand and contest the rule?
Pair exercise: fairness check
A policy tree treats residents who are high-deprivation and under 40.
- Translate the rule into plain language.
- Explain how the deprivation split can affect social groups differently even when group membership is not used.
- Name the table you would compute to check this empirically.
- Apply two checks from the list above.
- Name one value judgement the model cannot settle.
- Your partner says "the algorithm is objective because it only uses data." Counter this claim in two sentences.
Workflow for this week
- Estimate heterogeneity and targeting value (week 8 outputs).
- Fit shallow policy trees at multiple depths using the
policytreedefault reward objective. - Evaluate each policy out of sample with AIPW scores; report policy value with a confidence interval.
- Examine stability across train-test splits and reject unstable rules even when any single fit looks reasonable.
- Conduct a fairness check before recommending a rule.
- Report trade-offs: value, fairness, transparency, and feasibility.
Which question are we answering?
The methods in weeks 5-10 answer related but different questions. The sequence matters because each step changes the causal estimand, the statistical summary, and the decision that the evidence can support.
| Tool | Question | Main quantity | What it can support | What it does not settle |
|---|---|---|---|---|
| Outcome-wide ATE | Does one exposure appear to improve outcomes on average across a prespecified outcome family? | $\mathbb{E}[Y_k(1)-Y_k(0)]$ for each outcome $k$ | Whether the exposure has credible average evidence for one or more outcomes | Who should receive treatment |
| Prespecified group CATE | Do average effects differ across a group chosen before looking at results? | $\mathbb{E}[Y(1)-Y(0)\mid G=g]$ | Whether an interpretable group comparison deserves attention | Individual assignment or a complete allocation rule |
| Forest CATE ranking | Who is predicted to benefit more, conditional on baseline covariates? | $\hat{\tau}(X)$ | A ranking for possible targeting | A transparent public rule, or a fixed treatment share by itself |
| RATE / Qini | Does targeting by the CATE ranking improve outcomes over random or uniform allocation? | Targeting gain over the ranked population | Whether heterogeneity has practical value | Which simple rule should be used |
| Policy tree | What short allocation rule has high expected policy value? | $V(d)=\mathbb{E}[Y(d(X))]$ | A defensible, interpretable rule to consider | Whether the split variables are causes, or whether the rule is fair |
| Measurement checks | Do the outcomes and covariates mean what the analysis assumes? | Fit, invariance, construct, and measurement-error diagnostics | Whether interpretation needs qualification before policy use | The causal effect, or the fairness of the rule |
The table is also a writing guide. In a report, avoid presenting policy trees as though they answer the same question as an average treatment effect. The ATE asks whether the exposure helps on average. CATE summaries ask whether effects vary. RATE and Qini ask whether a ranking is useful for targeting. Policy trees ask whether a short rule can recover enough value to be worth considering. Measurement checks ask whether the variables inside the analysis support the interpretation being placed on them.
Return to the opening example
Back to the district health board. The question is not only "what rule
maximises sample gain?" The question is also whether the rule works out
of sample, can be explained, and is acceptable to the people who would
live with it. If there is a fixed budget, that constraint must be added
explicitly; it is not supplied by the default policytree call. That is
what moves a tree from a slide in a methods talk to a rule a clinic
might actually use.
The workflow from question to policy rule is now in place. One assumption has been present throughout but never examined: that our instruments measure the same construct across the groups we compare. Week 10 asks whether that assumption holds.
Pair exercise: policy tree versus ranking
- Strategy A ranks individuals by $\hat{\tau}(X_i)$ and, if a budget is fixed, could treat the top 20%. Strategy B fits depth-1 and depth-2 policy-tree candidates using the default reward objective, then selects the simpler rule unless depth-2 clears the prespecified gain threshold. The treated share is whatever the selected rule implies.
- Compare the two strategies on: (a) estimated policy value, (b) explainability to a non-technical audience, and (c) ability to answer "why was I selected?"
- State one scenario where Strategy A (pure ranking) is preferable.
- State one scenario where Strategy B (policy tree) is preferable.
- State one fairness question both strategies must answer before anyone uses them.
Lab materials: Lab 9: Policy Trees
Assessment checkpoint
We will reserve about 25 minutes today for the assessments.
- Test 2, 5 minutes. The test covers weeks 8-10: heterogeneous treatment effects, policy trees, and measurement. Bring one A4 sheet of notes. No devices.
- Test preparation, 10 minutes. The public study sheet and practice questions are linked from the resources section of the course book. Use them to practise short answers, not just definitions.
- Research report and Marsden option, 10 minutes. Option A should
follow the current research-report
template or the Google Drive
mirror if already downloaded: choose
religious_serviceorvolunteer_work, then report effects on all four wellbeing outcomes. Historical examples can help you see tone and structure, but this year's report is narrower and uses simulated data. For Option B, the old Marsden example is a model of compact grant style only; follow the current criteria in Assessments.
Bulbulia, J. A. (2024). A practical guide to causal inference in three-wave panel studies. PsyArXiv Preprints. https://doi.org/10.31234/osf.io/uyg3d
Hoffman, K. L., Salazar-Barreto, D., Rudolph, K. E., & Díaz, I. (2023). Introducing longitudinal modified treatment policies: A unified framework for studying complex exposures. https://doi.org/10.48550/arXiv.2304.09460
Suzuki, E., Shinozaki, T., & Yamamoto, E. (2020). Causal Diagrams: Pitfalls and Tips. Journal of Epidemiology, 30(4), 153–162. https://doi.org/10.2188/jea.JE20190192
Sverdrup, E., Kanodia, A., Zhou, Z., Athey, S., & Wager, S. (2024). Policytree: Policy learning via doubly robust empirical welfare maximization over trees. https://CRAN.R-project.org/package=policytree
VanderWeele, T. J., & Ding, P. (2017). Sensitivity analysis in observational research: Introducing the E-value. Annals of Internal Medicine, 167(4), 268–274. https://doi.org/10.7326/M16-2607
VanderWeele, T. J., & Mathur, M. B. (2020). Commentary: Developing best-practice guidelines for the reporting of e-values. International Journal of Epidemiology, 49(5), 1495–1497. https://doi.org/10.1093/ije/dyaa094
VanderWeele, T. J., Mathur, M. B., & Chen, Y. (2020). Outcome-wide longitudinal designs for causal inference: A new template for empirical studies. Statistical Science, 35(3), 437–466.
Week 10: Classical Measurement Theory from a Causal Perspective
Date: 13 May 2026
Key idea
Standard psychometric theories rest on strong, usually unstated, causal assumptions about how a construct and its indicators relate. These assumptions are especially likely to fail when measures are compared across cultures. Measurement is therefore part of causal identification, not a separate preliminary step.
Readings
Required
Optional
Key concepts
- EFA and CFA are model-building tools, not causal proofs.
- Invariance tests are associational diagnostics under a chosen measurement model, not tests of causal comparability.
- Reflective and formative equations need explicit causal interpretation.
- Measurement assumptions can open or close bias paths in DAGs.
Weeks 1 through 9 built a workflow from causal question to policy recommendation. Every step assumed that the outcomes we measure mean the same thing for every group in the target population. If they do not, contrasts between groups can reflect measurement artefact, not causal differences. This week examines that assumption.
Where we are in the heterogeneity sequence
- Week 6: define effect modification and CATE.
- Week 8: estimate CATE rankings and ask whether rankings contain useful targeting information.
- Week 9: turn modelled heterogeneity into interpretable policy trees.
- Week 10: ask whether the measured outcomes and covariates support the interpretation we place on the tree.
- Final assessment: report outcome-wide ATEs, then interpret policy trees as readable targeting summaries.
Week 10 closes the heterogeneity sequence by asking what the policy tree is made of. Its splits use measured covariates. Its value is judged on measured outcomes. Its fairness depends on whether those measurements behave comparably across groups. A clear tree can still mislead if the variables inside it do not mean what the analysis says they mean.
| Object | Question | Output |
|---|---|---|
| Outcome-wide ATEs | Which outcomes show credible average evidence? | Four-outcome ATE table or plot |
| CATE | For whom does the effect vary? | $\hat{\tau}(X)$ estimates |
| RATE / Qini | Does the ranking carry targeting value? | Diagnostic curves and summaries |
| Policy tree | Can we state a readable rule? | Depth-1 or depth-2 allocation rule |
| Measurement checks | Do the variables mean what the rule assumes? | Cautions about construct meaning and comparability |
Seminar
Motivating example
The Kessler-6 (K6) is widely used to screen psychological distress.
Two questions must be addressed before we try to compare scores causally across groups.
- Do the six items map to the same latent structure?
- Is that structure invariant across groups?
These questions are necessary, but they are not sufficient. Even if a measurement model fits well and invariance tests pass, causal interpretation still depends on a defended causal story about what the construct is and how it is measured.
This is why measurement belongs inside causal inference, not beside it. Measurement is part of identification: if a measure is unstable across the groups we compare, effect estimates and group contrasts can be distorted even when the adjustment set is otherwise defensible.
Classical validity and its limits
Psychology textbooks organise measurement quality around four types of validity.
Four classical validity types
- Content validity: the degree to which an instrument covers the intended domain.
- Construct validity: whether the construct is accurately defined and operationalised.
- Criterion validity: whether an instrument accurately predicts performance on an external criterion.
- Ecological validity: whether an instrument reflects real-world situations and behaviour.
These categories organise useful intuitions. From a causal perspective, each conflates problems that need to be kept separate.
Content validity asks whether items span the construct's domain. It does not specify the causal direction between construct and indicators. Does the construct cause the items (a reflective model), or do the items constitute the construct (a formative model)? Without stating the causal structure, "measures what it's intended to measure" has no formal content.
Construct validity bundles two separate questions. "Accurately defined" concerns the target quantity: what state of the world are we trying to capture? This is analogous to defining a causal estimand (which intervention, in which population, compared with what alternative). "Operationalised" concerns whether the same score means the same thing across individuals. This is a consistency question. Lumping both under one label obscures where a measurement fails.
Criterion validity is purely associational. An instrument can predict an outcome well for non-causal reasons: shared confounders, reverse causation, collider bias. "Predicts performance" tells you nothing about whether the instrument captures the construct that causally affects the criterion. Weeks 2 through 4 showed why prediction and causation are different questions. The same distinction applies here.
Ecological validity gestures at transportability without specifying what changes across settings. A causal framework asks: does the construct-to-indicator relationship hold in the target population? That question is testable through measurement invariance. "Reflects real-world situations" is not testable.
These four categories are qualitative checklists, not properties of a formal model. A causal approach specifies the directed graph connecting constructs to indicators, states the assumptions under which observed scores recover latent quantities, and tests those assumptions. The rest of this lecture shows what that looks like in practice.
Learning outcomes
By the end of this week, you should be able to:
- State what each classical validity type (content, construct, criterion, ecological) claims, and identify the causal assumptions each leaves implicit.
- Run exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) with clear model-comparison logic.
- Run configural, metric, and scalar/threshold invariance tests, and state what they do not establish.
- Explain why good fit does not prove a causal latent model.
- Explain why invariance tests do not deliver causal structure.
- Link measurement assumptions to DAG-based bias reasoning.
Part 1: practical workflow with K6
This lecture has two parts. Part 1 is the practical workflow you may be asked to run: exploratory and confirmatory factor analysis, then invariance testing. Part 2 is the causal reasoning that says what those tools can and cannot establish. Work the tools first, then read Part 2 before trusting them.
Step 1: prepare data and inspect factorability
library(margot)
library(tidyverse)
library(performance)
k6 <- df_nz |>
filter(wave == 2018) |>
select(
kessler_depressed,
kessler_effort,
kessler_hopeless,
kessler_worthless,
kessler_nervous,
kessler_restless
)
check_factorstructure(k6)
Bartlett and KMO are entry checks. They do not validate causal interpretation.
Step 2: run EFA
library(psych)
library(parameters)
efa <- psych::fa(k6, nfactors = 3, rotate = "oblimin") |>
model_parameters(sort = TRUE, threshold = "max")
efa
Oblique rotation is usually appropriate because psychological dimensions often co-vary.
Step 3: compare CFA candidates
library(datawizard)
library(lavaan)
library(performance)
set.seed(123)
parts <- data_partition(k6, training_proportion = 0.7, seed = 123)
train <- parts$p_0.7
test <- parts$test
m1_syntax <- psych::fa(train, nfactors = 1) |> efa_to_cfa()
m2_syntax <- psych::fa(train, nfactors = 2) |> efa_to_cfa()
m3_syntax <- psych::fa(train, nfactors = 3) |> efa_to_cfa()
m1 <- cfa(m1_syntax, data = test)
m2 <- cfa(m2_syntax, data = test)
m3 <- cfa(m3_syntax, data = test)
compare_performance(m1, m2, m3, verbose = FALSE)
Read CFI, RMSEA, AIC, and BIC together. Prefer simpler models when fit
is comparable. Do not suppress convergence or Heywood warnings here. A
three-factor model on six items is near the limit of what the data can
identify, so any warnings m3 produces are part of the model-comparison
evidence, not noise to hide.
Step 4: test invariance across groups
We teach invariance testing because it is widely used and because you may be asked to use it in comparative work. Treat it as a descriptive diagnostic, not as a generator of causal insight. A multi-group CFA invariance test is conditional on a specified measurement model (usually reflective), a chosen parameterisation, and assumptions such as local independence (no causal relations among items once the latent variable is held fixed). Passing an invariance test does not show that the same causally relevant construct exists in both groups. It only shows that a particular statistical measurement model, with particular equality constraints, is compatible with the observed covariance structure. Failures and successes are both compatible with many causal stories. Causal structure is underdetermined by associations in the data.
In this lecture, causal assumptions come first, and this means that statistical tests cannot replace thinking about our assumptions. We must define the construct and state a causal measurement story. Only then can invariance tests play a role, by checking some statistical implications of that story.
For ordinal items, threshold invariance is the analogue of scalar invariance. In practice, fit an ordinal estimator (for example WLSMV) when items are Likert-type.
library(semTools)
k6_eth <- df_nz |>
filter(wave == 2018, eth_cat %in% c("euro", "maori")) |>
select(
kessler_depressed,
kessler_effort,
kessler_hopeless,
kessler_worthless,
kessler_nervous,
kessler_restless,
eth_cat
)
# invariance testing is run end-to-end in Lab 10.
# current semTools fits the configural / metric / scalar sequence with
# measEq.syntax() and compares the steps with compareFit(). the older
# measurementInvariance() wrapper has been removed and no longer runs.
- Configural invariance: same loading pattern.
- Metric invariance: same loadings.
- Scalar/threshold invariance: same intercepts (continuous) or thresholds (ordinal).
How are these tests useful?
- They describe whether a chosen measurement model assigns similar roles to items across groups (within that model class). This can be a compact summary of group differences in the covariance structure.
- They can suggest where to look. If constraints fail, you learn which items or thresholds are most responsible, which can motivate substantive work (translation review, response-process interviews, item-by-item analysis, or redesign).
- They keep you honest about what is not identified. Even if every invariance test "passes", the causal meaning of the construct is not certified. If a test "fails", it does not tell you whether the problem is measurement bias or a real causal difference in the underlying state (for example, different causes of distress producing different item dynamics). To decide that, you need a causal story and often new data, not a better fit index.
Also note that item means can differ for many reasons that have nothing to do with biased measurement. If group A is more distressed than group B, we should expect different item means even under perfect measurement. Invariance testing describes whether a particular psychometric model is stable across groups. A failed test, on its own, does not establish whether a group difference is measurement artefact or a real difference in the underlying state.
Under the standard invariance interpretation, if scalar/threshold invariance fails then latent mean comparisons are not identified within that measurement model. In this course, treat this as a warning about interpretability under the assumed model, not as evidence that any observed group difference is "measurement bias" rather than a real difference in underlying causal reality.
Pair exercise: interpreting invariance results
- The K6 is tested across two ethnic groups. Configural invariance holds. Metric invariance holds. Scalar/threshold invariance fails on two items: "felt hopeless" and "felt worthless."
- State what each level of invariance means in plain language (same structure, same loadings, same intercepts/thresholds).
- State the standard invariance interpretation of scalar/threshold failure for group mean comparisons, then state the causal critique: why a cross-sectional associational test cannot decide whether the difference is measurement bias or a real difference in the underlying causal reality.
- Propose a hypothesis for why "felt hopeless" and "felt worthless" might function differently across groups (consider cultural norms, translation, different anchoring of response categories, or different causes of distress).
- Suppose all invariance tests passed. State one reason this would still not settle the causal question of whether "distress" is the same outcome across groups.
Part 2: how traditional measurement theory fails (for causal inference)
Part 1 showed the tools; Part 2 asks what they establish. Causal inference operates under assumptions, and for measurement it is causality all the way down. This follows from the potential outcomes framework: causal contrasts require well-defined outcomes under interventions, and constructed measures are themselves outcomes of causal processes (question wording, translation, response styles, incentives, and the world that generates the experiences being reported).
Classical psychometric checks (internal consistency, model fit, invariance tests) can organise associations. They do not, by themselves, evaluate the causal assumptions about direction and causal efficacy that are often imported into practice when we move from a measurement model to a causal claim. The causal question is not "does the model fit?" The causal question is "under which causal assumptions does this measured quantity behave like the variable in our DAG?"
Two ways of thinking about measurement in psychometric research
In psychometric research, formative and reflective models describe the relationship between latent variables and indicators. VanderWeele discusses this distinction, and its implications for causal inference with constructed measures, in the required reading (VanderWeele, 2022).
Reflective model (factor analysis)
In a reflective measurement model (an effect-indicator model), the latent variable is taken to cause the observed indicators. Each indicator is a reflection of the latent variable.
The reflective model is often written:
$$ X_i = \lambda_i\eta + \varepsilon_i. $$
Here, $X_i$ is an observed indicator, $\lambda_i$ is its loading, $\eta$ is a latent variable, and $\varepsilon_i$ is an error term. The equation is a statistical description. The stronger claim enters when we interpret it structurally: we treat $\eta$ as causally efficacious, and we treat the indicators as interchangeable reflections of it.
Formative model (factor analysis)
In a formative measurement model (a cause-indicator model), the indicators are taken to give rise to, or determine, a (univariate) latent variable. Correlation or interchangeability between indicators is not required. Each indicator can contribute distinctively to the latent variable.
The formative model is often written:
$$ \eta = \sum_i \lambda_iX_i + \varepsilon. $$
Again, the equation is a statistical description. It is not an automatic statement about causal direction.
Statistical models versus structural interpretations
VanderWeele distinguishes a statistical model from a structural interpretation (VanderWeele, 2022). A statistical model describes patterns in the observed covariance structure. A structural interpretation adds causal claims about how the world generates those patterns.
The two diagrams below show structural assumptions that are often taken for granted when scale scores are then used as exposures, outcomes, or confounders in causal analyses.
Why fit is not enough
A well-fitting factor model can be compatible with multiple causal structures. Fit indices alone cannot establish that one latent variable causes all indicators.
This is the central discipline point for this lecture. Fit is about what the statistical model can represent. Identification is about whether, under stated assumptions, the data identify the causal contrast we want.
Problems with the structural interpretations of reflective and formative factor models
Even if we grant the reflective or formative equations as useful statistical summaries, cross-sectional data do not, by themselves, decide the direction of causation among latent variables and indicators (VanderWeele, 2022). This creates a problem because the standard structural interpretations of reflective and formative models are used implicitly across psychology.
The same statistical forms can be compatible with alternative causal stories in which indicators (or the realities they partially reflect) are causally efficacious for the outcome. The compatibility examples below illustrate the issue. The point is not that one of these diagrams is "true". The point is that fit alone does not decide among them.
There are other compatible structural interpretations as well. For example, the "latent" reality may be multivariate, with different constituents giving rise to different indicators, and only some constituents being causally efficacious for the outcome.
VanderWeele's key observation is that cross-sectional data can describe relationships, but they cannot conclusively determine causal direction. This is worrying because it means that many psychometric checks do not explicitly evaluate the causal assumptions that later causal claims rely upon (VanderWeele, 2022). VanderWeele also discusses longitudinal tests for structural interpretations of univariate latent variables that often do not support the simple causal stories that are presumed. We might describe the uncritical reliance on factor-model structural interpretations as one component of a wider "causal crisis" in the social sciences (Bulbulia, 2023).
Multiple versions perspective
A coarse score may combine multiple underlying states. This is a multiple-versions problem. We can still estimate useful associations, but interpretation must state what is being averaged and what is unidentified.
Review: multiple versions of treatment
The theory of multiple versions of treatment addresses the fact that real interventions are rarely uniform. Let $K$ denote the "true" versioned treatment and let $A$ be a coarsened indicator of $K$.
Recall that a causal effect is defined as the difference in expected potential outcomes if everyone were exposed to one level of a treatment versus another, conditional on covariates $L$:
$$ \delta = \sum_l \left( \mathbb{E}[Y\mid A=a,l] - \mathbb{E}[Y\mid A=a^*,l] \right) P(l). $$
Under the multiple-versions interpretation, we can express a consistent estimand in terms of $K$:
$$ \delta = \sum_{k,l} \mathbb{E}[Y_k\mid l] P(k\mid a,l) P(l) - \sum_{k,l} \mathbb{E}[Y_k\mid l] P(k\mid a^*,l) P(l). $$
This corresponds to a hypothetical randomised trial in which, within strata of $L$, the treated group receives versions $K$ drawn from the version distribution among those with $A=a$ and the control group receives versions drawn from the version distribution among those with $A=a^*$ (VanderWeele & Hernan, 2013).
Reflective and formative measurement models as multiple versions
VanderWeele suggests using this framework to interpret constructed measures of psychosocial constructs (VanderWeele, 2022). Roughly, if $A$ is a constructed measure from indicators $(X_1,\dots,X_n)$, then $A$ can be treated as a coarsened indicator of an underlying reality, and the multiple-versions logic can preserve causal interpretability under strong assumptions.
One way to express this is to replace $K$ with an underlying (possibly multivariate) reality $\eta$, and to treat changes in a constructed measure as shifting the distribution of $\eta$ versions:
$$ \delta = \sum_{\eta,l} \mathbb{E}[Y_\eta\mid l] P(\eta\mid A=a+1,l) P(l) - \sum_{\eta,l} \mathbb{E}[Y_\eta\mid l] P(\eta\mid A=a,l) P(l). $$
This offers a reason not to despair. But it is not a free pass. The interpretation remains obscure when we do not have a clear definition of what the causally relevant constituents of the construct are, and when we have not explicitly stated which causal assumptions connect indicators, measures, and outcomes.
VanderWeele's model of reality
VanderWeele concludes by arguing that traditional univariate reflective and formative models do not adequately capture the relations between underlying causally relevant phenomena and our indicators and measures. He argues that the causally relevant constituents of reality are almost always multidimensional, that measure construction should start from construct definition, and that structural interpretations should be tested rather than presumed (VanderWeele, 2022).
VanderWeele's argument can be summarised as the following propositions (VanderWeele, 2022).
- Traditional univariate reflective and formative models do not adequately capture the relations between causally relevant phenomena and indicators and measures.
- The causally relevant constituents of reality related to psychosocial constructs are almost always multidimensional, giving rise to indicators and to our language and concepts.
- Construct measurement should start from an explicit construct definition, from which items are derived and justified.
- The presumption of a structural univariate reflective model can impair measure construction, evaluation, and use.
- If a structural interpretation of a univariate reflective factor model is proposed, it should be tested rather than presumed; factor analysis alone is not sufficient evidence.
- Even when causally relevant constituents are multidimensional but a univariate measure is used, associations with outcomes can be interpreted using multiple versions of treatment theory, though interpretation is obscured without clarity about constituents.
- When data permit, examining associations item-by-item, or in conceptually related item sets, can provide insight into facets of the construct.
This is a compelling sketch. It is not yet a complete causal recipe. In particular, it is not a causal DAG in the sense we have used throughout the course, because the arrows are not yet a clear set of causal claims that we can test with d-separation. It motivates the question we care about in causal inference: what assumptions do we need to connect our constructed measures to the causal contrasts we want to estimate?
A pragmatic causal response: measurement error as a structural threat
We can bring this discussion back to the causal workflow by using causal diagrams to represent measurement dynamics. Let $\eta_A$ denote a "true" exposure state, $\eta_Y$ a "true" outcome state, and $\eta_L$ a "true" confounder state. Let $A_{f(X_1,\dots,X_n)}$, $Y_{f(X_1,\dots,X_n)}$, and $L_{f(X_1,\dots,X_n)}$ denote constructed measures (functions of indicators). Allow unmeasured sources of measurement error, $U_A$, $U_Y$, and $U_L$, to influence the constructed measures.
Read the diagram as a measurement-augmented causal model.
- The $\eta$ nodes are latent realities ($\eta_L$, $\eta_A$, $\eta_Y$). They are the states we would ideally intervene on and measure without error.
- The $var_{f(X_1,\dots,X_n)}$ nodes are constructed measures: functions of multiple indicators. Simpler measurement DAGs compress this notation, writing the true outcome state as $Y^\ast$ and the recorded measure as $Y$; in that shorthand $\eta_Y$ is $Y^\ast$ and $Y_{f(X_1,\dots,X_n)}$ is $Y$.
- The $U$ nodes are unmeasured sources of error in those constructed measures. They include stable reporting tendencies, transient mood at the time of survey completion, social desirability, and culturally patterned response styles.
The key edges have the following interpretations.
- $\eta_L \rightarrow L_{f(X_1,\dots,X_n)}$: the true confounder state affects its measured realisation.
- $\eta_A \rightarrow A_{f(X_1,\dots,X_n)}$: the true exposure state affects its measured realisation.
- $\eta_Y \rightarrow Y_{f(X_1,\dots,X_n)}$: the true outcome state affects its measured realisation.
- $U_L \rightarrow L_{f(X_1,\dots,X_n)}$, $U_A \rightarrow A_{f(X_1,\dots,X_n)}$, $U_Y \rightarrow Y_{f(X_1,\dots,X_n)}$: unmeasured error sources distort each constructed measure. In the strongest language, our measures "see as through a mirror, in darkness" relative to the underlying reality they hope to capture.
- Correlated errors: $U$ nodes may share common causes, so error in one domain can correlate with error in another (for example, a general tendency to present oneself favourably affects multiple self-reports).
- Directed errors: true states can affect how other variables are measured (for example, exercise might change how people interpret distress items), creating pathways from $\eta_A$ into $U_Y$.
The utility of describing measurement dynamics using causal graphs is that we can see when measurement itself creates new paths. The act of conditioning on measured variables can introduce collider bias when both true states and measurement errors feed into the measured nodes. When unmeasured (multivariate) psycho-physical states are related to unmeasured sources of error in the measurement of those states, measurement can open pathways to confounding.
One key warning is that measurement error opens additional pathways to confounding when either errors are correlated or when the exposure causally affects the error in the measured outcome.
Confounding control by baseline measures in three-wave panels
One pragmatic design response is to measure baseline values of exposure and outcome (and key confounders), then estimate effects forward in time using a three-wave panel. Adjusting for each variable's own baseline value is called lagged-self adjustment.
- This design adjusts for baseline measurements of both exposure and outcome.
- Understanding this approach in the context of potential directed and correlated measurement errors clarifies its strengths and limitations.
- Baseline measures can reduce the chance that unmeasured sources of measurement error are correlated with later changes in exposure and outcome.
- For example, if individuals have a stable social desirability bias at baseline, then to create new bias it would need to change in a way that is unrelated to its baseline effects.
- However, we cannot eliminate the possibility of new bias development, nor directed effects of exposure on outcome reporting.
- Attrition and non-response can create additional directed measurement structures.
- Despite these challenges, including baseline exposure and outcome measures should be standard practice in multi-wave studies because it reduces the likelihood of novel confounding.
- Because we can never be certain the assumptions hold, we should still perform sensitivity analyses.
Pair exercise: fit is not identification
- A one-factor confirmatory factor analysis (CFA) of six K6 items yields CFI = 0.98 and RMSEA = 0.03.
- A colleague claims "the good fit confirms that distress causes all six responses." Evaluate this claim with reference to VanderWeele (2022).
- Choose two of the diagrams above that are compatible with the same statistical factor model, and state what causal assumption differs between them.
- Explain why the choice matters for causal inference downstream (hint: consider what happens when you use the scale score as a confounder or outcome in a DAG).
Return to the opening example
Back to K6.
A total score can still be useful for screening. But without defended structure and invariance, we should avoid strong causal claims about cross-group latent differences.
Our job as investigators is to separate what the model fits from what the design identifies. That discipline applies directly to the final assignment. Outcome-wide ATEs require outcomes that mean what the report says they mean. Policy trees require covariates whose split points can be interpreted without turning proxies into causes. Fairness checks require attention to whether measurement differs across groups before a rule is treated as publicly defensible.
Pair exercise: measurement as an identification problem
- Explain to your partner how scalar invariance failure distorts conditional average treatment effect (CATE) estimates even when exchangeability and positivity hold.
- A colleague says "the K6 has been validated in hundreds of studies, so measurement is not a concern." Counter this claim in two sentences, distinguishing internal consistency from cross-group invariance.
- Propose a workflow step that belongs between drawing the DAG and running estimation, specifically to check measurement assumptions. State what it tests and what a failure would change about the analysis.
Lab materials: Lab 10: End-to-End Research Report
Appendix: VanderWeele measurement lectures
These two lectures are optional supplements for students who want the measurement argument in seminar form. They develop the same core point as the required reading: constructed measures need explicit definitions, item choices, and causal interpretation before they can be used safely in causal analyses.
- Tyler VanderWeele, Constructed Measures and Causal Inference: Towards a New Model of Measurement, Johns Hopkins Causal Inference Working Group. The linked timestamp begins in the measurement-model discussion.
- Tyler VanderWeele, Causal Inference and Measure Construction: Towards a New Model of Measurement, Online Causal Inference Seminar. The linked timestamp begins near the practical implications for measure construction.
Bulbulia, J. A. (2023). A workflow for causal inference in cross-cultural psychology. Religion, Brain & Behavior, 13(3), 291–306. https://doi.org/10.1080/2153599X.2022.2070245
Fischer, R., & Karl, J. A. (2019). A primer to (cross-cultural) multi-group invariance testing possibilities in r. Frontiers in Psychology, 1507.
Harkness, J. A., Van de Vijver, F. J., & Johnson, T. P. (2003). Questionnaire design in comparative research. Cross-Cultural Survey Methods, 19–34.
Harkness, J. [et. al]. (2003). Questionnaire translation. In Cross-cultural survey methods (pp. 35–56). Wiley.
He, J., & Vijver, F. van de. (2012). Bias and Equivalence in Cross-Cultural Research. Online Readings in Psychology and Culture, 2(2). https://doi.org/10.9707/2307-0919.1111
VanderWeele, T. J. (2022). Constructed measures and causal inference: Towards a new model of measurement for psychosocial constructs. Epidemiology, 33(1), 141–151. https://doi.org/10.1097/EDE.0000000000001434
VanderWeele, T. J., & Hernan, M. A. (2013). Causal inference under multiple versions of treatment. Journal of Causal Inference, 1(1), 1–20.
Week 11: In-Class Test 2 (20%)
20 May 2026 (w11)
This week is the second in-class test, covering material from weeks 8–10.
What is covered
- Heterogeneous treatment effects and machine learning (week 8)
- Resource allocation and policy trees (week 9)
- Classical measurement theory from a causal perspective (week 10)
Format
- Duration: 50 minutes (allocated time: 1 hour 50 minutes)
- Closed book, one A4 sheet of notes permitted, no devices
- Bring a pen or pencil
- Location: EA120 (the usual seminar room)
Reminders
- You may bring in one A4 sheet with notes.
- No electronic devices permitted during the test.
- Arrive on time. The test begins at 14:10.
- Write clearly; illegible answers cannot be marked.
Week 12: Student Presentations (10%)
27 May 2026
This week
Each student presents a concise summary of your research report.
Format
- 10 minutes per presentation.
- One "panel" question afterwards. You may ask one brief clarifying question before answering.
- Whiteboard and paper notes only. No slides, handouts, or devices.
- Location: EA120.
Required structure
- Title and motivation (what is it, so what).
- Causal question, target population, exposure, and outcomes.
- A simple causal diagram showing your identification strategy.
- Causal estimand and analysis plan (what you will estimate, and how).
- One focal limitation or risk, and how you will address it.
Grading
The talk is graded against the Presentation Rubric, which sets out four equally weighted criteria — clarity and structure, causal reasoning, "so what", and response to the panel question — each scored in one of four bands.
Practical checklist
Preparation
- Plan the whiteboard layout before the talk; treat the board as three columns (DAG, estimand, result).
- Define every acronym at first use.
- Rehearse timing. Ten minutes is short. Aim for eight to leave breathing room.
- Prepare a one-sentence "so what" you can deliver from memory.
Research report due
The research report is due Friday 29 May (end of Week 12). Submit one PDF with an R code appendix via Nuku.
Lab Setup: R Packages and Build Tools
Set up your computer before the lab. Package installation can take time, and some errors require system tools that cannot be fixed quickly during class.
What You Need
You need five things, in this order:
- R and RStudio.
- System build tools for your operating system.
- Quarto's TinyTeX tool for PDF rendering.
- Core course packages from CRAN.
- Course packages from GitHub.
Restart RStudio after installing system build tools or GitHub packages.
Step 1: Install R and RStudio
Install R first, then install RStudio.
- R: https://cran.r-project.org/
- RStudio Desktop: https://posit.co/download/rstudio-desktop/
Use the current release unless you have been told otherwise.
Step 2: Install System Build Tools
Some R packages need to compile code. If your error mentions make,
gcc, g++, clang, or compilation, this step is probably missing.
macOS
Install Xcode Command Line Tools (Xcode CLT). In Terminal, run:
xcode-select --installFollow the prompts, then restart RStudio. You do not need the full Xcode app.
Windows
Install Rtools for your version of R:
https://cran.r-project.org/bin/windows/Rtools/
Restart RStudio, then check:
Sys.which("make")If this returns a path, R can find the build tools.
Linux
Install the usual build tools. On Ubuntu or Debian:
sudo apt install build-essentialRestart RStudio afterwards.
Step 3: Install TinyTeX for Quarto PDFs
The final report must render to PDF. Installing the R package called
tinytex is not enough: Quarto also needs a TeX distribution. Quarto
recommends TinyTeX for PDF documents, installed from the system command
line.
Run this in a terminal, not in the R console:
quarto install tinytex
Then check that Quarto can see it:
quarto list tools
If TinyTeX is already installed but Quarto says it is out of date, run:
quarto update tinytex
macOS
Open Terminal (or the RStudio Terminal tab) and run:
quarto install tinytex quarto list toolsIf
quartois not found, install or update Quarto from https://quarto.org/docs/download/.
Windows
Open Command Prompt, PowerShell, or the Terminal tab in RStudio. Run:
quarto install tinytex quarto list toolsDo not run this command in the R console. If Windows says
quartois not recognised, install Quarto from https://quarto.org/docs/download/, close and reopen RStudio, then try again. If the PDF render later sayslualatexor a LaTeX package is missing, rerunquarto install tinytexand render again.
Linux
Run the same command in your terminal:
quarto install tinytexA system TeX Live installation can also work, but TinyTeX is the supported route for this course.
Step 4: Install CRAN Packages
Run this in R:
cran_packages <- c(
"ggplot2", "dplyr", "tibble", "tidyr", "purrr", "grf",
"arrow", "qs2", "googledrive", "pak", "rmarkdown", "knitr",
"kableExtra", "tinytex", "ggdag", "dagitty"
)
missing_cran <- cran_packages[
!vapply(cran_packages, \(p) requireNamespace(p, quietly = TRUE), logical(1))
]
if (length(missing_cran) > 0) {
install.packages(missing_cran)
}
Step 5: Install GitHub Packages
Use pak first. It is the recommended installer because it resolves
dependencies well and gives useful diagnostics. It still needs the
system build tools above when packages must be built from source.
Recommended: pak
pak::pak("go-bayes/margot")
pak::pak("go-bayes/causalworkshop")
pak::pak("go-bayes/boilerplate")
Fallback: remotes
If pak itself fails, try remotes:
install.packages("remotes")
remotes::install_github("go-bayes/margot", upgrade = "never")
remotes::install_github("go-bayes/causalworkshop", upgrade = "never")
remotes::install_github("go-bayes/boilerplate", upgrade = "never")
If the error mentions make, gcc, g++, clang, or compilation,
changing installers will not solve the problem. Install the system build
tools in Step 2, restart RStudio, then rerun the pak commands.
Step 6: Check the Installation
Restart RStudio, then run:
packages <- c(
"ggplot2", "dplyr", "tibble", "arrow", "qs2", "googledrive",
"grf", "margot", "causalworkshop", "boilerplate"
)
status <- vapply(packages, requireNamespace, quietly = TRUE, logical(1))
print(status)
Every entry should be TRUE.
Check the course package version:
packageVersion("causalworkshop")
You need version 0.6.0 or later for Labs 8-10.
If Installation Fails
Read the first real error message, not only the final line. Common meanings:
quarto: command not foundor'quarto' is not recognized: install Quarto, restart RStudio, and runquarto install tinytexfrom a terminal.lualatex: command not found,pdflatex: command not found, or missing LaTeX packages while rendering PDF: runquarto install tinytexfrom a terminal, then render again.make: command not found: install Xcode CLT on macOS or Rtools on Windows.gcc,g++, orclangmissing: install system build tools.cannot open URLor timeout: check the internet connection, then try again.causalworkshopwas upgraded while loaded: restart RStudio, then rerun the lab.
If the GitHub package installation still fails, use the course lab machine or the pre-installed lab environment. Do not spend class time debugging compilers.
Lab 1: Git and GitHub
This session introduces version control with Git and GitHub. Setting up these tools first means you can track your work from day one.
Week 1 software requirement
Bring your laptop in week 1 and confirm Git/GitHub access. Install both R and RStudio in week 1, and use RStudio as the standard editor for this course. Instructions are in Lab 2: Install R and Set Up Your IDE.
What is version control?
Version control tracks every change you make to your files. Instead of
saving e.g. report_v2_final_FINAL.docx, you save a single file and Git
remembers its entire history. You can go back to any previous version,
see exactly what changed, and collaborate without overwriting each
other's work.
GitHub is a website that hosts your Git repositories online. It serves as a backup and lets you share your work. There are other services like GitLab and Bitbucket, but GitHub is the most popular.
Why bother?
- Your lab diary and final report will be easier to manage.
- You will never lose work (every version is saved).
- Employers value version control skills.
- It is the standard tool for reproducible research -- much more powerful than OSF presentations because every change is tracked and time-stamped.
Step 1: Create a GitHub account
- Go to https://github.com.
- Click Sign up and follow the prompts. (Get the student version -- see below).
- Choose a username you are happy to use professionally (e.g.,
jsmith-nz, notxXx_gamer_xXx). - Verify your email address.
Student benefits
Apply for the GitHub Student Developer Pack with your university email. It includes free access to GitHub Copilot, cloud credits, and other developer tools.
Step 2: Install Git
macOS
Open Terminal (search for it in Spotlight) and type:
git --version
If Git is not installed, macOS will prompt you to install the Xcode Command Line Tools. Follow the prompts.
Windows
Download Git from https://git-scm.com/download/win. Run the installer and accept the defaults.
Verify installation
Open a terminal (Terminal on macOS, Git Bash on Windows) and type:
git --version
You should see something like git version 2.44.0.
Step 3: Configure Git
Tell Git your name and email (use the same email as your GitHub account):
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"
Step 4: Set up SSH authentication
Before you can push code, GitHub needs to verify who you are. SSH keys let your computer prove its identity without a password each time. You do this once and it works from then on.
Check whether you already have a key
Many students already have an SSH key from a previous course,
internship, or Git setup. Before creating a new one, check what is
already in your ~/.ssh/ folder:
ls -la ~/.ssh
If you see files such as id_ed25519 and id_ed25519.pub, you may
already have a usable key. If you are unsure, it is fine to create a
fresh key for this course. GitHub allows more than one SSH key on an
account.
Generate an SSH key
Open a terminal and run:
ssh-keygen -t ed25519 -C "your.email@example.com"
Use the email address attached to your GitHub account.
You will see prompts like these:
Generating public/private ed25519 key pair.
Enter file in which to save the key (/Users/yourname/.ssh/id_ed25519):
Press Enter to accept the default location unless you already have a
key there and want to keep it. If you already have an id_ed25519 key
and want a separate course key, you can type a different file name such
as ~/.ssh/id_ed25519_psyc434.
Next you will be asked about a passphrase:
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
You may press Enter twice to skip it. A passphrase adds security, but it also means you may need to unlock the key when you restart your computer.
This creates two files in ~/.ssh/:
| File | What it is | Share it? |
|---|---|---|
id_ed25519 | Your private key | Never. Do not copy, email, upload, or commit this file. Anyone who has it can impersonate you. |
id_ed25519.pub | Your public key | Yes. This is what you give to GitHub. |
Protect your private key
The file
~/.ssh/id_ed25519(no.pub) is your private key. Treat it like a password. Never paste it into a chat, never commit it to a repository, never upload it anywhere. If you suspect it has been exposed, delete the key from your GitHub account immediately at github.com/settings/keys and generate a new one.
Confirm that the files were created
Run:
ls -l ~/.ssh
You should see both the private key and the public key. If you saved the
key under a custom name, look for that name instead of id_ed25519.
Add the key to the SSH agent
The SSH agent remembers your key so Git can use it automatically.
Start the agent:
eval "$(ssh-agent -s)"
Then add your key:
ssh-add ~/.ssh/id_ed25519
If you used a custom file name, replace id_ed25519 with that name.
On some macOS systems, the following command is preferred because it stores the passphrase in the keychain:
ssh-add --apple-use-keychain ~/.ssh/id_ed25519
If that command gives an error, use the plain
ssh-add ~/.ssh/id_ed25519 command instead.
Add the public key to your GitHub account
Copy the public key (the file ending in .pub) to your clipboard.
macOS:
pbcopy < ~/.ssh/id_ed25519.pub
Windows (Git Bash):
cat ~/.ssh/id_ed25519.pub
Select and copy the output manually.
If pbcopy does not work on your system, you can also print the key
manually:
cat ~/.ssh/id_ed25519.pub
Copy the entire line beginning with ssh-ed25519.
Then add it to GitHub:
- Go to github.com/settings/keys.
- Click New SSH key.
- Give it a title (e.g., "My laptop").
- Paste the key into the Key field.
- Click Add SSH key.
Test the connection
ssh -T git@github.com
The first time, you will usually see a message like:
The authenticity of host 'github.com (IP ADDRESS)' can't be established.
ED25519 key fingerprint is ...
Are you sure you want to continue connecting (yes/no/[fingerprint])?
Type yes and press Enter. You should then see:
Hi your-username! You've successfully authenticated, but GitHub does not provide shell access.
That message means it worked. All future git push and git pull
commands will authenticate automatically.
Common problems
If you get Permission denied (publickey)
This usually means one of four things:
- You copied the wrong key to GitHub.
- You copied the private key instead of the public key.
- Your key was not added to the SSH agent.
- You are using a different GitHub account from the one attached to the key.
Work through these checks:
ls -l ~/.ssh
ssh-add -l
cat ~/.ssh/id_ed25519.pub
Then compare the printed public key with the one shown at github.com/settings/keys.
If ~/.ssh/id_ed25519.pub does not exist
The key pair was not created where you think it was. Run ls -la ~/.ssh
and look for the file name you chose during ssh-keygen.
If you see Agent admitted failure to sign
Run:
ssh-add ~/.ssh/id_ed25519
and then test again with ssh -T git@github.com.
If
ssh -Thangs or failsSome university networks block SSH. If you are on campus Wi-Fi and the command hangs, try from a different network (e.g., your phone hotspot). If it still fails, contact the course coordinator.
Step 5: Create a private repository on GitHub
- Go to github.com/new.
- Name your repository
psy434-labs(or similar). - Set visibility to Private.
- Check Add a README file.
- Click Create repository.
Privacy and submission
Your GitHub repository must remain private for the duration of the course. Do not change its visibility to public. Lab diaries are submitted as
.mdfiles via NUKU, not through GitHub. GitHub is your version-control tool; NUKU is where marking happens. Submission instructions for each lab diary appear on the NUKU assignment page.
Step 6: Clone the repository to your computer
Cloning downloads a copy of the repository to your machine and links it to GitHub.
- On your repository page, click the green Code button.
- Select SSH and copy the URL (it starts with
git@github.com:). - Open a terminal and navigate to where you want to store your work:
mkdir -p ~/Documents/psy434
cd ~/Documents/psy434
- Clone the repository:
git clone git@github.com:YOUR-USERNAME/psy434-labs.git
Replace YOUR-USERNAME with your GitHub username.
- Move into the repository folder:
cd psy434-labs
You now have a local copy linked to GitHub.
Sanity check that you are in the right place:
pwd
ls
You should see your location end with psy434-labs, and you should see
the files in your repository (it may be empty at first).
If you are new to Terminal
These commands help you check where you are and what files you have:
pwd # print the current folder (your location) ls # list files in the current folder cd .. # go up one folder cd ~ # go to your home folderIf you ever see an error like "No such file or directory", run
pwdandlsto check your location and spelling.
Checkpoint
If you have cloned your repository successfully, you are on track. Everything below can be finished before next week's lab if you run out of time.
Step 7: The basic Git workflow
The everyday workflow has three steps: stage, commit, push.
Recommended editor
RStudio is the recommended editor for creating and editing
.md,.R, and.qmdfiles. It reduces file-extension errors (for example, accidentally savingREADME.md.txt) and all lab instructions assume RStudio. Install instructions are in Lab 2. You may use another editor, but you will need to adapt instructions yourself.
1. Create your first file
Your repository already has a README.md from Step 5. Open it and
replace its contents with:
# PSYCH 434 lab diary
This is my lab diary for PSYCH 434 (2026).
Save the file in the root of your repository folder (psy434-labs/).
Use RStudio if available (File > New File > Text File), then save as
README.md.
Windows file extensions
If you create the file in File Explorer, make sure it is named
README.md(notREADME.md.txt). If you cannot see extensions, turn on "File name extensions" in File Explorer.
Creating a file from the command line
If you are already in your repository folder, you can create an empty file with:
touch README.mdThen open it in a text editor and paste in the text above. If you are not sure you are in the right place, run
pwdand check that the folder name ends withpsy434-labs.
2. Check what changed
Before staging, check what Git sees:
git status
You should see README.md in red under "Untracked files". This command
is your best friend: run it whenever you are unsure what state your
repository is in.
A typical first output looks something like this:
On branch main
Untracked files:
(use "git add <file>..." to include in what will be committed)
README.md
nothing added to commit but untracked files present
Read this output slowly:
On branch maintells you which branch you are working on.Untracked filesmeans Git sees the file, but is not yet tracking it.- The suggested
git addcommand tells you the next step.
If you see "not a git repository"
You are not in your repository folder. Run
pwdandls, thencd psy434-labsand trygit statusagain.
3. Stage the change
Staging tells Git which changes you want to include in your next snapshot:
git add README.md
Run git status again immediately afterward:
git status
Now README.md should appear in green under a heading such as
Changes to be committed. That means the file is staged and ready for
the next commit.
You can inspect exactly what is staged with:
git diff --staged
This command shows the line-by-line changes that will go into the commit. It is a good habit to check this before every commit.
Safety: always name the files you are staging
git add README.mdstages one file. You know exactly what will be committed. Commands likegit add .orgit add -Astage everything Git can see in the current directory and below. If you run one of these from the wrong folder, you can accidentally stage thousands of files, including private data, SSH keys, or your entire home directory. One student rangit add ..(note the two dots, meaning the parent folder) and began uploading gigabytes of data.Rules:
- Always run
pwdfirst. Confirm you are inside your repository folder.- Always run
git statusbefore committing. Check that only the files you expect appear in green.- Prefer naming files explicitly (e.g.,
git add README.md diaries/lab-01.md). Usegit add .only when you have just checkedgit statusand every listed file belongs in the commit.- If you see hundreds of files listed in
git status, stop. You are probably in the wrong directory. Do not commit. Runpwd, thencdto the right folder.
4. Commit the change
A commit is a snapshot with a short message describing what you did:
git commit -m "add readme with course details"
Think of the commit message as an instruction to your future self. Good commit messages are short and specific.
Good examples:
git commit -m "add readme with course details"
git commit -m "start lab 01 diary"
git commit -m "fix typo in README"
Poor examples:
git commit -m "stuff"
git commit -m "update"
git commit -m "final final really final"
Sanity check that Git recorded your commit:
git log -1 --oneline
You should see your commit message listed.
Common mistake
There must be a space after
commitand between-mand the message:git commit -m "update readme" # correct git commit-m "update readme" # wrong: git does not recognise commit-m git commit -m"update readme" # wrong: missing space before the message
If Git says "nothing to commit"
Either you forgot to save the file in your editor, or you forgot to stage it. Run:
git statusIf the file appears under
Changes not staged for commit, rungit add README.mdagain. If Git shows nothing changed, save the file and try again.
5. Push to GitHub
Push sends your commits to GitHub so they are backed up online:
git push
If you set up authentication in Step 4, this should work without a password prompt.
If you see an error about an upstream branch, run:
git push -u origin HEAD
That command tells Git which remote branch your local branch should track. You usually only need to do it once.
If the push succeeds, Git will print a summary showing which branch was updated on GitHub.
If
git pushis rejectedSometimes GitHub has changes that your computer does not yet have, for example if you edited the README on the GitHub website. In that case, run:
git pull --rebase git pushIf Git reports a conflict, stop and ask for help rather than guessing.
Step 8: Check your work
Go to your repository page on GitHub. You should see the updated README file with your changes.
You should also see:
- The latest commit message near the top of the file list.
- The commit count link, which you can click to view history.
- Your updated README rendered as formatted text on the repository front page.
If the page has not changed, refresh the browser and check that your
git push command succeeded.
Quick reference
| Command | What it does |
|---|---|
git status | Show which files have changed |
git add <file> | Stage a file for the next commit |
git add -A | Stage all changes |
git commit -m "message" | Save a snapshot with a message |
git push | Upload commits to GitHub |
git pull | Download changes from GitHub |
git log --oneline | Show commit history |
Workflow summary
Edit files →
git add→git commit -m "message"→git push. Repeat.
Emergency: stuck in a rebase
If you accidentally start a rebase and want to get back to where you were:
git rebase --abortIf you are resolving a rebase conflict, the usual flow is:
git add <file> git rebase --continueIf you are unsure, run
git statusand ask for help before you try random commands.
What never belongs in a repository
Git remembers everything you commit, even after you delete the file. If you push a secret to GitHub, assume it is compromised. Removing it from later commits does not remove it from the history.
Never commit any of the following:
| Category | Examples |
|---|---|
| SSH or API keys | ~/.ssh/id_ed25519, .env files, API tokens |
| Passwords and credentials | database connection strings, login details |
| Personal data | NZAVS datasets, participant records, anything with names or ID numbers |
| Large binary files | .zip, .rds, .csv, .mp4 (your .gitignore already excludes these) |
Your .gitignore blocks most data and output files automatically, but
it cannot protect you if you stage files from outside your repository or
override it with a force flag.
If you accidentally push something private
- Do not panic, but act quickly.
- Revoke the credential. If it is an SSH key, delete it from github.com/settings/keys and generate a new one (repeat Step 4). If it is an API token or password, revoke it on the service that issued it. Once revoked, the exposed copy is useless.
- Delete and recreate the repository. Removing the file from a later commit does not remove it from the history. The simplest fix is to delete the repository on GitHub, create a new one, and push your current local folder to it. Your local files are unaffected.
- If personal data was exposed (e.g., participant records with names or ID numbers), this is a privacy breach. Report it to the course coordinator immediately so it can be escalated to the university.
Terminal basics
You have already used a few terminal commands (cd, git clone). The
terminal is a text interface for your computer. Every command you type
runs a small program. Here are the commands you will use most often.
Where am I?
pwd
pwd (print working directory) shows the folder you are currently in.
List files
ls
ls lists the files and folders in the current directory. To see hidden
files (names starting with .), use:
ls -a
Git stores its data in a hidden folder called .git. Try ls -a inside
your repository to see it.
Change directory
cd ~/Documents/psy434/psy434-labs
cd moves you into a folder. A few shortcuts:
| Shortcut | Meaning |
|---|---|
~ | Your home folder |
.. | One level up |
. | The current folder |
So cd .. moves up one level, and cd ~ takes you home.
Create a folder
mkdir diaries
mkdir (make directory) creates a new folder.
Create an empty file
touch lab-01.md
touch creates an empty file if it does not already exist. (Windows
users: touch works in Git Bash but not in PowerShell or Command
Prompt. Make sure you are using Git Bash.)
Terminal quick reference
| Command | What it does |
|---|---|
pwd | Print the current directory |
ls | List files |
ls -a | List files including hidden ones |
cd <folder> | Change directory |
cd .. | Go up one level |
mkdir <name> | Create a folder |
touch <name> | Create an empty file |
Organise your repository
Set up a simple folder structure for the course. From inside your repository folder:
mkdir diaries data R
Git does not track empty folders. To make sure diaries/, data/, and
R/ appear on GitHub, add a placeholder file to each:
touch diaries/.gitkeep data/.gitkeep R/.gitkeep
(data/.gitkeep is the only file in data/ that should be tracked.)
Each folder has a purpose:
| Folder | Contents |
|---|---|
diaries/ | Weekly lab diary entries (.md files) |
data/ | Datasets you generate or download (ignored by Git because of .gitignore, but useful for keeping your project organised locally) |
R/ | R scripts and Quarto documents (.R, .qmd) |
A tidy repository separates source files (code you write) from generated output (plots, PDFs, HTML). Source files go into Git; output does not. If your code is correct, anyone can regenerate the output by running it. This principle, that results follow from code, is the basis of reproducible research.
Naming conventions
Good file names are lowercase, use hyphens instead of spaces, and sort naturally:
| Good | Bad | Why |
|---|---|---|
lab-01.md | Lab 1.md | spaces break terminal commands |
lab-02.md | lab2.md | the hyphen and zero-padding (01, 02, …) keep files in order |
clean-data.R | CleanData_FINAL(2).R | one clear name, no version suffixes |
fig-ate-by-age.png | Figure 3.png | describes content, not position |
Three rules of thumb:
- No spaces. Use hyphens (
-) or underscores (_). Spaces require quoting in the terminal and cause errors in scripts. - Zero-pad numbers.
01,02, ...10sort correctly;1,2, ...10does not (your system puts10before2). - Name for content, not sequence. A file called
analysis-ate.Ris still meaningful six months later;script3.Ris not.
Next, create a .gitignore file. This tells Git to ignore files that
should not be tracked (system files, R temporary files, datasets, large
binaries, etc.).
If you are using RStudio:
- Go to
File > New File > Text File. - Paste the contents below.
- Go to
File > Save As.... - Save the file as
.gitignorein the root of your repository (psy434-labs/). - In the Files pane, click
More > Show Hidden Filesso dotfiles are visible. - Open
.gitignorefrom the Files pane whenever you want to edit it.
Paste the following contents:
# system files
.DS_Store
Thumbs.db
# R
.Rhistory
.RData
.Rproj.user
*.Rproj
# data files
data/**
!data/.gitkeep
*.rds
*.qs
*.parquet
*.arrow
*.csv
*.xlsx
*.sav
*.dta
# large binary and archive files
*.zip
*.7z
*.tar
*.gz
*.bz2
*.xz
*.dmg
*.iso
*.mp4
*.mov
*.avi
*.mp3
*.wav
# output files
*.pdf
*.html
*.png
*.jpg
*.jpeg
*.svg
*.gif
*.tiff
_files/
*_cache/
.quarto/
Save the file in the root of your repository (the same folder as
README.md). The filename must start with a dot: .gitignore, not
gitignore.
Your repository should look like this:
psy434-labs/
├── .gitignore
├── README.md
├── R/
│ └── .gitkeep
├── data/
│ └── .gitkeep
└── diaries/
└── .gitkeep
Notice that data/ exists on your computer but Git ignores its contents
by default. This is intentional: data files can be large, and anyone
with your code can regenerate them. Keep data local; keep code in Git.
Before every commit, check that you are staging only source files
(.md, .R, .qmd, and small config files):
git status
git diff --cached --name-only
If you accidentally stage a data file or large object, unstage it:
git restore --staged <file>
If you already committed it, remove it from tracking while keeping the local copy:
git rm --cached <file>
git commit -m "stop tracking data file"
git push
Lab diary files go in the diaries/ folder, named by week number:
diaries/
├── lab-01.md
├── lab-02.md
├── lab-03.md
├── ...
└── lab-10.md
There is no lab-07.md (week 7 is test 1). Create your first diary file
now:
touch diaries/lab-01.md
Markdown basics
Markdown is a plain-text format that converts to formatted documents.
You write in a .md file using simple symbols for headings, bold,
lists, and so on. GitHub renders markdown automatically, so your diary
will look formatted when you view it online.
Headings
Use # symbols. More # signs mean smaller headings:
# Heading 1
## Heading 2
### Heading 3
Paragraphs
Separate paragraphs with a blank line. A single line break without a blank line will not start a new paragraph.
Bold and italics
This is **bold** and this is *italic*.
Lists
Unordered lists use -:
- first item
- second item
- third item
Numbered lists use 1., 2., etc.:
1. first step
2. second step
3. third step
Inline code
Wrap code in single backticks:
Use the `git push` command to upload your work.
Links
[GitHub](https://github.com)
Markdown reference
GitHub has a concise guide to GitHub-flavoured markdown: Basic writing and formatting syntax.
Write your first lab diary
Create your week 1 diary entry now. Open diaries/lab-01.md in RStudio
and write ~150 words covering:
- What this lab covered and what you did.
- A connection to the week's lecture content.
- One thing you found useful, surprising, or challenging.
Use at least one heading, one bold or italic word, and one list. When you are done, stage these files, commit, and push:
git add .gitignore diaries/lab-01.md diaries/.gitkeep data/.gitkeep R/.gitkeep
git commit -m "add lab 01 diary and repo structure"
git push
Check your repository on GitHub to confirm the file appears and the markdown renders correctly.
Editors
RStudio is the recommended editor for this course. All examples and in-class instructions assume RStudio. You are welcome to use any editor you prefer (VS Code, Zed, Neovim, etc.), but you will need to translate instructions on your own. Avoid rich-text editors (Word, Pages, TextEdit) that can silently change file format or extensions.
Alternative: GitHub Desktop
If you prefer a graphical interface, download GitHub Desktop. It provides the same stage/commit/push workflow with buttons instead of terminal commands. Either approach is fine for this course.
Lab 2: Install R and Set Up Your IDE
Today's workflow
Complete this lab in a local RStudio project first. You can do all core exercises without git/GitHub, then connect to git/GitHub at the end.
This session introduces R and RStudio, then builds your core R skills.
Why learn R?
- You will need it for your final report (if you choose the report option).
- It supports your psychology coursework.
- It enhances your coding skills, which will help you in many domains of work, including utilising AI (!).
Installing R
- Visit CRAN at https://cran.r-project.org/.
- Select the version for your operating system (Windows, Mac, or Linux).
- Download and install by following the on-screen instructions.
Installing RStudio
Step 1: Install RStudio
- Go to https://posit.co/download/rstudio-desktop/.
- Choose the free version of RStudio Desktop and download it for your operating system.
- Install RStudio Desktop.
- Open RStudio to begin setting up your project environment.
Step 2: Choose your working folder and create lab folders
Use any folder you like for this lab. If you already have
labs-YOUR-USERNAME from Lab 1, you can use that. If not, create a new
folder:
mkdir -p ~/Documents/psy434/lab-02
cd ~/Documents/psy434/lab-02
mkdir -p diaries data R
pwd
ls -a
If you chose a different location, use that path instead.
Step 3: Open your folder as an RStudio project
- In RStudio, go to
File > New Project. - Choose Existing Directory.
- Browse to your folder (e.g.,
~/Documents/psy434/lab-02). - Click Create Project.
RStudio creates a .Rproj file in your folder.
File naming
Use clear labels that anyone could understand. That "anyone" will be your future self. Prefer lowercase with hyphens:
lab-02-intro.R, notLab 2 Intro.R.
Step 4: Create your first R script
Now that RStudio is installed, download the starter script:
- Download the R script for this lab.
- Save it as
R/lab-02.Rinside your project folder. - Open
R/lab-02.Rin RStudio.
If downloading is inconvenient, create your own script via
File > New File > R Script and save it as R/lab-02.R.
Step 5: Working with R scripts
- Write your R code in the script editor. Execute code by selecting
lines and pressing
Ctrl + Enter(Windows/Linux) orCmd + Enter(Mac). - Use comments (preceded by
#) to document your code. - Save your scripts regularly (
Ctrl + SorCmd + S).
Step 6: When you exit RStudio
Before concluding your work, restart R (Session > Restart R) and
choose not to save the workspace image when prompted.
Workflow habits
- Use clearly defined script names.
- Annotate your code.
- Save your scripts often (
Ctrl + SorCmd + S).
Running R from the terminal
You can run R without opening RStudio.
Interactive console
Open a terminal and type:
R
You will see the R prompt (>). Try a quick calculation:
1 + 1
Type q() to quit. When asked to save the workspace, type n.
Running a script
If you have an R script saved in your project folder, run it directly:
Rscript R/lab-02.R
Output prints to the terminal. This is useful for running code without opening RStudio, and is how R scripts are run on servers and in automated pipelines.
RStudio vs terminal
RStudio is easier for interactive exploration (viewing plots, inspecting data). The terminal is useful for running finished scripts and for working on remote machines. Both use the same R installation. Use whichever suits the task.
Basic R Commands
How to copy code from this page
- Open
File > New File > R Scriptin RStudio.- Name and save your new R script.
- Copy the code blocks below into your script.
- Save:
Ctrl + SorCmd + S.
Assignment (<-)
Assignment in R uses the <- operator:
x <- 10 # assigns the value 10 to x
y <- 5 # assigns the value 5 to y
RStudio shortcut for <-
- macOS:
Option+-(minus key)- Windows/Linux:
Alt+-(minus key)
Concatenation (c())
The c() function combines multiple elements into a vector:
numbers <- c(1, 2, 3, 4, 5)
print(numbers)
Operations (+, -)
x <- 10
y <- 5
total <- x + y
print(total)
difference <- x - y
difference
Executing code
Ctrl + Enter(Windows/Linux) orCmd + Enter(Mac).
Multiplication (*) and Division (/)
product <- x * y
product
quotient <- x / y
quotient
# element-wise operations on vectors
vector1 <- c(1, 2, 3)
vector2 <- c(4, 5, 6)
vector_product <- vector1 * vector2
vector_product
vector_division <- vector1 / vector2
vector_division
Be mindful of division by zero: 10 / 0 returns Inf, and 0 / 0
returns NaN.
# integer division and modulo
integer_division <- 10 %/% 3 # 3
remainder <- 10 %% 3 # 1
rm() Remove Object
devil_number <- 666
devil_number
rm(devil_number)
Logic (!, !=, ==)
x_not_y <- x != y # TRUE
x_equal_10 <- x == 10 # TRUE
OR (| and ||)
# element-wise OR
vector_or <- c(TRUE, FALSE) | c(FALSE, TRUE) # c(TRUE, TRUE)
# single OR (first element only)
single_or <- TRUE || FALSE # TRUE
AND (& and &&)
# element-wise AND
vector_and <- c(TRUE, FALSE) & c(FALSE, TRUE) # c(FALSE, FALSE)
# single AND (first element only)
single_and <- TRUE && FALSE # FALSE
RStudio workflow shortcuts
- Execute code line:
Cmd + Return(Mac) orCtrl + Enter(Win/Linux)- Insert section heading:
Cmd + Shift + R(Mac) orCtrl + Shift + R- Align code:
Cmd + Shift + A(Mac) orCtrl + Shift + A- Comment/uncomment:
Cmd/Ctrl + Shift + C- Save all:
Cmd/Ctrl + Shift + S- Find/replace:
Cmd/Ctrl + F- New file:
Cmd/Ctrl + Shift + N- Auto-complete:
TabFor more, explore
Tools > Command PaletteorShift + Cmd/Ctrl + P.
Data Types in R
Integers
x <- 42L
str(x) # int 42
y <- as.numeric(x)
str(y) # num 42
Integers are useful for counts or indices that do not require fractional values.
Characters
name <- "Alice"
Characters represent text: names, labels, descriptions.
Factors
colours <- factor(c("red", "blue", "green"))
Factors represent categorical data with a limited set of levels.
Ordered factors
education_levels <- c("high school", "bachelor", "master", "ph.d.")
# unordered
education_factor_no_order <- factor(education_levels, ordered = FALSE)
# ordered
education_factor <- factor(education_levels, ordered = TRUE)
education_factor
Ordered factors support logical comparisons based on level order:
edu1 <- ordered("bachelor", levels = education_levels)
edu2 <- ordered("master", levels = education_levels)
edu2 > edu1 # TRUE
Strings
you <- "world!"
greeting <- paste("hello,", you)
greeting # "hello, world!"
Vectors
Vectors are homogeneous: all elements must be of the same type.
numeric_vector <- c(1, 2, 3, 4, 5)
character_vector <- c("apple", "banana", "cherry")
logical_vector <- c(TRUE, FALSE, TRUE, FALSE)
Manipulating vectors
vector_sum <- numeric_vector + 10
vector_multiplication <- numeric_vector * 2
vector_greater_than_three <- numeric_vector > 3
Accessing elements:
first_element <- numeric_vector[1]
some_elements <- numeric_vector[c(2, 4)]
Functions with vectors
mean(numeric_vector)
sum(numeric_vector)
sort(numeric_vector)
unique(character_vector)
Data Frames
Creating data frames
df <- data.frame(
name = c("alice", "bob", "charlie"),
age = c(25, 30, 35),
gender = c("female", "male", "male")
)
head(df)
str(df)
Accessing elements
# by column name
name_column <- df$name
# by row and column
second_person <- df[2, ]
age_column <- df[, "age"]
# by condition
very_old_people <- subset(df, age > 25)
mean(very_old_people$age)
Exploring data frames
head(df) # first six rows
tail(df) # last six rows
str(df) # structure
summary(df) # summary statistics
Manipulating data frames
# adding columns
df$employed <- c(TRUE, TRUE, FALSE)
# adding rows
new_person <- data.frame(name = "diana", age = 28, gender = "female", employed = TRUE)
df <- rbind(df, new_person)
# modifying values
df[4, "age"] <- 26
# removing columns
df$employed <- NULL
# removing rows
df <- df[-4, ]
rbind() and cbind()
rbind() combines data frames by rows; cbind() combines by columns.
When using these functions, column names (for rbind) or row counts
(for cbind) must match. We will use dplyr for more flexible joining
in later weeks.
Summary statistics
set.seed(12345)
vector <- rnorm(n = 40, mean = 0, sd = 1)
mean(vector)
sd(vector)
min(vector)
max(vector)
table() for categorical data
set.seed(12345)
gender <- sample(c("male", "female"), size = 100, replace = TRUE, prob = c(0.5, 0.5))
education_level <- sample(c("high school", "bachelor", "master"), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2))
df_table <- data.frame(gender, education_level)
table(df_table)
table(df_table$gender, df_table$education_level) # cross-tabulation
First Data Visualisation with ggplot2
ggplot2 is based on the Grammar of Graphics: you build plots layer by
layer. Install it once if needed: install.packages("ggplot2").
library(ggplot2)
set.seed(12345)
student_data <- data.frame(
name = c("alice", "bob", "charlie", "diana", "ethan", "fiona", "george", "hannah"),
score = sample(80:100, 8, replace = TRUE),
stringsAsFactors = FALSE
)
student_data$passed <- ifelse(student_data$score >= 90, "passed", "failed")
student_data$passed <- factor(student_data$passed, levels = c("failed", "passed"))
student_data$study_hours <- sample(5:15, 8, replace = TRUE)
Bar plot
ggplot(student_data, aes(x = name, y = score, fill = passed)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("failed" = "red", "passed" = "blue")) +
labs(title = "student scores", x = "student name", y = "score") +
theme_minimal()
Scatter plot
ggplot(student_data, aes(x = study_hours, y = score, color = passed)) +
geom_point(size = 4) +
scale_color_manual(values = c("failed" = "red", "passed" = "blue")) +
labs(title = "scores vs. study hours", x = "study hours", y = "score") +
theme_minimal()
Box plot
ggplot(student_data, aes(x = passed, y = score, fill = passed)) +
geom_boxplot() +
scale_fill_manual(values = c("failed" = "red", "passed" = "blue")) +
labs(title = "score distribution by pass/fail status", x = "status", y = "score") +
theme_minimal()
Histogram
ggplot(student_data, aes(x = score, fill = passed)) +
geom_histogram(binwidth = 5, color = "black", alpha = 0.7) +
scale_fill_manual(values = c("failed" = "red", "passed" = "blue")) +
labs(title = "histogram of scores", x = "score", y = "count") +
theme_minimal()
Line plot (time series)
months <- factor(month.abb[1:8], levels = month.abb[1:8])
study_hours <- c(0, 3, 15, 30, 35, 120, 18, 15)
study_data <- data.frame(month = months, study_hours = study_hours)
ggplot(study_data, aes(x = month, y = study_hours, group = 1)) +
geom_line(linewidth = 1, color = "blue") +
geom_point(color = "red", size = 1) +
labs(title = "monthly study hours", x = "month", y = "study hours") +
theme_minimal()
Summary of today's lab
This session covered:
- Installing and setting up R and RStudio
- Basic arithmetic operations
- Data structures: vectors, factors, data frames
- Data visualisation with
ggplot2
End-of-lab git/GitHub check-in
After you finish the R tasks above, you can save this work to GitHub.
If you already completed Lab 1: Git and GitHub, run:
git status
git add R/lab-02.R
git commit -m "add lab 2 r script and setup"
git push
If git/GitHub is still new for you, stop here and return to this section after your Lab 1 workflow is working.
Where to get help
- Large language models: LLMs are effective coding tutors. Help from LLMs for coding does not constitute a breach of academic integrity in this course. For your final report, cite all sources including LLMs.
- Stack Overflow: outstanding community resource.
- Cross Validated: best place for statistics advice.
- Developer websites: Tidyverse.
- Your tutors and course coordinator.
Recommended reading
- Wickham, H., & Grolemund, G. (2016). R for Data Science. O'Reilly Media. Available online
- Megan Hall's lecture: https://meghan.rbind.io/talk/neair/
- RStudio learning materials: https://posit.cloud/learn/primers
Appendix A: At-home exercises
Exercise 1: Install the tidyverse package
- Open RStudio.
- Go to
Tools > Install Packages....- Type
tidyverseand checkInstall dependencies.- Click
Install.- Load with
library(tidyverse).
Exercise 2: Install the parameters and report packages
- Go to
Tools > Install Packages....- Type
parameters, report.- Check
Install dependenciesand clickInstall.
Exercise 2b: Install the causalworkshop package
The
causalworkshoppackage provides simulated data for your research report. Run the following in your R console:install.packages("pak") pak::pak("go-bayes/causalworkshop")If this fails with an error about
makeor compilation, follow Lab Setup: R Packages and Build Tools, then try again.Verify the installation:
library(causalworkshop) d <- simulate_nzavs_data(n = 100, seed = 2026) head(d)
Exercise 3: Basic operations
- Create
vector_a <- c(2, 4, 6, 8)andvector_b <- c(1, 3, 5, 7).- Add them, subtract them, multiply
vector_aby 2, dividevector_bby 2.- Calculate the mean and standard deviation of both.
Exercise 4: Working with data frames
- Create a data frame with columns
id(1–4),name,score(88, 92, 85, 95).- Add a
passedcolumn (pass mark = 90).- Extract name and score of students who passed.
- Explore with
summary(),head(),str().
Exercise 5: Logical operations and subsetting
- Subset
student_datato find students who scored above the mean.- Create an
attendancevector and add it as a column.- Subset to select only rows where students were present.
Exercise 6: Cross-tabulation
- Create factor variables
fruitandcolour.- Make a data frame and use
table()for cross-tabulation.- Which fruit has the most colour variety?
Exercise 7: Visualisation with ggplot2
- Using
student_data, create a bar plot of scores by name.- Add a title, axis labels, and colour by pass/fail status.
Appendix B: Solutions
Solution 3: Basic operations
vector_a <- c(2, 4, 6, 8)
vector_b <- c(1, 3, 5, 7)
sum_vector <- vector_a + vector_b
diff_vector <- vector_a - vector_b
double_vector_a <- vector_a * 2
half_vector_b <- vector_b / 2
mean(vector_a); sd(vector_a)
mean(vector_b); sd(vector_b)
Solution 4: Working with data frames
student_data <- data.frame(
id = 1:4,
name = c("alice", "bob", "charlie", "diana"),
score = c(88, 92, 85, 95),
stringsAsFactors = FALSE
)
student_data$passed <- student_data$score >= 90
passed_students <- student_data[student_data$passed == TRUE, ]
summary(student_data)
Solution 5: Logical operations and subsetting
mean_score <- mean(student_data$score)
students_above_mean <- student_data[student_data$score > mean_score, ]
attendance <- c("present", "absent", "present", "present")
student_data$attendance <- attendance
present_students <- student_data[student_data$attendance == "present", ]
Solution 6: Cross-tabulation
fruit <- factor(c("apple", "banana", "apple", "orange", "banana"))
colour <- factor(c("red", "yellow", "green", "orange", "green"))
fruit_data <- data.frame(fruit, colour)
table(fruit_data$fruit, fruit_data$colour)
# apple has the most colour variety (red, green)
Solution 7: Visualisation
library(ggplot2)
ggplot(student_data, aes(x = name, y = score, fill = passed)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("TRUE" = "blue", "FALSE" = "red")) +
labs(title = "student scores", x = "name", y = "score") +
theme_minimal()
Lab 3: Regression and Bias Mechanisms
R script
This lab introduces regression as a tool for describing associations and then shows why regression coefficients do not become causal effects simply because we fit a model. We begin with a simple simulated regression, and then we examine three common sources of bias: omitted variable bias, mediator bias, and collider bias.
What you will learn
- How to simulate a simple regression model and interpret its slope.
- Why population inference depends on assumptions, not on software output alone.
- How omitted variable bias arises when a common cause is left out.
- How mediator bias arises when we condition on a variable on the causal pathway.
- How collider bias arises when we condition on a common effect.
Packages
required_packages <- c("tidyverse", "parameters", "report", "ggeffects", "ggdag")
missing_packages <- required_packages[
!vapply(required_packages, \(pkg) requireNamespace(pkg, quietly = TRUE), logical(1))
]
if (length(missing_packages) > 0) {
install.packages(missing_packages)
}
library(tidyverse)
library(parameters)
library(report)
library(ggeffects)
library(ggdag)
Regression refresher
A regression model describes how the expected value of an outcome changes with a predictor. In this lab, the first simulation uses study hours as the predictor and exam score as the outcome.
set.seed(123)
n <- 200
study_hours <- rnorm(n, mean = 10, sd = 2)
exam_score <- 50 + 3 * study_hours + rnorm(n, mean = 0, sd = 6)
df_regression <- tibble(
study_hours = study_hours,
exam_score = exam_score
)
fit_regression <- lm(exam_score ~ study_hours, data = df_regression)
summary(fit_regression)
report::report(fit_regression)
The coefficient for study_hours tells us how much the expected exam
score changes for a one-unit increase in study hours. In this
simulation, the data-generating slope is positive, so the fitted line
should recover a positive relationship.
We can visualise the fitted line with raw data:
predicted_values <- ggeffects::ggpredict(fit_regression, terms = "study_hours")
plot(predicted_values, dot_alpha = 0.25, show_data = TRUE, jitter = 0.05)
A reminder about inference
Regression helps us summarise patterns in data, but it does not settle causal questions on its own. A coefficient can be statistically precise and still be causally misleading if the model conditions on the wrong variables, omits an important pre-treatment cause, or conditions on a post-treatment variable that should have been left alone.
This is why model fit is not the same as causal identification. A model with a lower BIC or higher $R^2$ can still answer the wrong causal question.
Omitted variable bias
Omitted variable bias occurs when a common cause of treatment and
outcome is not included in the model. In the script, l causes both a
and y, but a has no causal effect on y.
set.seed(434)
n <- 2000
l <- rnorm(n)
a <- 0.9 * l + rnorm(n)
y <- 1.2 * l + rnorm(n)
df_fork <- tibble(y = y, a = a, l = l)
fit_fork_naive <- lm(y ~ a, data = df_fork)
fit_fork_adjusted <- lm(y ~ a + l, data = df_fork)
parameters::model_parameters(fit_fork_naive)
parameters::model_parameters(fit_fork_adjusted)
The naive model reports an association between a and y because both
are driven by l. Once we adjust for l, the coefficient for a
should collapse toward zero. This is the logic of adjustment for
observed pre-treatment confounding.
The script also draws a DAG for this pattern:
dag_fork <- dagify(
y ~ l,
a ~ l,
exposure = "a",
outcome = "y"
) |>
tidy_dagitty(layout = "tree")
ggdag(dag_fork) +
theme_dag_blank()
Mediator bias
Mediator bias occurs when we condition on a variable that lies on the pathway from treatment to outcome, even though our goal is to estimate the total effect.
set.seed(435)
n <- 2000
a <- rbinom(n, 1, 0.5)
m <- 1.5 * a + rnorm(n)
y <- 2 * m + rnorm(n)
df_pipe <- tibble(y = y, a = a, m = m)
fit_pipe_total <- lm(y ~ a, data = df_pipe)
fit_pipe_overcontrolled <- lm(y ~ a + m, data = df_pipe)
parameters::model_parameters(fit_pipe_total)
parameters::model_parameters(fit_pipe_overcontrolled)
Here the total effect of a on y operates through m. If we
condition on m, we block the pathway that carries the treatment
effect. As a result, the coefficient for a shrinks toward zero, even
though the treatment really does matter.
In words, if you want the total effect, you do not control for the mediator.
Collider bias
Collider bias occurs when we condition on a variable that is caused by both the treatment and the outcome. Before conditioning, the treatment and outcome may be unrelated. After conditioning, we create a spurious association.
set.seed(436)
n <- 2000
a <- rnorm(n)
y <- rnorm(n)
c_var <- a + y + rnorm(n)
df_collider <- tibble(y = y, a = a, c_var = c_var)
fit_collider_unadjusted <- lm(y ~ a, data = df_collider)
fit_collider_adjusted <- lm(y ~ a + c_var, data = df_collider)
parameters::model_parameters(fit_collider_unadjusted)
parameters::model_parameters(fit_collider_adjusted)
The unadjusted model should show little or no relationship between a
and y. The adjusted model should create one. That is collider bias.
Compare the three bias patterns
The script finishes by collecting the treatment coefficient from each simulation into a single table:
results <- tibble(
scenario = c(
"omitted variable bias",
"omitted variable bias",
"mediator bias",
"mediator bias",
"collider bias",
"collider bias"
),
model = c(
"naive",
"adjusted for l",
"total effect",
"overcontrolled",
"do not condition",
"condition on collider"
),
estimate = c(
coef(fit_fork_naive)[["a"]],
coef(fit_fork_adjusted)[["a"]],
coef(fit_pipe_total)[["a"]],
coef(fit_pipe_overcontrolled)[["a"]],
coef(fit_collider_unadjusted)[["a"]],
coef(fit_collider_adjusted)[["a"]]
)
) |>
mutate(estimate = round(estimate, 3))
print(results)
This comparison is the point of the lab. Different forms of conditioning error create different forms of bias. Regression does not rescue us from those mistakes. We have to supply the right causal structure.
Exercise
- Increase the strength of the common cause
lin the omitted variable simulation. What happens to the naive coefficient? - Increase the effect of
aonmin the mediator simulation. What happens to the difference between the total-effect and overcontrolled models? - Increase the contribution of
aandyto the colliderc_var. What happens to the adjusted coefficient?
Take-home message
Regression is useful, but causal interpretation depends on design and assumptions. The practical rule is simple. Control for common causes. Do not control for mediators when estimating total effects. Do not control for colliders.
Lab 4: Writing Regression Models
R scripts
Last week asked you to learn R and causal inference at the same time. That is a heavy lift. This week slows down and focuses on one skill: writing regression models in R and seeing how the results change when you change the formula.
Start with the student practice script. It is shorter, repeats the same workflow, and gives you clear places to edit the model. The instructor script comes second. It adds extra annotation, more examples, and an optional extension that returns to the Week 4 question about samples and populations.
What you will learn
- How to write a simple regression formula in R.
- How adding a second predictor can change a coefficient.
- How an interaction changes fitted lines.
- How to rerun a model after changing the formula.
- Optional: how factor terms, curved relationships, and sample-to-population differences extend the same ideas.
Packages
The student script uses tidyverse. The instructor script also uses
parameters.
required_packages <- c("tidyverse")
missing_packages <- required_packages[
!vapply(required_packages, \(pkg) requireNamespace(pkg, quietly = TRUE), logical(1))
]
if (length(missing_packages) > 0) {
install.packages(missing_packages)
}
library(tidyverse)
How to use this lab
- Open the student practice script first.
- Run one exercise at a time.
- Change only the formula after
~. - Rerun the model and the lines immediately below it.
- Write down what changed before moving to the next exercise.
Student practice script
The student script has three core exercises.
- One predictor. Fit
exam_score ~ study_hours, then change it toexam_score ~ 1and see what the fitted line becomes. - Add a second predictor. Start with
exam_score ~ study_hours, then addmotivationand see how the coefficient forstudy_hourschanges. - Add an interaction. Start with
exam_score ~ study_hours + workshop, then change it toexam_score ~ study_hours * workshopand compare the fitted lines.
The aim is not to memorise syntax. The aim is to notice what each change in the formula does.
Instructor script with extensions
Use the instructor script after you have worked through the student version.
It includes:
- More annotation around the simulation code.
- Extra exercises with a factor predictor and a curved relationship.
- Cleaner comparison tables.
- An optional extension on sample versus population estimands.
That final section reconnects the lab to the Week 4 theme. It is useful, but it is not the place to start if you are still getting comfortable with R syntax.
Questions to answer
- In exercise 1, what happens to the fitted line when you change
exam_score ~ study_hourstoexam_score ~ 1? - In exercise 2, does the coefficient for
study_hoursget larger or smaller after addingmotivation? - In exercise 3, what changes when you replace
+with*in the formula? - Which formula felt easiest to interpret, and which felt hardest?
Optional extension
If you finish early, open the instructor script and run the optional section at the end. In one short paragraph, explain why the conditional coefficients can look similar even when the average treatment effect changes across populations.
Lab 5: Average Treatment Effects
R script
This lab introduces several ways to estimate average treatment effects (ATEs). You will compare naive, regression-adjusted, g-computation, and causal forest estimates against known ground-truth effects, then finish with one short illustration of inverse probability of treatment weighting (IPTW).
What you will learn
- Why naive estimates of causal effects are biased when confounding is present
- How covariate adjustment and g-computation reduce this bias
- How confounding control can also come from an exposure model through IPTW
- How to fit a causal forest and extract the ATE
- How to validate estimates against ground truth
New packages
This lab uses the
causalworkshopandgrfpackages. Install them before proceeding if you haven't already.
Setup and data
Install and load the required packages:
# install packages if needed
# install.packages("grf")
# if (!requireNamespace("pak", quietly = TRUE)) install.packages("pak")
# pak::pak("go-bayes/causalworkshop")
library(causalworkshop)
library(grf)
library(tidyverse)
Generate a simulated three-wave panel dataset. The data are modelled on
the New Zealand Attitudes and Values Study (NZAVS), with baseline
confounders (wave 0), binary exposures (wave 1), and continuous outcomes
(wave 2). Crucially, the data contain known ground-truth treatment
effects in the tau_* columns.
# simulate data
d <- simulate_nzavs_data(n = 5000, seed = 2026)
# check structure
dim(d)
names(d)
The data are in long format (three rows per individual). We need to separate the waves:
# separate waves
d0 <- d |> filter(wave == 0) # baseline confounders
d1 <- d |> filter(wave == 1) # exposure assignment
d2 <- d |> filter(wave == 2) # outcomes
# verify alignment
stopifnot(all(d0$id == d1$id), all(d0$id == d2$id))
We will estimate the effect of community group participation
(community_group) at wave 1 on purpose (purpose) at wave 2.
# ground truth: the true ATE
true_ate <- mean(d0$tau_community_purpose)
cat("True ATE:", round(true_ate, 3), "\n")
Naive ATE (biased)
A naive estimate ignores confounders. We simply regress the outcome on the exposure:
fit_naive <- lm(d2$purpose ~ d1$community_group)
naive_ate <- coef(fit_naive)[2]
cat("Naive ATE:", round(naive_ate, 3), "\n")
cat("True ATE: ", round(true_ate, 3), "\n")
cat("Bias: ", round(naive_ate - true_ate, 3), "\n")
Why is the naive estimate biased?
People who join community groups differ systematically from those who don't. They tend to be more extraverted, more agreeable, and less neurotic. These same traits also affect purpose directly. The naive estimate captures both the causal effect and the confounding.
Adjusted ATE (regression)
We can reduce bias by conditioning on baseline confounders:
# construct analysis dataframe
df <- data.frame(
y = d2$purpose,
a = d1$community_group,
age = d0$age,
male = d0$male,
nz_european = d0$nz_european,
education = d0$education,
partner = d0$partner,
employed = d0$employed,
log_income = d0$log_income,
nz_dep = d0$nz_dep,
agreeableness = d0$agreeableness,
conscientiousness = d0$conscientiousness,
extraversion = d0$extraversion,
neuroticism = d0$neuroticism,
openness = d0$openness,
community_t0 = d0$community_group,
purpose_t0 = d0$purpose
)
# regression with covariates
fit_adj <- lm(y ~ a + age + male + nz_european + education + partner +
employed + log_income + nz_dep + agreeableness +
conscientiousness + extraversion + neuroticism + openness +
community_t0 + purpose_t0, data = df)
adj_ate <- coef(fit_adj)["a"]
cat("Adjusted ATE:", round(adj_ate, 3), "\n")
cat("True ATE: ", round(true_ate, 3), "\n")
cat("Bias: ", round(adj_ate - true_ate, 3), "\n")
What changed?
The adjusted estimate should be much closer to the true ATE. Conditioning on confounders breaks the spurious association between exposure and outcome (recall the fork structure from the ggdag tutorial).
G-computation by hand
G-computation estimates the ATE by predicting outcomes under counterfactual treatment assignments. We create two copies of the data, one where everyone is treated and one where everyone is untreated, predict outcomes for each, and take the average difference.
# create counterfactual datasets
df_treated <- df
df_treated$a <- 1
df_control <- df
df_control$a <- 0
# predict outcomes under each scenario
y_hat_treated <- predict(fit_adj, newdata = df_treated)
y_hat_control <- predict(fit_adj, newdata = df_control)
# ATE via g-computation
gcomp_ate <- mean(y_hat_treated - y_hat_control)
cat("G-computation ATE:", round(gcomp_ate, 3), "\n")
cat("True ATE: ", round(true_ate, 3), "\n")
G-computation vs regression coefficient
When the treatment is binary and the model has no interactions, the g-computation ATE equals the regression coefficient on the treatment variable. They diverge when interactions are present, because g-computation averages over the empirical distribution of covariates.
ATE via causal forest
A causal forest estimates individual-level treatment effects $\widehat{\tau}(x_i)$ non-parametrically. The ATE is the average of these individual effects, with a valid standard error that accounts for the estimation uncertainty.
# construct matrices for the causal forest
covariate_cols <- c(
"age", "male", "nz_european", "education", "partner", "employed",
"log_income", "nz_dep", "agreeableness", "conscientiousness",
"extraversion", "neuroticism", "openness",
"community_t0", "purpose_t0"
)
X <- as.matrix(df[, covariate_cols])
Y <- df$y
W <- df$a
# fit causal forest
cf <- causal_forest(
X, Y, W,
num.trees = 1000,
honesty = TRUE,
tune.parameters = "all",
seed = 2026
)
# extract ATE with standard error
ate_cf <- average_treatment_effect(cf)
cat("Causal forest ATE:", round(ate_cf["estimate"], 3),
"(SE:", round(ate_cf["std.err"], 3), ")\n")
cat("True ATE: ", round(true_ate, 3), "\n")
What is honesty?
Setting
honesty = TRUEsplits the training data in half: one half builds the tree structure, the other estimates the treatment effects within each leaf. This prevents overfitting and ensures valid confidence intervals.
Compare all estimates
results <- data.frame(
method = c("Naive", "Adjusted regression", "G-computation", "Causal forest"),
estimate = c(naive_ate, adj_ate, gcomp_ate, ate_cf["estimate"]),
bias = c(naive_ate - true_ate, adj_ate - true_ate,
gcomp_ate - true_ate, ate_cf["estimate"] - true_ate)
)
results$estimate <- round(results$estimate, 3)
results$bias <- round(results$bias, 3)
print(results)
cat("\nTrue ATE:", round(true_ate, 3), "\n")
Key takeaway
All three adjusted methods (regression, g-computation, causal forest) should recover the true ATE reasonably well. The naive estimate is substantially biased because it does not account for confounding. The causal forest additionally provides valid standard errors and, as we will see in Lab 6, individual-level treatment effect predictions.
Optional extension: the same ATE from an exposure model
So far we have controlled confounding through an outcome model.
G-computation works by modelling $Y \mid A, L$ and then using
predict() to compare the treated and untreated worlds.
IPTW takes the other route. It models treatment assignment, $A \mid L$, then gives more weight to people who received an unexpectedly rare treatment for their covariate pattern. This creates a pseudo-population in which treatment is less confounded by $L$.
# model the probability of treatment
ps_model <- glm(
a ~ age + male + nz_european + education + partner +
employed + log_income + nz_dep + agreeableness +
conscientiousness + extraversion + neuroticism + openness +
community_t0 + purpose_t0,
data = df,
family = binomial()
)
ps_hat <- predict(ps_model, type = "response")
# stabilised IPTW weights
p_treated <- mean(df$a)
iptw <- ifelse(
df$a == 1,
p_treated / ps_hat,
(1 - p_treated) / (1 - ps_hat)
)
# quick weight check
tibble(
statistic = c("min", "median", "max"),
value = c(min(iptw), median(iptw), max(iptw))
)
# weighted ATE model
fit_iptw <- lm(y ~ a, data = df, weights = iptw)
iptw_ate <- coef(fit_iptw)[["a"]]
tibble(
method = c("G-computation", "IPTW"),
estimate = c(gcomp_ate, iptw_ate),
bias = c(gcomp_ate - true_ate, iptw_ate - true_ate)
)
What to notice
IPTW is aiming at the same ATE as g-computation, but it gets there through an exposure model rather than an outcome model.
This is why IPTW is useful to see now. Later, doubly robust estimators combine both ideas: an outcome model and an exposure model.
Exercises
Lab diary
Complete at least two of the following exercises for your lab diary.
-
Different exposure-outcome pair. Repeat the analysis using
religious_serviceas the exposure andbelongingas the outcome. How does the bias of the naive estimate compare? Check the true ATE usingmean(d0$tau_religious_belonging). -
Omit baseline adjustment. Re-fit the causal forest without including
community_t0andpurpose_t0in the covariate matrix. How much does the ATE estimate change? Why might baseline values of the exposure and outcome be important confounders? -
Sample size comparison. Generate data with
n = 1000andn = 10000. How do the causal forest ATE estimates and standard errors change? What does this tell you about the precision of causal forest estimates?
Lab 6: Conditional Average Treatment Effects
R script
This lab explores why functional form matters for estimating heterogeneous treatment effects. You will compare parametric and non-parametric estimators, examine individual-level predictions from causal forests, and test whether treatment effects genuinely vary across individuals.
What you will learn
- Why OLS can miss treatment effect heterogeneity
- How to extract individual treatment effect predictions from a causal forest
- How to test for significant heterogeneity using
test_calibration()- How to identify which covariates drive effect modification
Why functional form matters
When treatment effects vary across individuals, the method we use to estimate them matters. A linear model assumes effects change at a constant rate with each covariate; a causal forest can capture non-linear and interactive patterns.
library(causalworkshop)
library(grf)
library(tidyverse)
The simulate_nonlinear_data() function generates data where the true
treatment effect surface is deliberately non-linear, so that flexible
methods outperform rigid ones:
# simulate data with non-linear treatment effects
d_nl <- simulate_nonlinear_data(n = 2000, seed = 2026)
# compare four estimation methods
result <- compare_ate_methods(d_nl)
All four methods (OLS, polynomial, GAM, causal forest) recover the overall ATE reasonably well. But their ability to predict individual effects differs dramatically:
# compare RMSE for individual-level predictions
print(result$summary)
RMSE tells the story
RMSE (root mean squared error) measures how well each method predicts the true individual treatment effect $\tau(x_i)$. A lower RMSE means the method captures the heterogeneity pattern more accurately. OLS assumes a linear effect surface and typically has the highest RMSE.
Individual treatment effects from the causal forest
Now we return to the NZAVS data from Lab 5. The causal forest estimates $\widehat{\tau}(x_i)$ for each individual: what would their outcome change be if they were treated versus untreated?
# simulate NZAVS data (same as Lab 5)
d <- simulate_nzavs_data(n = 5000, seed = 2026)
d0 <- d |> filter(wave == 0)
d1 <- d |> filter(wave == 1)
d2 <- d |> filter(wave == 2)
# construct matrices
covariate_cols <- c(
"age", "male", "nz_european", "education", "partner", "employed",
"log_income", "nz_dep", "agreeableness", "conscientiousness",
"extraversion", "neuroticism", "openness",
"community_group", "purpose"
)
X <- as.matrix(d0[, covariate_cols])
Y <- d2$purpose
W <- d1$community_group
# fit causal forest
cf <- causal_forest(
X, Y, W,
num.trees = 1000,
honesty = TRUE,
tune.parameters = "all",
seed = 2026
)
Extract predicted individual treatment effects:
# predicted treatment effects for each individual
tau_hat <- predict(cf)$predictions
# summary statistics
cat("Mean tau_hat: ", round(mean(tau_hat), 3), "\n")
cat("SD tau_hat: ", round(sd(tau_hat), 3), "\n")
cat("Range tau_hat: ", round(range(tau_hat), 3), "\n")
Compare with the true individual effects:
# true individual effects from the data-generating process
tau_true <- d0$tau_community_purpose
# how well does the forest recover individual effects?
cat("Correlation(tau_hat, tau_true):", round(cor(tau_hat, tau_true), 3), "\n")
cat("RMSE:", round(sqrt(mean((tau_hat - tau_true)^2)), 3), "\n")
Visualise the distribution of predicted effects:
# histogram of predicted treatment effects
ggplot(data.frame(tau_hat = tau_hat), aes(x = tau_hat)) +
geom_histogram(bins = 40, fill = "steelblue", alpha = 0.7) +
geom_vline(xintercept = mean(tau_hat), colour = "red", linetype = "dashed") +
labs(
title = "Distribution of predicted treatment effects",
x = expression(hat(tau)(x)),
y = "Count"
) +
theme_minimal()
Interpreting the histogram
If treatment effects were homogeneous, this histogram would be tightly concentrated around the ATE. A wide spread indicates heterogeneity: some people benefit more from community group participation than others.
Test for heterogeneity
The test_calibration() function tests whether the forest has detected
genuine heterogeneity, or whether the variation in $\widehat{\tau}(x)$
is just noise.
# test for heterogeneity
cal_test <- test_calibration(cf)
print(cal_test)
Reading the calibration test
The key row is
differential.forest.prediction. If its coefficient is significantly greater than zero (p < 0.05), the forest has detected meaningful variation in treatment effects beyond the overall mean. Themean.forest.predictionrow tests whether the average effect is non-zero.
Variable importance
Which covariates drive the heterogeneity? The variable_importance()
function measures how frequently each variable is used for splitting in
the forest:
# variable importance
var_imp <- variable_importance(cf)
importance_df <- data.frame(
variable = colnames(X),
importance = as.numeric(var_imp)
) |>
arrange(desc(importance))
print(importance_df)
Cross-reference with ground truth
The true treatment effect formula for community group participation on purpose is:
$$\tau = 0.20 + 0.10 \times \text{extraversion} + 0.05 \times \text{partner} - 0.03 \times \text{neuroticism}^2$$
So extraversion, partner status, and neuroticism should appear as important variables. Does the forest recover this pattern?
Subgroup analysis
We can examine whether predicted effects differ across subgroups defined by the important covariates:
# compare effects by extraversion
high_extra <- tau_hat[d0$extraversion > 0]
low_extra <- tau_hat[d0$extraversion <= 0]
cat("Mean tau_hat (high extraversion):", round(mean(high_extra), 3), "\n")
cat("Mean tau_hat (low extraversion): ", round(mean(low_extra), 3), "\n")
cat("Difference: ", round(mean(high_extra) - mean(low_extra), 3), "\n")
# compare effects by partner status
partnered <- tau_hat[d0$partner == 1]
unpartnered <- tau_hat[d0$partner == 0]
cat("\nMean tau_hat (partnered): ", round(mean(partnered), 3), "\n")
cat("Mean tau_hat (unpartnered):", round(mean(unpartnered), 3), "\n")
cat("Difference: ", round(mean(partnered) - mean(unpartnered), 3), "\n")
Do the subgroup differences match the ground truth?
The tau formula adds $+0.10 \times \text{extraversion}$ and $+0.05 \times \text{partner}$. Highly extraverted and partnered individuals should show larger predicted treatment effects. Check whether this matches what you observe.
Predicted vs true effects scatter plot
# scatter plot of predicted vs true individual effects
ggplot(data.frame(true = tau_true, predicted = tau_hat),
aes(x = true, y = predicted)) +
geom_point(alpha = 0.1, colour = "steelblue") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "red") +
labs(
title = "Predicted vs true individual treatment effects",
x = expression(tau(x)),
y = expression(hat(tau)(x))
) +
theme_minimal()
Key takeaway
Causal forests can detect meaningful heterogeneity in treatment effects without requiring the analyst to specify the functional form in advance. The
test_calibration()function provides a formal test for heterogeneity, andvariable_importance()identifies which covariates drive it. In Lab 8, we will use these individual predictions to evaluate targeting strategies.
Exercises
Lab diary
Complete at least two of the following exercises for your lab diary.
-
Different seed. Run
compare_ate_methods()withseed = 42instead ofseed = 2026. Do the relative RMSE rankings change? Why or why not? -
Different exposure-outcome pair. Fit a causal forest for
volunteer_workonself_esteem. Runtest_calibration()andvariable_importance(). Which covariates drive heterogeneity? Does this match the ground-truth tau formula? (Hint: check thesimulate_nzavs_datadocumentation.) -
Why does OLS miss heterogeneity? In one paragraph, explain why a linear model that includes only main effects cannot capture the $-0.03 \times \text{neuroticism}^2$ term in the treatment effect formula. What would you need to add to the linear model to capture this non-linearity?
Lab 8: RATE and QINI Curves
R script
This lab evaluates whether targeting treatment to those predicted to benefit most improves outcomes compared with treating everyone. You will compute RATE curves and QINI curves from causal forest predictions and assess targeting efficiency at different population percentiles.
What you will learn
- How to rank individuals by predicted treatment benefit
- How to compute and interpret RATE curves (gain over random assignment)
- How to compute and interpret QINI curves (cumulative targeting gain)
- How to characterise the covariate profile of high-benefit individuals
Connection to previous labs
This lab builds directly on Labs 5 and 6. You will use the causal forest fitted in those labs to evaluate whether targeting resources to the most responsive individuals is worthwhile.
Setup
Install packages before class, then restart R. Follow Lab Setup: R
Packages and Build Tools first. The block below mirrors
what scripts/lab-08.R runs at the top: it installs missing CRAN
packages and requires causalworkshop >= 0.6.0, stopping with a clear
restart-R message if a stale namespace is already loaded.
required_packages <- c("grf", "tidyverse")
missing <- required_packages[
!vapply(required_packages, \(p) requireNamespace(p, quietly = TRUE), logical(1))
]
if (length(missing) > 0) install.packages(missing)
if (!requireNamespace("pak", quietly = TRUE)) install.packages("pak")
if (!requireNamespace("causalworkshop", quietly = TRUE) ||
packageVersion("causalworkshop") < "0.6.0") {
pak::pak("go-bayes/causalworkshop")
if ("causalworkshop" %in% loadedNamespaces()) {
stop("causalworkshop was upgraded; please restart R and re-run.", call. = FALSE)
}
}
suppressPackageStartupMessages({
library(causalworkshop)
library(grf)
library(tidyverse)
})
Re-fit the causal forest from Labs 5-6 (or copy the code from Lab 5):
# simulate data
d <- simulate_nzavs_data(n = 5000, seed = 2026)
d0 <- d |> filter(wave == 0)
d1 <- d |> filter(wave == 1)
d2 <- d |> filter(wave == 2)
# construct matrices
covariate_cols <- c(
"age", "male", "nz_european", "education", "partner", "employed",
"log_income", "nz_dep", "agreeableness", "conscientiousness",
"extraversion", "neuroticism", "openness",
"community_group", "purpose"
)
X <- as.matrix(d0[, covariate_cols])
Y <- d2$purpose
W <- d1$community_group
# fit causal forest
cf <- causal_forest(
X, Y, W,
num.trees = 1000,
honesty = TRUE,
tune.parameters = "all",
seed = 2026
)
# extract predicted individual treatment effects
tau_hat <- predict(cf)$predictions
Rank individuals by predicted benefit
The first step in any targeting analysis is to sort individuals from highest to lowest predicted treatment effect:
# sort by predicted benefit (descending)
n <- length(tau_hat)
tau_order <- order(tau_hat, decreasing = TRUE)
tau_sorted <- tau_hat[tau_order]
# what does the top of the distribution look like?
cat("Top 5 predicted effects: ", round(head(tau_sorted, 5), 3), "\n")
cat("Bottom 5 predicted effects:", round(tail(tau_sorted, 5), 3), "\n")
cat("Overall mean: ", round(mean(tau_hat), 3), "\n")
RATE curve
The RATE (Rank-Weighted Average Treatment Effect) curve shows how much we gain by targeting treatment to the top $q%$ of predicted beneficiaries, compared with random assignment.
For each targeting rate $q$, we compute the average predicted effect among the top $q%$ of individuals, minus the overall average:
# compute RATE curve
rates <- seq(0.05, 1.00, by = 0.05)
rate_results <- tibble(
rate = numeric(),
avg_tau_targeted = numeric(),
gain_over_random = numeric()
)
for (r in rates) {
n_targeted <- floor(r * n)
targeted_idx <- tau_order[seq_len(n_targeted)]
avg_targeted <- mean(tau_hat[targeted_idx])
gain <- avg_targeted - mean(tau_hat)
rate_results <- bind_rows(
rate_results,
tibble(rate = r, avg_tau_targeted = avg_targeted, gain_over_random = gain)
)
}
print(rate_results |> mutate(across(where(is.numeric), \(x) round(x, 3))))
Plot the RATE curve:
ggplot(rate_results, aes(x = rate, y = gain_over_random)) +
geom_line(colour = "#E69F00", linewidth = 1) +
geom_point(colour = "#E69F00", size = 2) +
scale_x_continuous(labels = scales::percent_format()) +
labs(
title = "RATE curve: gain from targeting",
x = "Targeting rate (proportion treated)",
y = "Gain over random assignment"
) +
theme_minimal()
Reading the RATE curve
A steep curve at low targeting rates means a small group benefits substantially more than average. A flat curve means everyone benefits similarly, and targeting adds no value. The curve always reaches zero at 100% (treating everyone is the same as random).
QINI curve
The QINI curve measures the cumulative gain from targeting. For each percentile $p$, it computes the total benefit from targeting the top $p%$, minus the proportional share they would get under random assignment:
# compute QINI curve
qini_results <- tibble(
percentile = numeric(),
cumulative_gain = numeric()
)
for (p in rates) {
n_top <- floor(p * n)
top_idx <- tau_order[seq_len(n_top)]
# cumulative gain: total effect for targeted minus proportional share
cum_gain <- sum(tau_hat[top_idx]) - p * sum(tau_hat)
qini_results <- bind_rows(
qini_results,
tibble(percentile = p, cumulative_gain = cum_gain)
)
}
print(qini_results |> mutate(across(where(is.numeric), \(x) round(x, 3))))
Plot the QINI curve:
ggplot(qini_results, aes(x = percentile, y = cumulative_gain)) +
geom_line(colour = "#56B4E9", linewidth = 1) +
geom_point(colour = "#56B4E9", size = 2) +
scale_x_continuous(labels = scales::percent_format()) +
labs(
title = "QINI curve: cumulative targeting gain",
x = "Population percentile",
y = "Cumulative gain over random"
) +
theme_minimal()
Compute the area under the QINI curve (AUQC) via trapezoidal approximation:
# area under QINI curve via trapezoidal rule
qini_for_area <- bind_rows(
tibble(percentile = 0, cumulative_gain = 0),
qini_results
)
auqc <- sum(
diff(qini_for_area$percentile) *
(head(qini_for_area$cumulative_gain, -1) +
tail(qini_for_area$cumulative_gain, -1)) / 2
)
cat("Area Under QINI Curve (AUQC):", round(auqc, 3), "\n")
AUQC interpretation
A larger AUQC means targeting is more valuable. An AUQC near zero means there is little heterogeneity to exploit, and random assignment performs nearly as well as targeted assignment.
Targeting efficiency
Create a summary table comparing the top 10%, 20%, and 50% of predicted beneficiaries:
# targeting efficiency at key percentiles
top_10_idx <- tau_order[seq_len(floor(0.10 * n))]
top_20_idx <- tau_order[seq_len(floor(0.20 * n))]
top_50_idx <- tau_order[seq_len(floor(0.50 * n))]
overall_mean <- mean(tau_hat)
efficiency <- tibble(
group = c("Top 10%", "Top 20%", "Top 50%", "Everyone"),
avg_effect = c(
mean(tau_hat[top_10_idx]),
mean(tau_hat[top_20_idx]),
mean(tau_hat[top_50_idx]),
mean(tau_hat)
)
) |>
mutate(
gain_vs_random = avg_effect - overall_mean,
lift_vs_random = if_else(
abs(overall_mean) > 1e-8,
avg_effect / overall_mean,
NA_real_
),
efficiency_gain_pct = if_else(
abs(overall_mean) > 1e-8,
round((lift_vs_random - 1) * 100, 1),
NA_real_
)
)
print(efficiency |> mutate(across(where(is.numeric), \(x) round(x, 3))))
If mean(tau_hat) is close to zero, prefer gain_vs_random over
lift_vs_random because ratios become unstable.
Characterise the covariate profile of high-benefit individuals:
# who are the top 10%?
top_10_data <- d0[tau_order[seq_len(floor(0.10 * n))], ]
everyone <- d0
cat("Top 10% vs everyone:\n")
cat(" Extraversion: ", round(mean(top_10_data$extraversion), 2),
"vs", round(mean(everyone$extraversion), 2), "\n")
cat(" Neuroticism: ", round(mean(top_10_data$neuroticism), 2),
"vs", round(mean(everyone$neuroticism), 2), "\n")
cat(" Partner (prop): ", round(mean(top_10_data$partner), 2),
"vs", round(mean(everyone$partner), 2), "\n")
cat(" Agreeableness: ", round(mean(top_10_data$agreeableness), 2),
"vs", round(mean(everyone$agreeableness), 2), "\n")
Key takeaway
RATE and QINI curves translate heterogeneous treatment effects into actionable targeting decisions. If the curves are steep, concentrating resources on high-benefit individuals improves overall outcomes. If the curves are flat, treating everyone equally is just as effective. In Lab 9, we will learn how to express these targeting decisions as simple, interpretable rules using policy trees.
Exercises
Lab diary
Complete at least two of the following exercises for your lab diary.
-
Different outcome. Compute RATE and QINI curves for a different outcome (e.g.,
belongingorlife_satisfaction). Is the AUQC larger or smaller? What does this imply about targeting? -
Negative effects. Some individuals may have $\widehat{\tau}(x) < 0$, meaning the treatment is predicted to harm them. How many individuals in your sample have negative predicted effects? What are the implications for resource allocation?
-
AUTOC vs QINI weighting. The RATE curve (AUTOC weighting) emphasises the top of the ranking, while the QINI curve weights all percentiles equally. In one paragraph, explain when each metric would be more useful for a policy-maker.
Lab 9: Policy Trees (multi-outcome workflow with margot)
R script
Download the R script for this lab
The script checks that the required packages are already installed and downloads the ~80 MB cache on first run. Run the setup block below before class, restart R, then run the script.
This lab moves from a CATE ranking to an explicit allocation rule. Lab 8 asked whether targeting has value using RATE and QINI curves. Lab 9 asks how to turn that information into a short rule someone could read, explain, and contest.
What you will learn
- How to read depth-1 and depth-2 policy trees.
- How to select tree depth using a stated parsimony threshold.
- How to interpret policy coverage: the share of people the rule treats.
- How to translate one tree into plain language.
- How to state the limits of a policy-tree rule.
Connection to previous labs
Lab 5 introduced the average treatment effect. Lab 6 introduced the conditional average treatment effect (CATE). Lab 8 evaluated whether targeting based on a CATE ranking has practical value. This lab adds the next idea: turning a CATE ranking into a transparent allocation rule.
How this lab is different
Earlier labs taught the pieces separately: average treatment effects,
conditional average treatment effects (CATEs), causal forests, RATE, and
QINI. Today's lab uses a cached margot workflow so we can put those
pieces together and spend the session reading policy trees. The cache is
a teaching object: it gives us fitted forests, depth comparisons,
policy-tree plots, and prose summaries without asking every laptop to
refit the models during class.
Use the course workflow as normative. The full manuscript workflows are more complex because they answer different questions, use real data, and add extra diagnostics. This lab teaches the sequence of decisions students need for the course: estimate effects, inspect heterogeneity, state a parsimony rule, read the allocation rule, and explain its limits.
Cached fits
The cache holds three artefacts:
models_binary— the batch causal forest (one per outcome) with augmented inverse-propensity-weighted (AIPW) scores, out-of-bag predictions, and the combined ATE table.policy_tree_stability— bootstrap-based stability output for each outcome's depth-1 and depth-2 policy trees.policy_workflow— the interpretive layer: depth recommendations, plots, and auto-generated prose summaries.The fit script that produced the cache is at
scripts/fit-lab-09-cache.R. You do not need to run it during the lab; it is there so you can see exactly what was fitted.
Teaching simplification
The cache is deliberately smaller than the lab's research scripts. It is designed to teach the sequence of decisions, not to reproduce the full employer-gratitude analysis workflow.
Setup
Install packages before class, then restart R. Follow Lab Setup: R Packages and Build Tools first. The lab script no longer tries to build GitHub packages while the lab is running, because this failed on some student laptops and can take a long time even when it works.
cran_packages <- c(
"ggplot2", "dplyr", "tibble", "arrow", "qs2", "googledrive"
)
missing_cran <- cran_packages[
!vapply(cran_packages, \(p) requireNamespace(p, quietly = TRUE), logical(1))
]
if (length(missing_cran) > 0) install.packages(missing_cran)
if (!requireNamespace("pak", quietly = TRUE)) install.packages("pak")
if (!requireNamespace("margot", quietly = TRUE)) {
pak::pak("go-bayes/margot")
}
if (!requireNamespace("causalworkshop", quietly = TRUE) ||
packageVersion("causalworkshop") < "0.6.2") {
pak::pak("go-bayes/causalworkshop")
if ("causalworkshop" %in% loadedNamespaces()) {
stop("causalworkshop was upgraded; please restart R and re-run.", call. = FALSE)
}
}
suppressPackageStartupMessages({
library(causalworkshop)
library(margot)
library(ggplot2)
library(dplyr)
})
If the GitHub package installation still fails after installing build tools, stop there and use the course lab machine or the pre-installed lab environment. Do not try to debug compilers during the lab.
Load the cached fits. The first call downloads roughly 80 MB into a per-user cache directory; subsequent calls read from disk.
cache <- causalworkshop::load_policy_learning_cache()
models_binary <- cache$models_binary
policy_tree_stability <- cache$policy_tree_stability
wf <- cache$policy_workflow
For reference, the cache was produced by the following sequence:
# models_binary <- margot::margot_causal_forest(...)
# full call shown in the R script; it is abbreviated here because it is long
policy_tree_stability <- margot::margot_policy_tree_stability(
model_results = models_binary,
depth = 2,
n_iterations = 100,
vary_type = "split_only",
label_mapping = label_mapping,
seed = 2026
)
wf <- margot::margot_policy_workflow(
stability = policy_tree_stability,
original_df = df_grf,
label_mapping = label_mapping,
audience = "policy",
interpret_models = "recommended",
plot_models = "recommended"
)
The first line is abbreviated because fitting the four causal forests takes too long for the lab. The full call is commented in the R script.
The four outcomes are purpose, belonging, self-esteem, and life satisfaction at wave 2. The exposure is community-group participation at wave 1. We pass a label mapping to every plot and table call so the figures are legible.
label_mapping <- list(
model_t2_purpose = "Sense of Purpose",
model_t2_belonging = "Belonging",
model_t2_self_esteem = "Self-esteem",
model_t2_life_satisfaction = "Life satisfaction"
)
Step 1: quick evidence check
The combined ATE table holds one row per outcome with the risk-difference effect, a 95% confidence interval, and two E-values. This is a quick check before reading policy trees. Do not spend the lab reinterpreting RATE or QINI; that was Lab 8.
print(models_binary$combined_table)
Convert the table into a forest plot:
ate_plot <- margot_plot(
models_binary$combined_table,
type = "RD",
order = "magnitude_desc",
e_val_bound_threshold = 1.2,
label_mapping = label_mapping,
save_path = tempdir()
)
print(ate_plot$plot)
Reading the forest plot
The dashed line at zero is the null. Estimates whose E-value bound is close to 1 are fragile: a relatively modest unmeasured confounder could explain the effect away. Treat those rows with caution when you describe the result. The plot's
$interpretationslot contains a one-sentence draft naming the outcomes most worth treating as causal. Read it as a draft, not a verdict.
Step 2: quick heterogeneity check
The ATE asks "does it work on average?". The next question is "does it
work the same for everyone?". margot_omnibus_hetero_test() wraps
grf::test_calibration(). The output's "Differential prediction"
coefficient is what matters: a positive coefficient with a small p-value
means the forest sees genuine heterogeneity, not just sampling noise.
omnibus <- margot_omnibus_hetero_test(
models_binary,
label_mapping = label_mapping
)
print(omnibus)
What if heterogeneity is absent?
If an outcome shows reliable mean prediction but no differential prediction, the forest is telling you that targeting will not beat random allocation for that outcome. Reporting only the ATE is the honest move.
Step 3: policy-tree summary tables
Start with the policy-tree summary. Coverage is the share of
participants the learned rule recommends for treatment. This is an
output of the tree, not a fixed budget chosen by the analyst.
cat("\n=== policy-tree one-page summary ===\n")
print(wf$policy_brief_df)
Then compare depth-1 and depth-2. A deeper tree is useful only if the gain is worth the extra complexity.
cat("\n=== depth comparison ===\n")
print(wf$best$depth_summary_df)
The depth choice is not mechanical. Investigators must state how much
extra policy value is needed before a depth-2 tree is worth using.
margot exposes that choice through min_gain_for_depth_switch.
depth_thresholds <- c(0.005, 0.01, 0.03)
depth_sensitivity <- dplyr::bind_rows(lapply(depth_thresholds, \(threshold) {
best_at_threshold <- suppressMessages(suppressWarnings(
margot::margot_policy_summary_compare_depths(
policy_tree_stability,
label_mapping = label_mapping,
min_gain_for_depth_switch = threshold,
verbose = FALSE
)
))
best_at_threshold$depth_summary_df |>
dplyr::transmute(
threshold = threshold,
outcome = outcome_label,
selected_depth = depth_selected,
pv_depth1 = pv_depth1,
pv_depth2 = pv_depth2,
gain_depth2_minus_depth1 = pv_depth2 - pv_depth1
)
}))
cat("\n=== depth selection sensitivity ===\n")
print(depth_sensitivity)
Parsimony threshold
A threshold of
0.005says a depth-2 tree is worth using for a gain above 0.005 outcome units. A threshold of0.03says the extra split must buy at least 0.03 units. The threshold is an investigator judgement about interpretability, implementation cost, and expected value.
Coverage matters
If a policy tree recommends treatment for 90-97% of people, it is mostly saying "treat nearly everyone". That can be a valid learned rule, but it is not a scarce-budget allocation policy. If a real programme can only treat 20% of people, the budget constraint must be added separately.
In this lab the policy-tree objective contains no treatment-cost term. A
do not treatleaf means the outcome-only rule assigns the no-treatment action for that profile; it does not report money saved or resources freed. A cost-sensitive policy would need a different objective, for example subtracting a costcfrom the treatment reward and refitting the tree across plausible values ofc.
Step 4: render policy trees
A policy tree converts the personalised CATE into a transparent allocation rule: "treat people in this leaf, do not treat in this leaf". The lab caps depth at two, so each rule asks at most three yes/no questions before committing.
Two functions render each outcome's policy tree.
margot_plot_decision_tree() returns the tree diagram alone, the rule
as a slide-ready flowchart. margot_plot_policy_tree() returns the
prediction-points scatter, the same rule shown as a partition of the
underlying covariate space. The script saves every plot to
lab-09-policy-tree-plots/ so you can inspect the trees outside the
RStudio plot pane.
model_ids <- names(label_mapping)
plot_dir <- file.path(getwd(), "lab-09-policy-tree-plots")
dir.create(plot_dir, showWarnings = FALSE, recursive = TRUE)
policy_tree_plots <- list()
for (m in model_ids) {
cat("\n=== ", label_mapping[[m]], " ===\n", sep = "")
depth1_tree <- margot_plot_decision_tree(
policy_tree_stability,
model_name = m,
max_depth = 1,
label_mapping = label_mapping
)
depth2_tree <- margot_plot_decision_tree(
policy_tree_stability,
model_name = m,
max_depth = 2,
label_mapping = label_mapping
)
depth2_scatter <- margot_plot_policy_tree(
policy_tree_stability,
model_name = m,
max_depth = 2,
label_mapping = label_mapping
)
suppressWarnings(print(depth1_tree))
suppressWarnings(print(depth2_tree))
suppressWarnings(print(depth2_scatter))
suppressWarnings(ggplot2::ggsave(file.path(plot_dir, paste0(m, "-depth1-tree.png")),
depth1_tree, width = 8, height = 5, dpi = 150
))
suppressWarnings(ggplot2::ggsave(file.path(plot_dir, paste0(m, "-depth2-tree.png")),
depth2_tree, width = 8, height = 5, dpi = 150
))
suppressWarnings(ggplot2::ggsave(file.path(plot_dir, paste0(m, "-depth2-scatter.png")),
depth2_scatter, width = 8, height = 5, dpi = 150
))
policy_tree_plots[[m]] <- list(
depth1_tree = depth1_tree,
depth2_tree = depth2_tree,
depth2_scatter = depth2_scatter
)
}
saved_policy_plots <- list.files(plot_dir, pattern = "\\.png$", full.names = TRUE)
print(saved_policy_plots)
Reading a policy tree
Each non-leaf node names a covariate and a threshold. Branches go left if the condition is true and right otherwise. Each leaf says "treat" or "do not treat". A useful policy tree can be repeated by a clinician, teacher, or community organiser without opening the model.
Read "do not treat" as "the rule assigns the no-treatment action under the current outcome-only objective." The current tree compares expected wellbeing outcomes under treatment and no treatment; it does not attach a resource saving to the no-treatment action.
Inspect every graph
Open every file printed in
saved_policy_plots. For each outcome, compare the depth-1 tree, the depth-2 tree, and the depth-2 scatter plot. Ask three questions: what rule is being learned, whether the extra depth changes the rule in a useful way, and whether the scatter plot suggests a stable separation or a fragile threshold.
Step 5: translate one rule
After inspecting all graphs, use the Purpose tree as the worked example:
worked_model <- "model_t2_purpose"
file.path(plot_dir, paste0(worked_model, "-depth2-tree.png"))
Open that file. Write the rule as a sentence:
If [condition], recommend treatment; otherwise [condition].
Then add one sentence about coverage: what share of people would be treated under the rule?
Finally, add one limitation. For example: the splitters are not causes, the rule may rely on proxy variables, the coverage may be too broad for a scarce programme, or the threshold cases may be fragile.
Step 6: read the auto-generated prose
margot synthesises a draft narrative from the policy-workflow object.
Read it after you have inspected the trees yourself. The prose is
generated from the same numbers you saw above; it does not bring new
information.
cat(wf$report_prose)
Read the prose critically
Auto-generated text is a draft. Check the numbers against the tables you printed earlier. Replace any phrasing that overstates causal certainty. The prose is a starting point for your own writing, not a final answer.
Step 7: audit against simulator ground truth
The analyses above never see the truth. The simulator stores the true
individual treatment effects in tau_community_<outcome> columns.
Ranking outcomes by their true population mean tau lets you audit
margot's recommendations without circularity.
d <- simulate_nzavs_data(n = 5000, seed = 2026)
d0 <- d |> filter(wave == 0)
true_tau_table <- tibble::tibble(
outcome = c("Sense of Purpose", "Belonging", "Self-esteem", "Life satisfaction"),
true_mean_tau = c(
mean(d0$tau_community_purpose),
mean(d0$tau_community_belonging),
mean(d0$tau_community_self_esteem),
mean(d0$tau_community_life_satisfaction)
),
true_sd_tau = c(
sd(d0$tau_community_purpose),
sd(d0$tau_community_belonging),
sd(d0$tau_community_self_esteem),
sd(d0$tau_community_life_satisfaction)
)
) |>
arrange(desc(true_mean_tau))
print(true_tau_table)
The true_sd_tau column is the headroom for targeting: it tells you
whether even a perfect ranking would deliver appreciable extra benefit
beyond the average. Use this table to keep the policy trees honest. A
tree cannot recover heterogeneity that is weak, noisy, or absent.
Ethical considerations
A high-value rule still needs judgement
A policy tree maximises expected treatment benefit under the objective you give it. Before anyone acts on it, ask:
- Objective. Does the rule optimise wellbeing gain only, or has a treatment cost been specified in the same units as the outcome?
- Proxy variables. Does the tree split on variables correlated with protected characteristics such as ethnicity, gender, or socioeconomic status?
- Fairness. Does targeting those who benefit most leave out people who would still benefit?
- Transparency. Can a community organiser, teacher, clinician, or member of the public understand the rule?
- Override. When should someone override the rule because they know relevant context the model cannot see?
When presenting policy tree results, discuss these trade-offs plainly.
Exercises
Lab diary
Complete at least two of the following exercises for your lab diary.
-
Translate a policy tree. Pick one outcome. Translate its depth-2 policy tree into a rule a non-technical reader could follow.
-
Compare depth. For the same outcome, compare the depth-1 and depth-2 rules. Does the extra depth seem worth the extra complexity? Use the depth summary table and the plots.
-
Coverage and budget. Use
wf$policy_brief_df. If the rule treats more than 80% of people, explain why that is not a scarce-budget policy. -
Discuss override. In one paragraph, describe a scenario in which a community-group co-ordinator should override a policy tree recommendation. What information would they have that the model does not?
Lab 10: End-to-End Research Report
Lab materials
Download the final assessment Option A report template. If the course-site download fails, use the Google Drive mirror.
Run the setup block in Lab Setup, including
quarto install tinytexfrom a terminal, then restart R. Work from the project the report-template unzips into. Before attempting the manuscript, opensetup.Rand run the whole script. The first run downloads the required materials and packages, which may take about 5 to 10 minutes. Whensetup.Rfinishes without an error, open and rendermanuscript.qmd.
This lab puts the whole course together in one workflow. By the end of the session you will have a draft Option A research report rendered to PDF: an outcome-wide average treatment effect table, a forest plot, per-outcome policy trees, and an ethics paragraph.
What you will learn
- How to set up a single-source-of-truth
setup.Rfor a research report.- How to estimate four average treatment effects in one batch with
margot::margot_causal_forest().- How to apply a Bonferroni correction and report E-values for an outcome-wide design.
- How to fit policy trees and choose between depth-1 and depth-2 with a stated parsimony threshold.
- How to apply a transparent graphing rule so you graph only policies that survive a stated reporting test.
- How to build a research report in Quarto Markdown with citations.
Setup
Install packages before class, then restart R. Follow Lab Setup: R Packages and Build Tools first.
PDF rendering has one extra system step. Run this in a terminal, not in the R console:
quarto install tinytex
quarto list tools
On macOS, use Terminal or the RStudio Terminal tab. On Windows, use
Command Prompt, PowerShell, or the RStudio Terminal tab. If Windows says
quarto is not recognised, install Quarto from
https://quarto.org/docs/download/, close and reopen RStudio, then
rerun the commands above.
The R block below mirrors what the report template's setup.R requires
at the top: it installs missing CRAN packages, including the tinytex R
interface package. The R package is useful, but it does not replace
quarto install tinytex, which installs the TeX distribution Quarto
needs to build PDFs. The block also requires causalworkshop >= 0.6.0
and margot >= 1.0.322, stopping with a clear restart-R message if a
stale namespace is loaded.
cran_packages <- c(
"ggplot2", "dplyr", "tibble", "tidyr", "ggdag", "grf",
"knitr", "kableExtra", "rmarkdown", "tinytex"
)
missing_cran <- cran_packages[
!vapply(cran_packages, \(p) requireNamespace(p, quietly = TRUE), logical(1))
]
if (length(missing_cran) > 0) install.packages(missing_cran)
if (!requireNamespace("pak", quietly = TRUE)) install.packages("pak")
if (!requireNamespace("margot", quietly = TRUE) ||
packageVersion("margot") < "1.0.322") {
pak::pak("go-bayes/margot")
}
if (!requireNamespace("causalworkshop", quietly = TRUE) ||
packageVersion("causalworkshop") < "0.6.0") {
pak::pak("go-bayes/causalworkshop")
if ("causalworkshop" %in% loadedNamespaces()) {
stop("causalworkshop was upgraded; please restart R and re-run.", call. = FALSE)
}
}
suppressPackageStartupMessages({
library(causalworkshop)
library(margot)
library(grf)
library(ggdag)
library(ggplot2)
library(dplyr)
library(tibble)
library(knitr)
library(kableExtra)
})
If the GitHub package installation fails after installing build tools, stop and use the course lab machine. Do not debug compilers during class.
Step 1: Start from the report template
Download
research-report-template.zip
and unzip it. If you already downloaded the template from the Google
Drive mirror, you can keep using that copy. Open the resulting folder in
RStudio (or another editor), then open manuscript.qmd.
In RStudio:
- Choose
File > Open Project...if the folder contains an.Rprojfile; otherwise chooseFile > Open...and selectmanuscript.qmd. - Keep the Files pane pointed at the unzipped
research-report-templatefolder. - Click
manuscript.qmdto open the report source.
If you use another editor, open the whole unzipped folder first, then
open manuscript.qmd from inside that folder. Do not edit the zipped
file itself.
Windows note
Keep the template in a simple folder such as
Documents\PSYC434\research-report-template, not inside the zipped archive and not in a synced folder with a very long path. Render from the RStudio Terminal with:quarto render manuscript.qmdIf the render fails before running any R code and mentions LaTeX, run
quarto install tinytexfrom Command Prompt or PowerShell, restart RStudio, and render again.
Your folder should look like this:
research-report/
setup.R # study decisions, helpers, single source of truth
manuscript.qmd # prose, headings, tables, figures
_quarto.yml # render configuration
references.bib # BibTeX references
README.md # short orientation
setup.R holds decisions and reusable code. manuscript.qmd is the
file you will write in: it holds the prose, headings, tables, figures,
and code chunks that read from setup.R. The split lets you change one
decision (the exposure, the parsimony threshold, the seed) without
scattering edits across the manuscript.
The main decision controls live near the top of setup.R:
use_fit_cache: whether a successful model fit is reused on later renders. The template leaves thisFALSEso a changed exposure or model reruns cleanly.min_gain_for_depth_switch: how much held-out policy-value point gain depth-2 needs before it replaces depth-1.policy_value_lower_thresholdandtreated_uplift_lower_threshold: the graphing-rule thresholds that decide whether a selected tree is strong enough to appear as a figure.alpha_family_wise: the family-wise error rate for the four-outcome Bonferroni correction.num_treesandn_iterations_stability: fitting controls that trade speed against Monte Carlo noise and stability information.
These are not cosmetic choices. They decide what counts as enough improvement, what counts as enough evidence to graph a tree, and how cautious the report should be.
Quarto Markdown survival guide
A Quarto report is a plain-text .qmd file that mixes prose, citations,
tables, figures, and executable R code. When you render it, Quarto runs
the code chunks, inserts their output, and turns the result into PDF and
HTML. Treat manuscript.qmd as the source of truth for the written
report.
Use this minimal syntax while drafting:
## Methods
This report estimates the causal effect of `r exposure_label` on four
wellbeing outcomes.
We follow the outcome-wide reporting logic in VanderWeele [@vanderweele2020].
```{r}
#| label: tbl-ate
#| echo: false
#| tbl-cap: "Average treatment effects for four wellbeing outcomes."
ate$transformed_table
```
The parts to recognise are:
##starts a section heading. Use the template headings unless you have a good reason to change them.- Ordinary paragraphs are just text. Leave a blank line between paragraphs.
- Inline R code uses
`r object_name`. Use it when a number or label should update automatically. - R code chunks start with
```{r}and end with```. Put chunk options such asecho: falseorfig-cap:directly under the chunk header. - Citations use
[@citekey], where the key must appear inreferences.bib. For example,[@vanderweele2020]. - Figures and tables need captions. A reader should understand what the display shows without searching the prose.
Quarto code chunks: what runs and what appears
Quarto code chunks have three layers: the fence, the options, and the code. The fence says what language the block uses. The options say whether the code runs and whether the code or output appears in the report. The code does the work.
For R chunks, options begin with #| because # is an R comment:
```{r}
#| label: fig-example
#| echo: false
#| warning: false
#| fig-cap: "Example figure caption."
plot(1:10)
```
For LaTeX/TikZ chunks, options begin with %| because % is a LaTeX
comment:
```{tikz}
%| label: fig-dag-tikz
%| eval: false
%| fig-cap: "TikZ DAG, switched off while using the ggdag version."
\begin{tikzpicture}
\node {$A \to Y$};
\end{tikzpicture}
```
The most useful options are:
eval: false: do not run this chunk. Use this when you want to keep alternative code in the file, such as the TikZ DAG, without rendering it.echo: false: run the code but hide the code from the PDF or HTML. Use this for most tables and figures.include: false: run the code but hide the code and all its output. Use this for setup chunks that create objects used later.warning: falseandmessage: false: hide routine warnings or package messages from the report. Do not use these to hide real errors.tbl-cap:andfig-cap:: add table and figure captions. Usetbl-capfor tables andfig-capfor figures.label:: name the chunk. Labels must be unique, short, and contain no spaces. A good pattern istbl-results,fig-forest, orsetup-fit.
Use the right option for the job:
| Goal | Option |
|---|---|
| Keep code in the file but stop it running | eval: false |
| Run code but hide the code itself | echo: false |
| Run setup code silently | include: false |
| Show output as raw Markdown | output: asis |
| Temporarily let render continue after an error | error: true |
In the template, the ggdag DAG chunk has #| eval: true and the TikZ
chunk has %| eval: false. To switch to the TikZ version, turn the
ggdag chunk to #| eval: false and the TikZ chunk to %| eval: true.
Keep only one DAG version active at a time.
Do not put options in the middle of a chunk. Quarto reads them at the
top of the chunk, before the code. Also check the comment marker: #|
is for R chunks, %| is for TikZ/LaTeX chunks.
For this report, write the prose in manuscript.qmd, put study
decisions and helper code in setup.R, and keep references in
references.bib. Do not paste screenshots of tables or figures into the
report. Let the code chunks create them when the document renders.
Useful Quarto references:
- Hello, Quarto for
RStudio: the
basic render workflow for
.qmdfiles. - Markdown basics: headings, links, lists, equations, and images.
- Tutorial: Computations: executable R code chunks and inline code.
- Execution
options:
eval,echo,include,warning,message,output, and related chunk controls. - Citations: how
[@citekey],references.bib, and citation styles work. - Render command: terminal
options for
quarto render.
Adding citations with BibTeX
The template already tells Quarto to use references.bib. You only need
to add BibTeX entries to that file and cite them from manuscript.qmd.
Use this workflow for a source you find on Google Scholar:
- Search for the exact article title in Google Scholar.
- Find the correct record. Prefer the publisher version or a record with a Digital Object Identifier (DOI) when there are duplicates.
- Click the quotation-mark cite icon, or the
Citelink if that is what your browser shows. - Click
BibTeX. - Copy the full entry, from
@article{...or@book{...through the closing}. - Paste it at the bottom of
references.bib. - Give the entry a readable key, such as
vanderweele2020,ding2016, orsmith2024wellbeing. - In
manuscript.qmd, cite it with[@vanderweele2020]or use an in-text citation such as@vanderweele2020. - Render the manuscript. If Quarto says a citation is missing, check
that the key in the
.qmdexactly matches the key inreferences.bib.
A BibTeX entry looks like this:
@article{vanderweele2020,
title = {Outcome-wide Epidemiology},
author = {VanderWeele, Tyler J.},
journal = {Epidemiology},
year = {2020},
volume = {31},
number = {1},
pages = {6--9},
doi = {10.1097/EDE.0000000000001141}
}
Check Google Scholar entries before trusting them. Fix obvious errors in
capitalisation, missing journal names, missing DOIs, page ranges, or
author names. For title capitalisation that must be preserved in APA
style, protect proper nouns with braces, for example
title = {Te Tiriti o Waitangi and Wellbeing}.
For several sources, you can also set Google Scholar to show BibTeX
links directly: open the menu, choose Settings, find
Bibliography manager, select Show links to import citations into,
choose BibTeX, and save. Some library guides also describe saving
sources to My Library and exporting several records at once, but for
this assessment it is usually safer to add and check one source at a
time.
Step 2: Make your study decisions
Open setup.R. The main study decisions and reporting thresholds live
near the top. Start with the target population, exposure, seed, and
simulated sample size:
target_population <- "..."
# for the lab walkthrough, set this to "community_group"
# for your submitted report, set this to "religious_service" or "volunteer_work"
name_exposure <- "community_group"
study_seed <- 2026
study_n <- 2000
Lab demo vs report
In this session we walk through the same template using
community_groupas the exposure. We usecommunity_groupfor the lab so we can work end-to-end without spoiling either of the two report exposures (religious_serviceorvolunteer_work). When you start your report, changename_exposureback to one of those two. The setup script accepts all three values so the lab can render, but only the first two are valid choices for submission.
Choose the people first. The target population tells the reader whose wellbeing the report is about and which interventions could sensibly apply to them. Only then define the exposure contrast and outcomes.
The four wellbeing outcomes are fixed by the assignment: sense of purpose, belonging, self-esteem, and life satisfaction at wave 2. The adjustment set $L$ contains baseline demographics, socioeconomic and personality variables, the baseline exposure, and the baseline value of every outcome (lagged-self adjustment — using each variable's own past value as a covariate).
Step 3: Simulate the panel
The simulator returns a long panel with three waves. The
simulate_panel() helper in setup.R reshapes it into a wide tibble
with one row per person: baseline covariates, exposure at wave 1, and
the four outcomes at wave 2.
panel <- simulate_panel()
nrow(panel)
mean(panel$exposure_t1)
Synthetic data
The simulator is the same one used in Lab 9. It supports the eight exposure-by-outcome combinations students may pick. Numerical results do not generalise to any real population; they let you practise the workflow against a known truth.
Step 4: Fit four causal forests in one call
margot::margot_causal_forest() fits one honest causal forest per
outcome with a shared adjustment set, and returns a single object with
combined results. This is more concise than looping over
grf::causal_forest() yourself, and it produces the inputs the
policy-tree pipeline expects.
X <- as.matrix(panel[, covariate_cols])
W <- panel$exposure_t1
weights <- rep(1, nrow(panel))
models_binary <- margot::margot_causal_forest(
data = panel,
outcome_vars = c("t2_purpose", "t2_belonging",
"t2_self_esteem", "t2_life_satisfaction"),
covariates = X,
W = W,
weights = weights,
grf_defaults = list(num.trees = 500, honesty = TRUE),
top_n_vars = 12,
save_models = TRUE,
save_data = TRUE,
compute_conditional_means = TRUE,
train_proportion = 0.5,
use_train_test_split = TRUE,
seed = study_seed
)
print(models_binary$combined_table)
Reading
combined_tableOne row per outcome. Columns are the risk-difference estimate
E[Y(1)]-E[Y(0)], the 95% confidence interval (2.5 %,97.5 %), the E-value for the point estimate (E_Value), and the E-value for the bound nearest the null (E_Val_bound). The CI is unadjusted; you will apply Bonferroni in Step 5.
Step 5: Outcome-wide reporting with margot_plot()
Reporting four outcomes in one table requires correcting for
multiplicity. margot::margot_plot() does this in one call: pass the
four-outcome combined table, set adjust = "bonferroni", and it returns
a plot, a transformed table with the multiplicity-adjusted confidence
intervals, and a short prose interpretation.
ate <- ate_plot_objects(models_binary) # wrapper from setup.R
ate$plot # forest plot
ate$transformed_table # ATE, Bonferroni 95% CI, E-values
cat(ate$interpretation) # auto-drafted summary
ate_plot_objects() wraps margot::margot_plot() with a
plot_defaults list (in setup.R) so you can restyle (colours,
ordering, text size, x-axis limits) without editing the manuscript
chunks. The Bonferroni correction widens each CI by the family-wise
$z$ factor ($z_{\alpha_{FW}/(2k)}$ with $k = 4$), and the
bound E-value is recomputed at the multiplicity-adjusted lower
limit. If that limit crosses zero, the bound E-value is reported as 1 —
meaning no unmeasured-confounder strength is needed to push the bound to
the null.
Why Bonferroni
The outcome-wide design asks how the exposure affects all four outcomes jointly. If you tested each outcome at $\alpha = 0.05$ independently, the family-wise error rate would be approximately $1 - (1 - 0.05)^4 \approx 0.19$. Bonferroni keeps the family-wise rate at 0.05. Other corrections (Holm, Benjamini-Hochberg) are reasonable; pick one and state which.
Step 6: Policy trees
After estimating the four ATEs, the question becomes whether the same
exposure benefits everyone equally. Lab 9 introduced policy trees as an
interpretable allocation rule. Lab 10 uses the same margot workflow,
then applies a transparent graphing rule before deciding which trees
to put in the report.
The course workflow uses an outcome-only policy objective. A
do not treat leaf means the fitted rule assigns the no-treatment
action for that covariate profile; it does not compute a financial
saving. If a decision-maker needs a scarce-budget or cost-sensitive
rule, the cost must be specified explicitly, for example by subtracting
a treatment cost c from the treatment reward before refitting the
policy tree.
Step 6a: Stability and the workflow object
policy_tree_stability <- margot::margot_policy_tree_stability(
model_results = models_binary,
depth = 2,
n_iterations = 50,
vary_type = "split_only",
parallel = FALSE,
label_mapping = label_mapping,
seed = study_seed
)
wf <- margot::margot_policy_workflow(
stability = policy_tree_stability,
original_df = panel,
label_mapping = label_mapping,
audience = "policy",
prefer_stability = TRUE,
min_gain_for_depth_switch = 0.01,
signal_score = "pv_snr",
signals_k = 3,
interpret_models = "wins_borderline",
plot_models = "wins_borderline"
)
print(wf$policy_brief_df)
print(wf$best$depth_summary_df)
policy_brief_df lists, for each outcome, the selected depth, the
policy value with its 95% confidence interval, the treated-uplift with
its 95% confidence interval, and coverage (the share of the sample the
rule recommends for treatment).
best$depth_summary_df contains the depth-1 versus depth-2 comparison
and the parsimony decision. With min_gain_for_depth_switch = 0.01, the
workflow selects depth-2 only when the held-out policy-value point gain
over depth-1 is at least 0.01 outcome units. Use uncertainty intervals,
stability, equity, and implementation burden to judge how cautiously to
interpret the selected rule; do not treat interval overlap as the
depth-selection rule.
In the report template, these decisions appear in two tables:
tbl-depthreads fromfit$wf$best$depth_summary_dfand shows the depth-1 value, depth-2 value, point gain, and selected depth.tbl-selectionreads fromfit$selectionand shows whether each selected tree passes the graphing rule.
The corresponding controls are in setup.R: min_gain_for_depth_switch
controls tbl-depth; policy_value_lower_threshold and
treated_uplift_lower_threshold control tbl-selection.
Coverage is an output of the fitted rule, not a fixed treatment budget. If the rule treats 80% of people, the tree is mostly saying "treat broadly" under the outcome-only objective. It is not saying a programme with capacity for 20% should use the same rule.
Step 6b: The graphing rule
margot::margot_select_grf_policy_trees() applies a transparent
reporting test. It keeps a policy tree only when both the
policy-value lower confidence limit and the treated-uplift lower
confidence limit exceed zero. Outcomes that fail the test still appear
in your tables; their trees do not get graphed.
selection <- margot::margot_select_grf_policy_trees(
policy_brief = wf$policy_brief_df,
policy_value_lower_threshold = 0,
treated_uplift_lower_threshold = 0
)
print(selection)
The graph_policy_tree column is TRUE for outcomes whose targeting
story passes the test. State the thresholds in your methods. If you
raise them (for example, policy_value_lower_threshold = 0.05), justify
the choice.
Why a graphing rule
The policy-tree algorithm always returns a tree. The interesting question is whether that tree is reliable enough to put in front of a reader. Reporting a tree with a policy value that includes zero invites overclaiming. The graphing rule is a precommitment: state the test, then graph only what passes.
Step 6c: Plotting the selected trees
margot_policy_workflow() builds combined plots (decision tree +
prediction-points scatter) for the models flagged for interpretation.
Pull them out for the manuscript by mapping selected outcome labels back
to their model names.
label_to_model <- setNames(
wf$best$depth_summary_df$model,
wf$best$depth_summary_df$outcome_label
)
graphed_labels <- selection$Outcome[selection$graph_policy_tree]
graphed_models <- unname(label_to_model[graphed_labels])
for (mn in graphed_models) {
print(wf$plots[[mn]]$combined_plot)
}
If wf$plots does not contain a model you want to graph, change
interpret_models and plot_models in the workflow call. The default
wins_borderline covers most reports; recommended is more permissive.
Step 7: Audit against simulator ground truth
The simulator stores the true individual treatment effects in tau_*
columns. A side-by-side comparison of the estimated ATE and the
population mean tau is a sanity check the analyses themselves cannot
perform.
truth <- ground_truth_audit()
results |>
select(outcome_label, estimate) |>
left_join(truth, by = "outcome_label") |>
mutate(diff = estimate - true_mean_tau)
In a real research report you do not have a tau_* column. The audit is
a teaching scaffold: in the lab, it shows you how close the estimates
come to the truth. Do not include the audit in the submitted report.
Step 8: Render the manuscript
Open manuscript.qmd in RStudio (or your editor) before rendering. This
is the report source file. The preamble sources setup.R, simulates the
panel, fits the four causal forests, builds the policy-tree workflow,
and applies the graphing rule. Every table and figure reads from the
resulting objects.
In the terminal (not the R console), run:
cd research-report
quarto render manuscript.qmd
On Windows, first open the unzipped template folder in RStudio, then run
the render command from the RStudio Terminal. If you are using Command
Prompt or PowerShell, use cd to move into the folder that contains
manuscript.qmd; paths with spaces are easier to handle if you keep the
folder under Documents\PSYC434.
Quarto writes a PDF and an HTML next to manuscript.qmd. Read both
alongside the .qmd source. Check that every table and figure updates
from setup.R; do not hard-code numerical results in the prose.
Render early and often
Render after every meaningful change. A render that fails an hour after the last successful render is easier to repair than one that fails after a day. While drafting, lower
study_n,num_trees, andn_iterations_stabilityinsetup.Rfor faster iteration; raise them for the final render.If the render fails, read the first error message in the terminal. Most failures come from a missing package, an object name that does not exist, an unclosed code chunk, or a citation key that is absent from
references.bib.
Step 9: Write the report
The template gives you headings, code chunks, tables, and figures. What it does not give you is the prose. Replace each italicised placeholder paragraph with your own writing, using the Introduction, Methods, Results, and Discussion (IMRAD) structure. IMRAD is the default for psychology, epidemiology, and most life-sciences journals.
The ten steps in the Causal Workflow tell you what a defensible causal study has to establish. IMRAD tells you where in the manuscript each piece goes. The two are complementary: every IMRAD slot is anchored to specific workflow steps.
IMRAD checklist, mapped to the causal workflow
- Title: concise and informative. Name the exposure, the outcome family, and the population. Highlight the result, not only the method.
- Abstract: a stand-alone summary with background, causal question, key methods, principal results, and conclusions. Anchored in workflow Steps 0, 1, 2, 3 (target population, exposure, time zero, outcomes).
- Introduction: set the scientific context, identify the gap, and state why the question matters for the target population. Do not turn the Introduction into the ten-step workflow and do not summarise results here.
- Methods: detailed enough for another investigator to reproduce the study. Anchored in Steps 0-8: target population, time zero, exposure, outcomes, the DAG and adjustment set for exchangeability, the causal-consistency argument, positivity diagnostics, measurement choices, and the strategy for preserving representativeness (attrition, missingness).
- Results: objective presentation of findings — the four ATEs with Bonferroni-adjusted intervals, E-values, and any policy trees that pass the graphing rule. No interpretation, no citations. Outputs of estimation given the identification arguments stated in Methods.
- Discussion: interpret results, compare with existing literature, address limitations, state implications. Anchored in Step 9 (transparent documentation): name the assumptions you made, where each could fail, and what the sensitivity analyses (E-values) imply. Acknowledge limitations honestly.
- In summaries and abstracts, keep to one tense.
- Use flowing prose in the main text. Bullet lists belong in methods checklists and supplementary material.
The template's section ordering follows the workflow after the Introduction: target population first, then the causal contrast and outcomes, then identification, then estimation. The Introduction, Methods, Results, and Discussion blocks map cleanly onto the IMRAD slots above; use the checklist as your editing pass.
Ethics paragraph (subsection of the Discussion)
The Option A criteria require a short statement on what would need to be considered before acting on a policy rule. Cover fairness, proxy variables, governance, and one value judgement the analysis depends on. In the template this lives as the Ethics subsection of the Discussion (three to five sentences). A policy tree that splits on income or deprivation can be defensible, or it can be a proxy for ethnicity or migration history; the ethics paragraph names the trade-off. Also state that the current rule does not include treatment cost unless you have explicitly specified one. The model cannot settle public values; you can describe what evidence the model can and cannot provide.
Pointers
- The Option A assessment criteria are in Assessments.
- The reporting checklist is in the Reporting Guide.
- Self-checks for the workflow are in Assessment Self-Checks.
- The simulator is documented in the Simulation Guide.
- Quarto's official Markdown basics, R computations tutorial, and citation guide cover the report-writing mechanics.
Test 1: Study Sheet
Current study sheet for Test 1, covering lectures 1–6.
Practice Test (2025 Test 1)
The 2025 PSYCH 434 Test 1. Part 1 is multiple choice (20 questions, some with more than one correct answer). Part 2 is short answer (choose 4 of 6). Use this as practice for the 2026 test.
Suggested answers are available here.
Practice Test 2025 Answers
Suggested answers for the 2025 PSYCH 434 Test 1.
Test 2 Study Sheet
Current study sheet for Test 2, covering weeks 8-10.
Test 2 Practice Questions
Use these questions to practise for Test 2. They cover the same topic blocks as the test: heterogeneous treatment effects, policy trees and judgement, outcome-wide reporting, and measurement.
Write short answers without notes first. Then check your answer against the relevant lecture, lab, Reporting Guide, and Test 2 Study Sheet.
Heterogeneous Treatment Effects (Week 8)
- State the difference between the average treatment effect and the conditional average treatment effect (CATE). When is the distinction substantively important?
- A regression model with treatment-by-age and treatment-by-gender interactions reports both interactions as non-significant. Does this rule out heterogeneous treatment effects? Why or why not?
- Explain honest splitting in a causal forest. What problem is it designed to prevent?
- What does it mean for an estimator to be doubly robust? Give one practical advantage.
- A causal forest reports RATE-AUTOC = 0.04 with a 95% CI of $[0.01, 0.07]$. State, in plain language, what this tells you about heterogeneity. Compare with RATE-AUTOC = 0.04 with CI $[-0.02, 0.10]$.
- Sketch a Qini curve. Label the axes. Describe the shape of the curve when targeting helps and when it does not.
- Why do we expect causal forest estimates of $\hat{\tau}(x)$ to be noisy in regions of the covariate space with little data?
- A reviewer says "If the heterogeneity is real, a regression with the right interaction terms will find it." Reply in two sentences.
Policy Trees, Fairness, and Judgement (Week 9)
- State, in one or two sentences, what a policy tree does that a conditional average treatment effect estimate alone does not. Your answer should use the idea of utility or policy value over allocation rules.
- Explain the parsimony rule for choosing tree depth. Why is the simpler rule often preferable even when the more complex rule has a slightly larger estimated utility or policy value?
- A depth-2 policy tree treats the leaf "deprivation index $> 1.2$ and age $\leq 40$". Translate this into a sentence a community organiser could repeat.
- Name three things you would check before using a policy tree to allocate a programme.
- Explain how a split on deprivation can affect social groups differently even when group membership does not appear in the tree.
- State, in one or two sentences, why statistical evidence cannot by itself decide whether an allocation rule is just.
- Describe one scenario in which a community organiser should override a policy tree recommendation. What information would they have that the model does not?
- State the difference between a ranking rule (for example, treat the top 20% by $\hat{\tau}$ when a budget is fixed) and the course policy-tree workflow, which fits depth-1 and depth-2 candidates and chooses depth-2 only if it clears the prespecified gain threshold. In your answer, distinguish estimating CATEs from estimating the utility of allocation rules. Give one scenario in which the ranking is preferable.
Outcome-Wide Reporting
- State the four causal estimands an outcome-wide design implies for one exposure and four outcomes.
- Why is multiple-testing correction necessary in an outcome-wide design? Describe the Bonferroni correction at $\alpha_{FW} = 0.05$ for four outcomes.
- Explain, in plain language, what an E-value of 2.0 means.
- A forest plot orders four outcomes by effect magnitude. The largest two effects survive Bonferroni; the smaller two do not. Write a one-paragraph interpretation suitable for a non-specialist audience.
Measurement (Week 10)
- Briefly distinguish reflective and formative measurement models. State one reason each is awkward for causal inference.
- Define measurement invariance. Describe how scalar invariance can fail across cultures, and why this threatens cross-cultural causal comparisons.
- Why is "mindfulness intervention" vulnerable to the multiple-versions-of-treatment problem? What does this imply for consistency?
- Draw a causal directed acyclic graph with $A$, $Y^\ast$, $Y$, and $U_Y$ representing differential measurement error. Explain how the path $A \to U_Y \to Y$ threatens identification.
- Explain why including the baseline measurement of the outcome ($Y_0$) in the adjustment set provides strong confounding control in a three-wave panel.
- State one design response and one analytic response when measurement invariance fails across groups.
Synthesis Questions
- An investigator estimates the effect of weekly volunteer work on four wellbeing outcomes using a causal forest. Set out the full causal workflow: estimands, identification assumptions, estimator, multiple-testing correction, sensitivity analysis, presentation.
- A community wellbeing programme wants to use heterogeneous treatment effect estimates. Walk through the steps from conditional average treatment effect estimation to a readable policy-tree summary. Explain that policy learning estimates utility over allocation rules, name the fairness check, and state one value judgement the model cannot settle. State how the answer would change if the programme also had a fixed budget.
- A cross-cultural study uses the K6 to compare the effect of a school-based mindfulness intervention in two countries. Identify the measurement, treatment-version, and identification threats, and propose one response to each.
- A reviewer challenges your outcome-wide forest plot with: "Three of these confidence intervals cross zero after Bonferroni. Are you sure your story holds?" Reply in one paragraph.
Test 2 Practice Question Answers
Use these answers to check your own work after you have tried the practice questions. Strong answers can be phrased in different ways. The key is to show the causal logic clearly.
Heterogeneous Treatment Effects
-
The average treatment effect (ATE) is the mean contrast $\mathbb{E}[Y(1)-Y(0)]$ in the target population. The conditional average treatment effect (CATE) is $\mathbb{E}[Y(1)-Y(0)\mid X=x]$, the average contrast within people who share covariates $x$. The distinction matters when the average hides large differences in who benefits.
-
No. Non-significant age and gender interactions only rule out those two pre-specified linear interactions with the available power and model form. Heterogeneity may involve other variables, non-linear patterns, or high-dimensional combinations.
-
Honest splitting separates the data used to choose tree splits from the data used to estimate effects within leaves. It reduces overfitting and prevents the forest from exaggerating heterogeneity by using the same noise twice.
-
A doubly robust estimator remains consistent if either the outcome model or the treatment/propensity model is correctly specified. The practical advantage is protection against one form of model misspecification.
-
RATE-AUTOC = 0.04 with CI $[0.01,0.07]$ suggests the forest's ranking describes useful treatment-effect heterogeneity: people ranked higher appear to benefit more than people ranked lower. The same estimate with CI $[-0.02,0.10]$ is inconclusive: the point estimate is positive, but the data are compatible with no useful targeting signal. Neither result identifies the causal source of the heterogeneity or shows that any splitter is itself a cause.
-
A Qini curve plots the treatment share targeted on the $x$-axis and cumulative gain over a baseline allocation on the $y$-axis. When targeting helps, the curve rises steeply above the diagonal among the first people targeted. When targeting does not help, it stays close to the diagonal.
-
Sparse regions have few comparable treated and untreated observations. The forest must extrapolate more, leaves contain less information, and the CATE estimate has higher variance.
-
Regression only tests the interactions the investigator specifies. A causal forest can search for non-linear and high-dimensional heterogeneity that a small regression interaction set would miss, though the forest still needs validation.
Policy Trees, Fairness, and Judgement
-
A policy tree converts estimated treatment-effect heterogeneity into a simple allocation rule: for these covariates, treat; for those covariates, do not treat. A CATE estimate describes the expected treatment contrast at $X=x$. Policy learning compares whole allocation rules by their expected utility or policy value if the rule were applied to the target population.
-
Fit the shallower and deeper trees as candidates, then prefer the simpler tree unless the deeper tree clears the prespecified held-out policy-value point-gain threshold. Uncertainty and stability guide how cautiously to interpret a threshold-clearing depth-2 rule; interval overlap is not the selection rule. Simpler trees are easier to explain, more stable, and less likely to be misapplied.
-
Offer the programme to residents whose deprivation index is above 1.2 and who are aged 40 or younger.
-
A strong fairness check could ask: what share of each relevant social group receives the programme; whether split variables are proxies for sensitive characteristics; whether people with similar need receive similar access; whether the rule is stable across resamples; and whether there is an override or appeal pathway.
-
Deprivation is correlated with many social conditions, such as income, housing, neighbourhood resources, age, health, and sometimes ethnicity. A rule that uses deprivation may therefore allocate treatment unevenly across groups even if group membership is not in the tree. The split is descriptive: deprivation may be the strongest measured marker even if the root causes of the heterogeneity lie upstream, for example in housing policy, labour-market exclusion, discrimination, or other causes of deprivation.
-
Statistical evidence can estimate expected benefits, uncertainty, and subgroup patterns. It cannot decide which public value should govern allocation: maximising total benefit, equal access, need, individual choice, cost control, or another principle.
-
An organiser might override the rule for someone assigned "do not treat" who is in acute crisis or whose situation changed after baseline measurement. The organiser has current, relational, and contextual information that the model does not contain.
-
A ranking rule can treat the top 20% by $\hat{\tau}(x)$ when a budget fixes the treatment share. It uses the full continuous CATE score, so the first task is estimating person-level treatment contrasts. The course policy-tree workflow solves a different problem: it estimates the utility or policy value of simple if-then allocation rules, fits depth-1 and depth-2 candidates, and chooses depth-2 only if the held-out policy-value point gain clears the prespecified threshold. The treated share is whatever the selected rule implies. Ranking may be preferable when trained staff can use a score safely, a fixed budget must be met exactly, and the extra policy value is large enough to justify lower transparency.
Outcome-Wide Reporting
-
The four causal estimands are the effects of the same exposure on each outcome: purpose, belonging, self-esteem, and life satisfaction. Formally, $\mathrm{ATE}_k=\mathbb{E}[Y_k(1)-Y_k(0)]$ for $k=1,\ldots,4$.
-
With four outcomes, the chance of at least one apparently positive result by chance is higher than 5%. Bonferroni controls the family-wise error rate by testing each outcome at $\alpha=0.05/4=0.0125$, equivalent to reporting 98.75% confidence intervals for each outcome.
-
An E-value of 2.0 means that an unmeasured confounder would need to be associated with both the exposure and the outcome by a risk ratio of at least 2.0 each, above and beyond measured covariates, to explain away the estimated effect. It is a sensitivity summary, not proof that confounding is absent.
-
The exposure shows its clearest evidence for the two outcomes whose Bonferroni-adjusted intervals exclude zero. The other two estimates point in the same direction but remain uncertain after correcting for the four-outcome family. A non-specialist summary should emphasise the overall pattern, the stronger outcomes, and the fact that some outcomes remain compatible with no effect.
Measurement
-
In a reflective model, the latent construct causes the indicators. This is awkward for causal inference if the latent variable is treated as a real causal object without a clear intervention. In a formative model, the indicators compose the construct. This is awkward because intervening on the construct is not the same as intervening on each component item.
-
Measurement invariance means that a scale relates to the underlying construct in the same way across groups. Scalar invariance fails when item intercepts or thresholds differ across cultures, so the same observed score can imply different latent levels. Mean comparisons may then mix real differences with measurement artefacts.
-
"Mindfulness intervention" can bundle different versions: breathing exercises, meditation apps, classroom lessons, retreats, or teacher-led practice. Consistency is threatened because $Y(1)$ is not a single well-defined potential outcome if different people receive different versions.
-
A suitable graph is $A \to Y^\ast \to Y$, with $U_Y \to Y$, and differential measurement error represented by $A \to U_Y \to Y$. The path $A \to U_Y \to Y$ means the exposure changes how the outcome is measured, so the observed effect on $Y$ may partly reflect measurement processes rather than the true outcome $Y^\ast$.
-
Baseline outcome adjustment controls much of the stable prior difference between exposed and unexposed people. In a three-wave panel, a remaining unmeasured confounder would need to affect later exposure conditional on earlier exposure and earlier outcome, which is a stronger requirement.
-
A design response is to adapt, translate, pilot, or replace items before data collection, ideally with cultural consultation. An analytic response is to test measurement invariance and use group-specific models, partial invariance, sensitivity analysis, or avoid mean comparisons where invariance fails.
Synthesis Questions
-
A strong workflow states the four causal estimands for weekly volunteer work versus no weekly volunteer work; defines the target population and time horizon; states consistency, conditional exchangeability, and positivity; adjusts for baseline covariates, exposure, and outcomes; fits a causal forest; reports four ATEs with Bonferroni-adjusted intervals; reports E-values for point estimates and confidence limits; tests heterogeneity with RATE or calibration; and presents a forest plot plus table.
-
Estimate CATEs with a causal forest, check whether heterogeneity is real, then move from person-level treatment contrasts to policy learning. Policy learning estimates the expected utility or policy value of candidate allocation rules if they were applied to the target population. Fit shallow policy trees, evaluate held-out policy value and stability, and translate the chosen rule into plain language. The fairness check should consider group treatment shares implied by the rule, proxy variables, need, and override procedures. The model cannot decide whether the allocation should prioritise total benefit, equal access, greatest need, cost control, or another public value. If a fixed budget is added, the analysis must also compare rules or rankings under that budget; the current default policy-tree workflow does not impose the percentage treated by itself.
-
Measurement threat: the K6 may not be invariant across countries; test invariance, adapt items, or avoid direct mean comparisons if scalar invariance fails. Treatment-version threat: "school-based mindfulness" may differ across teachers, schools, and countries; define the intervention more precisely or analyse versions separately. Identification threat: treated and untreated students may differ in baseline distress, family background, school resources, or selection into the programme; use a clear DAG, baseline adjustment, and sensitivity analysis.
-
A good reply is: "The corrected intervals mean we should not claim four separate effects. The evidence is strongest for the outcomes whose Bonferroni-adjusted intervals exclude zero. For the other outcomes, the estimates may still contribute to the overall pattern, but they remain compatible with no effect after family-wise correction. I would present the result as an outcome-wide pattern with graded uncertainty, not as four confirmed findings."
Potential Outcomes and Causal Inference
This page introduces the fundamental problem of causal inference, the potential outcomes framework, and the three identification assumptions needed to estimate causal effects from data. It draws on the Women's Health Initiative (WHI) hormone replacement therapy case study as a motivating example.
Motivating example: hormone replacement therapy
Observational evidence (1980s–1990s)
Throughout the 1980s and 1990s, observational studies suggested that oestrogen therapy reduced all-cause mortality in postmenopausal women by roughly 30% (hazard ratio $\approx 0.68$ for current users vs. never users). Professional bodies endorsed hormone replacement therapy (HRT) on this basis:
- 1992, American College of Physicians: "Women who have coronary heart disease or who are at increased risk ... are likely to benefit from hormone therapy."
- 1996, American Heart Association: "ERT does look promising as a long-term protection against heart attack."
The experiment disagreed
The Women's Health Initiative (WHI) was a large randomised, double-blind, placebo-controlled trial enrolling 16,000 women aged 50–79. Participants were assigned to oestrogen plus progestin therapy and followed for up to eight years.
The experimental hazard ratio for all-cause mortality was 1.23 (initiators vs. non-initiators), the opposite direction from the observational finding.
What went wrong?
The discrepancy was not a failure of causal assumptions. It was a failure of study design: the observational studies did not correctly emulate a target trial. Specifically, they failed to align "time zero" (the start of follow-up) with the moment of treatment initiation, introducing survivor bias. When investigators re-analysed the observational data using a target trial emulation framework that matched treatment initiation to the start of follow-up, the observational estimates aligned with the experimental findings.
Lesson
If you want causal inferences from observational data, design the analysis as though you were running an experiment. Specify the target trial first.
The fundamental problem of causal inference
Causality is never directly observed. To quantify a causal effect, we need to compare two states of the world for the same individual, but each individual can experience only one.
Notation
Let $A$ denote a binary exposure ($A = 1$: treated, $A = 0$: untreated) and $Y$ denote the outcome.
- $Y_i(1)$: the potential outcome for individual $i$ under treatment.
- $Y_i(0)$: the potential outcome for individual $i$ under control.
The individual causal effect is:
$$\tau_i = Y_i(1) - Y_i(0)$$
We say there is a causal effect when $Y_i(1) - Y_i(0) \neq 0$.
The missing-data problem
At most one potential outcome is observed for each individual. The unobserved outcome is the counterfactual:
- If $A_i = 1$ is observed, then $Y_i(0)$ is counterfactual.
- If $A_i = 0$ is observed, then $Y_i(1)$ is counterfactual.
Individual-level causal effects are therefore generally unidentifiable. However, under certain assumptions, we can identify average causal effects at the population level.
Three identification assumptions
1. Causal consistency
The potential outcome corresponding to the exposure an individual actually receives equals their observed outcome:
$$Y_i(a) = Y_i \mid A_i = a$$
This assumption requires that treatment is well-defined (no hidden versions of treatment) and that there is no interference between units (one person's treatment does not affect another's outcome).
2. Exchangeability
The potential outcomes are independent of treatment assignment. In a randomised experiment, this holds by design. In observational studies, we require conditional exchangeability: after conditioning on a set of measured covariates $L$, treatment assignment is independent of potential outcomes:
$$Y(a) \coprod A \mid L$$
When exchangeability holds, the Average Treatment Effect (ATE) is identified:
$$\text{ATE} = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] = \mathbb{E}(Y \mid A=1) - \mathbb{E}(Y \mid A=0)$$
In observational settings with confounders $L$:
$$\text{ATE} = \sum_{l} \Big[\mathbb{E}(Y \mid A=1, L=l) - \mathbb{E}(Y \mid A=0, L=l)\Big] \Pr(L=l)$$
3. Positivity
Every individual has a non-zero probability of receiving each treatment level, conditional on their covariates:
$$P(A = a \mid L = l) > 0 \quad \text{for all } a \text{ and } l$$
Positivity is the only assumption that can be verified with data. Violations occur when certain subgroups never receive a particular treatment level, making causal effect estimates for those subgroups extrapolations rather than identifiable quantities.
Observational challenges
In observational settings, all three assumptions face threats. Causal consistency may fail when treatment varies across individuals (e.g., different forms of "religious service attendance"). Exchangeability is violated when unmeasured confounders exist. Positivity fails when certain subgroups have no access to treatment. These threats motivate the careful study designs and sensitivity analyses covered in later weeks.
From experiments to observational data
Randomised experiments address the fundamental problem by balancing confounders across treatment groups. Random assignment satisfies exchangeability by design, and controlled treatment administration satisfies consistency. Although individual causal effects remain unobservable, random assignment allows inference about average (marginal) causal effects.
In observational data, we must satisfy these three assumptions through study design, covariate adjustment, and sensitivity analysis. The remainder of this course develops the tools for doing so: causal diagrams (weeks 2–4), estimation methods (weeks 5–6, 8–9), and measurement considerations (week 10).
Discussion questions
- Where in your own research would an average treatment effect be the right causal estimand, and when would it mask disparities that stakeholders care about?
- Which of the three identification assumptions is most fragile in your field, and what designs or measurements could strengthen it?
- What are examples of post-treatment variables you have been tempted to adjust for, and how would doing so bias the total effect?
Further reading
- Hernan MA, Robins JM. Causal Inference: What If. Chapman & Hall/CRC, 2025. Chapters 1–3. Book site
- See the Course Readings page for a chapter-by-week guide.
The Causal Workflow: Ten Steps
This page summarises a ten-step framework for conducting causal inference with observational data. Each step addresses a potential threat to valid causal interpretation. The framework draws on Hernan and Robins (What If, 2024), VanderWeele (2022), and Bulbulia (2024).
How to use this page
This is a reference resource. Use it as a checklist when planning your research report. Each step links back to the lecture where it was introduced.
Step 0: Define the target population
Say exactly who the answer is meant to inform before choosing the exposure or outcomes. A causal question about "New Zealanders" differs from one about "adults for whom weekly religious-service attendance is a meaningful option" or "working-age adults for whom volunteering is feasible." The population choice shapes which interventions are coherent, which outcomes matter, and where positivity may fail.
Eligibility rules define the source population, but sampling and participation can yield a study population with a different distribution of effect modifiers. If you intend to generalise beyond the source population (transportability), articulate the additional conditions required.
See Week 6 for how effect modification interacts with population composition.
Step 1: State a well-defined treatment
Specify the hypothetical intervention precisely enough that every member of the target population could, in principle, receive it. "Mindfulness" is too vague because people meditate with different apps, in groups, alone, once, or every day. A clearer intervention is: "start a guided mindfulness app at the beginning of semester and complete one 10-minute session per day for eight weeks."
Precision here underwrites the causal consistency assumption (Step 5). If the treatment is vaguely defined, different people effectively receive different interventions, and the potential outcome $Y(a)$ is not well-defined. It also makes the relevant time origin visible.
See Week 5 for the formal definition of causal consistency.
Step 2: Establish time zero
Define the point at which treatment assignment begins and follow-up starts. We cannot do this until the treatment is specified, because time zero is defined relative to treatment assignment or initiation.
See Week 5 for why ambiguous time zero distorts target trial emulation.
Step 3: State a well-defined outcome
Define the outcome so the causal contrast is meaningful and temporally anchored. "Sense of Purpose" is underspecified; "psychological distress one year post-intervention, measured with the Kessler-6" is interpretable and reproducible. Include timing, scale, and instrument.
See Week 4 for measurement from the vantage point of causal inference, and Week 10 for the problems with classical test theory.
Step 4: Evaluate exchangeability
Make the case that potential outcomes are independent of treatment conditional on covariates: $Y(a) \coprod A \mid X$. Use DAGs, subject-matter arguments, pre-treatment covariate balance checks, and overlap diagnostics. If exchangeability is doubtful, redesign (e.g., stronger measurement, alternative identification strategies) rather than rely solely on modelling.
See Week 5 for the formal definition of exchangeability.
Step 5: Ensure causal consistency
Consistency requires that, for individuals receiving a treatment version compatible with level $a$, the observed outcome equals $Y(a)$. It also presumes well-defined versions and no interference between units. When multiple versions exist, either refine the intervention so versions are irrelevant to $Y(a)$, or condition on version-defining covariates.
See Week 5 for examples of consistency violations.
Step 6: Check positivity (overlap)
Each treatment level must occur with non-zero probability at every covariate profile needed for exchangeability:
$$P(A = a \mid L = l) > 0$$
Diagnose limited overlap using propensity score distributions or extreme weights. Consider design-stage remedies (trimming, restriction, adaptive sampling) before estimation.
See Week 5 for the formal positivity assumption.
Step 7: Ensure measurement aligns with the scientific question
Verify that constructs are captured by instruments whose error structures do not distort the causal contrast of interest. Be explicit about forms of measurement error (classical, Berkson, differential, misclassification) and their structural implications for bias.
See Week 4 for measurement error in causal DAGs, and Week 10 for classical measurement theory from a causal perspective.
Step 8: Preserve representativeness
End-of-study analyses should reflect the target population's distribution of effect modifiers. Differential attrition, non-response, or measurement processes tied to treatment and outcomes can induce selection bias. Plan strategies such as inverse probability weighting for censoring, multiple imputation, and sensitivity analyses for missing-not-at-random data.
See Week 4 for selection bias structures.
Step 9: Document transparently
Make assumptions, disagreements, and judgement calls legible. Register or timestamp your analytic plan. Include identification arguments (DAGs), code, and data where possible. Report robustness and sensitivity analyses. Transparent reasoning is a scientific result in its own right.
Summary table
| Step | Requirement | Core assumption |
|---|---|---|
| 0 | Target population | Generalisability |
| 1 | Well-defined treatment | Consistency |
| 2 | Time zero aligned to assignment | Temporal coherence |
| 3 | Well-defined outcome | Interpretability |
| 4 | Exchangeability | Conditional independence |
| 5 | Causal consistency | No interference, well-defined versions |
| 6 | Positivity | Overlap |
| 7 | Measurement validity | No differential error |
| 8 | Representativeness | No selection bias |
| 9 | Transparent documentation | Reproducibility |
Assessment Self-Checks
Use these checks when preparing for Test 2, the presentation, and the research report. They keep you focused on the causal framework taught in the course.
Test 2 is in class, on paper, with one A4 sheet of notes, no devices, and no AI tools. The important work is yours: state the target population, causal contrast, outcome, estimand, assumptions, and limits.
Causal Question Check
Before estimating or interpreting anything, state:
- Target population: who the answer is meant to apply to.
- Causal contrast: the intervention and comparator, with timing.
- Outcome: what is measured, when it is measured, and how the score is constructed.
- Causal estimand: average treatment effect (ATE), conditional average treatment effect (CATE), or another estimand.
- Identification assumptions: consistency, exchangeability, and positivity.
- Measurement assumptions: whether the measure represents the same outcome across people, groups, treatment states, and time.
If you cannot state these pieces, the analysis is not yet interpretable.
HTE and Policy Tree Check
Heterogeneous treatment effect (HTE) methods describe how estimated treatment effects vary across measured covariates. Policy trees turn those estimates into simple rules or summaries of high-response regions. They are useful, and easy to over-read.
Before reporting HTE or policy tree results, check:
- Outcome: which outcome the rule is for. A rule for one outcome may not be a rule for another.
- Estimand: whether you are reporting CATEs, ranked heterogeneity, policy value, a high-response region, or an allocation rule.
- Modifier status: whether a split variable is a measured descriptor, proxy, or downstream marker. Do not read a split variable as the causal source of effect modification.
- Support: whether both treatment levels are represented in the relevant covariate regions. A policy tree cannot rescue poor positivity.
- Parsimony: whether the depth-2 tree clears the prespecified held-out policy-value point-gain threshold. Prefer the simpler rule by default, then use uncertainty, stability, equity, and implementation burden to decide how cautiously to interpret any threshold-clearing depth-2 rule.
- Uncertainty: whether the policy value and subgroup effects have confidence intervals.
- Oversight: what value judgement would authorise acting on the rule. The model can estimate who appears to benefit more; it cannot decide which public value should win.
Useful wording:
The policy tree describes where estimated treatment benefits are largest in the measured data. The split variables should be interpreted as descriptors for targeting and hypothesis generation, not as identified causes of differential response.
Avoid:
The tree shows that deprivation causes the treatment to work better.
Better:
The tree split on deprivation, which may be a useful marker of differential response. The causal source of that heterogeneity could lie upstream, including in variables not directly measured in the tree.
Measurement Check
For measurement questions, do not treat associational model output as evidence of causal structure. Regression coefficients, factor loadings, structural equation models, invariance tests, fit indices, and reliability statistics summarise patterns in the data. They do not identify the causal structure that produced those patterns.
Ask:
- Does the measured score represent the same outcome under the intervention and control conditions?
- Does the measure work the same way across the groups being compared?
- Could exposure, intervention, or group membership affect how people interpret or answer the items?
- Could missingness, translation, response style, or social desirability affect the measured outcome?
- If a statistic is used to support "validity", what causal or measurement assumption is it meant to support, and what alternative explanation does it fail to rule out?
Reporting Guide
This guide shows what to report in the research report. Use it as a checklist for average treatment effects, heterogeneous effects, policy trees, and sensitivity analyses.
The ten-step causal inference checklist
Before reporting results, check that each piece is in place.
Steps 0–3: problem definition
- Well-defined treatment. Specify the exposure precisely, including the contrast (e.g., "weekly religious service attendance vs. less than weekly").
- Time zero. State when treatment assignment or initiation begins and when follow-up starts.
- Well-defined outcome. State the outcome measure, its scale, and when it was assessed.
- Target population. Define who the results apply to, including any weighting for population representativeness.
Steps 4–6: identification strategy
- Exchangeability. Describe the baseline covariates you adjust for and why they are enough, under your causal directed acyclic graph (causal DAG).
- Consistency. Explain why the treatment is well-defined and uniform across individuals.
- Positivity. Report checks showing that both exposure levels appear across the relevant covariate patterns.
Steps 7–9: implementation
- Measurement. Explain how the outcome is measured and any assumptions needed for the measure to mean the same thing across people, groups, and time.
- Attrition handling. The simulated data do not have missing responses, so follow the Lab 10 template: state how panel dropout would be handled if present, for example with inverse probability of censoring weights.
- Transparent reporting. Document the analysis choices and assumptions.
Target trial emulation
Frame your causal question as: "How would outcomes change if we intervened to set everyone's exposure to level $a=1$ rather than $a=0$, conditional on baseline characteristics?"
Reporting average treatment effects
Standard ATE table format
Include these elements for each outcome:
| Outcome | Estimate (SD units) | Bonferroni 95% CI | E-value (point) | E-value (Bonferroni bound) |
|---|---|---|---|---|
| Outcome A | 0.12 | [0.05, 0.19] | 1.8 | 1.3 |
| Outcome B | 0.15 | [0.08, 0.22] | 2.1 | 1.4 |
Key reporting elements
- Effect sizes: report in standard deviation units. The simulator's four wellbeing outcomes are pre-z-scored, so the standardised scale is the only scale available — there is no original 1–7 scale to recover.
- Confidence intervals: report the multiplicity-adjusted (Bonferroni) interval, since the design is outcome-wide across four outcomes.
- E-values: report two — one for the point estimate and one for
the lower end of the multiplicity-adjusted confidence interval (the
bound nearest the null). Both are produced by the
ate_table()helper in the research-report template'ssetup.R. - Sample size: total analysed after exclusions and weighting.
Example results text
"Weekly religious service attendance was estimated to improve all four wellbeing outcomes. The largest estimates were for sense of belonging ($\beta = 0.18$ SD units, Bonferroni 95% CI: 0.11–0.25) and life satisfaction ($\beta = 0.15$ SD units, Bonferroni 95% CI: 0.08–0.22). The point E-values were above 1.6 and the Bonferroni-bound E-values above 1.3, meaning an unmeasured confounder would need to be associated with both attendance and these outcomes by a risk ratio of at least 1.3 each, above and beyond measured covariates, to push the multiplicity-adjusted lower bound to the null."
Reporting heterogeneous treatment effects
For Option A, the policy-tree workflow is the only heterogeneity output. The rank-weighted average treatment effect (RATE) and Qini diagnostics introduced in Lab 8 are not part of this report's scaffold; the template does not ship code for them, and you do not need to include them.
Reporting policy tree results
Present each policy tree on the standardised outcome scale used by the simulator (the four wellbeing outcomes are pre-z-scored, so SD units are the only scale available). The tree describes the action or high-response region implied by the supplied rewards; it does not force a fixed percentage treated.
The course policy-tree workflow uses an outcome-only objective. A do not treat leaf means the rule assigns the no-treatment action for that covariate profile. It does not compute money saved, staff time saved, or other resource savings. To make savings part of the analysis, investigators would need to specify a treatment cost in outcome units, subtract that cost from the treatment reward, and refit or compare trees across plausible cost values.
Parsimony rule
Fit both depth-1 and depth-2 policy trees. Prefer the depth-1 tree
unless the depth-2 tree improves held-out policy value, using the point
estimate, by at least min_gain_for_depth_switch. The course default
is:
min_gain_for_depth_switch <- 0.01
State the threshold before reporting the selected tree. If depth-2 does
not clear the threshold, report the simpler tree in the main text and
place the full depth comparison in the appendix. If depth-2 clears the
threshold but uncertainty is wide, describe the rule as promising or
fragile and use stability, equity, and implementation burden to temper
the conclusion. The margot_policy_workflow() function applies the
point-gain rule automatically and exposes the comparison in
wf$best$depth_summary_df.
Graphing rule
Apply margot::margot_select_grf_policy_trees() after the policy-tree
workflow runs. The graphing rule keeps a policy tree only when both
the policy-value lower confidence limit and the treated-uplift lower
confidence limit exceed zero. Course defaults set both lower-CI
thresholds at zero:
policy_value_lower_threshold <- 0
treated_uplift_lower_threshold <- 0
State the thresholds in your methods. Outcomes that fail the rule remain in your tables and prose; their trees do not appear as figures. The rule is a precommitment device: state the test, then graph only what passes.
Subgroup reporting
For each graphed tree, report the high-response subgroups with their estimated effects, uncertainty, and sample proportions. Use the standardised scale; do not invent an original-scale interpretation, because the simulator outcomes are not on a 1–7 (or any other) raw scale.
Report coverage as the treated share implied by the selected rule, not as a budget constraint. If a programme can treat only a fixed share, say so explicitly. The default policy tree estimates the value of a shallow allocation rule; it does not solve a fixed-capacity allocation problem.
Example: high-response subgroups for life satisfaction
- Older adults with high baseline belonging (age > 45, baseline
belonging > +1 SD)
- Standardised effect: $\beta = 0.28$ SD units (95% CI: 0.21–0.35)
- Sample proportion: 23%
- Younger adults with lower baseline purpose (age <= 45, baseline
purpose < 0)
- Standardised effect: $\beta = 0.22$ SD units (95% CI: 0.16–0.28)
- Sample proportion: 31%
- All others
- Standardised effect: $\beta = 0.08$ SD units (95% CI: 0.04–0.12)
- Sample proportion: 46%
Example policy tree text
"The policy tree estimates the expected value of assigning the single treatment according to the fitted rule. In this example, the rule assigns treatment to two profiles with larger estimated gains in life satisfaction. Older adults (45+) with high baseline belonging had the largest estimated gain ($\beta = 0.28$ SD units), representing 23% of the sample. Interpret these leaves as parts of an allocation rule, with age and baseline belonging used as splitting variables."
Avoid phrasing such as "the do-not-treat leaf saves resources" unless a treatment cost has been built into the objective. Under the course workflow, the accurate wording is: "the outcome-only rule assigns these profiles to no treatment because the no-treatment action has the higher estimated value for that leaf."
Sensitivity analysis: E-values
Interpretation
An E-value says how strong an unmeasured confounder would need to be, on the risk ratio scale, with both the treatment and the outcome to explain away the observed effect.
There is no universal threshold at which an E-value becomes "safe". Interpret it against the study design, the covariates already measured, and plausible omitted causes in the setting. Report the E-value for the point estimate and for the confidence-limit closest to the null, and explain what kind of unmeasured confounder would be needed for the result to disappear.
Example sensitivity text
"For sense of belonging, the E-value was 2.4. An unmeasured confounder would need to be associated with both religious service attendance and belonging by a risk ratio of 2.4 each, above and beyond the measured covariates, to explain away the estimate."
Methods section template
A complete methods section should include:
- Treatment definition: what the exposure is, how it is coded, and the contrast of interest.
- Time zero and follow-up: when assignment occurs, when follow-up starts, and why that timing matches the intervention.
- Outcome definition: measures used, timing of assessment, any transformations applied.
- Target population: sampling frame, weighting strategy, eligibility criteria.
- Causal identification: covariates adjusted for, with a justification for conditional exchangeability.
- Statistical analysis: estimation method, key tuning parameters (e.g., number of trees, minimum node size).
- Attrition handling: censoring weights, stages of dropout addressed.
- Heterogeneity assessment: policy-tree depth (with parsimony decision), policy value, treated uplift, and the graphing-rule decision.
- Sensitivity analysis: E-values for all primary estimates.
Reporting checklist
Do report
- Effect sizes in SD units (the simulator's outcome scale) with multiplicity-adjusted (Bonferroni) confidence intervals
- Sample sizes after exclusions and weighting
- E-values for both the point estimate and the Bonferroni-adjusted lower bound
- Clear practical interpretation of effect sizes
- Subgroup sizes and subgroup estimates for graphed policy trees
- The parsimony threshold and the graphing-rule thresholds you used
- Target trial framework and causal question
- Explicit treatment and outcome definitions
Do not report
- Model coefficients without interpretation
- p-values alone without effect sizes
- Original-scale (1–7) effects for the simulated outcomes; the simulator returns z-scored outcomes only
- Technical details that obscure main findings
- Causal claims beyond your identification strategy
Figure presentation
ATE plots
- Use forest plots with confidence intervals.
- Order by effect magnitude or E-value.
- Include sample sizes.
Policy tree plots
- Show decision rules clearly.
- Include sample proportions in each node.
- Provide plain-language interpretation alongside the tree.
Presentation Advice (10%)
This advice is for the in-class presentation in Week 12. You have 10 minutes, followed by one panel question. You may use the whiteboard and paper notes. No slides, handouts, or devices.
The talk is marked on four criteria, each worth 2.5%. Each criterion is scored in one of four bands:
- A range — 80–100% of the criterion mark.
- B range — 65–79%.
- C range — 50–64% (the pass band).
- D / E — below 50% (fail).
These bands map to Victoria Uni's grade letters. For the full schedule, see the VUW grades and grade point average page.
The four criteria are marked separately. A talk can be strong on causal reasoning and still need work on answering the "so what" question.
What we look for
A good presentation leaves the audience clear on three things:
- What causal question did you ask?
- How would you answer it?
- Why would the answer matter?
That is your job.
Rubric
1. Clarity and structure (2.5%)
Can we follow the talk? This criterion covers pacing, whiteboard work, and signposting.
| Band | Descriptor |
|---|---|
| A range | Opens with a plain one-sentence motivation. The whiteboard is tidy, legible, and useful. The talk fits the 10-minute slot with a little room to breathe. Each part follows naturally from the last. |
| B range | The structure is clear enough to follow. The whiteboard helps, though it has some clutter or unclear notation. Pacing slips a little, usually by rushing the end. |
| C range | The audience can mostly follow, but has to do some of the organising. The whiteboard is hard to read or not used enough. Timing causes important parts to be skipped or crammed. |
| D / E | The talk is hard to follow. The whiteboard is a wall of text, or almost empty. The speaker runs out of time before reaching the answer. |
2. Causal reasoning (2.5%)
Does the talk get the causal logic right? This is the course-specific part of the mark.
| Band | Descriptor |
|---|---|
| A range | States the causal question clearly: target population, exposure contrast, and outcome(s). Names the causal estimand, such as the average treatment effect, and keeps it separate from the statistical estimate. Uses the directed acyclic graph (DAG) to justify the adjustment set. Names and defends consistency, positivity, and conditional exchangeability. |
| B range | The question, exposure, and outcome are clear. The DAG is present and mostly right. Identification assumptions are mentioned, though one is a bit thin or vague. The estimand is stated but sometimes blurred with the estimate. |
| C range | The causal question is implied rather than stated. The DAG is there, but misses important arrows or confounders. Causal language slips, for example using "predicts" when the claim is causal. |
| D / E | The causal question is treated as if it were just a statistical model. The DAG is missing or wrong. Causal language is loose throughout. |
3. So what — including subgroup and ethics (2.5%)
Why should anyone outside the room care? This is the bit students most often leave too late. Because this course studies differences across people and groups, your "so what" should name the subgroup with the strongest estimated treatment effect (from your policy tree) and one practical or ethical issue that would matter before anyone acted on the result.
| Band | Descriptor |
|---|---|
| A range | Names the population, the size of effect that would matter in practice, who might act on the result, and the assumption that would change the conclusion if it failed. Describes the strongest-response subgroup in plain language. Names one issue such as fairness, proxy variables, consent, cost, or who gets to decide. |
| B range | Names a plausible audience and decision. Mentions the size of the effect, but does not anchor it well. Names the strongest-response subgroup. The practical or ethical issue is present, but needs more detail. |
| C range | Says the result is important without saying who would use it. Subgroup or ethics is tacked on at the end. The "so what" is generic, for example "this matters for wellbeing". |
| D / E | The "so what" is missing, or just says the topic is interesting. No subgroup content and no practical or ethical issue. |
4. Response to the panel question (2.5%)
How well does the speaker handle one question after the talk? You may ask one brief clarifying question before answering. This is not a trap; it is a chance to show you understand your own argument.
| Band | Descriptor |
|---|---|
| A range | Listens to the question, checks understanding briefly, and answers directly. If the answer is uncertain, says so and names what evidence would help. The response deals with the question rather than replaying the talk. |
| B range | Answers the question asked. Some prepared material comes back in, but the answer still lands. Acknowledges uncertainty where needed. |
| C range | Partly answers, then drifts into nearby prepared material. Uncertainty is hidden or talked around. |
| D / E | Does not really answer the question. Repeats earlier material or answers a different question. |
A worked "so what" answer
For a talk on religious_service → outcome-wide wellbeing:
"If our estimate is right, encouraging weekly attendance among adults who currently attend less often would lift four wellbeing outcomes by roughly a tenth of a standard deviation each. For a public-health body deciding whether to fund community programmes that lower the cost of regular religious participation, that magnitude is in the range that has justified investment in similar programmes elsewhere. The recommendation depends on conditional exchangeability after adjusting for thirteen baseline covariates plus baseline outcomes; if a strong unmeasured confounder remained — say, a stable disposition we have not measured — the result could be reversed for some outcomes. The E-values say a confounder would need to be associated with both attendance and the outcomes by a risk ratio of at least 1.6 to do that."
This answer names the population (adults who currently attend less often), the magnitude (a tenth of a standard deviation, four outcomes), the decision-maker (a public-health body), the decision (whether to fund community programmes), and the assumption that would change the answer (conditional exchangeability, with an E-value benchmark).
Common pitfalls
- Spending the first three minutes on background and never reaching the estimand.
- Drawing the DAG without naming the adjustment set it implies.
- Reporting only standardised effects, with no sense of whether the effect is big enough to matter.
- Reading from notes word for word. Notes are allowed; karaoke is not the goal.
- Treating the panel question as a personal attack. It is just part of the assessment.
Practical checklist
Preparation
- Rehearse the opening sentence until it is short and clear.
- Plan the whiteboard layout in advance (left third for the DAG, centre for the estimand, right for the result).
- Time yourself: aim for eight minutes, leaving two for breathing room.
- Anticipate one likely panel question and prepare a one-sentence answer.
- Re-read the Reporting Guide for terminology consistent with the course.
Simulation Guide
Simulations are pedagogical tools that let us see what causal inference methods do when we know the truth. In observational research, we never know the true causal effect. In a simulation, we build the data-generating process ourselves, so we can compare each method's estimate against the ground-truth parameter. The four simulations in this guide illustrate distinct threats to valid causal inference and distinct strategies for addressing them.
The data-generating helpers live in the causalworkshop package. Each
section below shows the helper call followed by the analysis we apply to
it, so you can see the reasoning step by step.
Required R packages
causalworkshop,tidyverse,stdReg,gtsummary,clarify,grf. Follow Lab Setup: R Packages and Build Tools, then installcausalworkshopfrom GitHub withpak::pak("go-bayes/causalworkshop")(version $\geq$ 0.6.0).
Generalisability and transportability
Connects to Week 4: external validity and selection bias.
This simulation creates two populations that differ in the prevalence of an effect modifier $Z$. In the sample, $Z = 1$ is rare ($p = 0.1$); in the target population, $Z = 1$ is common ($p = 0.5$). The treatment effect depends on $Z$: individuals with $Z = 1$ benefit more from treatment. Because the sample under-represents these high-benefit individuals, the naive sample Average Treatment Effect (ATE) underestimates the population ATE.
library(causalworkshop)
data <- simulate_ate_data_with_weights(
n_sample = 10000,
n_population = 100000,
p_z_sample = 0.1,
p_z_population = 0.5,
beta_a = 1,
beta_z = 2.5,
beta_az = 0.5,
noise_sd = 0.5,
seed = 123
)
sample_data <- data$sample_data # y_sample, a_sample, z_sample, weights
population_data <- data$population_data # y_population, a_population, z_population
The simulation fits three models: an unweighted model on the sample, a weighted model on the sample (using inverse-probability-of-sampling weights), and an oracle model on the full population. The regression coefficients are nearly identical across all three models, yet the marginal ATEs differ. This dissociation is the central lesson: model coefficients describe conditional associations, but the ATE is a marginal quantity that depends on the distribution of effect modifiers in the target population. Weighting the sample to match the population distribution of $Z$ recovers the correct ATE.
The legacy script also includes a manual calculation that shows exactly
what stdReg does under the hood: create counterfactual datasets in
which everyone receives $A = 0$ and everyone receives $A = 1$,
predict outcomes under each scenario, and take the mean difference. This
"g-computation" procedure makes the marginalisation step explicit.
Key takeaway
Regression coefficients can be correct and yet the ATE can still be wrong for the target population. External validity requires that the distribution of effect modifiers in the sample matches the target, or that we adjust for the mismatch.
Cross-sectional data ambiguity
Connects to Week 3: confounding versus mediation.
This simulation generates data in which $A$ causes $L$, and $L$ causes $Y$. The variable $L$ is therefore a mediator, not a confounder.
data_cross <- simulate_mediation_example(
n = 1000,
beta = 2, # effect of A on L
delta = 1.5, # effect of L on Y
seed = 123
)
fit_blocked <- lm(Y ~ A + L, data = data_cross) # incorrectly conditions on the mediator
fit_total <- lm(Y ~ A, data = data_cross) # correctly recovers the total effect
The model that conditions on $L$ returns a near-zero estimate for the
effect of $A$ on $Y$ because it blocks the very path through which
$A$ operates. The model that omits $L$ correctly recovers the total
effect (true value: beta * delta = 3).
The crux of the problem is that with cross-sectional data alone, the investigator cannot distinguish a confounder from a mediator. Both the fork $A \leftarrow L \rightarrow Y$ and the chain $A \rightarrow L \rightarrow Y$ produce the same observed association between $A$, $L$, and $Y$. The correct modelling decision depends on the assumed causal structure, which the data themselves do not reveal.
Warning
Good model fit does not resolve this ambiguity. A model that conditions on a mediator can fit the data well while returning a biased causal estimate. Model fit is a statistical property; confounding is a structural (causal) property.
Confounding control strategies
Connects to Weeks 3–4: conditioning choices and the backdoor criterion.
This simulation builds a three-wave panel structure with a baseline covariate $L_0$, a prior outcome $Y_0$, a prior exposure $A_0$, an unmeasured confounder $U$, a treatment $A_1$, and an outcome $Y_2$. The true treatment effect is $\delta_{A_1} = 0.3$, and the outcome also depends on $Y_0$, $A_0$, $L_0$, their interaction, and $U$.
data_panel <- simulate_three_wave_panel(
n = 10000,
delta_A1 = 0.3,
seed = 123
)
fit_no_control <- lm(Y_2 ~ A_1, data = data_panel)
fit_standard <- lm(Y_2 ~ A_1 + L_0, data = data_panel)
fit_interaction <- lm(Y_2 ~ A_1 * (L_0 + A_0 + Y_0), data = data_panel)
Three models are compared. The "no control" model regresses $Y_2$ on $A_1$ alone and overestimates the effect because it leaves all backdoor paths open. The "standard" model adds $L_0$ but still omits $Y_0$ and $A_0$, leaving residual confounding. The "interaction" model conditions on $L_0$, $A_0$, $Y_0$, and their interactions with $A_1$, recovering an estimate close to the true value.
The simulation uses the clarify package to obtain simulation-based
confidence intervals for each ATE. The progressive improvement from no
control to standard to interaction control illustrates that closing more
backdoor paths moves the estimate closer to the truth, but only
conditioning on the right set of variables eliminates confounding
entirely.
Key takeaway
In a three-wave panel, conditioning on the prior exposure, prior outcome, and baseline covariates (along with their interactions) is ordinarily necessary to satisfy the backdoor criterion. Omitting any of these leaves residual confounding.
Causal forest estimation
Connects to Week 8: machine learning for heterogeneous treatment effects.
This simulation reuses simulate_three_wave_panel(), then fits a causal
forest (from the grf package) with $L_0$, $A_0$, and $Y_0$ as
covariates.
library(grf)
data_panel <- simulate_three_wave_panel(seed = 123)
W <- as.matrix(data_panel$A_1)
Y <- as.matrix(data_panel$Y_2)
X <- as.matrix(data_panel[, c("L_0", "A_0", "Y_0")])
fit_cf <- causal_forest(X, Y, W)
average_treatment_effect(fit_cf)
If grf warns that estimated treatment propensities are low for some
observations, read that as a positivity warning, not as a broken
simulation. The example still recovers the true effect closely, but the
warning is a useful reminder that causal forests need overlap between
treated and untreated observations.
The causal forest is a non-parametric method that estimates individual-level treatment effects $\hat{\tau}(x)$ by partitioning the covariate space adaptively. Unlike the parametric models above, the causal forest does not require the investigator to specify interaction terms; it discovers them from the data.
Comparing the causal forest estimate to the parametric interaction model illustrates two points. First, the causal forest can recover the ATE without requiring the analyst to guess the correct functional form. Second, the causal forest provides a standard error that accounts for the adaptive splitting, making valid inference possible even in the non-parametric setting.
Key takeaway
Causal forests automate the discovery of heterogeneous treatment effects but still require the investigator to supply the correct set of confounders. Machine learning solves the functional-form problem, not the identification problem.
Glossary and DAG Hand-outs
A reference page covering causal inference terminology and links to hand-out PDFs on directed acyclic graphs (DAGs), confounding, selection bias, and measurement error.
Causal inference glossary
Causal inference rests on mathematical foundations that enjoy broad agreement, but the terminology varies across disciplines. The same word sometimes carries different (even opposite) meanings in different literatures. Terms to watch include "selection", "fixed effects", "standardisation", "moderator", "structural equation model", and "identification".
Core concepts
| Term | Definition |
|---|---|
| Average Treatment Effect (ATE) | The expected difference in potential outcomes across the entire population: $\text{ATE} = \mathbb{E}[Y(1) - Y(0)]$. Also called the marginal effect. |
| Conditional Average Treatment Effect (CATE) | The ATE within a subgroup defined by covariates $X = x$: $\text{CATE}(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x]$. |
| Potential outcomes | The outcomes that would be observed under each possible treatment level. For individual $i$: $Y_i(1)$ under treatment, $Y_i(0)$ under control. Also called counterfactual outcomes. |
| Counterfactual | The potential outcome corresponding to the treatment level not actually received. Unobservable for any given individual. |
| Causal consistency | $Y_i(a) = Y_i$ when $A_i = a$. The observed outcome equals the potential outcome under the treatment actually received. Requires well-defined treatment and no interference. |
| Exchangeability | Potential outcomes are independent of treatment assignment: $Y(a) \coprod A$. In observational studies, we require conditional exchangeability: $Y(a) \coprod A \mid L$. |
| Positivity | Every subgroup has a non-zero probability of receiving each treatment level: $P(A = a \mid L = l) > 0$. |
Graphical concepts
| Term | Definition |
|---|---|
| DAG (directed acyclic graph) | A graph with directed edges (arrows) and no cycles. Used to encode causal assumptions about which variables influence which. |
| Confounder | A common cause of both the exposure and the outcome. Creates a non-causal (backdoor) path that must be blocked for valid causal inference. |
| Collider | A variable caused by two or more other variables on a path. Conditioning on a collider opens a spurious association. |
| Mediator | A variable on the causal pathway between exposure and outcome ($A \to M \to Y$). Conditioning on a mediator blocks the indirect effect. |
| Backdoor path | A non-causal path from exposure to outcome that passes through a common cause. Blocking all backdoor paths satisfies the backdoor criterion. |
| d-separation | A graphical criterion for determining conditional independence. Two variables are d-separated given a set $Z$ if every path between them is blocked by $Z$. |
Estimation and sensitivity
| Term | Definition |
|---|---|
| Propensity score | The probability of receiving treatment given covariates: $e(L) = P(A = 1 \mid L)$. Used for weighting, matching, or stratification. |
| E-value | The minimum strength of association an unmeasured confounder would need with both treatment and outcome (on the risk ratio scale) to explain away an observed effect. |
| RATE (Rank Average Treatment Effect) | A metric for assessing treatment effect heterogeneity. Measures how well a prioritisation rule identifies individuals with larger effects. |
| QINI curve | A cumulative gain curve showing the benefit of treating individuals in order of predicted treatment effect. Area under the QINI curve summarises heterogeneity. |
| Policy tree | A decision tree that assigns treatment based on covariates to maximise a welfare criterion. Used for identifying high-response subgroups. |
| Doubly robust estimation | An estimation strategy that yields consistent causal estimates if either the outcome model or the propensity score model (but not necessarily both) is correctly specified. |
DAG hand-outs
The following hand-outs cover DAG conventions, common structures, and specific forms of bias. All PDFs are available for download from the hand-outs folder (Dropbox).
Foundations
| Hand-out | Topic |
|---|---|
| 1a. Local conventions | Conventions for causal diagram construction and interpretation |
| 1b. Directed graph terminology | Core terminology for directed acyclic graphs |
| S1. Graphical key | Visual reference guide for DAG symbols and notation |
| S2. Glossary | Comprehensive glossary of causal inference terminology |
Common applications
| Hand-out | Topic |
|---|---|
| 2. Common causal questions | Frequently encountered causal questions and how to set them up |
| 6. Effect modification | When and how treatment effects vary across subgroups |
| 9. External validity | Generalising causal findings across populations and contexts |
Time series and confounding
| Hand-out | Topic |
|---|---|
| 3. Time series approaches | How longitudinal data help address confounding bias |
| 4. Three-wave panel methods | Using three-wave panel data for causal inference |
| 5. Time series limitations | When time series approaches may not resolve confounding |
| S3. Time-resolved confounding | Advanced approaches to time-varying confounding |
Advanced topics
| Hand-out | Topic |
|---|---|
| 7. Selection bias | Selection bias in longitudinal studies |
| 8. Measurement error | Structural approaches to representing and addressing measurement error |
| 10. Experimental design | How experiments address confounding and selection bias |
Supplementary materials
| Hand-out | Topic |
|---|---|
| S5. Timing examples | Practical examples of confounding and timing issues |
| S6. Detailed panel examples | What can go wrong in a three-wave panel |
| S7. Cross-sectional approaches | When to report multiple DAGs in cross-sectional studies |
| S8. Bias correction | Quantitative approaches to bias correction |
| S9. Mediator bias | Confounding bias in mediation analysis |
| S10. Misclassification bias | Examples of misclassification bias and bias towards the null |
Accessing hand-outs
All PDFs are in the hand-outs folder on Dropbox. File names match the numbering in the tables above (e.g.,
1a-terminologylocalconventions.pdf,S2-glossary.pdf).
Causal Diagrams with ggdag
This tutorial introduces the ggdag package for drawing and analysing
directed acyclic graphs (DAGs). DAGs encode causal assumptions and help
identify which variables to condition on (and which to leave alone) when
estimating causal effects.
How to use this page
This is a reference resource, not a graded lab. Work through the examples at your own pace and return when you need to draw or analyse a DAG for your research report.
Load libraries
# data wrangling
library(tidyverse)
# graphing
library(ggplot2)
# automated causal diagrams
library(ggdag)
The fork: omitted variable bias
Let's use ggdag to identify confounding arising from omitting L in our
regression of A on Y.
First we write out the DAG:
# code for creating a DAG
graph_fork <- dagify(Y ~ L,
A ~ L,
exposure = "A",
outcome = "Y") |>
tidy_dagitty(layout = "tree")
# plot the DAG
graph_fork |>
ggdag() + theme_dag_blank() + labs(title = "L is a common cause of A and Y")
Next we ask ggdag which variables we need to include for an unbiased
estimate:
ggdag::ggdag_adjustment_set(graph_fork) +
theme_dag_blank() +
labs(title = "{L} is the exclusive member of the confounder set for A and Y")
The causal graph tells us to obtain an unbiased estimate of A on Y we must condition on L.
When we include the omitted variable L in our simulated dataset, it breaks the spurious association between A and Y:
set.seed(123)
N <- 1000
L <- rnorm(N)
A <- rnorm(N, L)
Y <- rnorm(N, L)
# without control: A appears associated with Y
fit_fork <- lm(Y ~ A)
parameters::model_parameters(fit_fork)
# with control: association vanishes
fit_fork_controlled <- lm(Y ~ A + L)
parameters::model_parameters(fit_fork_controlled)
Mediation and the pipe
Suppose we are interested in the causal effect of A on Y, where the effect operates through a mediator M.
graph_mediation <- dagify(Y ~ M,
M ~ A,
exposure = "A",
outcome = "Y") |>
ggdag::tidy_dagitty(layout = "tree")
graph_mediation |>
ggdag() +
theme_dag_blank() +
labs(title = "Mediation Graph")
What should we condition on?
ggdag::ggdag_adjustment_set(graph_fork)
"Backdoor Paths Unconditionally Closed" means we may obtain an unbiased estimate of A on Y without including additional variables, assuming our DAG is correct. There is no "backdoor path" from A to Y that would bias our estimate.
Two variables are d-connected if information flows between them (conditional on the graph), and d-separated if they are conditionally independent.
ggdag::ggdag_dconnected(graph_mediation)
Here d-connection is desirable because it means we can estimate A's effect on Y.
Simulation: pipe confounding
Suppose we want to know whether a ritual action condition (A) influences charity (Y), and the effect operates entirely through perceived social cohesion (M):
$A \to M \to Y$
set.seed(123)
N <- 100
c0 <- rnorm(N, 10, 2)
ritual <- rep(0:1, each = N / 2)
cohesion <- ritual * rnorm(N, .5, .2)
c1 <- c0 + ritual * cohesion
d <- data.frame(c0 = c0, c1 = c1, ritual = ritual, cohesion = cohesion)
# total effect of ritual on charity
parameters::model_parameters(lm(c1 ~ c0 + ritual, data = d))
# conditioning on the mediator blocks the causal path
parameters::model_parameters(lm(c1 ~ c0 + ritual + cohesion, data = d))
The direct effect of ritual drops out when we include cohesion. Once the model knows M, it gets no new information from A. Conditioning on a post-treatment variable creates pipe confounding.
Masked relationships
When two correlated variables have opposing effects on the outcome, their individual effects can be "masked" in simple regressions.
dag_m1 <- dagify(K ~ C + R,
R ~ C,
exposure = "C",
outcome = "K") |>
tidy_dagitty(layout = "tree")
dag_m1 |> ggdag()
set.seed(123)
n <- 100
C <- rnorm(n)
R <- rnorm(n, C)
K <- rnorm(n, R - C)
d_sim <- data.frame(K = K, R = R, C = C)
# C alone: weak or null
parameters::model_parameters(lm(K ~ C, data = d_sim))
# both: opposing effects "pop"
parameters::model_parameters(lm(K ~ C + R, data = d_sim))
Note that ggdag correctly identifies that you do not need to condition
on R to estimate C's total effect on K:
dag_m1 |> ggdag_adjustment_set()
The total effect of C on K combines the direct path ($C \to K$) and the indirect path ($C \to R \to K$); these work in opposite directions.
Collider confounding
The selection-distortion effect (Berkson's paradox). Imagine there is no relationship between the newsworthiness and trustworthiness of science. Selection committees make decisions on the basis of both:
dag_sd <- dagify(S ~ N,
S ~ T,
labels = c("S" = "Selection",
"N" = "Newsworthy",
"T" = "Trustworthy")) |>
tidy_dagitty(layout = "nicely")
dag_sd |>
ggdag(text = FALSE, use_labels = "label") + theme_dag_blank()
When two arrows enter a variable, conditioning on it opens a path of information between its causes:
ggdag_dseparated(
dag_sd,
from = "T",
to = "N",
controlling_for = "S",
text = FALSE,
use_labels = "label"
) + theme_dag_blank()
# simulation of selection-distortion effect
set.seed(123)
n <- 1000
p <- 0.05
d <- tibble(
newsworthiness = rnorm(n, mean = 0, sd = 1),
trustworthiness = rnorm(n, mean = 0, sd = 1)
) |>
mutate(total_score = newsworthiness + trustworthiness) |>
mutate(selected = ifelse(total_score >= quantile(total_score, 1 - p), TRUE, FALSE))
# correlation among selected proposals
d |>
filter(selected == TRUE) |>
select(newsworthiness, trustworthiness) |>
cor()
Selection induces a spurious correlation. Among selected proposals, newsworthy ones appear less trustworthy, even though the two are independent in the population.
Collider bias in experiments
Conditioning on a post-treatment variable can open a spurious path even when no experimental effect exists:
dag_ex2 <- dagify(
C1 ~ C0 + U,
Ch ~ U + R,
labels = c(
"R" = "Ritual",
"C1" = "Charity-post",
"C0" = "Charity-pre",
"Ch" = "Cohesion",
"U" = "Religiousness (Unmeasured)"
),
exposure = "R",
outcome = "C1",
latent = "U"
) |>
control_for(c("Ch", "C0"))
dag_ex2 |>
ggdag_collider(text = FALSE, use_labels = "label") +
ggtitle("Cohesion is a collider that opens a path from ritual to charity")
Taxonomy of confounding
There are four basic structures:
The fork (omitted variable bias)
confounder_triangle(x = "Coffee", y = "Lung Cancer", z = "Smoking") |>
ggdag_dconnected(text = FALSE, use_labels = "label")
The pipe (fully mediated effects)
mediation_triangle(x = NULL, y = NULL, m = NULL, x_y_associated = FALSE) |>
tidy_dagitty(layout = "nicely") |>
ggdag()
The collider
collider_triangle() |>
ggdag_dseparated(controlling_for = "m")
Confounding by proxy
Controlling for a descendant of a collider introduces collider bias:
dag_sd <- dagify(
Z ~ X,
Z ~ Y,
D ~ Z,
labels = c("Z" = "Collider", "D" = "Descendant", "X" = "X", "Y" = "Y"),
exposure = "X",
outcome = "Y"
) |>
control_for("D")
dag_sd |>
ggdag_dseparated(
from = "X", to = "Y",
controlling_for = "D",
text = FALSE, use_labels = "label"
) +
ggtitle("D induces collider bias")
Rules for avoiding confounding
From Statistical Rethinking (p. 286):
- List all paths connecting the exposure and outcome
- Classify each path as open or closed (a path is open unless it contains a collider)
- Classify each path as a backdoor path (has an arrow entering the exposure)
- If there are open backdoor paths, decide which variable(s) to condition on to close them
Selection bias in sampling and longitudinal research
Selection bias arises when the sample is not representative of the target population due to conditioning on a collider or its descendant:
coords_mine <- tibble::tribble(
~name, ~x, ~y,
"glioma", 1, 2,
"hospitalized", 2, 3,
"broken_bone", 3, 2,
"reckless", 4, 1,
"smoking", 5, 2
)
dagify(hospitalized ~ broken_bone + glioma,
broken_bone ~ reckless,
smoking ~ reckless,
labels = c(hospitalized = "Hospitalization",
broken_bone = "Broken Bone",
glioma = "Glioma",
reckless = "Reckless \nBehaviour",
smoking = "Smoking"),
coords = coords_mine) |>
ggdag_dconnected("glioma", "smoking", controlling_for = "hospitalized",
text = FALSE, use_labels = "label", collider_lines = FALSE)
In longitudinal research, retention can act as a descendant of a collider, introducing bias when the sample is conditioned on being retained.
Summary
We control for variables to avoid omitted variable bias. But included
variable bias is also commonplace. It arises from "pipes", "colliders",
and conditioning on descendants of colliders. The ggdag package can
help identify adjustment sets, but the results depend on assumptions
encoded in your DAG that are not part of the data. Clarify your
assumptions.
Suggested Answers: Pair Exercises
These are brief suggested answers for the pair exercises embedded in weekly lectures. They are intended as discussion guides, not definitive solutions. Many exercises are deliberately open-ended.
Week 1
Formulating a contrast
A well-formed causal question might be: "Would replacing two hours of nightly screen time with two hours of reading reduce sleep onset latency (in minutes) over four weeks among 14-to-16-year-olds in Aotearoa New Zealand?" Both sides of the contrast are specified (screen time versus reading), the outcome is defined (sleep onset latency), the time horizon is stated (four weeks), and the target population is named (14-to-16-year-olds in Aotearoa New Zealand). Common critique points: "screen time" is vague (passive scrolling? gaming? messaging?), "poor sleep" needs a measurable operationalisation, and "teenagers" lumps heterogeneous age groups.
Three problems in one claim
- Definitional clarity: "religion" could mean attendance, belief, prayer, or community membership. "Mental health" could mean depression, life satisfaction, anxiety, or a composite. Neither side of the contrast is specified.
- Population specificity: the answer may differ between adolescents and older adults, between countries with majority-religion norms and secular societies, or between denominations.
- Unobservability: we cannot observe the same person both practising and not practising religion. The individual causal effect is missing by construction.
A rewrite: "Among adults aged 40-65 in Aotearoa New Zealand, would initiating weekly religious service attendance (versus maintaining no attendance) reduce depressive symptoms (PHQ-9 score) over 12 months?"
Week 2
Naming the structure
- Fork. SES causes both neighbourhood quality and health outcomes: neighbourhood $\leftarrow$ SES $\to$ health. Neighbourhood and health are marginally associated (through SES). Conditioning on SES removes the association.
- Chain. Drug $\to$ inflammation $\to$ pain. Drug and pain are marginally associated (through the mediating path). Conditioning on inflammation blocks the path and removes the association between drug and pain.
- Collider. Genetics $\to$ BP $\leftarrow$ diet. Genetics and diet are marginally independent (neither causes the other). Conditioning on blood pressure opens a spurious association: among people with the same BP, knowing genetic risk tells you something about diet (they must compensate).
Checking assumptions against a causal DAG
In the observational design, parental consent ($L$) is driven by SES ($U$), and $U$ also affects polio risk ($Y$). The backdoor path $A \leftarrow L \leftarrow U \to Y$ is open. Exchangeability fails: $Y(a) \cancel\coprod A$.
In the randomised design, $A$ is assigned by a chance mechanism ($\mathcal{R}$) that is independent of $U$ and $L$. The backdoor path through $L$ and $U$ is severed because $A$ no longer depends on $L$. Exchangeability holds: $Y(a) \coprod A$.
Positivity is more probable to fail in the observational design: some SES strata may have near-universal consent or refusal, leaving no comparison group.
Neurath's ship and your own causal DAG
Answers vary by discipline. The key check is whether the partner can identify a fork (common cause generating spurious association) and a chain (mediating path). The sceptic's challenge should propose either a reversed arrow or a missing common cause that would change the adjustment strategy.
Week 3
Applying the backdoor criterion
Paths from $A$ to $Y$: (1) $A \to M \to Y$ (causal, through mediator); (2) $A \leftarrow L_1 \to Y$ (backdoor through health consciousness); (3) $A \leftarrow L_1 \to L_2 \to Y$ (backdoor through health consciousness and diet).
${L_1}$ satisfies the backdoor criterion: it blocks both backdoor paths (paths 2 and 3) and $L_1$ is not a descendant of $A$. Conditioning on ${L_1}$ supports exchangeability.
Adding $M$ violates the criterion because $M$ is a descendant of $A$ (it lies on the causal path $A \to M \to Y$). Conditioning on $M$ blocks part of the total effect we want to estimate.
M-bias in practice
The DAG: $U_1 \to A$ (attendance), $U_2 \to Y$ (giving), $U_1 \to L \leftarrow U_2$ (neighbourhood social capital is a collider of two unmeasured causes). Without conditioning on $L$, the path $A \leftarrow U_1 \to L \leftarrow U_2 \to Y$ is blocked at the collider $L$.
Conditioning on $L$ opens this path, creating a spurious association between $A$ and $Y$ through the unmeasured causes. "Adjust for all pre-treatment variables" fails because $L$ is pre-treatment but is a collider: conditioning on it opens, rather than closes, a biasing path.
$R^2$ versus identification
$R^2$ measures variance explained, a statistical property. Confounding is a structural property of the DAG. A model with high $R^2$ can still be biased if the adjustment set includes a collider (opening a spurious path) or a mediator (blocking part of the causal path).
Example DAG where the larger set introduces bias: if Investigator A's set includes a variable $C$ that is a collider ($A \to C \leftarrow Y$), conditioning on $C$ opens a non-causal path and biases the estimate, despite improving $R^2$. Investigator B's smaller set ${$age, conscientiousness$}$ would satisfy the backdoor criterion if conscientiousness blocks all backdoor paths and is not a descendant of $A$.
Week 4
Classifying measurement error
- Type 1: independent, uncorrelated. The screen-time noise and the purpose noise do not share a common cause and neither is causally affected by the other variable. This typically attenuates toward the null.
- Type 3: dependent, uncorrelated. The treatment (bilingualism) causally affects how the outcome (cognitive performance) is measured, because the test instrument is language-dependent. The DAG shows $A \to$ measurement error node $\to Y^$ (recorded outcome), opening a non-causal path from $A$ to $Y^$.
- Type 2: independent, correlated. The shared translation team creates a common cause of errors in both measures. Neither variable's true value causes the other's measurement error, but the errors co-vary through the shared cause.
Collider bias versus confounding
The DAG: depression ($A$) $\to$ ward admission ($C$) $\leftarrow$ injury severity $\to$ recovery ($Y$). Marginally, $A$ and $Y$ may be independent (or associated only through a causal path). Restricting to admitted patients conditions on $C$, opening the path $A \to C \leftarrow$ injury severity $\to Y$.
This is not confounding. Confounding requires an open backdoor path through a common cause (e.g., $A \leftarrow L \to Y$). Here, the path was blocked before conditioning. Conditioning on the collider $C$ actively opens a previously blocked path. Among admitted patients, less depressed individuals tend to have more severe injuries (otherwise they would not have been admitted), creating a spurious negative association between depression and recovery.
Design fix: analyse all eligible patients regardless of admission status, or use inverse probability weighting to account for selection into the hospital sample.
Auditing a study for two failure modes
Selection bias: university mailing list recruitment acts as a filter. Academic motivation and language confidence jointly affect enrolment, making the analytic sample unrepresentative. If motivation or confidence also relates to bilingualism or cognitive outcomes, conditioning on sample membership distorts the contrast.
Measurement bias: type 3 (dependent, uncorrelated). The treatment (bilingualism) causally affects how the English-only cognitive test measures the outcome. Non-English-dominant bilinguals are systematically mismeasured, and this mismeasurement depends on treatment status.
Week 5
Building a potential outcomes table
The key distinction is between the hidden science and the observed data. In the hidden science, each student has both potential outcomes and an individual effect. In the observed data, one potential outcome and hence $\delta_i$ are missing for every student.
One possible hidden science table is:
| $i$ | $Y_i(1)$ | $Y_i(0)$ | $\delta_i$ |
|---|---|---|---|
| 1 | 0 | 1 | $-1$ |
| 2 | 0 | 0 | 0 |
| 3 | 1 | 0 | 1 |
| 4 | 1 | 1 | 0 |
If treatment assignment is $A_1=A_2=1$ and $A_3=A_4=0$, the observed-data table is:
| $i$ | $Y_i(1)$ | $Y_i(0)$ | $\delta_i$ | $A_i$ | $Y_i^{\text{obs}}$ |
|---|---|---|---|---|---|
| 1 | 0 | NA | NA | 1 | 0 |
| 2 | 0 | NA | NA | 1 | 0 |
| 3 | NA | 0 | NA | 0 | 0 |
| 4 | NA | 1 | NA | 0 | 1 |
The true ATE in the hidden science is $(-1 + 0 + 1 + 0)/4 = 0$. The naive observed difference in means is $\bar{Y}{A=1}^{\mathrm{obs}} - \bar{Y}{A=0}^{\mathrm{obs}} = 0 - 0.5 = -0.5$. The discrepancy arises because treatment assignment is not random with respect to the potential outcomes: students 1 and 2, who received $A=1$, differ in their hidden counterfactual outcomes from students 3 and 4, who received $A=0$. Exchangeability does not hold.
Tracing the identification logic
The claim "students who chose the mindfulness app had lower anxiety, therefore the app works" compares $\mathbb{E}[Y \mid A=1]$ with $\mathbb{E}[Y \mid A=0]$ and interprets the difference causally.
Consistency is questionable if "used the app" pools different versions of treatment under one label: different apps, different session lengths, different start dates, or irregular adherence. Multiple versions undermine the link between $A_i = 1$ and a well-defined $Y_i(1)$.
Exchangeability is the most plausible violated assumption: students who chose the app may have differed from non-users in baseline anxiety, motivation, help-seeking, or available time. The treated group may therefore have had different counterfactual outcomes even without the app, so $Y(0) \cancel\coprod A$.
Positivity may also fail: in some covariate strata, such as students with very high workload or students already receiving intensive therapy, almost no one may choose one side of the contrast, leaving no meaningful comparison group in those strata.
Designing a target trial
Causal estimand: the average difference in anxiety symptoms (e.g., GAD-7 score) at 6 months if all university students practised 20 minutes of daily meditation versus if all maintained their current routine (no meditation).
Time zero: the date of programme enrolment (or randomisation in the target trial).
Two baseline covariates with causal rationale: (1) baseline anxiety (GAD-7 at enrolment), because prior anxiety affects both the decision to meditate and the outcome; (2) academic workload (full-time vs part-time enrolment), because workload affects adherence to meditation and anxiety levels.
Positivity failure: students with severe clinical anxiety may be referred to treatment rather than a meditation programme, so the stratum "severe baseline anxiety" may contain no one in the meditation arm.
Week 6
Interaction versus effect modification
The causal estimand for interaction requires four potential outcomes: $Y(a=1,g=\text{young})$, $Y(a=1,g=\text{old})$, $Y(a=0,g=\text{young})$, $Y(a=0,g=\text{old})$. This is conceptually odd because we cannot intervene on age.
The causal estimand for effect modification involves one intervention (exercise) with subgroup contrasts: $\mathbb{E}[Y(1)-Y(0) \mid G=\text{young}]$ versus $\mathbb{E}[Y(1)-Y(0) \mid G=\text{old}]$. This is the design that matches the study.
The regression interaction term could be non-zero without causal modification if, for example, the linear specification is wrong (the true effect varies non-linearly with a confounder correlated with age), or if age is a collider or descendant of a collider in the DAG.
Why conditioning changes effect modification
Even without a direct $G \to Y$ path, the CATE varies by age because $G$ (age) affects $L$ (fitness), and if the treatment effect varies with $L$, then the distribution of $L$ within age strata determines the subgroup average. Different age groups have different fitness distributions, so $\tau(g)$ differs.
The colleague's null interaction conclusion is premature: the regression test depends on the conditioning set and the functional form. A non-significant $A \times G$ term in a linear model does not rule out effect modification visible with a richer specification or different conditioning set.
Two apparent modifiers could vanish together if both $G_1$ and $G_2$ are proxies for the same underlying variable $L$. Each captures part of the variation in $L$; conditioning on both accounts for $L$ fully, and the residual variation in treatment effect disappears.
From average to subgroup
If 60% of participants have CATE = 8 and 40% have CATE = $-2$: ATE = $0.6 \times 8 + 0.4 \times (-2) = 4.8 - 0.8 = 4.0$. Adjust proportions: e.g., 50% with CATE = 8 and 50% with CATE = $-2$: ATE = $4 - 1 = 3$ mmHg. The policy-maker misses that 50% of participants are harmed (blood pressure increases by 2 mmHg).
The claim "$\hat{\tau}(X_i) = 8$ means the programme will reduce my blood pressure by 8" confuses an estimated subgroup average with an individual effect. $\hat{\tau}(X_i) = 8$ estimates the average effect for everyone sharing person $i$'s measured profile. Person $i$'s true individual effect $Y_i(1) - Y_i(0)$ is unobservable and could be larger, smaller, or opposite in sign.
Week 8
From tree to forest to causal forest
A single regression tree is interpretable (you can read the decision path), but unstable: small changes in the data shift splits and predictions (high variance).
Averaging many trees (a forest) reduces variance. Each tree's idiosyncratic splits cancel out, producing smoother, more reliable predictions.
Two differences in a causal forest: (a) the target is $\tau(x) = \mathbb{E}[Y(1)-Y(0) \mid X=x]$, a treatment contrast, not a prediction of $Y$; (b) honest splitting uses one subsample to choose splits and a separate subsample to estimate contrasts within leaves. Honest splitting is necessary because treatment contrasts require estimating quantities under two exposures, only one of which is observed per person. Using the same data for splitting and estimation would overfit to noise in the individual-level contrasts.
Reading a TOC curve
A steep initial rise means treatment gains are heavily concentrated among the top-ranked individuals. The programme helps some people a lot but most people only a little.
Large AUTOC but small Qini at $q=0.3$ means that gains concentrate in a very narrow top slice (perhaps the top 5-10%), and by the time you expand to 30% coverage, the additional individuals contribute little. For a decision-maker with a 30% budget, the targeting advantage over random allocation is small.
Computing the TOC curve on training data overfits: the forest's rankings are optimised for the training sample, so in-sample evaluation inflates the apparent targeting value. Honest evaluation requires held-out or cross-fitted data.
Should we target?
The Qini addresses the causal estimand: "does treating the top $q$ fraction (ranked by estimated treatment effect) yield greater total benefit than treating a random $q$ fraction?" It goes beyond the ATE by asking whether benefits are concentrated enough to justify selective allocation.
Two non-statistical reasons not to target: (1) logistical feasibility (screening and scoring may cost more than universal provision); (2) stigma or fairness concerns (singling out individuals with high loneliness scores may be perceived as labelling).
Response to the stigma concern: "The concern about stigmatisation is legitimate and must shape how targeting is implemented. The evidence shows that some students benefit substantially more than others, but it does not mandate that selection criteria be disclosed or that participation be compulsory. A self-referral design using the targeting criteria as capacity planning could capture most of the benefit without labelling individuals."
Week 9
Reading a simple policy rule
The supplied tree first splits on deprivation index, then splits the high-deprivation branch on baseline loneliness. The high-response leaf is "high deprivation, high loneliness".
Plain-language summary: the strongest estimated response is among residents in high-deprivation areas who also report high baseline loneliness.
If roughly 40% of the population is high-deprivation and half of that group is high-loneliness, the high-response region contains approximately:
$$ 0.40 \times 0.50 = 0.20 $$
That is about 20% of residents.
This does not mean the analysis must treat exactly 20%. In our lab workflow, the policy tree helps describe where estimated responses are strongest, and the report should give the expected mean difference for that region. The size of the region is still useful because it tells readers whether the finding describes a small niche group or a sizeable part of the eligible population.
Depth-1 versus depth-2
The depth-1 rule can be stated as: treat residents above the log-income threshold and do not treat residents at or below it. The depth-2 rule can be stated as: first split on openness; for those below the first threshold, split again on openness, and for those above it, split on neuroticism; treat only the leaves labelled "treat."
A lift of $0.028$ standard-deviation units is small for any one person. Across $10,000$ eligible people, however, it is an average improvement applied many times. If the rule is implemented correctly and the estimate transports, the population-level gain could be meaningful even though the individual-level gain sounds modest.
Deploy the depth-1 rule when implementation must be simple, when staff will apply the rule under time pressure, or when the depth-2 splits are unstable across resamples. Deploy the depth-2 rule when the intervention is high-stakes, the extra gain is meaningful, the confidence interval supports the improvement, and the implementing organisation can apply the rule reliably.
The extra evidence that would support depth-2 is a held-out policy-value estimate whose uncertainty clearly favours depth-2, stable splits across resamples, and an equity audit showing that the added split does not worsen disparities.
Equity audit
Plain-language rule: offer the programme to high-deprivation residents under 40.
A deprivation split can affect social groups differently because deprivation is correlated with many background conditions: income, housing, neighbourhood resources, family structure, age, health, and sometimes ethnicity. A rule that targets high deprivation may therefore produce uneven treatment shares even when group membership is not an explicit splitter. This does not make deprivation the causal root of the heterogeneity. Deprivation may be the strongest measured marker of deeper causes, including ethnic injustice or other upstream social processes.
The first empirical check is a cross-tabulation of assigned action by relevant social groups, ideally with uncertainty intervals for the treated share in each group. The analyst should also check the distribution of need and estimated benefit across those groups, because equal treatment shares can still hide unequal need.
Applying governance checks: (1) "Who gains and who loses?" The high-deprivation-under-40 group gains access; everyone else is excluded. If the excluded group includes high-deprivation people over 40 who also benefit, the rule creates an age-based inequity within disadvantaged communities. (2) "Can affected communities understand and contest the rule?" A depth-2 tree is transparent enough to explain, but communities need a mechanism to challenge the split variables and thresholds.
The model cannot decide whether maximising expected benefit, equal access, need, individual choice, fiscal constraint, or another principle should govern the allocation. That judgement belongs to democratic and institutional decision-making. The analyst can clarify the consequences of different choices and report whether the rule behaves as advertised.
"The algorithm is objective because it only uses data" is too quick. The algorithm optimises a chosen objective function on data produced by social institutions and past decisions. Mechanical consistency in computation does not settle whether the rule is fair, legitimate, or publicly acceptable.
Policy tree versus ranking
(a) Estimated policy value: Strategy A (pure ranking) typically achieves equal or slightly higher policy value because it uses the full granularity of $\hat{\tau}(X_i)$. Strategy B loses some value by collapsing to a few leaves.
(b) Explainability: Strategy B is far more explainable. A depth-2 tree is a short set of if-then rules. Strategy A requires explaining a continuous score derived from thousands of overlapping tree splits.
(c) Ability to answer "why was I selected?": Strategy B can give a clear answer ("because your deprivation index is above 8 and your loneliness score is above the median"). Strategy A can only say "because your estimated benefit score was in the top 20%," if a 20% budget was imposed, which is opaque.
Strategy A is preferable when the decision is internal (e.g., a research
team allocating limited follow-up resources), a fixed treatment share
must be met exactly, and public justification is less central. Strategy
B is preferable when the rule must be defended publicly, contested by
affected communities, or implemented by non-technical staff. Under the
default policytree workflow, the percentage treated is an output of
the fitted rule, not a fixed input.
Both strategies must answer the same equity question: who is excluded under the rule, and does that exclusion worsen disparities for protected or structurally disadvantaged groups?
Week 10
Interpreting invariance results
Configural invariance means the same items load on the same factors in both groups (same pattern of zero and non-zero loadings). Metric invariance means the factor loadings are equal across groups (a one-unit increase in the latent variable produces the same change in item responses). Scalar/threshold invariance means the intercepts (or thresholds for ordinal items) are equal, so the same latent level produces the same expected response.
Failing scalar/threshold invariance means that even at the same latent distress level, the two groups endorse "felt hopeless" and "felt worthless" differently. A one-unit difference in total score does not correspond to the same latent difference across groups. Group mean comparisons on the total score therefore confound true latent differences with measurement artefact.
Hypothesis for differential functioning: cultural norms about expressing hopelessness or worthlessness may differ. In some cultural contexts, endorsing "felt worthless" may carry greater stigma, leading to systematically lower endorsement at the same latent distress level. Alternatively, translation may anchor response categories differently.
Fit is not identification
Good fit means the model reproduces the observed covariance matrix. It does not establish causal direction. Multiple causal structures can generate the same covariance pattern.
Reflective DAG: $\eta \to X_1, \eta \to X_2, \ldots, \eta \to X_6$. The latent variable causes the indicators. Formative DAG: $X_1 \to \eta, X_2 \to \eta, \ldots, X_6 \to \eta$. The indicators cause the composite. Both can produce identical fit statistics for a single-factor solution.
The choice matters for downstream causal inference. If the construct is reflective and we use it as a confounder, we assume the latent variable is the true common cause. If the construct is actually formative (a composite of independent causes), conditioning on the composite may not block the backdoor paths we intend to close, because each component may have a different causal relationship with treatment and outcome.
Measurement as an identification problem
Scalar/threshold non-invariance means the same response pattern corresponds to different latent levels across groups. Put differently, the mapping from the latent outcome $Y^$ to the measured outcome $Y$ depends on group membership. This matters for CATE because CATE is defined by a group contrast. If measurement differs by group, the estimated heterogeneity can be measurement artefact. This can happen even when exchangeability and positivity hold for $Y^$, because the analysis uses $Y$, not $Y^*$.
Example intuition: population A is distressed and population B is not. In A, "worthlessness" may be caused by unemployment. In B, it may be rare and have different causes. The same K6 item can have different causal parents across groups. The factor structure and item means can therefore differ without any change in "true distress".
Counter to "validated in hundreds of studies": most validation evidence is about internal consistency, short-term stability, and associations with other variables. Those are associational properties. They do not establish that the items have the same meaning, the same causes, or the same measurement function across the particular groups you want to compare. A scale can be reliable within each group and still be non-comparable across groups.
Proposed workflow step (between DAG and estimation): write down a measurement submodel as causal assumptions. State whether your estimand is the effect on reported K6 ($Y$) or the effect on the underlying state ($Y^*$). Then draw a measurement DAG that makes explicit what causes item responses in each group, including stigma, translation, and response norms. Decide what design or data would support those assumptions. If you choose to run measurement invariance tests, treat them as descriptive stress tests of a specific reflective model, not as evidence that measurement is causally comparable. If the stress test fails, the honest conclusion is that your group comparison is not identified without stronger measurement assumptions or better data.