Welcome to PSYC 434

Conducting Research Across Cultures | Trimester 1, 2026

Prof Joseph Bulbulia | Victoria University of Wellington

Assessments

Assessment	CLOs	Weight	Due
Lab diaries (8 × 1.25%)	1, 2, 3	10%	Weekly (satisfactory/not)
In-class test 1	2	20%	22 April (w7)
In-class test 2	2, 3	20%	20 May (w11)
In-class presentation	1, 2, 3	10%	27 May (w12)
Research report (Option A or B)	1, 2, 3	40%	30 May (Fri)

Accessing Lectures and Readings

Seminar: Wednesdays, 14:10–17:00, Easterfield Building EA120
Schedule: see the Schedule page for topics, readings, and assignments
Lectures: weekly content pages contain slides, recordings, and lab materials
Tests: in the same room as the seminar (bring a pen/pencil, no devices)

Contact


Course coordinator	Prof Joseph Bulbulia, joseph.bulbulia@vuw.ac.nz
Office	EA324
R help	Boyang Cao, caoboya@myvuw.ac.nz

Course Description

From the VUW course catalogue:

This course focuses on theoretical and practical challenges for conducting research involving individuals from more than one cultural background or ethnicity. Topics include defining and measuring culture; developing culture-sensitive studies; choice of language and translation; communication styles and bias; questionnaire and interview design; qualitative and quantitative data analysis for cultural and cross-cultural research; minorities, power, and ethics in cross-cultural research; and ethno-methodologies and indigenous research methodologies. Appropriate background for this course: PSYC 338.

Course Learning Objectives

In this advanced course, students develop foundational skills in cross-cultural psychological research with a strong emphasis on causal inference.

Programming in R. Students will learn the basics of programming in the statistical language R, gaining essential computational tools for psychological research. These skills lay the foundation for applying data analysis techniques in a causal inference framework and beyond. This year we will try something different, introducing you to coding with LLM agents.
Understanding causal inference. Students will develop a robust understanding of causal inference concepts and approaches, with particular emphasis on how they mitigate common pitfalls in cross-cultural research. We focus, first, on how to ask questions in comparative psychology, and (only then), how to answer them. This takes to the design of studies, analysing data, and drawing appropriately confident conclusions about cause-and-effect relationships in comparative research.
Understanding measurement in comparative settings. Quite a lot of this course is devoted to concepts of measurement in psychological research. Although we will cover classical techniques for constructing and validating psychometrically sound measures across cultures, we will also clarify why statistical tests alone are insufficient to ensure we are measuring as we hope.

Schedule

Weekly Schedule (2026)

Week	Date (Wed)	Content	Lab	Main Readings
w1	25 Feb	How to ask a question in psychological science?	Git and GitHub	—
w2	4 Mar	Causal diagrams: elementary structures	Install R and RStudio	H&R Ch 1–2
w3	11 Mar	Causal diagrams: confounding bias	Regression, graphing, and simulation	H&R Ch 3, 6
w4	18 Mar	Interaction, measurement bias, selection bias	Regression and confounding bias	H&R Ch 5, 7–8
w5	25 Mar	Causal inference: average treatment effects	Average treatment effects	H&R Ch 1–3 (review)
w6	2 Apr	Effect modification / CATE	CATE and effect modification	H&R Ch 4
—	8 Apr	Mid-trimester break	—	—
—	15 Apr	Mid-trimester break	—	—
w7	22 Apr	In-class test 1 (20%)	—	—
w8	29 Apr	Heterogeneous treatment effects and machine learning	RATE and QINI curves	—
w9	6 May	Resource allocation and policy trees	Policy trees	—
w10	13 May	Classical measurement theory from a causal perspective	Measurement invariance	H&R Ch 9
w11	20 May	In-class test 2 (20%)	—	—
w12	27 May	Student presentations (10%)	—	—

Labs

Labs run in the final 60–90 minutes of the seminar. Nine labs across weeks 1–6 and 8–10. Your best eight lab diaries count toward the 10% assessment. See Assessments for due dates.

Assessments

Overview

Assessment	CLOs	Weight	Due
Lab diaries (8 × 1.25%)	1, 2, 3	10%	Weekly
In-class test 1	2	20%	22 April (w7)
In-class test 2	2, 3	20%	20 May (w11)
In-class presentation	1, 2, 3	10%	27 May (w12)
Research report	1, 2, 3	40%	30 May (Fri)

Assessment 1: Lab Diaries (10%)

Nine weekly diaries, one per lab (weeks 1–6 and 8–10). There are no labs in week 7 (test 1), week 11 (test 2), or week 12 (presentations). Your best eight diaries count (8 × 1.25%), so you may miss one without penalty. Each diary is graded satisfactory/not satisfactory. You receive full credit for submitting a satisfactory entry. Diaries are due by the end of the lab session.

Diary	Week	Due date
`lab-01.md`	w1	Wed 25 Feb
`lab-02.md`	w2	Wed 4 Mar
`lab-03.md`	w3	Wed 11 Mar
`lab-04.md`	w4	Wed 18 Mar
`lab-05.md`	w5	Wed 25 Mar
`lab-06.md`	w6	Wed 2 Apr
`lab-08.md`	w8	Wed 29 Apr
`lab-09.md`	w9	Wed 6 May
`lab-10.md`	w10	Wed 13 May

What to write

Each diary is a short reflection (~150 words) covering:

What the lab covered and what you did.
A connection to the week's readings or lecture content.
One thing you found useful, surprising, or challenging.

Format

Write each diary as a plain markdown (.md) file named by week number: lab-01.md, lab-02.md, …, lab-10.md (there is no lab-07.md). Use GitHub-flavoured markdown formatting: headings, paragraphs, bold, italics, and lists. Because you push diaries to GitHub, your files will render there automatically. These submissions build your markdown fluency; later in the course you will use Quarto to extend markdown to PDF and Word.

Submission

Push your diary files to the lab diary GitHub repository created in Lab 1. The commit timestamp is your submission record.

Markdown example

Here is a minimal diary entry showing basic markdown formatting:

# Lab 01: Introduction to R

This week we installed R and RStudio, then ran our first script.
The exercise connected to the lecture on **causal questions** by
showing how we structure data for analysis.

I found the following steps useful:

- Creating an RStudio project
- Writing a short R script
- Pushing changes to GitHub

Assessment 2: In-Class Test 1 (20%) — 22 April

Covers material from weeks 1–6 (causal diagrams, confounding, ATE, effect modification). Test duration is 50 minutes. The allocated time is 1 hour 50 minutes. Required: pen/pencil. No devices permitted.

Test Location

The test is in class. Come to the seminar room (EA120) with a writing instrument.

Assessment 3: In-Class Test 2 (20%) — 20 May

Covers material from weeks 8–10 (heterogeneous treatment effects, machine learning, resource allocation, policy trees, and classical measurement theory). Same format and conditions as test 1.

Assessment 4: In-Class Presentation (10%) — 27 May

10-minute presentation summarising the student's study. Assessment criteria: clarity, efficiency, and quality of presentation.

Assessment 5: Research Report (40%) — Due 30 May

You choose your format

Students choose one of two formats for the research report:

Option A: Research Report — quantify an average treatment effect using the NZAVS synthetic dataset.
Option B: Marsden Fund EOI — write a first-round Marsden Fund Expression of Interest using the causal inference framework.

You must declare your choice by submitting the option form on Nuku by Friday 3 April (end of w6). If no declaration is received by this date, Option A is assumed.

Generate your data using the causalworkshop package:

# install (once)
install.packages("remotes")
remotes::install_github("go-bayes/causalworkshop@v0.2.1")

# generate data
library(causalworkshop)
d <- simulate_nzavs_data(n = 5000, seed = 2026)

Choose one exposure (community_group, religious_service, or volunteer_work) and one outcome (wellbeing, belonging, self_esteem, or life_satisfaction). Lab sessions support you in this assignment. We assume no statistical background.

Late Penalty

Late assignments, and assignments with extensions, may be subject to delays in marking and may not receive comprehensive feedback.

Assignments submitted late without an approved extension will incur a grade penalty of 5% of the total marks available for the assignment per day late (i.e., in 24-hr increments), up to a maximum of 5 days (up to 24 hrs late = −5%; up to 48 hrs late = −10%, etc.).

Assignments submitted more than five days late without an approved extension will not be graded unless exceptional circumstances are accepted by the Course Coordinator.

Option A: Research Report

Quantify the Average Treatment Effect of a specific exposure on a specific outcome using the synthetic NZAVS panel dataset generated by causalworkshop::simulate_nzavs_data(). You choose one of three exposures and one of four outcomes (see the data generation instructions above).

Introduction: 1,500-word limit.
Conclusion: 1,500-word limit.
Methods/Results: concise, no specific word limit.
APA style. Submit as a single PDF with R code appendix.

Assessment Criteria (Option A)

Stating the problem. State your question clearly. Explain its scientific importance. Frame it as a causal question. Identify any subgroup analysis (e.g. cultural group). Explain the causal inference framework for non-specialists. Describe ethics/policy relevance. Confirm data are from the NZAVS simulated dataset using three waves.

Determining the outcome. Define outcome variable $Y$ . Assess multiple outcomes if applicable. Explain how outcomes relate to the question. Address outcome type (binary and rare?) and timing (after exposure).

Determining the exposure. Define exposure variable $A$ . Explain relevance, positivity, consistency, and exchangeability. Specify exposure type (binary or continuous) and whether you contrast static interventions or modified treatment policies. Confirm exposure precedes outcome.

Accounting for confounders. Define baseline confounders $L$ . Justify how they could affect both $A$ and $Y$ . Confirm they are measured before exposure. Include baseline measures of exposure and outcome in confounder set. Assess sufficiency; explain sensitivity analysis (E-values) if needed. Address confounder type (z-scores for continuous; one-hot encoding for categorical with three or more levels).

Drawing a causal diagram. Include both measured and unmeasured confounders. Describe potential measurement error biases. Add time indicators ensuring correct temporal order.

Identifying the estimand. State your causal contrast clearly.

Source and target populations. Explain how your sample relates to your target populations.

Eligibility criteria. State the eligibility criteria for the study.

Sample characteristics. Provide descriptive statistics for baseline demographics. Describe magnitude of exposure change from baseline. Include references for more information about the sample. Make clear the data are simulated.

Missing data. Check for missing data. Describe how you will address the problem (IPCW).

Model approach. Decide between G-computation, IPTW, or doubly-robust estimation. Specify model. Explain how machine learning works. Address outcome specifics (logistic regression for rare binary outcomes; z-scores for continuous). Describe sensitivity analysis (E-values).

Option B: Marsden Fund Expression of Interest (EOI)

Write a first-round Marsden Fund Expression of Interest (EOI) following the RSNZ 2026 guidelines. Your research question must use the causal inference framework taught in this course. Assume an Ecology, Human Behaviour, and Evolution (EHB) panel.

Templates and Guidelines

Download the official 2026 RSNZ templates before you begin:

Formatting: 12-point Times New Roman, single spacing, 2 cm margins. Submit as a single PDF.

Required Sections

Section numbers follow the 2026 RSNZ EOI form.

1a. Research Title (max 25 words). Plain language, no jargon. The title should be accessible to a scientifically literate non-specialist.

1d. Research Summary (max 200 words). State the current state of the field, the aims of your research, the methods you will use, and the expected outcome. This summary must be standalone: assessors outside your discipline will read it.

2. Vision Mātauranga (max 200 words). Describe how the proposed research relates to the four Vision Mātauranga (VM) themes: (i) indigenous innovation, drawing on Māori knowledge, resources, and people; (ii) taiao, achieving environmental sustainability through iwi and hapū relationships with land and sea; (iii) hauora/oranga, improving health and social wellbeing; (iv) mātauranga, exploring indigenous knowledge and its contribution to NZ research. If none of the themes apply, you may state "not applicable" with a considered justification.

3a. Abstract (max 1 page). Cover the following: aims of the research; importance of the research area; novelty, originality, insight, and ambition of the proposed work; potential impact; methodology; and your capacity to deliver.

3b. Benefit Statement (max 400 words, 1 page). Describe the economic, environmental, or health benefit of the research to New Zealand. Explain why NZ is the right place for this research and describe potential impacts for Māori. In a student context the benefit case may be aspirational, but it must be concrete.

3c. References (max 3 pages). Bold your own name. Include article titles and full author lists (up to 12 authors; use "et al." thereafter).

3d. Roles and Resources (max 1 page). Describe the contributions of each team member, the resources required, and any ethical considerations. Use the Roles and Resources form.

Assessment Criteria (Option B)

Research. Quantifiable impact potential through novelty, originality, insight, and ambition. Rigorous methods grounded in prior research. Ability and capacity to deliver.

Benefit. Economic, environmental, or health benefit to New Zealand. Rationale for NZ-based research. In a student context the benefit case may be aspirational but must be concrete.

Vision Mātauranga. Relation to VM themes; where relevant, engagement with Māori. "Not applicable" is acceptable with considered justification.

Causal reasoning (course-specific). Well-defined causal question, clearly stated causal estimand, appropriate identification strategy. This criterion carries substantial weight.

For the full Marsden Fund assessment criteria, see the RSNZ 2026 EOI Guidelines (pdf).

Extensions and Materials

Extensions

Negotiate a new due date by writing (email) before the mid-term test. Every reasonable request will be accepted (e.g. too many assignments falling in the same week, you want another week to complete).

Penalties

Late submissions incur a one full grade-per-week penalty: e.g. if late by one day, B becomes C; one week later, C becomes D. Over-length assignments will be penalised.

Unforeseeable Events

Extensions will require evidence (e.g. medical certificate).

Materials and Equipment

Bring a laptop with R and RStudio installed for data analysis sessions. Contact the instructor if you lack computer access. For in-class tests, bring a writing utensil. Electronic devices are not permitted during tests.

Test 1: Study Sheet

Comprehensive study sheet covering lectures 1–6 (weeks 1–6).

Download the study sheet

Practice Quiz (2024, with Answers)

Last year's in-class quiz covering key concepts and causal diagrams. Includes answers.

Download the quiz

Practice Quiz (2025, Workshop Diagnostic)

A diagnostic quiz from the 2025 SASP Causal Inference Workshop covering causal diagrams, confounding, and identification assumptions.

Download the quiz

Simulation Guide

Simulations are pedagogical tools that let us see what causal inference methods do when we know the truth. In observational research, we never know the true causal effect. In a simulation, we build the data-generating process ourselves, so we can compare each method's estimate against the ground-truth parameter. The four simulations in this course illustrate distinct threats to valid causal inference and distinct strategies for addressing them.

Download the R script

Required R packages

tidyverse, stdReg, gtsummary, clarify, grf

Generalisability and transportability

Connects to Week 4: external validity and selection bias.

This simulation creates two populations that differ in the prevalence of an effect modifier $Z$ . In the sample, $Z = 1$ is rare ( $p = 0.1$ ); in the target population, $Z = 1$ is common ( $p = 0.5$ ). The treatment effect depends on $Z$ : individuals with $Z = 1$ benefit more from treatment. Because the sample under-represents these high-benefit individuals, the naive sample Average Treatment Effect (ATE) underestimates the population ATE.

The simulation fits three models: an unweighted model on the sample, a weighted model on the sample (using inverse-probability-of-sampling weights), and an oracle model on the full population. The regression coefficients are nearly identical across all three models, yet the marginal ATEs differ. This dissociation is the central lesson: model coefficients describe conditional associations, but the ATE is a marginal quantity that depends on the distribution of effect modifiers in the target population. Weighting the sample to match the population distribution of $Z$ recovers the correct ATE.

The script also includes a manual calculation section that shows exactly what stdReg does under the hood: create counterfactual datasets in which everyone receives $A = 0$ and everyone receives $A = 1$ , predict outcomes under each scenario, and take the mean difference. This "g-computation" procedure makes the marginalisation step explicit.

Key takeaway

Regression coefficients can be correct and yet the ATE can still be wrong for the target population. External validity requires that the distribution of effect modifiers in the sample matches the target, or that we adjust for the mismatch.

Cross-sectional data ambiguity

Connects to Week 3: confounding versus mediation.

This simulation generates data in which $A$ causes $L$ , and $L$ causes $Y$ . The variable $L$ is therefore a mediator, not a confounder. Two models are fit: one that conditions on $L$ (treating it as a confounder) and one that omits $L$ (treating it as a mediator). The model that conditions on $L$ returns a near-zero estimate for the effect of $A$ on $Y$ because it blocks the very path through which $A$ operates. The model that omits $L$ correctly recovers the total effect.

The crux of the problem is that with cross-sectional data alone, the investigator cannot distinguish a confounder from a mediator. Both the fork $A \leftarrow L \to Y$ and the chain $A \to L \to Y$ produce the same observed association between $A$ , $L$ , and $Y$ . The correct modelling decision depends on the assumed causal structure, which the data themselves do not reveal.

Warning

Good model fit does not resolve this ambiguity. A model that conditions on a mediator can fit the data well while returning a biased causal estimate. Model fit is a statistical property; confounding is a structural (causal) property.

Confounding control strategies

Connects to Weeks 3–4: conditioning choices and the backdoor criterion.

This simulation builds a three-wave panel structure with a baseline covariate $L_{0}$ , a prior outcome $Y_{0}$ , a prior exposure $A_{0}$ , an unmeasured confounder $U$ , a treatment $A_{1}$ , and an outcome $Y_{2}$ . The true treatment effect is $δ_{A_{1}} = 0.3$ , and the outcome also depends on $Y_{0}$ , $A_{0}$ , $L_{0}$ , their interactions, and $U$ .

Three models are compared. The "no control" model regresses $Y_{2}$ on $A_{1}$ alone and overestimates the effect because it leaves all confounding paths open. The "standard" model adds $L_{0}$ but still omits $Y_{0}$ and $A_{0}$ , leaving residual confounding. The "interaction" model conditions on $L_{0}$ , $A_{0}$ , $Y_{0}$ , and their interactions with $A_{1}$ , recovering an estimate close to the true value.

The simulation uses the clarify package to obtain simulation-based confidence intervals for each ATE. The progressive improvement from no control to standard to interaction control illustrates that closing more backdoor paths moves the estimate closer to the truth, but only conditioning on the right set of variables eliminates confounding entirely.

Key takeaway

In a three-wave panel, conditioning on the prior exposure, prior outcome, and baseline covariates (along with their interactions) is ordinarily necessary to satisfy the backdoor criterion. Omitting any of these leaves residual confounding.

Causal forest estimation

Connects to Week 8: machine learning for heterogeneous treatment effects.

This simulation uses the same data-generating process as the confounding control simulation, then fits a causal forest (from the grf package) with $L_{0}$ , $A_{0}$ , and $Y_{0}$ as covariates. The causal forest is a non-parametric method that estimates individual-level treatment effects $\overset{τ}{^} (x)$ by partitioning the covariate space adaptively. Unlike the parametric models above, the causal forest does not require the investigator to specify interaction terms; it discovers them from the data.

The simulation reports the causal forest's ATE estimate and its standard error. Comparing this estimate to the parametric interaction model illustrates two points. First, the causal forest can recover the ATE without requiring the analyst to guess the correct functional form. Second, the causal forest provides a standard error that accounts for the adaptive splitting, making valid inference possible even in the non-parametric setting.

Key takeaway

Causal forests automate the discovery of heterogeneous treatment effects but still require the investigator to supply the correct set of confounders. Machine learning solves the functional-form problem, not the identification problem.

Running the simulations

To run the full script, open R and execute:

source("scripts/simulations.R")

Each section prints its results to the console. You can also run individual sections by selecting and executing the relevant code blocks. The script sets random seeds so that results are reproducible.

The Causal Workflow: Nine Steps

This page summarises a nine-step framework for conducting causal inference with observational data. Each step addresses a potential threat to valid causal interpretation. The framework draws on Hernan and Robins (What If, 2024), VanderWeele (2022), and Bulbulia (2024).

How to use this page

This is a reference resource. Use it as a checklist when planning your research report. Each step links back to the lecture where it was introduced.

Step 1: State a well-defined treatment

Specify the hypothetical intervention precisely enough that every member of the target population could, in principle, receive it. "Weight loss" is too vague because people lose weight via exercise, diet, depression, surgery, and more. A clearer intervention is: "engage in vigorous physical activity for at least 30 minutes per day."

Precision here underwrites the causal consistency assumption (Step 5). If the treatment is vaguely defined, different people effectively receive different interventions, and the potential outcome $Y (a)$ is not well-defined.

See Week 5 for the formal definition of causal consistency.

Step 2: State a well-defined outcome

Define the outcome so the causal contrast is meaningful and temporally anchored. "Wellbeing" is underspecified; "psychological distress one year post-intervention, measured with the Kessler-6" is interpretable and reproducible. Include timing, scale, and instrument.

See Week 10 for how measurement choices affect causal identification.

Step 3: Clarify the target population

Say exactly who you aim to inform. Eligibility rules define the source population, but sampling and participation can yield a study population with a different distribution of effect modifiers. If you intend to generalise beyond the source population (transportability), articulate the additional conditions required.

See Week 6 for how effect modification interacts with population composition.

Step 4: Evaluate exchangeability

Make the case that potential outcomes are independent of treatment conditional on covariates: $Y (a) ⊥ ⊥ A ∣ X$ . Use DAGs, subject-matter arguments, pre-treatment covariate balance checks, and overlap diagnostics. If exchangeability is doubtful, redesign (e.g., stronger measurement, alternative identification strategies) rather than rely solely on modelling.

See Week 5 for the formal definition of exchangeability.

Step 5: Ensure causal consistency

Consistency requires that, for individuals receiving a treatment version compatible with level $a$ , the observed outcome equals $Y (a)$ . It also presumes well-defined versions and no interference between units. When multiple versions exist, either refine the intervention so versions are irrelevant to $Y (a)$ , or condition on version-defining covariates.

See Week 5 for examples of consistency violations.

Step 6: Check positivity (overlap)

Each treatment level must occur with non-zero probability at every covariate profile needed for exchangeability:

$P (A = a ∣ L = l) > 0$

Diagnose limited overlap using propensity score distributions or extreme weights. Consider design-stage remedies (trimming, restriction, adaptive sampling) before estimation.

See Week 5 for the formal positivity assumption.

Step 7: Ensure measurement aligns with the scientific question

Verify that constructs are captured by instruments whose error structures do not distort the causal contrast of interest. Be explicit about forms of measurement error (classical, Berkson, differential, misclassification) and their structural implications for bias.

See Week 10 for how measurement error creates collider bias in DAGs.

Step 8: Preserve representativeness

End-of-study analyses should reflect the target population's distribution of effect modifiers. Differential attrition, non-response, or measurement processes tied to treatment and outcomes can induce selection bias. Plan strategies such as inverse probability weighting for censoring, multiple imputation, and sensitivity analyses for missing-not-at-random data.

See Week 4 for selection bias structures.

Step 9: Document transparently

Make assumptions, disagreements, and judgement calls legible. Register or timestamp your analytic plan. Include identification arguments (DAGs), code, and data where possible. Report robustness and sensitivity analyses. Transparent reasoning is a scientific result in its own right.

Summary table

Step	Requirement	Core assumption
1	Well-defined treatment	Consistency
2	Well-defined outcome	Interpretability
3	Target population	Generalisability
4	Exchangeability	Conditional independence
5	Causal consistency	No interference, well-defined versions
6	Positivity	Overlap
7	Measurement validity	No differential error
8	Representativeness	No selection bias
9	Transparent documentation	Reproducibility

Potential Outcomes and Causal Inference

This page introduces the fundamental problem of causal inference, the potential outcomes framework, and the three identification assumptions needed to estimate causal effects from data. It draws on the Women's Health Initiative (WHI) hormone replacement therapy case study as a motivating example.

Motivating example: hormone replacement therapy

Observational evidence (1980s–1990s)

Throughout the 1980s and 1990s, observational studies suggested that oestrogen therapy reduced all-cause mortality in postmenopausal women by roughly 30% (hazard ratio $\approx 0.68$ for current users vs. never users). Professional bodies endorsed hormone replacement therapy (HRT) on this basis:

1992, American College of Physicians: "Women who have coronary heart disease or who are at increased risk ... are likely to benefit from hormone therapy."
1996, American Heart Association: "ERT does look promising as a long-term protection against heart attack."

The experiment disagreed

The Women's Health Initiative (WHI) was a large randomised, double-blind, placebo-controlled trial enrolling 16,000 women aged 50–79. Participants were assigned to oestrogen plus progestin therapy and followed for up to eight years.

The experimental hazard ratio for all-cause mortality was 1.23 (initiators vs. non-initiators), the opposite direction from the observational finding.

What went wrong?

The discrepancy was not a failure of causal assumptions. It was a failure of study design: the observational studies did not correctly emulate a target trial. Specifically, they failed to align "time zero" (the start of follow-up) with the moment of treatment initiation, introducing survivor bias. When investigators re-analysed the observational data using a target trial emulation framework that matched treatment initiation to the start of follow-up, the observational estimates aligned with the experimental findings.

Lesson

If you want causal inferences from observational data, design the analysis as though you were running an experiment. Specify the target trial first.

The fundamental problem of causal inference

Causality is never directly observed. To quantify a causal effect, we need to compare two states of the world for the same individual, but each individual can experience only one.

Notation

Let $A$ denote a binary exposure ( $A = 1$ : treated, $A = 0$ : untreated) and $Y$ denote the outcome.

$Y_{i} (1)$ : the potential outcome for individual $i$ under treatment.
$Y_{i} (0)$ : the potential outcome for individual $i$ under control.

The individual causal effect is:

$τ_{i} = Y_{i} (1) - Y_{i} (0)$

We say there is a causal effect when $Y_{i} (1) - Y_{i} (0) \neq = 0$ .

The missing-data problem

At most one potential outcome is observed for each individual. The unobserved outcome is the counterfactual:

If $A_{i} = 1$ is observed, then $Y_{i} (0)$ is counterfactual.
If $A_{i} = 0$ is observed, then $Y_{i} (1)$ is counterfactual.

Individual-level causal effects are therefore generally unidentifiable. However, under certain assumptions, we can identify average causal effects at the population level.

Three identification assumptions

1. Causal consistency

The potential outcome corresponding to the exposure an individual actually receives equals their observed outcome:

$Y_{i} (a) = Y_{i} ∣ A_{i} = a$

This assumption requires that treatment is well-defined (no hidden versions of treatment) and that there is no interference between units (one person's treatment does not affect another's outcome).

2. Exchangeability

The potential outcomes are independent of treatment assignment. In a randomised experiment, this holds by design. In observational studies, we require conditional exchangeability: after conditioning on a set of measured covariates $L$ , treatment assignment is independent of potential outcomes:

$Y (a) ⊥ ⊥ A ∣ L$

When exchangeability holds, the Average Treatment Effect (ATE) is identified:

$ATE = E [Y (1)] - E [Y (0)] = E (Y ∣ A = 1) - E (Y ∣ A = 0)$

In observational settings with confounders $L$ :

$ATE = l \sum [E (Y ∣ A = 1, L = l) - E (Y ∣ A = 0, L = l)] Pr (L = l)$

3. Positivity

Every individual has a non-zero probability of receiving each treatment level, conditional on their covariates:

$P (A = a ∣ L = l) > 0 for all a and l$

Positivity is the only assumption that can be verified with data. Violations occur when certain subgroups never receive a particular treatment level, making causal effect estimates for those subgroups extrapolations rather than identifiable quantities.

Observational challenges

In observational settings, all three assumptions face threats. Causal consistency may fail when treatment varies across individuals (e.g., different forms of "religious service attendance"). Exchangeability is violated when unmeasured confounders exist. Positivity fails when certain subgroups have no access to treatment. These threats motivate the careful study designs and sensitivity analyses covered in later weeks.

From experiments to observational data

Randomised experiments address the fundamental problem by balancing confounders across treatment groups. Random assignment satisfies exchangeability by design, and controlled treatment administration satisfies consistency. Although individual causal effects remain unobservable, random assignment allows inference about average (marginal) causal effects.

In observational data, we must satisfy these three assumptions through study design, covariate adjustment, and sensitivity analysis. The remainder of this course develops the tools for doing so: causal diagrams (weeks 2–4), estimation methods (weeks 5–6, 8–9), and measurement considerations (week 10).

Discussion questions

Where in your own research would an average treatment effect be the right causal estimand, and when would it mask disparities that stakeholders care about?
Which of the three identification assumptions is most fragile in your field, and what designs or measurements could strengthen it?
What are examples of post-treatment variables you have been tempted to adjust for, and how would doing so bias the total effect?

Reporting Guide

This guide covers how to report results from causal inference analyses. It follows the nine-step causal workflow and demonstrates best practices for communicating average treatment effects, heterogeneous effects, and sensitivity analyses. This resource directly supports the research report assessment (40%).

The nine-step causal inference workflow

Before reporting results, ensure each step has been addressed.

Steps 1–3: problem definition

Well-defined treatment. Specify the exposure precisely, including the contrast (e.g., "weekly religious service attendance vs. less than weekly").
Well-defined outcome. State the outcome measure, its scale, and when it was assessed.
Target population. Define who the results apply to, including any weighting for population representativeness.

Steps 4–6: identification strategy

Exchangeability. Describe how conditional independence was achieved (e.g., "rich baseline covariate control including 32 covariates").
Consistency. Explain why the treatment is well-defined and uniform across individuals.
Positivity. Report verification through transition tables showing exposure variation across covariate strata.

Steps 7–9: implementation

Measurement validity. Note the psychometric properties of outcome scales.
Attrition handling. Describe how panel dropout was addressed (e.g., inverse probability of censoring weights).
Transparent reporting. Document all analytical decisions and assumptions.

Target trial emulation

Frame your causal question as: "How would outcomes change if we intervened to set everyone's exposure to level $a = 1$ rather than $a = 0$ , conditional on baseline characteristics?"

Reporting average treatment effects

Standard ATE table format

Include these elements for each outcome:

Outcome	Estimate (SD units)	95% CI	E-value	Interpretation
Outcome A	0.12	[0.08, 0.16]	1.8	Small positive effect
Outcome B	0.15	[0.11, 0.19]	2.1	Moderate positive effect

Key reporting elements

Effect sizes: report in standard deviation units for continuous outcomes.
Confidence intervals: show uncertainty around each estimate.
E-values: indicate robustness to unmeasured confounding.
Sample size: total analysed after exclusions and weighting.

Example results text

"Weekly religious service attendance showed positive causal effects across all cooperation measures. The largest effects were observed for social outcomes: sense of belonging ( $β = 0.18$ , 95% CI: 0.14–0.22) and social support ( $β = 0.15$ , 95% CI: 0.11–0.19). All effects were robust to moderate unmeasured confounding (E-values > 1.6)."

Reporting heterogeneous treatment effects

RATE results summary

When heterogeneity is detected, report it systematically:

Outcome	RATE-AUTOC	p-value	RATE-Qini	p-value	Evidence
Social support	0.12	0.003	0.08	0.012	Strong
Belonging	0.15	0.001	0.11	0.004	Strong
Charitable donations	0.06	0.089	0.04	0.156	Moderate
Volunteering	0.03	0.234	0.02	0.445	Weak

Example heterogeneity text

"We found substantial heterogeneity in treatment effects for social outcomes (RATE-Qini > 0.08, $p < 0.05$ ), but limited heterogeneity for behavioural outcomes. This suggests that while religious service benefits most people socially, individual responses vary considerably in magnitude."

Reporting policy tree results

Present policy trees with both standardised and original-scale interpretations.

Subgroup identification with data-scale effects

High-response subgroups for charitable donations:

Older adults with high agreeableness (age > 45, agreeableness > +1 SD)
- Standardised effect: $β = 0.28$ (95% CI: 0.21–0.35)
- Data-scale effect: NZ$847 additional annual donations (95% CI: NZ$635–1,058)
- Sample proportion: 23%
Parents with medium conscientiousness (parent = yes, conscientiousness > 0)
- Standardised effect: $β = 0.22$ (95% CI: 0.16–0.28)
- Data-scale effect: NZ$665 additional annual donations (95% CI: NZ$484–846)
- Sample proportion: 31%
All others
- Standardised effect: $β = 0.08$ (95% CI: 0.04–0.12)
- Data-scale effect: NZ$242 additional annual donations (95% CI: NZ$121–363)
- Sample proportion: 46%

Example policy tree text

"Policy tree analysis identified two subgroups with enhanced treatment response for charitable donations. Older adults (45+) with high agreeableness showed the largest increase ( $β = 0.28$ SD, equivalent to NZ$847 additional annual donations), representing 23% of the sample. In practical terms, targeted interventions toward these subgroups could generate 2.8–3.5 times more charitable giving than population-wide approaches."

Sensitivity analysis: E-values

Interpretation

E-values quantify robustness to unmeasured confounding. The E-value is the minimum strength of association, on the risk ratio scale, that an unmeasured confounder would need with both the treatment and the outcome to explain away the observed effect.

E-value range	Interpretation
$\geq 2.0$	Robust to strong confounding
1.5–2.0	Robust to moderate confounding
$< 1.5$	Vulnerable to weak confounding

Example sensitivity text

"To assess robustness to unmeasured confounding, we calculated E-values for all estimates. The observed effect on sense of belonging (E-value = 2.4) would require an unmeasured confounder associated with both religious service and belonging by a risk ratio of 2.4 each to explain away the result."

Methods section template

A complete methods section following the nine-step workflow should include:

Treatment definition: what the exposure is, how it is coded, and the contrast of interest.
Outcome definition: measures used, timing of assessment, any transformations applied.
Target population: sampling frame, weighting strategy, eligibility criteria.
Causal identification: covariates conditioned on, justification for conditional exchangeability.
Statistical analysis: estimation method, key tuning parameters (e.g., number of trees, minimum node size).
Attrition handling: censoring weights, stages of dropout addressed.
Heterogeneity assessment: RATE metrics, false discovery rate correction, policy tree depth.
Sensitivity analysis: E-values for all primary estimates.

Reporting checklist

Do report

Both standardised and data-scale effects
Effect sizes with confidence intervals
Sample sizes after exclusions and weighting
E-values for sensitivity analysis
Clear practical interpretation of effect magnitudes
Subgroup sizes and effect magnifications
Target trial framework and causal question
Explicit treatment and outcome definitions

Do not report

Model coefficients without interpretation
p-values alone without effect sizes
Only standardised effects for policy-relevant outcomes
Technical details that obscure main findings
Causal claims beyond your identification strategy

Figure presentation

ATE plots

Use forest plots with confidence intervals.
Order by effect magnitude or E-value.
Include sample sizes.

Policy tree plots

Show decision rules clearly.
Include sample proportions in each node.
Provide plain-language interpretation alongside the tree.

Glossary and DAG Hand-outs

A reference page covering causal inference terminology and links to hand-out PDFs on directed acyclic graphs (DAGs), confounding, selection bias, and measurement error.

Causal inference glossary

Causal inference rests on mathematical foundations that enjoy broad agreement, but the terminology varies across disciplines. The same word sometimes carries different (even opposite) meanings in different literatures. Terms to watch include "selection", "fixed effects", "standardisation", "moderator", "structural equation model", and "identification".

Core concepts

Term	Definition
Average Treatment Effect (ATE)	The expected difference in potential outcomes across the entire population: $ATE = E [Y (1) - Y (0)]$ . Also called the marginal effect.
Conditional Average Treatment Effect (CATE)	The ATE within a subgroup defined by covariates $X = x$ : $CATE (x) = E [Y (1) - Y (0) ∣ X = x]$ .
Potential outcomes	The outcomes that would be observed under each possible treatment level. For individual $i$ : $Y_{i} (1)$ under treatment, $Y_{i} (0)$ under control. Also called counterfactual outcomes.
Counterfactual	The potential outcome corresponding to the treatment level not actually received. Unobservable for any given individual.
Causal consistency	$Y_{i} (a) = Y_{i}$ when $A_{i} = a$ . The observed outcome equals the potential outcome under the treatment actually received. Requires well-defined treatment and no interference.
Exchangeability	Potential outcomes are independent of treatment assignment: $Y (a) ⊥ ⊥ A$ . In observational studies, we require conditional exchangeability: $Y (a) ⊥ ⊥ A ∣ L$ .
Positivity	Every subgroup has a non-zero probability of receiving each treatment level: $P (A = a ∣ L = l) > 0$ .

Graphical concepts

Term	Definition
DAG (directed acyclic graph)	A graph with directed edges (arrows) and no cycles. Used to encode causal assumptions about which variables influence which.
Confounder	A common cause of both the exposure and the outcome. Creates a non-causal (backdoor) path that must be blocked for valid causal inference.
Collider	A variable caused by two or more other variables on a path. Conditioning on a collider opens a spurious association.
Mediator	A variable on the causal pathway between exposure and outcome ( $A \to M \to Y$ ). Conditioning on a mediator blocks the indirect effect.
Backdoor path	A non-causal path from exposure to outcome that passes through a common cause. Blocking all backdoor paths satisfies the backdoor criterion.
d-separation	A graphical criterion for determining conditional independence. Two variables are d-separated given a set $Z$ if every path between them is blocked by $Z$ .

Estimation and sensitivity

Term	Definition
Propensity score	The probability of receiving treatment given covariates: $e (L) = P (A = 1 ∣ L)$ . Used for weighting, matching, or stratification.
E-value	The minimum strength of association an unmeasured confounder would need with both treatment and outcome (on the risk ratio scale) to explain away an observed effect.
RATE (Rank Average Treatment Effect)	A metric for assessing treatment effect heterogeneity. Measures how well a prioritisation rule identifies individuals with larger effects.
QINI curve	A cumulative gain curve showing the benefit of treating individuals in order of predicted treatment effect. Area under the QINI curve summarises heterogeneity.
Policy tree	A decision tree that assigns treatment based on covariates to maximise a welfare criterion. Used for identifying high-response subgroups.
Doubly robust estimation	An estimation strategy that yields consistent causal estimates if either the outcome model or the propensity score model (but not necessarily both) is correctly specified.

DAG hand-outs

The following hand-outs cover DAG conventions, common structures, and specific forms of bias. All PDFs are available for download from the hand-outs folder (Dropbox).

Foundations

Hand-out	Topic
1a. Local conventions	Conventions for causal diagram construction and interpretation
1b. Directed graph terminology	Core terminology for directed acyclic graphs
S1. Graphical key	Visual reference guide for DAG symbols and notation
S2. Glossary	Comprehensive glossary of causal inference terminology

Common applications

Hand-out	Topic
2. Common causal questions	Frequently encountered causal questions and how to set them up
6. Effect modification	When and how treatment effects vary across subgroups
9. External validity	Generalising causal findings across populations and contexts

Time series and confounding

Hand-out	Topic
3. Time series approaches	How longitudinal data help address confounding bias
4. Three-wave panel methods	Using three-wave panel data for causal inference
5. Time series limitations	When time series approaches may not resolve confounding
S3. Time-resolved confounding	Advanced approaches to time-varying confounding

Advanced topics

Hand-out	Topic
7. Selection bias	Selection bias in longitudinal studies
8. Measurement error	Structural approaches to representing and addressing measurement error
10. Experimental design	How experiments address confounding and selection bias

Supplementary materials

Hand-out	Topic
S5. Timing examples	Practical examples of confounding and timing issues
S6. Detailed panel examples	What can go wrong in a three-wave panel
S7. Cross-sectional approaches	When to report multiple DAGs in cross-sectional studies
S8. Bias correction	Quantitative approaches to bias correction
S9. Mediator bias	Confounding bias in mediation analysis
S10. Misclassification bias	Examples of misclassification bias and bias towards the null

Accessing hand-outs

All PDFs are in the hand-outs folder on Dropbox. File names match the numbering in the tables above (e.g., 1a-terminologylocalconventions.pdf, S2-glossary.pdf).

Course Readings

Primary text

Hernan MA, Robins JM. Causal Inference: What If. Chapman & Hall/CRC, 2025.

Chapters 1–9 are the required reading for this course. The book is freely available from the link above.

Chapter guide

Chapter	Topic	Course week(s)	What it covers
1	A definition of causal effect	w1–2	Individual and average causal effects using potential outcomes notation.
2	Randomised experiments	w2	How randomisation identifies causal effects and why experiments are the benchmark.
3	Observational studies	w3	Conditions under which observational data can support causal inference.
4	Effect modification	w6	How treatment effects vary across subgroups defined by baseline covariates.
5	Interaction	w4	Distinguishing interaction from effect modification; additive vs. multiplicative scales.
6	Graphical representation of causal effects	w2–3	Directed acyclic graphs (DAGs) and how they encode causal assumptions.
7	Confounding	w3–4	Formal definition of confounding, the backdoor criterion, and adjustment strategies.
8	Selection bias	w4	How conditioning on common effects (colliders) creates spurious associations.
9	Measurement bias	w10	How measurement error distorts causal effect estimates.

Reading strategy

Read each chapter before the corresponding lecture week. The chapters are short (roughly 10–15 pages each) and written in accessible prose with worked examples. Focus on understanding the concepts rather than memorising notation.

Supplementary references

These are not required but provide additional depth on specific topics covered in the course.

Neal B. Introduction to Causal Inference. 2020. Chapters 1–2. Covers the same foundations as Hernan and Robins with a machine-learning orientation.
Pearl J, Glymour M, Jewell NP. Causal Inference in Statistics: A Primer. Wiley, 2016. Compact introduction to the graphical (structural) approach to causation.

Week 1: How to Ask a Question in Psychological Science?

Lab: Git and GitHub

Causal graph: we refer to this image in the lecture and begin reviewing causal graphs in Week 2

Background readings

None today. Recommended readings are listed at the end of this page.

Key concepts for the test(s)

Today we introduce the following topics relevant to the test(s):

Confounding (introduced in week 2)
Internal validity (introduced in week 2)
External validity (introduced in week 3)

Today we discuss these concepts informally. We define them formally in week 2.

Before next week

Install R and RStudio before week 2. Instructions are in Lab 2: Install R and RStudio.

Lecture: Introduction to the Course

Two foundational questions

Every causal inquiry in psychological science begins with two questions. First: what do I want to know? Second: for which population does this knowledge generalise? These questions sound simple, yet most published studies answer neither one clearly. A regression table reports associations, but associations are not causes. A convenience sample of undergraduates describes that sample, not the population we care about. This course teaches you to state your question precisely before you touch any data.

The first question forces you to specify an intervention. "Does green space affect happiness?" is vague. A sharper version: "What is the causal effect of moving from a neighbourhood with no accessible park to one with an accessible park on self-reported life satisfaction?" Here the treatment is well-defined ( $A = 1$ for park access, $A = 0$ for no park access), the outcome is specified ( $Y$ = life satisfaction), and someone could, in principle, run an experiment to test it.

The second question forces you to specify a target population. Results obtained from a sample of university students in Wellington may not generalise to elderly residents in rural Southland. The distinction between the sample population (who you studied) and the target population (who you want to learn about) is central to external validity, which we revisit formally in Week 4.

The fundamental problem of causal inference

Consider a person who moves to a neighbourhood with a park. She reports high life satisfaction. Would she have reported the same life satisfaction had she stayed in her old neighbourhood without a park? We cannot know, because she experienced only one of the two conditions. This is the fundamental problem of causal inference: for any individual, we can observe at most one potential outcome.

We formalise this with potential outcomes notation. Let $Y_{i} (1)$ denote the outcome that person $i$ would experience under treatment ( $A = 1$ ), and $Y_{i} (0)$ the outcome under control ( $A = 0$ ). The individual causal effect is:

$δ_{i} = Y_{i} (1) - Y_{i} (0)$

Because we observe only one of $Y_{i} (1)$ or $Y_{i} (0)$ , the individual causal effect $δ_{i}$ is never directly observable. This is not a limitation of our methods; it is a logical constraint. No amount of data collection, no statistical technique, and no machine learning algorithm can reveal both potential outcomes for the same person at the same time.

From individuals to populations

If individual effects are unobservable, what can we learn? We can learn about average effects across a population. The Average Treatment Effect (ATE) is defined as:

$ATE = E [Y (1) - Y (0)]$

This is the expected difference in the outcome if everyone in the target population experienced treatment versus if everyone experienced control. The ATE is a population-level quantity. It tells us what would happen on average, not what would happen to any particular person.

The shift from individual to population is not just a concession to practicality. Causal inference contrasts counterfactual states at the population or subpopulation level. When we say "green space causes happiness," we mean that on average, across a defined population, access to green space raises life satisfaction relative to the counterfactual of no access.

How randomisation solves the problem

If we randomly assign people to neighbourhoods with or without parks, treatment assignment is independent of all other characteristics, both observed and unobserved. The healthy, the unhealthy, the wealthy, the poor, the optimists, and the pessimists are all equally probable in both groups. Under randomisation, the average outcome among the treated group estimates $E [Y (1)]$ , and the average outcome among the control group estimates $E [Y (0)]$ . Their difference estimates the ATE.

Randomisation works because it satisfies a condition called exchangeability: the potential outcomes are independent of treatment assignment. We write this as $Y (a) ⊥ ⊥ A$ . In words, the treatment and control groups are exchangeable: had the treated group received control instead, their average outcome would have matched the control group's observed average, and vice versa.

We develop this idea formally in Week 2, where we also introduce the three assumptions that underpin all causal inference (causal consistency, conditional exchangeability, and positivity). For now, the key insight is that randomisation transforms the unobservable population-level counterfactual contrast into an observable comparison between groups.

What comes next

Most psychological research cannot randomise the variables we care about. We cannot randomly assign people to experience trauma, adopt a religion, or lose a job. This course is about what to do when randomisation is impossible. The answer involves stating our causal assumptions explicitly (using causal diagrams), checking whether those assumptions are sufficient for identification (using the backdoor criterion), and estimating the causal effect using methods that respect the structure of the problem.

Further reading

The green-space example and counterfactual framework are adapted from:

Bulbulia JA (2024). "A causal inference framework for cross-cultural research." link

The two foundational questions and the formal treatment of causal inference appear in:

Bulbulia JA (2024). "Methods in causal inference. Part 1: causal diagrams and confounding." Evolutionary Human Sciences. link

Lab materials: Lab 1: Git and GitHub

Week 2: Causal Diagrams — Five Elementary Structures

Readings

Barrett M (2023). ggdag: Analyze and Create Elegant Directed Acyclic Graphs. R package version 0.2.7.9000. https://github.com/malcolmbarrett/ggdag
"An Introduction to Directed Acyclic Graphs": https://r-causal.github.io/ggdag/articles/intro-to-dags.html
"Common Structures of Bias": https://r-causal.github.io/ggdag/articles/bias-structures.html

Key concepts for the test(s)

Confounding
Causal Directed Acyclic Graph
Five elementary causal structures
d-separation
Back door path
Conditioning
Fork bias
Collider bias
Mediator bias
Four rules of confounding control

Download your lab R script

Create a new .R file called 02-lab.R with your name, contact, date, and a title such as "Simulating the five basic causal structures in R."
Copy and paste the code chunks below during class.
Save in a clearly defined project directory.

You may also download the lab here: Download the R script for Lab 02

Seminar

Overview

Understand basic features of causal diagrams: definitions and applications
Introduction to the five elementary causal structures
Lab: gentle introduction to simulation and regression

Review

Psychological research begins with two questions:

What do I want to know?

For which population does this knowledge generalise?

This course considers how to ask psychological questions that pertain to populations with different characteristics.

In psychological research, we typically ask questions about the causes and consequences of thought and behaviour: "What if?" questions (Hernán & Robins, 2025).
The following concepts help us describe two distinct failure modes:

External validity: the extent to which findings generalise to other situations, people, settings, and time periods. We want to know if our findings carry beyond the sample population to the target population. We fail when our results do not generalise as we think, or when we have not clearly defined our question or target population.
Internal validity: the extent to which the associations we obtain from data reflect causality. When asking "What if?" questions, we want to understand what would happen if we intervened. In this course, we use "treatment" or "exposure" to denote the intervention, and "outcome" to denote the effect of an intervention.

During the first part of the course, our primary focus is on challenges to internal validity from confounding bias.

How randomisation identifies causal effects

Last week we introduced the fundamental problem of causal inference: we cannot observe both potential outcomes $Y_{i} (1)$ and $Y_{i} (0)$ for the same individual. Randomisation solves this problem at the population level. When treatment is randomly assigned ( $R \to A$ ), treatment assignment is independent of all potential outcomes:

$Y (a) ⊥ ⊥ A$

This condition is called unconditional exchangeability. It means that the treated and untreated groups are, on average, identical in every respect except the treatment they received. Any difference in average outcomes between groups can therefore be attributed to the treatment itself. The ATE is then identified by a simple difference in group means:

$ATE = \hat{E} [Y ∣ A = 1] - \hat{E} [Y ∣ A = 0]$

Experiments are the benchmark for causal inference because randomisation eliminates confounding without requiring the investigator to know or measure the confounders. However, most psychological questions cannot be answered by experiment alone: we cannot randomly assign people to experience grief, adopt a religion, or grow up in poverty. For these questions, we need observational methods that achieve conditional exchangeability through careful adjustment. The causal diagrams we learn today are the tools for deciding what to adjust for.

Definitions

Internal validity is compromised if the association between the treatment and outcome in a study does not consistently reflect causality in the sample population as defined at baseline.

External validity is compromised if the association between the treatment and outcome in a study does not consistently reflect causality in the target population as defined at baseline.

Confounding bias exists if there is an open back-door path between the treatment and outcome, or if the path between the treatment and outcome is blocked.

Today, our purpose is to clarify the meaning of each term in this definition. To that end, we introduce the five elementary graphical structures employed in causal diagrams, then explain the four elementary rules that allow investigators to identify causal effects from the asserted relations in a causal diagram.

Introduction to Causal Diagrams

Causal diagrams (also called causal graphs, Directed Acyclic Graphs, or Causal DAGs) are graphical tools whose primary purpose is to enable investigators to detect confounding biases.

Remarkably, causal diagrams are rarely used in psychology!

The meaning of our symbols

Our variable naming conventions (adapted from Bulbulia, 2023)

For us:

$X$ denotes a variable without reference to its role.
$A$ denotes the "treatment" or "exposure" variable: the variable for which we seek to understand the effect of intervening on it. It is the "cause."
$Y$ denotes the outcome or response of an intervention. It is the "effect."
$Y (a)$ denotes the counterfactual or potential state of $Y$ in response to setting the level of the exposure to $A = a$ . To consistently estimate causal effects we need to evaluate counterfactual states of the world. This might seem like science fiction, but we are already familiar with methods for obtaining such contrasts: randomised controlled experiments.
$L$ denotes a measured confounder or set of confounders: a variable which, if conditioned upon, closes an open back-door path between $A$ and $Y$ .
$U$ denotes an unmeasured confounder: a variable that may affect both the treatment and the outcome but for which we have no direct measurement.
$M$ denotes a mediator: a variable along the path from exposure to outcome. Conditioning on a mediator when estimating the total effect of $A$ on $Y$ will bias that estimate.
$\overset{ˉ}{X}$ denotes a sequence of variables, for example a sequence of treatments.
$R$ denotes a randomisation or a chance event.

Elements of our causal graphs

Nodes, edges, and conditioning conventions (adapted from Bulbulia, 2023)

Time indexing

In our causal diagrams, we implement two conventions for temporal order. First, the layout is structured left to right to reflect the sequence of causality. Second, we index nodes according to the relative timing of events. If $X_{0}$ precedes $X_{1}$ , the indexing indicates this chronological order.

Representing uncertainty in timing

When the sequence of events is ambiguous (particularly in cross-sectional data), we use $X_{ϕt}$ to propose a temporal order without clear time-specific measurements. We denote an event presumed to occur first as $X_{ϕ 0}$ and a subsequent event as $X_{ϕ 1}$ .

Arrows

Black arrows denote causality.
Red arrows reveal an open backdoor path.
Dashed black arrows denote attenuation.
Red dashed arrows denote bias in a true causal association.
A blue arrow with a circle point denotes effect-measure modification.
$R \to A$ denotes random treatment assignment.

Boxes

A black box denotes conditioning that reduces confounding or is inert.
A red box describes conditioning that introduces confounding bias.
A dashed circle denotes a latent variable (not measured or not conditioned upon).

Terminology for conditional independence

Statistical independence ( $∐$ ): $A ∐ Y (a)$ means treatment assignment is independent of the potential outcomes.
Statistical dependence ( $∐$ ): $A ∐ Y (a)$ means treatment assignment is related to potential outcomes, potentially introducing bias.
Conditioning ( $∣$ ): specifies contexts under which independence or dependence holds.
- Conditional independence: $A ∐ Y (a) ∣ L$ means that once we account for $L$ , treatment and potential outcomes are independent.
- Conditional dependence: $A ∐ Y (a) ∣ L$ means potential outcomes and treatments are not independent after conditioning on $L$ .

The Five Elementary Structures of Causality

Judea Pearl proved that all elementary structures of causality can be represented graphically (Pearl, 2009).

Five elementary structures (adapted from Bulbulia, 2023)

Two variables:

Causality absent: no causal effect between $A$ and $B$ . They are statistically independent: $A ∐ B$ .
Causality: $A$ causally affects $B$ . They are statistically dependent: $A ∐ B$ .

Three variables:

Fork: $A$ causally affects both $B$ and $C$ . Variables $B$ and $C$ are conditionally independent given $A$ : $B ∐ C ∣ A$ .
Chain: $C$ is affected by $B$ , which is affected by $A$ . Variables $A$ and $C$ are conditionally independent given $B$ : $A ∐ C ∣ B$ . Here $B$ mediates the effect of $A$ on $C$ .
Collider: $C$ is affected by both $A$ and $B$ , which are independent. Conditioning on $C$ induces an association: $A ∐ B ∣ C$ .

Once we understand these basic relationships, we can build more complex causal diagrams. These structures help us see how statistical independences and dependencies emerge from the data, allowing us to clarify confounders so that $Y (a) ∐ A ∣ L$ .

The Three Identification Assumptions

Every causal inference, whether from an experiment or an observational study, rests on three assumptions. When all three hold, the causal effect of $A$ on $Y$ is identified: it can be estimated from the observed data.

Assumption 1: Causal consistency

The observed outcome for an individual equals the potential outcome under the treatment that individual actually received. Formally: if $A_{i} = a$ , then $Y_{i} = Y_{i} (a)$ . This assumption requires that the treatment is well-defined and that there are no hidden versions of treatment. For example, if "exercise" is the treatment, the assumption requires that exercising means roughly the same thing for everyone in the study.

Assumption 2: Conditional exchangeability (no unmeasured confounding)

Within levels of the measured covariates $L$ , treatment assignment is independent of the potential outcomes:

$Y (a) ⊥ ⊥ A ∣ L$

This is the assumption that, after conditioning on $L$ , the treated and untreated groups are exchangeable. It is satisfied automatically by randomisation (where $L$ is empty and the condition is unconditional). In observational studies, it requires that we have measured and conditioned on all common causes of treatment and outcome. No statistical test can confirm this assumption; it is justified by subject-matter knowledge encoded in a causal diagram.

Assumption 3: Positivity

Every individual has a non-zero probability of receiving each treatment level, within every stratum of the covariates:

$P (A = a ∣ L = l) > 0 for all a and l$

In plain language, there must be both treated and untreated individuals at every combination of covariate values. If some subgroup never receives treatment (for example, if no one over age 90 is prescribed the drug), we cannot estimate the causal effect for that subgroup because we have no counterfactual comparison.

These three assumptions form the foundation for everything that follows in this course. Causal diagrams help us evaluate whether the second assumption (conditional exchangeability) holds: if we can identify a set $L$ that blocks all backdoor paths between $A$ and $Y$ , and if we condition on $L$ , then conditional exchangeability is satisfied (given that our diagram is correct).

You might wonder: "If not from the data, where do our assumptions about causality come from?" Our assumptions are based on existing knowledge. Otto Neurath, an Austrian philosopher, used the metaphor of a ship rebuilt at sea:

We are like sailors who on the open sea must reconstruct their ship but are never able to start afresh from the bottom. Where a beam is taken away a new one must at once be put there, and for this the rest of the ship is used as support. In this way, by using the old beams and driftwood, the ship can be shaped entirely anew, but only by gradual reconstruction. (Neurath, 1973, p. 199)

The Four Rules of Confounding Control

Four rules of confounding control

Condition on common cause or its proxy. When $A$ and $Y$ share common causes, conditioning on these common causes blocks the open backdoor paths.
Do not condition on a mediator. When $L$ mediates $A \to Y$ , conditioning on $L$ biases the total causal effect estimate.
Do not condition on a collider. When $L$ is a common effect of $A$ and $Y$ , conditioning on $L$ induces a spurious association. For example, if marriage causes wealth and happiness causes wealth, conditioning on wealth induces an association between marriage and happiness even without a causal connection.
Proxy rule: conditioning on a descendant is akin to conditioning on its parent. When $L^{'}$ is an effect of $L$ , conditioning on $L^{'}$ is approximately equivalent to conditioning on $L$ . This can introduce bias (if $L$ is a collider) or reduce bias (if $L$ is an unmeasured common cause and $L^{'}$ is a measured proxy).

Further reading

The three identification assumptions and randomisation framework are covered in:

Hernán MA, Robins JM (2025). Causal Inference: What If. Chapters 1–2. link
Bulbulia JA (2024). "Methods in causal inference. Part 1: causal diagrams and confounding." Evolutionary Human Sciences. link

Why experiments are the benchmark for causal inference:

Bulbulia JA (2024). "Methods in causal inference. Part 4: confounding in experiments." Evolutionary Human Sciences. link

Lab materials: Lab 2: Install R and RStudio

Week 3: Causal Diagrams — The Structures of Confounding Bias

Date: 11 Mar 2026

Readings

Required

Hernán MA, Robins JM (2025). Causal Inference: What If. Chapter 6. link

Optional

Suzuki E et al. (2020). Causal diagrams. link
Bulbulia JA (2023). A causal inference framework for cross-cultural research. link
Neal B (2020). Introduction to Causal Inference. Chapter 3.

Key concepts for the test(s)

M-bias
Regression
Intercept
Regression coefficient
Model fit
Why model fit is misleading for causality

Download your lab R script

Create a new .R file called 03-lab.R with your name, contact, date, and a title such as "Regression and confounding bias."
Copy and paste the code chunks below during class.
Save in a clearly defined project directory.

You may also download the lab here: Download the R script for Lab 03

Seminar

Learning outcomes

By the end of this week, you will be able to:

Define confounding bias formally and identify it in a causal diagram.
State and apply the backdoor criterion to determine which variables to condition on.
Explain why good regression model fit does not rule out confounding.
Distinguish confounding problems that longitudinal data can resolve from those it cannot.
Define M-bias and explain why conditioning on a pre-treatment collider introduces spurious associations.

What is confounding?

Last week we introduced the five elementary causal structures and the four rules of confounding control. This week we apply those tools to the central problem of observational causal inference: confounding.

Confounding exists when a common cause of both the treatment $A$ and the outcome $Y$ creates a non-causal (backdoor) path between them. If this path is left open, the observed association between $A$ and $Y$ conflates the true causal effect with a spurious correlation induced by the shared cause.

Definition: Confounding bias

Confounding bias exists when there is at least one open backdoor path between the treatment $A$ and the outcome $Y$ . A backdoor path is any path from $A$ to $Y$ that begins with an arrow pointing into $A$ .

Consider a simple example. Suppose we want to know whether exercise ( $A$ ) causes lower blood pressure ( $Y$ ). Health consciousness ( $L$ ) is a common cause: health-conscious people exercise more and eat better, which independently lowers blood pressure. The DAG is $A \leftarrow L \to Y$ , with a direct arrow $A \to Y$ . The path $A \leftarrow L \to Y$ is a backdoor path. If we do not condition on $L$ , our estimate of the effect of $A$ on $Y$ will be inflated because it includes the spurious association transmitted through $L$ .

The backdoor criterion

How do we know which variables to condition on? Pearl's backdoor criterion provides the answer.

Definition: Backdoor criterion

A set of variables $L$ satisfies the backdoor criterion relative to treatment $A$ and outcome $Y$ if:

No element of $L$ is a descendant of $A$ (we do not condition on anything caused by the treatment).
$L$ blocks every backdoor path from $A$ to $Y$ .

When $L$ satisfies the backdoor criterion, conditioning on $L$ yields conditional exchangeability: $Y (a) ⊥ ⊥ A ∣ L$ .

The backdoor criterion translates the causal problem into a graphical one. Instead of reasoning about unobservable counterfactuals, we draw a DAG, list all backdoor paths, and check whether our measured covariates block them. If they do, the causal effect is identified. If they do not, we have unresolved confounding, and our estimate will be biased regardless of sample size or statistical sophistication.

Confounding and regression

Regression is the standard tool for "controlling for" confounders. When we write $Y = β_{0} + β_{1} A + β_{2} L + ε$ , we are conditioning on $L$ and interpreting $β_{1}$ as the effect of $A$ on $Y$ holding $L$ constant.

Definition: Key regression concepts

Intercept ( $β_{0}$ ): the expected value of $Y$ when all predictors are zero.
Regression coefficient ( $β_{1}$ ): the expected change in $Y$ for a one-unit change in the predictor, holding other predictors constant.
Model fit ( $R^{2}$ ): the proportion of variance in $Y$ explained by the predictors.

A critical point for this course: good model fit does not indicate absence of confounding. A regression can explain 90% of the variance in $Y$ (high $R^{2}$ ) while conditioning on the wrong variables. If you condition on a mediator instead of a confounder, the model may fit well but the coefficient for $A$ will be biased. If you condition on a collider, the model may again fit well but will introduce a spurious association that was not there before. Model fit is a statistical property of the relationship between predictors and outcome. Confounding is a structural (causal) property of the data-generating process. The two are logically independent.

Why model fit is misleading for causality

A model can fit the data well while conditioning on the wrong variables. Conditioning on a mediator blocks the causal path and attenuates the estimate. Conditioning on a collider opens a spurious path and inflates or reverses the estimate. Neither error is detectable from $R^{2}$ , residual plots, or any other goodness-of-fit diagnostic. Only a correctly specified causal diagram can distinguish confounders (condition on them) from mediators and colliders (do not).

Confounding problems resolved by time-series data

Many confounding problems arise because cross-sectional data cannot distinguish cause from effect. When we measure $A$ and $Y$ at the same time, we cannot tell whether $A$ caused $Y$ , $Y$ caused $A$ , or a shared cause $L$ produced both. Longitudinal data collection, in which variables are measured at distinct time points, resolves several of these ambiguities.

The following figure presents seven confounding scenarios. In each case, the structural problem can be addressed by measuring the relevant variables in the correct temporal order: confounders before treatment, treatment before outcome.

Confounding problems resolved by time-series data (adapted from Bulbulia, 2023)

The seven scenarios include reverse causation (where the apparent cause is actually the effect), confounding by a measured common cause (which can be conditioned on once it is measured before treatment), and cases where the temporal ordering of measurement clarifies which variables are mediators versus confounders. In each case, the solution is the same: measure confounders at baseline ( $t_{0}$ ), treatment at wave 1 ( $t_{1}$ ), and outcome at wave 2 ( $t_{2}$ ). This three-wave panel design is the minimum structure for most observational causal questions in psychology.

Confounding problems not resolved by time-series data alone

Not all confounding problems can be solved by longitudinal data collection. The next figure presents six examples where time-series data are insufficient. Study these carefully before seminar.

Confounding problems not resolved by time-series data (adapted from Bulbulia, 2023)

Definition: M-bias

M-bias arises when a pre-treatment variable $L$ is a collider of two unmeasured confounders: one that affects the treatment and one that affects the outcome. The DAG forms an "M" shape: $U_{1} \to L \leftarrow U_{2}$ , with $U_{1} \to A$ and $U_{2} \to Y$ . Without conditioning on $L$ , the path $A \leftarrow U_{1} \to L \leftarrow U_{2} \to Y$ is blocked (because $L$ is a collider). Conditioning on $L$ opens this path, creating a spurious association between $A$ and $Y$ that did not exist before.

M-bias is counterintuitive because $L$ is measured before treatment and appears to be a plausible confounder. The instinct to "control for everything measured at baseline" can introduce bias rather than remove it. The only way to diagnose M-bias is through a causal diagram: it is invisible in the data.

Other scenarios in the figure include residual confounding (where unmeasured variables affect changes in confounders over time even after baseline adjustment), treatment-confounder feedback (where past treatments affect future confounders, creating cycles that standard regression cannot handle), and intermediary confounding in mediation analysis (where a variable that confounds the mediator-outcome relationship is itself affected by treatment, making natural indirect effects unidentifiable).

Worked example: the assumptions in causal mediation

The following figure illustrates the assumptions required for causal mediation analysis. When we decompose the total effect of $A$ on $Y$ into a direct effect and an indirect effect through $M$ , we need not only the standard three assumptions (causal consistency, conditional exchangeability, positivity) but also the absence of treatment-induced confounding of the mediator-outcome relationship.

The assumptions in causal mediation (adapted from Bulbulia, 2023)

If a confounder of the $M \to Y$ path is itself affected by treatment $A$ , conditioning on it blocks part of the treatment effect (mediator bias) while failing to condition on it leaves confounding open. This dilemma has no solution within the standard regression framework and is one reason why causal mediation analysis requires stronger assumptions than total-effect estimation.

Further reading

All open access:

Bulbulia JA (2024). "Methods in causal inference. Part 1: causal diagrams and confounding." Evolutionary Human Sciences. link
Bulbulia JA (2024). "A causal inference framework for cross-cultural research." link
Hernán MA, Robins JM (2025). Causal Inference: What If. Chapter 6. link

Lab materials: Lab 3: Regression, Graphing, and Simulation

Week 4: Interaction, Measurement Bias, and Selection Bias

Readings

Required

Bulbulia (2023) "Methods in Causal Inference Part 1: Causal Diagrams and Confounding." link
See simplified reading

Optional

Hernan & Robins (2025) What If, Chapters 6--9. link
Hernan (2004) "A Structural Approach to Selection Bias." link
Hernan (2017) "Selection Without Colliders." link
Hernan & Cole (2009) "Causal Diagrams and Measurement Bias." link
VanderWeele & Hernan (2012) "Results on Differential and Dependent Measurement Error." link

Key concepts for the test(s)

Effect (measure) modification
Undirected/uncorrelated measurement error bias
Undirected/correlated measurement error bias
Directed/uncorrelated measurement error bias
Directed/correlated measurement error bias
Selection bias and transportability

Download your lab R script

Create a new .R file called 04-lab.R with your name, contact, date, and a title such as "Regression and confounding bias."
Copy and paste the code chunks below during class.
Save in a clearly defined project directory.

You may also download the lab here: Download the R script for Lab 04

Seminar

Learning outcomes

By the end of this week, you will be able to:

Distinguish effect modification from confounding and from interaction.
Identify the four types of measurement error bias in a causal diagram.
Explain how selection bias arises from conditioning on a collider at baseline.
Define the distinction between a target population and a source population, and explain why it matters for transportability.

Effect modification

Effect modification occurs when the causal effect of $A$ on $Y$ differs across strata of a third variable $V$ . In our DAG conventions, effect modification is indicated by a blue arrow with a circle point from $V$ to the $A \to Y$ path. Importantly, effect modification is not a source of bias. It is a feature of the world: some treatments work differently for different people.

Effect modification differs from confounding. A confounder is a common cause of $A$ and $Y$ that must be conditioned on to eliminate bias. An effect modifier is a variable that changes the magnitude (or direction) of the causal effect. We do not adjust for effect modifiers to remove bias; we examine them to understand for whom the treatment works.

Effect modification also differs from statistical interaction, although the two are related. Interaction is a property of a statistical model (e.g., a product term $A \times V$ in a regression). Effect modification is a causal property of the data-generating process. A statistical interaction may reflect effect modification, confounding, or an artefact of the scale on which the outcome is measured. Distinguishing these requires a causal diagram.

Common causal questions presented as causal graphs

The following figure shows how several standard research designs can be represented as causal DAGs. Each design answers a different causal question, and the DAG makes explicit what is being estimated and what assumptions are required.

Common causal questions presented as causal graphs, adapted from Bulbulia (2024)

A typology of measurement error bias

Measurement error is pervasive in psychology. We rarely observe the true value of a variable; instead, we observe a measured version that may differ from the truth. When measurement error is systematic rather than purely random, it can introduce bias into our causal estimates. The following figure presents four structural types of measurement error, classified along two dimensions.

A typology of measurement error bias, adapted from Bulbulia (2024)

Four types of measurement error bias

Dimension 1: Independent (undirected) versus dependent (directed). Independent measurement error means that the true value of one variable does not affect the measurement error in another. Dependent (directed) error means that one variable's true value causally influences how another variable is measured.

Dimension 2: Uncorrelated versus correlated. Uncorrelated errors have no shared cause. Correlated errors share a common cause that affects the measurement of both variables simultaneously.

The four types:

Independent, uncorrelated. Errors in $A$ and $Y$ are unrelated. Under certain conditions (notably when the true effect is zero), this type does not introduce bias, but it typically attenuates effect estimates toward zero.
Independent, correlated. Errors in $A$ and $Y$ share a common cause (for example, a survey method effect that inflates both responses). This can create a spurious association even when the true causal effect is zero.
Dependent, uncorrelated. The true treatment causally affects how the outcome is measured (or vice versa). For example, receiving a diagnosis ( $A$ ) changes how a patient reports symptoms ( $Y$ ). This creates a non-causal path from $A$ to the measured version of $Y$ .
Dependent, correlated. Both dependent and correlated errors operate simultaneously. This is the most complex scenario and can produce bias in either direction.

Understanding these types matters because the standard advice to "use reliable measures" addresses only the simplest case (independent, uncorrelated error). The other three types require structural reasoning: you need a DAG to determine whether measurement error in your study is a source of bias and, if so, in which direction.

Selection bias and transportability

Selection bias arises when the composition of the study sample differs systematically from the target population in ways that affect the treatment-outcome relationship. The most common structural source of selection bias is conditioning on a collider at baseline.

Threats to external validity, adapted from Bulbulia (2024)

Consider a study of the effect of exercise on blood pressure. Suppose health-conscious people are more probable to both exercise and volunteer for health studies, and suppose high socioeconomic status (SES) independently predicts both better health and study participation. Participation is a collider of health consciousness and SES. By restricting analysis to study participants, we condition on this collider and induce a spurious association between health consciousness and SES, which can distort the estimated effect of exercise on blood pressure.

Definition: Target, source, and analytic sample populations

The target population is the population to which investigators seek to generalise their findings.
The source population is the population from which the study sample is drawn.
The analytic sample population comprises the actual participants at baseline.

External validity requires that the causal effect estimated in the analytic sample applies to the target population. When the distribution of effect modifiers differs between sample and target, the sample ATE may not generalise, even if internal validity is perfect.

The acronym "WEIRD" (Western, Educated, Industrialised, Rich, Democratic) has drawn attention to the narrow demographic base of much psychological research. From a causal perspective, the problem is not that WEIRD samples are unrepresentative per se, but that effect modifiers may be distributed differently in the target population. If the treatment effect varies across levels of an effect modifier, and that modifier's distribution differs between sample and target, the sample ATE will not transport to the target. The simulation guide for this course demonstrates this point concretely using a transportability simulation.

Measurement invariance (which we cover in Week 10) connects to measurement error bias in cross-cultural research. If the same questionnaire measures a construct differently in different populations, then what appears to be a cross-population difference in the outcome may instead be a difference in how the outcome is measured. This is a form of correlated, directed measurement error, and it cannot be detected by standard reliability statistics.

Further reading

All open access:

Bulbulia JA (2024). "Methods in causal inference. Part 3: measurement error and external validity threats." Evolutionary Human Sciences. link
Bulbulia JA (2024). "Methods in causal inference. Part 1: causal diagrams and confounding." Evolutionary Human Sciences. link
Bulbulia JA (2024). "Methods in causal inference. Part 2: interaction, mediation, and time-varying treatments." Evolutionary Human Sciences. link

Lab materials: Lab 4: Regression and Confounding Bias

Week 5: Causal Inference: Average Treatment Effects

Date: 25 Mar 2026

Readings

Required

Hernan, M. A. & Robins, J. M. (2024). Causal Inference: What If. Chapters 1--3. link

Optional

Neal, B. (2020). Introduction to Causal Inference. Chapters 1--2. link

Key concepts for the test(s)

The Fundamental Problem of Causal Inference
Causal Inference in Randomised Experiments
Causal Inference in Observational Studies: Average (Marginal) Treatment Effects
Three Fundamental Assumptions for Causal Inference

Lab

For the lab, copy and paste code chunks following the "Part 2: Lab" section.

Terminology

In these notes, we use the terms "counterfactual outcomes" and "potential outcomes" interchangeably.

Seminar

Learning Outcomes

You will understand why causation is never directly observed.
You will understand how experiments address this "causal gap."
You will understand how applying three principles from experimental research allows human scientists to close this "causal gap" when making inferences about a population as a whole, that is, inferences about "marginal effects."

Opening

Robert Frost

Two roads diverged in a yellow wood, And sorry I could not travel both And be one traveler, long I stood And looked down one as far as I could To where it bent in the undergrowth;

Then took the other, as just as fair, And having perhaps the better claim, Because it was grassy and wanted wear; Though as for that the passing there Had worn them really about the same,

And both that morning equally lay In leaves no step had trodden black. Oh, I kept the first for another day! Yet knowing how way leads on to way, I doubted if I should ever come back.

I shall be telling this with a sigh Somewhere ages and ages hence: Two roads diverged in a wood, and I -- I took the one less traveled by, And that has made all the difference.

-- The Road Not Taken

Introduction: The Estrogen Therapy Paradox

Why observational data can mislead

In the 1980s and 1990s, observational studies found that postmenopausal women using estrogen replacement therapy had substantially lower all-cause mortality than non-users, with a hazard ratio of 0.68 (current users vs never-users). Major medical organisations endorsed estrogen therapy:

1992 American College of Physicians: "Women who are at increased risk of coronary heart disease are probable to benefit from hormone therapy."
1996 American Heart Association: "ERT does look promising as a long-term protection against heart attack."

The Women's Health Initiative (WHI) then ran a massive randomised, double-blind, placebo-controlled trial with 16,000 women. The result: a hazard ratio of 1.23 for initiators versus non-initiators. The treatment increased risk.

What went wrong? The observational studies were confounded. Women who chose estrogen therapy differed systematically from those who did not, in ways that also affected their health outcomes. The story illustrates why the three assumptions covered in this lecture (causal consistency, exchangeability, positivity) are not mere formalities.

Motivating Example

Consider the following cross-cultural question:

Does bilingualism improve cognitive abilities in children?

There is evidence that bilingual children perform better at cognitive tasks, but is this improvement truly caused by learning more than one language, or is it confounded by other factors (e.g., cultural environment, parental influence)? How can we know? Each child might answer, like the traveller in Frost's poem:

"And sorry I could not travel both. And be one traveler $\dots$ "

Part 1: The Fundamental Problem of Causal Inference as a Missing Data Problem

The fundamental problem of causal inference is that causality is never directly observed.

Let $Y$ and $A$ denote random variables.

We formulate a causal question by asking whether experiencing an exposure $A$ , when this exposure is set to level $A = a$ , would lead to a difference in $Y$ , compared to what would have occurred had the exposure been set to a different level, say $A = a^{'}$ will lead to a difference in outcome $Y$ . For simplicity, we imagine binary exposure such that $A = 1$ denotes receiving the "bilingual" exposure and $A = 0$ denotes receiving the "monolingual" exposure. Assume these are the only two exposures of interest:

Let:

$Y_{i} (a = 1)$ denote the cognitive ability of child $i$ if the child were bilingual (potential outcome when $A_{i} = 1$ ).
$Y_{i} (a = 0)$ denote the cognitive ability of child $i$ if the child were monolingual (potential outcome when $A_{i} = 0$ ).

What does it mean to quantify a causal effect? We may define the individual-level causal effect of bilingualism on cognitive ability for child $i$ as the difference between two states of the world: one for which the child experiences a bilingual exposure and the other for which the child does not. We write this contrast by referring to the potential outcomes under different levels of exposure:

$Causal Effect_{i} = Y_{i} (1) - Y_{i} (0) .$

We say there is a causal effect of the bilingual exposure if

$Y_{i} (1) - Y_{i} (0) \neq = 0.$

Because each child experiences only one exposure condition in reality, we cannot directly compute this difference from any dataset. The missing observation is called the counterfactual:

If $Y_{i} ∣ A_{i} = 1$ is observed, then $Y_{i} (0) ∣ A_{i} = 1$ is counterfactual.
If $Y_{i} ∣ A_{i} = 0$ is observed, then $Y_{i} (1) ∣ A_{i} = 1$ is counterfactual.

"And sorry I could not travel both / And be one traveler, long I stood $\dots$ "

In short, individuals cannot simultaneously experience both exposure conditions, so one outcome is inevitably missing.

How can we make contrasts between counterfactual (potential) outcomes?

Challenging material ahead

The next three sections introduce the formal assumptions that underpin all causal inference. The notation is dense, but the core ideas are intuitive. Focus on the plain-language interpretation, not the formalism. Each assumption is followed by a "key intuition" box.

Fundamental Assumption 1: Causal Consistency

Causal consistency means that the potential outcome corresponding to the exposure an individual actually receives is exactly what we observe. In other words, if individual $i$ receives exposure $a$ , then the potential outcome (or equivalently the counterfactual outcome under a given level of exposure $A = a$ , that is $Y_{i} (a)$ ) is equivalent to the observed outcome: $Y_{i} ∣ A_{i} \equiv a$ . Where the symbol $\equiv$ means "equivalent to", when we assume that the causal consistency assumption is satisfied, we assume that:

$counterfactual Y_{i} (1) counterfactual Y_{i} (0) \equiv observable (Y_{i} ∣ A_{i} = 1), \equiv observable (Y_{i} ∣ A_{i} = 0) .$

Notice however that we cannot generally obtain individual causal effects because at any given time, each individual may only receive at most one level of an exposure. Where the symbol $⟹$ means "implies," at any given time, receiving one level of an exposure precludes receiving any other level of that exposure:

$Y_{i} ∣ A_{i} = 1 ⟹ Y_{i} (0) ∣ A_{i} = 1 is counterfactual$

Likewise:

$Y_{i} ∣ A_{i} = 0 ⟹ Y_{i} (1) ∣ A_{i} = 1 is counterfactual$

Because of the laws of physics (above the atomic scale), an individual can experience only one exposure level at any moment. Consequently, we can observe only one of the two counterfactual outcomes needed to quantify a causal effect. This is the fundamental problem of causal inference. Counterfactual contrasts cannot be individually observed.

However, because of the causal consistency assumption, we can nevertheless recover half of the missing counterfactual (or "potential") outcomes needed to estimate average treatment effects. We may do this if two other assumptions are satisfied.

Key intuition: causal consistency

Causal consistency says: "what you see is what you get." If a person actually received the treatment, their observed outcome equals their potential outcome under treatment. This seems obvious, but it fails when treatments are vaguely defined (e.g., "exercise more") because different versions of the treatment could produce different outcomes.

Fundamental Assumption 2: Exchangeability

Exchangeability justifies recovering unobserved counterfactuals from observed outcomes and averaging them. By accepting that $Y_{i} (a) = Y_{i}$ if $A_{i} = a$ , we can estimate population-level average potential outcomes. In an experiment where exposure groups are comparable, we define the Average Treatment Effect (ATE) as:

$ATE = E [Y (1)] - E [Y (0)] = E (Y ∣ A = 1) - E (Y ∣ A = 0) .$

Because randomisation ensures that missing counterfactuals are exchangeable with those observed, we can still estimate $E [Y (a)]$ . For instance, assume:

$counterfactual E [Y (1) ∣ A = 1] = unobservable E [Y (1) ∣ A = 0] = observed (Y_{i} ∣ A_{i} = 1)$

which lets us infer the average outcome if everyone were treated. Likewise, if

$counterfactual E [Y (0) ∣ A = 0] = unobservable E [Y (1) ∣ A = 0] = observed E (Y ∣ A_{i} = 0)$

then we can infer the average outcome if everyone were given the control. The difference between these two quantities gives the ATE:

$ATE = [E [Y (1) ∣ A = 1] by consistency: \equiv observed E [Y ∣ A = 1] + E [Y (1) ∣ A = 0] by exchangeability: unobservable, yet \equiv E [Y ∣ A = 1]] - [E [Y (0) ∣ A = 0] by consistency: \equiv observed E [Y ∣ A = 0] + E [Y (0) ∣ A = 1] by exchangeability: unobservable, yet \equiv E [Y ∣ A = 0]]$

We have it that $E [Y ∣ A = 1]$ and $E [Y ∣ A = 0]$ and $E [Y (1) ∣ A = 0]$ are observed. If both consistency and exchangeability are satisfied then we may use these observed quantities to identify contrasts of counterfactual quantities.

Thus, although individual-level counterfactuals are missing, the consistency assumption and the exchangeability assumption allow us to identify the average effect of treatment using observed data. Randomised controlled experiments allow us to meet these assumptions. Randomisation warrants the exchangeability assumption. Control warrants the consistency assumption.

Key intuition: exchangeability

Exchangeability says: "the treated group is a valid stand-in for the untreated group, and vice versa." In a randomised experiment, this holds by design. In observational data, we try to achieve it by conditioning on all common causes of treatment and outcome (the confounders identified in your DAG).

Fundamental Assumption 3: Positivity

There is one further assumption, called positivity. It states that treatment assignments cannot be deterministic. That is, for every covariate pattern $L = l$ , each individual has a non-zero probability of receiving every treatment level to be compared:

$P (A = a ∣ L = l) > 0.$

Randomised experiments achieve positivity by design, at least for the sample that is selected into the study. In observational settings violations occur if some subgroups never receive a particular treatment. If treatments occur but are rare, we may have sufficient data from which to obtain convincing causal inferences.

Positivity is the only assumption that can be verified with data. We will consider how to assess this assumption using data when we develop our data analytic workflows in the second half of this course.

Key intuition: positivity

Positivity says: "everyone could plausibly receive either treatment." If some people would never receive the treatment (e.g., a drug contraindicated for their condition), we cannot estimate what would happen if they did. Positivity is the only one of the three assumptions we can check directly with data.

Section summary

The three assumptions work together. Causal consistency links potential outcomes to observed data. Exchangeability lets us use the treated group's outcomes to infer what would have happened to the untreated group (and vice versa). Positivity ensures that both groups exist at every covariate combination. When all three hold, the ATE is identified from observed data.

Challenges with Observational Data

1. Satisfying Causal Consistency is Difficult in Observational Settings

Below are some ways in which real-world complexities can violate causal consistency in observational studies. Causal consistency requires there is no interference between units (also called "SUTVA" or "Stable Unit Treatment Value"). Causal consistency also requires that each treatment level is well-defined and applied uniformly. If these conditions fail, then $Y (a)$ may not reflect a consistent exposure across individuals. We are then comparing apples with oranges. Consider some examples:

Cultural differences: one group's "second-language exposure" may differ qualitatively from another's if cultural norms shape how, when, or by whom the second language is taught. A child in a bilingual community may receive diverse and immersive language experiences all of which are coded as $A = 1$ . Yet the treatments might be quite different. We might be comparing apples with oranges.
Age of acquisition: the developmental effect of learning a second language may vary by when the child is exposed. Comparing acquisition at, say, age two with acquisition at, say, age twelve might be comparing apples with oranges.
Language variation: sign languages, highly tonal languages, or unwritten languages may demand different cognitive tasks than spoken, nontonal, or widely documented languages. Lumping them together as "learning a second language" can obscure the fact that these distinct exposures might produce fundamentally different outcomes. Again, comparisons here would be of apples with oranges.

These sources of heterogeneity reveal why careful delineation of treatments is crucial. If the actual exposures differ across individuals, then consistency $(Y_{i} (a) = Y_{i} ∣ A_{i} = a)$ may fail, because $A = a$ is not the same phenomenon for everyone.

2. Conditional Exchangeability (No Unmeasured Confounding) Is Difficult to Achieve

In theory, we can identify a causal effect from observational data if all confounders $L$ are measured. Formally, we need the potential outcomes to be independent of treatment once we condition on $L$ . One way to express this assumption is: $Y (a) ∐ A ∣ L$ . If the potential outcomes are independent of treatment assignment, we can identify the Average Treatment Effect (ATE) as:

$ATE = l \sum [E (Y ∣ A = 1, L = l) - E (Y ∣ A = 0, L = l)] Pr (L = l) .$

In randomised experiments, conditioning is automatic because $A$ is unrelated to potential outcomes by design. In observational studies, ensuring or approximating such conditional exchangeability is often difficult. For example, bilingualism research would need to consider:

Cultural histories: cultures that value language acquisition might also value knowledge acquisition. Associations might arise from culture, not causation.
Personal values: families who place a high priority on bilingualism may also promote other developmental resources.

If important confounders go unmeasured or are poorly measured, these differences can bias causal effect estimates.

3. The Positivity Assumption May Fail: Treatments Might Not Exist for All

Positivity requires that each individual could, in principle, receive any exposure level. In real-world observational settings, some groups have no access to bilingual education (or no reason to be monolingual), making certain treatment levels impossible for them. If a treatment level does not appear in the data for a given subgroup, any causal effect estimate for that subgroup is purely an extrapolation (Westreich & Cole, 2010; Hernan & Robins, 2023).

Summary

We introduced the fundamental problem of causal inference by distinguishing correlation (associations in the data) from causation (contrasts between potential outcomes, of which only one can be observed for each individual).

Randomised experiments address this problem by balancing confounding variables across treatment levels. Although individual causal effects are unobservable, random assignment allows us to infer average causal effects, also called marginal effects.

In observational data, inferring average treatment effects demands that we satisfy three assumptions that are automatically satisfied in (well-conducted) experiments: causal consistency, exchangeability, and positivity. These assumptions ensure that we can compare like-with-like (that the population-level treatment effect is consistent across individuals), that there are no unmeasured common causes of the exposure and outcomes that may lead to associations in the absence of causality, and that every exposure level is a real possibility for each subgroup.

Lab materials: Lab 5: Average Treatment Effects

Appendix A: Notation Variants

Consider that in the causal inference literature, we may write the contrast of two potential outcomes under treatment as:

$Causal Effect_{i} = Y_{i}^{a = 1} - Y_{i}^{a = 0}$

$Causal Effect_{i} = Y_{i}^{1} - Y_{i}^{0}$

or:

$Causal Effect_{i} = Y_{1} - Y_{0}$

Where subscripts are dropped. You will soon encounter many notational variants across sources.

Week 6: Effect Modification and CATE

Date: 2 Apr 2026

Required Reading

Hernan and Robins (2024) What If, Chapters 4--5. link

Optional Reading

VanderWeele (2007) "Four types of effect modification." link
VanderWeele (2009) "On the distinction between interaction and effect modification." link

Key Concepts for Assessment

Causal Estimand: The specific causal question we want to answer (e.g., the average effect in the whole population).
Statistical Estimand: The calculation we perform on our data to try and answer the causal question.
Interaction: The combined effect of two or more interventions.
Effect Modification: When the effect of one intervention changes depending on a person's characteristics.
Heterogeneous Treatment Effects (HTE): The general idea that treatment effects vary across people.
Conditional Average Treatment Effect (CATE) $τ (x)$ : The average treatment effect for a specific subgroup defined by characteristics $x$ .
Estimated Conditional Average Treatment Effect $\overset{τ}{^} (X)$ : Our estimate (from data) of the average treatment effect for a subgroup with characteristics $X$ .

What this lecture covers

This lecture covers four related concepts: (1) the distinction between interaction and effect modification, (2) the conditional average treatment effect (CATE), (3) why treatment effects vary across individuals, and (4) how to move from "does it work on average?" to "for whom does it work best?". The appendices contain formal derivations; the main text focuses on intuition.

If you learn nothing else from this course...

To answer psychological questions properly, we first need to state them very clearly. Causal inference gives us the tools to do this.

Causal inference asks: "What if?"

The core idea is to compare what actually happened with what could have happened under different conditions (these "what ifs" are called counterfactuals).

Imagine an outcome we care about, like student test scores ( $Y$ ). We might compare the score if everyone got a new teaching method (let's call this condition $a^{*}$ ) versus if everyone got the old method (condition $a$ ).

The difference in the potential outcome ( $Y$ ) under these two scenarios is the causal effect for one person: $Y (a^{*}) - Y (a)$ .

Since we can't see both scenarios for the same person, we often look at the average effect across a group. The Average Treatment Effect (ATE) is the average difference in potential outcomes across the whole population:

$ATE = E [Y (a^{*}) - Y (a)]$

This asks: "On average, how much would scores change if we switched everyone from the old method ( $a$ ) to the new method ( $a^{*}$ )?".

A big challenge is dealing with confounders, other factors that mix up the relationship between the treatment ( $A$ ) and the outcome ( $Y$ ), potentially misleading us about the true causal effect. We need to account for these.

What do 'Interaction' and 'Effect Modification' mean in Causal Inference?

Words like 'moderation' and 'interaction' are often used loosely. Causal inference needs precise terms.

We'll focus on two specific ideas:

Interaction: About the effect of combining interventions.
Effect Modification: About how the effect of one intervention changes for different types of people.

Interaction: The Effect of Teamwork (or Lack Thereof)

Interaction in causal inference is about joint interventions. We look at what happens when we apply two or more different treatments at the same time.

Let's say we have two treatments, $A$ and $B$ , and an outcome $Y$ .

$Y (a)$ is the potential outcome if we set treatment $A$ to level $a$ .
$Y (b)$ is the potential outcome if we set treatment $B$ to level $b$ .
$Y (a, b)$ is the potential outcome if we set $A$ to $a$ and $B$ to $b$ simultaneously.

To figure out these effects from observational data, we usually need assumptions like:

Consistency: The outcome we see for someone who got treatment $a$ is the same as their potential outcome $Y (a)$ .
Conditional Exchangeability (No Unmeasured Confounding): We can make the groups receiving different treatments comparable by adjusting for measured confounders ( $L$ for $A \to Y$ , and $Q$ for $B \to Y$ ). The sets $L$ and $Q$ might overlap.
Positivity: The exposures to be compared occur in all subgroups.

To study the interaction between $A$ and $B$ , we need to be able to estimate the effect of $A$ and the effect of $B$ , which means we need to adjust for all confounders in both $L$ and $Q$ (i.e., their union $L \cup Q$ ).

Defining Interaction: Does 1 + 1 = 2?

Let's use an education example:

$A$ : New teaching method (1=New, 0=Old)
$B$ : Extra tutoring (1=Yes, 0=No)
$Y$ : Test score

Is the boost in scores from getting both the new method and tutoring simply the sum of the boost from only the new method and the boost from only tutoring?

We define causal interaction on the additive scale (looking at differences) by comparing the effect of the joint intervention to the sum of the individual effects (all compared to getting neither):

$Effect of Both (E [Y (1, 1)] - E [Y (0, 0)]) vs Effect of A only (E [Y (1, 0)] - E [Y (0, 0)]) + Effect of B only (E [Y (0, 1)] - E [Y (0, 0)])$

Interaction exists if these are not equal. This simplifies to checking if the following is non-zero (see Appendix A):

$Both E [Y (1, 1)] - A only E [Y (1, 0)] - B only E [Y (0, 1)] + Neither E [Y (0, 0)] \neq = 0$

If this is positive: Synergy (the combination is better than expected).
If this is negative: Antagonism (the combination is worse than expected).

(We could also look at interaction on other scales, like ratios, which might give different answers. Always state the scale you're using; we'll come back to this in later lectures.)

Finding Causal Interaction in Data

To estimate this interaction, we need valid estimates for all four average potential outcomes:

$E [Y (0, 0)], E [Y (1, 0)], E [Y (0, 1)], E [Y (1, 1)]$

This means we must control for confounders of both the $A \to Y$ link and the $B \to Y$ link.

The following figure shows this. $L_{A}$ are confounders for $A \to Y$ , and $L_{B}$ are confounders for $B \to Y$ . We need to block the backdoor paths (red arrows).

Diagram illustrating causal interaction. Assessing the joint effect of two interventions, A (e.g., teaching method) and B (e.g., tutoring), on outcome Y (e.g., test score). L_A represents confounders of the A-Y relationship, and L_B represents confounders of the B-Y relationship. Red arrows indicate biasing backdoor paths requiring adjustment.

The next figure shows we need to condition on (adjust for) both $L_{A}$ and $L_{B}$ .

Identification of causal interaction requires adjusting for all confounders of A-Y (L_A) and B-Y (L_B). Boxes around L_A and L_B indicate conditioning, closing backdoor paths.

In our education example:

$L_{A}$ (Confounders for Teaching Method $\to$ Score): Prior achievement, motivation, family background (SES), school quality, teacher differences (if not randomly assigned).
$L_{B}$ (Confounders for Tutoring $\to$ Score): Prior achievement, motivation, family background (SES, paying for tutoring), student availability, specific learning needs.

Notice that prior achievement and motivation are in both $L_{A}$ and $L_{B}$ . We need to measure and adjust for all important factors in $L_{A}$ and $L_{B}$ to get a reliable estimate for interaction.

Effect Modification: Different Effects for Different People

Unlike interaction (about combining treatments), effect modification is about whether the causal effect of a single intervention ( $A$ ) on an outcome ( $Y$ ) is different for different subgroups in the population. These subgroups are defined by baseline characteristics (like age, sex, prior history; let's call these $G$ or $X$ ).

Effect modification helps us understand who benefits most (or least) from an intervention. We explore this using ideas like Heterogeneous Treatment Effects (HTE) and Conditional Average Treatment Effects (CATE).

Heterogeneous Treatment Effects (HTE): The Idea of Variation

Heterogeneous Treatment Effects (HTE) just means that the effect of a treatment ( $A$ on $Y$ ) isn't the same for everyone. The effect varies. This variation is effect modification.

Why does it vary?

Differences in things we can measure (like age, sex, baseline health, our $X$ variables).
Differences in things we can't easily measure (like genetics, unmeasured background factors).

HTE is the reality; treatments rarely work identically for all.

Conditional Average Treatment Effect (CATE): Measuring Variation with Data $τ (x)$

To study HTE using data, we focus on the Conditional Average Treatment Effect (CATE). CATE is a specific causal question (estimand): What is the average treatment effect for the subgroup of people who share specific measured characteristics $X = x$ ?

$τ (x) = CATE (x) = E [Y (1) - Y (0) ∣ X = x]$

Here, $Y (1)$ is the potential outcome with treatment, $Y (0)$ without. $τ (x)$ tells us the average effect specifically for people with characteristics $X = x$ . By looking at how $τ (x)$ changes for different $x$ , we quantify effect modification by the characteristics we measured in X.

Comparing Effects Across Defined Groups

A simple way to check for effect modification by a category $G$ (like comparing males vs females, or different locations) is to estimate the Average Treatment Effect (ATE) separately within each group. This is like comparing CATEs where $X$ is just the group variable $G$ .

Let's say $A$ is the treatment (0=control, 1=treated) and $G$ is the potential modifier (e.g., $g =$ female, $g^{'} =$ male).

We compare:

The average effect for females ( $G = g_{1}$ ): $δ_{g_{1}} = E [Y (1) ∣ G = g_{1}] - E [Y (0) ∣ G = g_{1}]$
The average effect for males ( $G = g_{2}$ ): $δ_{g_{2}} = E [Y (1) ∣ G = g_{2}] - E [Y (0) ∣ G = g_{2}]$

Effect modification by $G$ exists if these are different: $γ = δ_{g_{1}} - δ_{g_{2}} \neq = 0$ .

If our estimate $\overset{γ}{^}$ is far from zero, it suggests the treatment effect differs between males and females.

Finding Effect Modification in Data

To estimate these group-specific effects ( $δ_{g}$ ) and their difference ( $γ$ ) correctly, we need to control for confounders ( $L$ ) of the $A \to Y$ relationship within each group defined by $G$ . Note that we are not estimating the causal effect of $G$ . As such, we do not need to control for things that cause $G$ itself, unless they also confound the $A \to Y$ relationship (i.e., are also in $L$ ).

Look at the following figure. To estimate the $A \to Y$ effect within levels of $G$ , we need to adjust for the confounders $L_{0}$ . However, this will partially block the effect-modification of $G$ on $Y$ because $L_{0}$ is a mediator for that path. Moreover, if we were identifying the causal effect of $G$ on $Y$ , after conditioning on $L$ , we would find that a backdoor path opens from $G \to Y$ because $L$ is a collider. $G$ does not have a causal interpretation in this model. However we would be wrong to think that $G$ is not an effect modifier of the effect of $A$ on $Y$ . (See Appendix C.)

How shall we investigate effect modification of A on Y by G? Can you see the problem?

Thus it is essential to understand that when we control for confounding along the $A \to Y$ path, we do not identify the causal effects of effect-modifiers.

To clarify:

If the statistical model correctly identifies causal effect modification (by appropriately handling confounders and avoiding collider bias), then it has prognostic value regarding the differential outcomes expected under intervention $A = 1$ vs $A = 0$ depending on $G = g$ .
If the statistical model contains interaction terms that are artefacts of bias (like conditioning on a collider) or reflect a different target (like conditioning on a mediator when the total effect was intended), its causal prognostic value is compromised or needs careful interpretation. It might still predict $Y$ well given $A$ , $G$ , and $L$ in an observational setting, but it wouldn't accurately predict the results of intervening on A differently for different $G$ groups. (See Appendix B.)

The choice of variables fundamentally determines which causal question (if any) the statistical model is estimating. As with the average treatment effect, we interpret evidence for effect modification in the context of our assumptions about the causal relationships that obtain in the world. This is because the statistical interaction we observe is highly sensitive to model choices. Any interpretation as causal effect modification, and therefore any reliable prognostic value for intervention effects within different segments of the population, depends entirely on whether and how our statistical model accounts for the causal structure.

Estimating How Effects Vary: Getting $\overset{τ}{^} (x)$ from Data

We defined the Conditional Average Treatment Effect (CATE), $τ (x)$ , as the true average effect for a subgroup with specific features $X = x$ :

$τ (x) = E [Y (1) - Y (0) ∣ X = x]$

Now, we want to estimate this from our actual data. We call our estimate $\overset{τ}{^} (x)$ . For any person $i$ in our study with features $X_{i}$ , the value $\overset{τ}{^} (X_{i})$ is our data-based prediction of the average treatment effect for people like person i.

"Personalised" Effects vs. True Individual Effects

Wait: didn't we say we can't know the true effect for one specific person, $Y_{i} (1) - Y_{i} (0)$ ? Yes, that's still true.

So what does $\overset{τ}{^} (X_{i})$ mean?

Individual Causal Effect (Unknowable): $Y_{i} (1) - Y_{i} (0)$ . This is the true effect for person $i$ . We can't observe both $Y_{i} (1)$ and $Y_{i} (0)$ .
Estimated CATE ( $\overset{τ}{^} (X_{i})$ ) (What we calculate): This is our estimate of the average effect, $E [Y (1) - Y (0)]$ , for the subgroup of people who share the same measured characteristics $X_{i}$ as person $i$ .

When people talk about "personalised" or "individualised" treatment effects in this context, they usually mean $\overset{τ}{^} (x)$ . It's "personalised" because the prediction uses person $i$ 's specific characteristics $X_{i} = x$ . But remember, it's an estimated average effect for a group, not the unique effect for that single individual.

People Have Many Characteristics

People aren't just in one group; they have many features at once. A student might be:

Female
21 years old
From a low-income family
Did well on previous tests
Goes to a rural school
Highly motivated

All these factors ( $X_{i}$ ) together might influence how they respond to a new teaching method.

Trying to figure this out with traditional regression by manually adding interaction terms (like A*gender*age*income*...) becomes impossible very quickly:

Too many combinations, not enough data in each specific combo.
High risk of finding "effects" just by chance (false positives).
Hard to know which interactions to even include.
Can't easily discover unexpected patterns.

Thus, while simple linear regression with interaction terms (lm(Y ~ A * X1 + A * X2)) can estimate CATEs if the model is simple and correct, it often fails when things get complex (many $X$ variables, non-linear effects).

Causal forests (using the grf package in R; Athey, Tibshirani, and Wager, 2024) are a powerful, flexible alternative designed for this task. They build decision trees that specifically aim to find groups with different treatment effects.

Demo: Why Functional Form Matters (Even Without Confounding)

The following simulation makes this concrete. Treatment is randomised, so there is no confounding at all. The only question is whether each method can recover the true effect surface $τ (x)$ , which is deliberately non-linear (it includes sinusoidal, threshold, and interaction components that regression cannot capture).

# install once
# remotes::install_github("go-bayes/causalworkshop@v0.2.1")
library(causalworkshop)

# simulate data with non-linear heterogeneous effects
d <- simulate_nonlinear_data(n = 2000, seed = 2026)

# compare four estimation methods
results <- compare_ate_methods(d)

# summary table: ATE and individual-level RMSE
results$summary

# plot: predicted vs true treatment effects
results$plot_comparison

# plot: estimated effect as a function of x1
results$plot_by_x1

All four methods recover the correct ATE (because treatment is randomised, a simple difference in means suffices). The RMSE column tells a different story: it measures how well each method recovers the true effect for each individual. OLS with linear interactions compresses heterogeneity to a flat plane. Polynomial regression adds curvature but remains constrained by its parametric form. The GAM and causal forest learn the effect surface non-parametrically and achieve substantially lower RMSE.

The lesson: when treatment effects vary non-linearly across individuals, the functional form you assume determines whether you can detect that variation, even in a randomised experiment. Regression does not fail because of confounding here; it fails because it assumes a shape that the data do not follow.

We'll learn how to use grf after the mid-term break. It will allow us to get the $\overset{τ}{^} (x)$ predictions and then think about how to use them, for instance, to prioritise who gets a treatment if resources are limited.

Summary

Let's revisit the core concepts.

Interaction:

Think: Teamwork effect.
What: Effect of two or more different interventions ( $A$ and $B$ ) applied together.
Question: Is the joint effect $E [Y (a, b)]$ different from the sum of individual effects?
Needs: Control confounders for all interventions involved ( $L \cup Q$ ).

Effect Modification / HTE / CATE:

Think: Different effects for different groups.
What: Effect of a single intervention ( $A$ ) varies depending on people's baseline characteristics ( $G$ or $X$ ).
Question (HTE): Does the effect vary? (The phenomenon.)
Question (CATE $τ (x)$ ): What is the average effect for a specific subgroup with features $X = x$ ? (The measure.)
Needs: Control confounders for the single intervention ( $L$ ) within subgroups.

Estimated "Individualised" Treatment Effects ( $\overset{τ}{^} (x)$ ):

Think: Personal profile prediction.
What: Our estimate of the average treatment effect for the subgroup of people sharing characteristics $X_{i}$ .
How: Calculated using models (like causal forests) that use the person's full profile $X_{i}$ .
Important: This is not the true effect for that single person (which is unknowable). It's an average for people like them.
Use: Explore HTE, identify subgroups, potentially inform targeted treatment strategies.

Keeping these concepts distinct helps us ask clear research questions and choose the right methods.

Course Review So Far: A Quick Recap

Let's quickly review the main ideas of causal inference we've covered.

The Big Question: Does A cause Y?

Causal inference helps us answer if something (like a teaching method, $A$ ) causes a change in something else (like test scores, $Y$ ).

Core Idea: "What If?" (Counterfactuals)

We compare what actually happened to what would have happened in a different scenario.

$Y (1)$ : Score if the student had received the new method.
$Y (0)$ : Score if the student had received the old method.

The Average Treatment Effect (ATE) = $E [Y (1) - Y (0)]$ is the average difference across the whole group.

This Lecture Clarified Concepts of Interaction vs. Effect Modification vs. Individual Predictions

Interaction (Think: Teamwork Effects)

About: Combining two different interventions (A and B).
Question: Does using both A and B together give a result different from just adding up their separate effects? (e.g., new teaching method + tutoring).
Needs: Analyse effects of A alone, B alone, and A+B together. Control confounders for both A and B.

Effect Modification (Think: Different Effects for Different Groups)

About: How the effect of one intervention (A) changes based on people's characteristics (X, like prior grades).
Question: Does the teaching method (A) work better for high-achieving students (X=high) than low-achieving students (X=low)?
- HTE: The idea that effects differ.
- CATE $τ (x)$ : The average effect for the specific group with characteristics $X = x$ .
Needs: Analyse effect of A within different groups (levels of X). Control confounders for A.

Estimated Individualised Effects ( $\overset{τ}{^} (X_{i})$ ) (Think: Personal Profile Prediction)

About: Using a person's whole profile of characteristics ( $X_{i}$ , age, gender, background, etc.) to predict their probable response to treatment A.
How: Modern methods (like causal forests) take all of $X_{i}$ and estimate $\overset{τ}{^} (X_{i})$ .
Result: this $\overset{τ}{^} (X_{i})$ is not the true unknowable effect for person $i$ . It is the estimated average effect for people similar to person i (sharing characteristics $X_{i}$ ).
Use: helps explore if tailoring treatment based on these profiles ( $X_{i}$ ) could be beneficial.

Simple Summary

Interaction: Do A and B work together well/badly?
Effect Modification: Does A's effect depend on who you are (based on X)?
$\overset{τ}{^} (X_{i})$ : Can we predict A's average effect for someone based on their specific profile $X_{i}$ ?

Understanding these differences is key to doing good causal research.

Lab materials: Lab 6: CATE and Effect Modification

Supplementary material

The appendices below contain formal derivations. They are included for completeness but are not required for the test. Focus on the main text above.

Appendix A: Simplification of Additive Interaction Formula

We start with the definition of additive interaction based on comparing the joint effect relative to baseline versus the sum of individual effects relative to baseline:

$(E [Y (1, 1)] - E [Y (0, 0)]) - [(E [Y (1, 0)] - E [Y (0, 0)]) + (E [Y (0, 1)] - E [Y (0, 0)])]$

First, distribute the negative sign across the terms within the square brackets:

$E [Y (1, 1)] - E [Y (0, 0)] - (E [Y (1, 0)] - E [Y (0, 0)]) - (E [Y (0, 1)] - E [Y (0, 0)])$

Now remove the parentheses, flipping the signs inside them where preceded by a minus sign:

$E [Y (1, 1)] - E [Y (0, 0)] - E [Y (1, 0)] + E [Y (0, 0)] - E [Y (0, 1)] + E [Y (0, 0)]$

Next, combine the $E [Y (0, 0)]$ terms:

We have $- E [Y (0, 0)]$
Then $+ E [Y (0, 0)]$ (these two cancel each other out)
And another $+ E [Y (0, 0)]$ remains.

The expression simplifies to:

$E [Y (1, 1)] - E [Y (1, 0)] - E [Y (0, 1)] + E [Y (0, 0)]$

This is the standard definition of additive interaction, often called the interaction contrast. If this expression equals zero, there is no additive interaction; a non-zero value indicates an interaction effect.

This shows clearly that interaction is the deviation of the joint effect from the sum of the separate effects, adjusted for the baseline.

Appendix B: Evidence for Effect-Modification is Relative to Inclusion of Other Variables in the Model

The 'sharp-null hypothesis' states there is no effect of the exposure on the outcome for any unit in the target population. Unless the sharp-null hypothesis is false, there may be effect-modification. For any study worth conducting, we cannot evaluate whether the sharp-null hypothesis is false. If we could, the experiment would be otiose. Therefore, we must assume the possibility of effect-modification. Whether a variable is an effect-modifier also depends on which other variables are included in the model. That is, just as for the concept of a 'confounder', where a variable is an 'effect-modifier' cannot be stated without reference to an assumed causal order and an explicit statement about which other variables will be included in the model (VanderWeele, 2012).

As illustrated in the following figure, the marginal association between $A$ and $Y$ is unbiased. Here, exposure $A$ is unconditionally associated with $Y$ . We use the convention that a solid blue box (e.g., $G$ ) denotes effect-modification with conditioning and a dashed blue circle (e.g., Z in a dashed circle) indicates effect-modification without conditioning.

Consider a randomised experiment. There is no confounding. The marginal association between A and Y provides an unbiased estimate for the causal effect of A on Y. G modifies A's effect on Y by proxy, in virtue of G's relationship to Z, which is a direct effect modifier for the effect of A on Y.

The next figure presents the same randomised experiment as in the previous causal diagram. We again assume that there is no confounding of the marginal association between the exposure, $A$ , and the outcome, $Y$ . However, suppose we were to adjust for $Z$ and ask: does the conditional association of $A$ on $Y$ vary within levels of $G$ , after adjusting for $Z$ ? That is, does $G$ remain an effect-modifier of the exposure on the outcome? VanderWeele (2007) proved that for effect-modification to occur, at least one other arrow besides the treatment must enter into the outcome. According to the diagram, the only arrow into $Y$ other than $A$ arrives from $Z$ . Because $Y$ is independent of $G$ conditional on $Z$ , we may infer that $G$ is no longer an effect modifier for the effect of $A$ on $Y$ . Viewed another way, $G$ no longer co-varies with $Y$ conditional on $Z$ and so cannot act as an effect-modifier.

Conditioning on Z renders G independent of Y. G is no longer an effect modifier after conditioning on Z because G is independent of Y. Although Z is an unconditional effect modifier, G is not.

The following figure presents the same randomised experiment as in the previous graph. We assume a true effect of $A \to Y$ . If we do not condition on $B$ , then $G$ will not modify the effect of $A \to Y$ because $G$ will not be associated with $Y$ . However, if we were to condition on $B$ , then both $B$ (an effect modifier by proxy) and $G$ may become effect-modifiers for the causal effect of $A$ on $Y$ . In this setting, both $B$ and $G$ are conditional effect-modifiers.

Note that causal graphs help us to evaluate classifications of conditional and unconditional effect modifiers. They may also help to clarify conditions in which conditioning on unconditional effect-modifiers may remove conditional effect-modification. However, we cannot tell from a causal diagram whether the ancestors of an unconditional effect-modifier will be conditional effect-modifiers for the effect of the exposure on the outcome; see VanderWeele (2007), also Suzuki et al. (2013). Causal diagrams express non-parametric relations. I have adopted an off-label colouring convention to denote instances of effect-modification to highlight possible pathways for effect-modification, which may be relative to other variables in a model.

Blue path denotes effect-modification for G by conditioning on B. Both B and G are conditional effect modifiers.

The next figure reveals the relativity of effect-modification. If investigators do not condition on $B$ , then $G$ cannot be a conditional effect-modifier because $G$ would then be independent of $Z$ because $B$ is a collider. However, as we observed in the previous figure, conditioning on $B$ (a collider) may open a path for effect-modification of $G$ by $Z$ . Both $B$ and $G$ are conditional effect modifiers.

Blue path denotes effect-modification. Here G is not an effect modifier because B, a common effect (collider) of G and Z, is not conditioned on. Any conditional effect modification for G would require conditioning on B, and not conditioning on G. Otherwise G will be d-separated from Y.

The final figure considers the implications of conditioning on $Z$ , which is the only unconditional effect-modifier on the graph. If $Z$ is measured, conditioning on $Z$ will remove effect-modification for $B$ and $G$ because $B, G ⊥ ⊥ Y ∣ Z$ . This example again reveals the context dependency of effect-modification. Here, causal diagrams are useful for clarifying features of dependent and independent effect modification. For further discussion, see Suzuki et al. (2013) and VanderWeele (2009).

Blue path denotes effect-modification. Neither G nor B are unconditional effect-modifiers for the effect of A on Y after Z is conditioned upon. If investigators condition on Z, the causal diagram implies they will not find evidence for effect-modification by B or G, which are conditionally independent of Y once Z is conditioned upon.

Appendix C: Further Clarification on Effect Modification Without Statistical Evidence for It

Again, look at the effect modification DAG from the seminar. Suppose the investigator models the effect of $A$ on $Y$ . Suppose $G$ is not associated with $Y$ conditional on $L$ . The DAG provides structural clarification for why concluding there is no effect modification by $G$ is incorrect.

To clarify:

We have a DAG structure: $G \to L$ ; $L \to A$ , $L \to Y$
We're interested in the conditional average treatment effect (CATE): $τ (g) = E [Y (1) - Y (0) ∣ G = g]$
To identify the effect of $A$ on $Y$ , we must condition on $L$
$G$ is (partially or fully) d-separated from $Y$ conditional on $L$

Even though $G$ is d-separated from $Y$ given $L$ , this doesn't mean $G$ can't modify the effect of $A$ on $Y$ . The CATE for a specific value of $G$ can be expressed as:

$τ (g) = E [Y (1) - Y (0) ∣ G = g] = E [E [Y (1) - Y (0) ∣ L, G = g] ∣ G = g]$

Since $G$ is d-separated from $Y$ given $L$ :

$τ (g) = E [E [Y (1) - Y (0) ∣ L] ∣ G = g]$

This means $τ (g)$ is a weighted average of the L-specific treatment effects, where the weights are determined by the distribution of L given $G = g$ .

If two conditions are met:

The effect of $A$ on $Y$ varies across levels of $L$
The distribution of $L$ varies with $G$ (which it does, since $G \to L$ )

Then $τ (g)$ will vary with $g$ , indicating effect modification by $G$ .

The investigators would be wrong to equate d-separation with absence of effect modification. Although $G$ doesn't directly affect $Y$ after conditioning on $L$ , $G$ can still modify the effect of $A$ on $Y$ through its influence on the distribution of $L$ .

Week 7: In-Class Test 1 (20%)

22 April 2026

This week is the first in-class test, covering material from weeks 1–6.

What is covered

Causal diagrams: elementary structures (week 2)
Confounding bias structures (week 3)
Interaction, measurement bias, and selection bias (week 4)
Average treatment effects and the three fundamental assumptions (week 5)
Effect modification and CATE (week 6)

Format

Duration: 50 minutes (allocated time: 1 hour 50 minutes)
Closed book, no devices
Bring a pen or pencil
Location: EA120 (the usual seminar room)

Reminders

No electronic devices permitted during the test.
Arrive on time. The test begins at 14:10.
Write clearly; illegible answers cannot be marked.

Week 8: Heterogeneous Treatment Effects and Machine Learning

Date: 29 Apr 2026

Readings

Required

grf: Generalised Random Forests

Optional

VanderWeele et al. (2020), "Outcome-Wide Longitudinal Designs for Causal Inference"
Suzuki et al. (2020), "Causal Diagrams"
Bulbulia (2024), "A Practical Guide to Causal Inference with Time-Series Cross-Sectional Data"
Hoffman et al. (2023), "Introducing suringer"

Key Concepts

The workflow below introduces heterogeneous-treatment-effect (HTE) analysis with causal forests. By the end of the lecture you should understand six technical ideas:

ATE (lectures 1--5)
CATE (lecture 6)
The estimator $τ (x)$ (lecture 6)
The RATE statistics drawn from a Targeting-Operator Characteristic (TOC) curve (new)
Qini curves (new)
Policy trees, and how each fits into an applied research pipeline (new)

Why worry about heterogeneity?

Relying on the average treatment effect (ATE) is a bit like handing out size-nine shoes to an entire student body: on average they might fit, but watch the tall students hobble and the small ones trip.

Today we focus on the causal question: "What would be the effects on multi-dimensional well-being if everyone spent at least one hour a week socialising with their community?"

A one-hour boost in weekly community socialising could send some students' sense of belonging soaring while leaving others unmoved. Spotting that spread, measuring how large it really is, and deciding whether it is worth tailoring an exposure to individual "shoe sizes" are the three practical goals of heterogeneous treatment effects analysis.

1. Start with estimating the average treatment effect (ATE)

Assume the three fundamental assumptions of causal inference (consistency, exchangeability, positivity) are met. Suppose we wish to estimate the average treatment effect for socialising with one's community.

We begin with the most straightforward (and secretly impossible) counterfactual: run two parallel universes, one where everyone gets the treatment, another where no one does, and compare the final scores. The resulting difference is the average treatment effect:

$ATE = E [Y (1) - Y (0)] .$

This gives us the average response. You have seen this before.

2. Do effects differ across people?

Variation is captured by the conditional average treatment effect (CATE),

$τ (x) = E [Y (1) - Y (0) ∣ X = x],$

where $X$ gathers pre-treatment covariates: age, baseline wellbeing, personality, and whatever else we have measured and included in our model. Normally these will be our baseline confounders.

If $τ (x)$ turns out to be flat, we say there is no evidence for heterogeneity worth targeting.

People differ in countless, overlapping ways. Think of age, baseline wellbeing, personality traits, study habits, and more. A linear interaction model tests whether the treatment works differently along one straight dimension, such as gender, by fitting a straight line. But real-world data often twist and turn. If the true relationship bends like a garden hose, a straight line will miss the curve.

Regression forests fix this by letting the data place splits wherever the shape changes, so they can follow any bends that appear (Wager and Athey, 2018). Straight-line models are fine for simple patterns, but regression forests can trace the curves that simple lines overlook.

Causal forests are based on regression forests, where the splitting attempts to maximise differences in causal effect estimates. What this means will soon be clear.

3. From straight lines to trees

Challenging material

The next sections introduce the mechanics of regression trees, forests, and causal forests. You do not need to memorise the algorithms. Focus on understanding why a causal forest can detect treatment effect heterogeneity that a linear model misses.

Traditional "parametric" models (like simple regression) guess a single functional shape, often a straight line, before seeing the data. A non-parametric model, by contrast, lets the data decide the shape. A regression tree is the simplest non-parametric learner we will use.

Regression tree

The idea is to split the covariate space by asking yes/no questions ("Age $\leq$ 20?", "Baseline wellbeing $> 0.3$ ?") until each terminal leaf is fairly homogeneous. Inside a leaf the predicted outcome is just the sample mean, so the tree builds a piecewise-constant surface instead of a global line. Think of tiling a garden with stepping stones: each stone is flat, but taken together they follow the ground's contours.

Regression forest

A single tree is quick and interpretable but unstable: small changes in the data can move the splits and shift predictions. A random forest grows many trees on bootstrap samples and averages their outputs. Averaging cancels much of the noise (Breiman, 2001).

Causal forest

To estimate treatment effects rather than outcomes, each tree plays a two-step "honest" game (Wager and Athey, 2018). First, one half of its sample is used to choose splits that separate treated from control units. Second, the other half is used to compute treatment-control differences within every leaf.

For a new individual with covariates $x_{i}$ , each tree supplies a noisy leaf-level effect; the forest reports the average, written

$τ (x) = E [Y (1) - Y (0) ∣ X = x] .$

Because the noisy estimates point in many directions, their average is markedly less variable. The wisdom of trees is a wisdom of crowds.

Key intuition: honest splitting

"Honesty" means the forest never uses the same data to find the splits and estimate the effects within them. Think of it as a student who writes the exam questions (finds the splits) and a different student who answers them (estimates the effects). This separation prevents overfitting and ensures that the confidence intervals are valid.

Straight-line models suit simple patterns; regression forests flex to any bends; causal forests add a third dimension, namely variation in treatment responses.

For more about causal forests, see the lecture by Susan Athey (starting around 18 minutes in): https://www.youtube.com/watch?v=YBbnCDRCcAI

4. Building honest trees: avoiding overfitting

Sample splitting means partitioning your data into training and testing sets. This avoids overfitting the model to observations. Remember, we seek to estimate parameters for an entire population under two different exposures, at most only one of which is observed on any individual. Sample splitting is a feature of estimation in causal forests: we separate model selection from estimation.

Moreover, the forest adds a second safeguard: out-of-bag (OOB) prediction. Each $τ (x_{i})$ is averaged only over trees that never used $i$ in their split phase. Together, honesty and OOB prediction deliver reliable uncertainty estimates even in high-dimensional settings (i.e. settings with many covariates).

5. Handling missing data

The grf package adopts Missing Incorporated in Attributes (MIA) splitting. "Missing" can itself become a branch, so cases are neither discarded nor randomly imputed. This pragmatic approach keeps all observations in play while preserving the forest's interpretability.

6. Is the heterogeneity actionable? RATE statistics

Once we have a personalised score $τ (x)$ for every unit, the practical question is whether targeting high scorers delivers a benefit large enough to justify the extra effort. The tool of choice is the Targeting-Operator Characteristic (TOC) curve:

$G (q) = \frac{1}{n} i = 1 \sum ⌊ q n ⌋ τ_{(i)}, 0 \leq q \leq 1,$

where $τ_{(1)} \geq τ_{(2)} \geq \dots$ are the estimated effects sorted from largest to smallest. The horizontal axis $q$ is the fraction of the population we would treat; the vertical axis $G (q)$ is the cumulative gain we expect from treating that top slice.

Two integrals of the TOC curve summarise how lucrative targeting could be.

RATE AUTOC (Area Under the TOC) puts equal weight on every $q$ . This answers: if benefits are concentrated among the very best prospects, how much can we harvest by cherry-picking them?

RATE Qini applies heavier weight to the mid-range of $q$ . This is the go-to metric when investigators face a fixed, moderate-sized budget, say, "we can afford to treat 40% of individuals; will targeting help?" (Yadlowsky et al., 2021). We will evaluate the curve at treatment of 20% and 50% of the population.

To quantify the economic or policy value of heterogeneity, rank units by $τ (x)$ and draw a TOC curve that plots cumulative gain against the fraction $q$ of the population treated.

Summary

Our workflow in this lecture answers three questions in sequence:

Is there substantial heterogeneity? Reject $H_{0} : τ (x)$ constant if RATE AUTOC or RATE Qini is positive and statistically reliable.
Does targeting pay at realistic budgets? Inspect the slope of the Qini curve around plausible coverage levels.
Can we express the targeting rule in a few defensible steps? Fit and validate a shallow policy tree.

In the lab section you will reproduce each stage on a simulated dataset. Next week we turn to the applied side of RATE statistics: interpreting AUTOC curves, Qini curves, and building policy trees from causal forest output.

Lab materials: Lab 8: RATE and QINI Curves

Week 9: Resource Allocation and Policy Trees

Date: 6 May 2026

Readings

Required

GRF documentation: https://grf-labs.github.io/grf/

Optional

VanderWeele et al. (2020) Outcome-Wide Longitudinal Designs for Causal Inference
Suzuki et al. (2020) Causal Diagrams
Bulbulia et al. (2024) A Practical Guide to Causal Inference with Panel Data
Hoffman et al. (2023) Causal Forests

Key Concepts

This week builds on the heterogeneous-treatment-effect (HTE) framework introduced in Week 8. By the end of this lecture you should understand six technical ideas:

ATE (Weeks 1--5)
CATE (Week 6)
The estimator $τ (x)$ (Week 6)
RATE statistics derived from a Targeting-Operator Characteristic (TOC) curve (Week 8, applied here)
Qini curves (new)
Policy trees and how each fits into an applied research pipeline (new)

Recap: RATE Statistics and the TOC Curve

In Week 8 we introduced the machinery for deciding whether heterogeneity in treatment effects is large enough to act on. The central tool is the Targeting-Operator Characteristic (TOC) curve, which plots cumulative gain on the vertical axis against the fraction $q$ of the population treated on the horizontal axis, after ranking individuals by their estimated CATE $τ (x)$ from largest to smallest:

$G (q) = \frac{1}{n} i = 1 \sum ⌊ q n ⌋ τ_{(i)}, 0 \leq q \leq 1.$

Two summary statistics integrate the TOC curve into a single number:

RATE AUTOC (Area Under the TOC) weights every fraction $q$ equally, answering: if benefits concentrate among the best prospects, how much can we harvest by selecting them?
RATE Qini weights the mid-range of $q$ more heavily, suiting investigators who face a fixed, moderate-sized budget (Yadlowsky et al. 2021). We evaluate the curve at treatment of 20% and 50% of the population.

A positive, statistically reliable RATE statistic tells us that targeting outperforms blanket treatment. This week we apply these statistics to worked examples and extend the pipeline with Qini curves and policy trees.

Section 7: RATE AUTOC Example

Although out-of-bag (OOB) predictions are "out-of-sample" for individual trees, the full forest still reuses information. A simple remedy when estimating the RATE AUTOC and Qini is to split the data, training the forest on one fold and testing RATE/Qini on the other. This explicit splitting blocks optimistic bias and yields honest test statistics, including confidence intervals (GRF Labs 2024).

The RATE AUTOC curve for the effect of hours socialising on sense of meaning illustrates the typical pattern. The curve rises steeply at first: a small, correctly targeted programme could deliver large gains. Past about 30% of the sample the curve dips below zero, meaning that beyond that threshold targeting the CATE may perform worse than simply applying the ATE to everyone. The initial steepness reflects concentration of benefit among a minority of individuals, while the subsequent decline shows that extending the programme to those with small or negative estimated effects dilutes its value.

A Difficult Problem

Figuring out who will benefit from a treatment is a difficult statistical problem (GRF Labs 2024). The RATE AUTOC curve summarises the potential for targeting, not a guarantee.

Section 8: Visualising Policy Value with the Qini Curve

A Qini curve displays cumulative benefit on the vertical axis and treatment coverage (percentage of the population treated) on the horizontal axis. As with the AUTOC curve, we use a held-out test fold to validate the response curve.

For the effect of hours socialising on social belonging, the Qini curve shows that focussing on the top 20% of individuals nets a gain of roughly 0.08 units (95% CI 0.04--0.12). Widening coverage to 50% increases the cumulative gain to approximately 0.13 units (95% CI 0.07--0.19). Beyond that point the curve flattens: once we have treated everyone who offers a meaningful return, additional coverage adds little.

The Qini curve therefore translates CATE estimates into a practical budgeting tool. A policy-maker can read off the expected gain at any feasible coverage level and weigh it against cost.

Section 9: From a Black Box to Simple Rules --- Policy Trees

The causal forest produces a personalised CATE for every individual, mapping a high-dimensional covariate vector $X$ to a number $τ (X)$ . Useful as that forecast may be, it stops short of telling us what to do. The function itself is too tangled (thousands of overlapping splits) to translate directly into a policy.

The policytree algorithm bridges that gap by collapsing the forest's $τ (X)$ values into a single, shallow decision tree whose depth you choose. Each split is chosen to maximise expected benefit (Sverdrup et al. 2024). In this course we cap the depth at two, for three reasons:

At most three yes/no questions per rule, so the logic fits on a slide you can present to policy-makers.
Each leaf still contains enough observations to yield a stable effect estimate.
Deeper trees increase computational complexity faster than they improve payoffs.

The decision tree for social belonging first splits participants by self-esteem at $- 0.925$ (original scale: 3.958). For those with self-esteem at or below this threshold, the next split is by neuroticism at $0.642$ (original scale: 4.228). Within that subgroup, individuals with neuroticism at or below the threshold are recommended control, while those above are recommended treated.

For participants with self-esteem above $- 0.925$ , the second split is by social belonging at $0.776$ (original scale: 5.972). In this subgroup, individuals with social belonging at or below the threshold are recommended treated, while those above are recommended control.

The corresponding policy map plots each individual in the two-dimensional covariate space defined by the tree's splitting variables, with colour indicating the recommended treatment assignment. The vertical and horizontal lines mark the split thresholds, carving the space into four rectangular regions. This visualisation makes the tree's logic immediately legible: one can see at a glance which combinations of covariates lead to a treatment recommendation and which do not. Because predictions are generated out of training sample, the map also serves as an informal check on how well the rule generalises.

Section 10: Ethical and Practical Considerations

There is no guarantee that statistical optimality will align with social optimality. A rule that maximises expected health gains might still be unaffordable for a public agency, unfair to a protected group, or opaque to those asked to trust it. We all have our notions of fairness, and we cannot be expected to ignore them. Moreover, the estimation of CATE is always sensitive to which variables we include in our model (see the caveats in Week 6 regarding evidence for effect modification relative to the inclusion of other variables).

We should not consider CATE an absolute guide to practice. Caution is warranted.

Yet the very same CATE machinery that powers targeting also helps science move past a one-size-fits-all mindset. By mapping treatment effects across a high-dimensional covariate space, we can test whether our favourite categories (gender, age group, clinical severity) actually capture the differences that matter. Sometimes they do; often they do not, revealing that nature is not carved at the joints of our folk classifications. Discovering where the forest finds meaningful splits can generate fresh psychological hypotheses about who responds, why, and under what circumstances, even when no policy decision is on the table. Over the coming weeks we shall return to this point with examples.

Summary and Next Steps

Our workflow answers three questions in sequence:

Is there substantial heterogeneity? Reject $H_{0} : τ (x)$ constant if RATE AUTOC or RATE Qini is positive and statistically reliable.
Does targeting pay at realistic budgets? Inspect the slope of the Qini curve around plausible coverage levels.
Can we express the targeting rule in a few defensible steps? Fit and validate a shallow policy tree.

In the lab section you will reproduce each stage on a simulated dataset.

Lab materials: Lab 9: Policy Trees

Week 10: Classical Measurement Theory from a Causal Perspective

Date: 13 May 2026

Readings

Required

Fischer & Karl (2019). A primer on measurement invariance testing. link

Optional

Van de Vijver (2021). Culture and Psychology. link
He & van de Vijver (2012). Bias and equivalence in cross-cultural research. link
Harkness (2003). Questionnaire translation. link

Key Concepts

Exploratory Factor Analysis (EFA)
Confirmatory Factor Analysis (CFA)
Multigroup CFA
Invariance Testing (configural, metric, scalar)

Overview

By the conclusion of this session, you will gain proficiency in:

Exploratory Factor Analysis (EFA),
Confirmatory Factor Analysis (CFA),
Multigroup Confirmatory Factor Analysis,
Partial Invariance (configural, metric, and scalar equivalence).

We will learn these concepts by working through an analysis.

Part 1: Factor Analysis and Measurement Invariance

Focus on Kessler-6 Anxiety

The code below will:

Load required packages.
Select the Kessler 6 items.
Check whether there is sufficient correlation among the variables to support factor analysis.

Select a scale to validate: Kessler-6 Distress

# get synthetic data
library(margot)
library(tidyverse)
library(performance)

# select the columns of the kessler-6
dt_only_k6 <- df_nz |>
  filter(wave == 2018) |>
  select(
    kessler_depressed,
    kessler_effort,
    kessler_hopeless,
    kessler_worthless,
    kessler_nervous,
    kessler_restless
  )

# check factor structure
performance::check_factorstructure(dt_only_k6)

Practical definitions of the Kessler-6 items

The df_nz dataset is loaded with the margot package. It is a synthetic dataset. We take items from the Kessler-6 (K6) scale: depressed, effort, hopeless, worthless, nervous, and restless (Kessler et al. 2002; Kessler et al. 2010). The Kessler-6 is used as a diagnostic screening tool for depression by physicians in New Zealand.

The Kessler-6 (K6) is a widely used diagnostic screening tool designed to identify levels of psychological distress that may indicate mental health disorders such as depression and anxiety. Physicians in New Zealand use it to screen patients quickly (Krynen et al. 2013). Each item on the Kessler-6 asks respondents to reflect on their feelings and behaviours over the past 30 days, with responses provided on a five-point ordinal scale:

"...you feel hopeless": This item measures the frequency of feelings of hopelessness. It assesses a core symptom of depression, where the individual perceives little or no optimism about the future.
"...you feel so depressed that nothing could cheer you up": This statement gauges the depth of depressive feelings and the inability to gain pleasure from normally enjoyable activities, a condition known as anhedonia.
"...you feel restless or fidgety": This item evaluates agitation and physical restlessness, which are common in anxiety disorders but can also be present in depressive states.
"...you feel that everything was an effort": This query assesses feelings of fatigue or exhaustion with everyday tasks, reflecting the loss of energy that is frequently a component of depression.
"...you feel worthless": This item measures feelings of low self-esteem or self-worth, which are critical indicators of depressive disorders.
"...you feel nervous": This question is aimed at identifying symptoms of nervousness or anxiety, helping to pinpoint anxiety disorders.

Response options

The ordinal response options provided for the Kessler-6 are designed to capture the frequency of these symptoms, which is crucial for assessing the severity and persistence of psychological distress:

1. "None of the time": The symptom was not experienced at all.
2. "A little of the time": The symptom was experienced infrequently.
3. "Some of the time": The symptom was experienced occasionally.
4. "Most of the time": The symptom was experienced frequently.
5. "All of the time": The symptom was constantly experienced.

In clinical practice, higher scores on the Kessler-6 are indicative of greater distress and a higher probability of a mental health disorder. Physicians use a sum score of 13 to decide on further diagnostic evaluations or immediate therapeutic interventions.

The simplicity and quick administration of the Kessler-6 make it an effective tool for primary care settings in New Zealand, allowing for the early detection and management of mental health issues. Do the items in this scale cohere? Do they all relate to depression? Might we quantitatively evaluate "coherence" in this scale?

Exploratory Factor Analysis

We employ performance::check_factorstructure() to evaluate the data's suitability for factor analysis. Two tests are reported:

Bartlett's test of sphericity

Bartlett's Test of Sphericity is used in psychometrics to assess the appropriateness of factor analysis for a dataset. It tests the hypothesis that the observed correlation matrix is an identity matrix, which would suggest that all variables are orthogonal (i.e., uncorrelated) and therefore factor analysis would be inappropriate.

The outcome:

Chi-square (Chisq): 50402.32 with degrees of freedom (15) and a p-value < .001

This result (p < .001) confirms that the observed correlation matrix is not an identity matrix, substantiating the factorability of the dataset.

Kaiser-Meyer-Olkin (KMO) measure

The KMO test assesses sampling adequacy by comparing the magnitudes of observed correlation coefficients to those of partial correlation coefficients. A KMO value nearing 1 indicates appropriateness for factor analysis. The results are:

Overall KMO: 0.87. This value suggests good sampling adequacy, indicating that the sum of partial correlations is relatively low compared to the sum of correlations, thus supporting the potential for distinct and reliable factors.

Each item's KMO value exceeds the acceptable threshold of 0.5, so the data are suitable for factor analysis.

Explore factor structure

The following R code allows us to perform EFA on the Kessler-6 data, assuming three latent factors.

# exploratory factor analysis
# explore a factor structure made of 3 latent variables
library("psych")
library("parameters")

# do efa
efa <- psych::fa(dt_only_k6, nfactors = 3) |>
  model_parameters(sort = TRUE, threshold = "max")

print(efa)

What is "rotation"?

In factor analysis, rotation is a mathematical technique applied to the factor solution to make it more interpretable. Rotation is analogous to adjusting the angle of a camera to get a better view. When we rotate the factors, we are not changing the underlying data, just how we are looking at it, to make the relationships between variables and factors clearer and more meaningful.

The main goal of rotation is to achieve a simpler and more interpretable factor structure. This simplification is achieved by making the factors as distinct as possible, by aligning them closer with specific variables, which makes it easier to understand what each factor represents.

There are two types:

Orthogonal rotations (such as Varimax) assume that the factors are uncorrelated and keep the axes at 90 degrees to each other. This is useful when we assume that the underlying factors are independent.

Oblique rotations (such as Oblimin) allow the factors to correlate. This is more realistic in psychological and social sciences because constructs such as stress and anxiety may naturally correlate with each other, so oblique rotation is often a better option.

Results

Using oblimin rotation, the items loaded as follows on the three factors:

MR3: Strongly associated with kessler_hopeless (0.79) and kessler_worthless (0.79). This factor might capture aspects related to feelings of hopelessness and worthlessness, often linked with depressive affect.
MR1: Mostly linked with kessler_depressed (0.99), suggesting this factor represents core depressive symptoms.
MR2: Includes kessler_restless (0.72), kessler_nervous (0.43), and kessler_effort (0.38). This factor seems to encompass symptoms related to anxiety and agitation.

The complexity values indicate the number of factors each item loads on "significantly." A complexity near 1.00 suggests that the item predominantly loads on a single factor, which is seen with most items except for kessler_nervous and kessler_effort, which show higher complexity and thus share variance with more than one factor.

Uniqueness values represent the variance in each item not explained by the common factors. Lower uniqueness values for items like kessler_depressed indicate that the factor explains most of the variance for that item.

Variance explained

The three factors together account for 62.94% of the total variance in the data, distributed as follows:

MR3: 28.20%
MR1: 17.56%
MR2: 17.18%

This indicates a substantial explanation of the data's variance by the model, with the highest contribution from the factor associated with hopelessness and worthlessness.

Consensus view?

There are different algorithms for assessing the factor structure. The performance package allows us to consider a "consensus" view.

n <- n_factors(dt_only_k6)
n
# plot
plot(n) + theme_classic()

Output:

The choice of 1 dimensions is supported by 8 (50.00%) methods out of 16 (Optimal coordinates, Acceleration factor, Parallel analysis, Kaiser criterion, Scree (SE), Scree (R2), VSS complexity 1, Velicer's MAP).

The result indicates that a single dimension is supported by half of the methods used (8 out of 16). However, science is not a matter of voting. Also, does it make sense that there is one latent factor here?

Confirmatory Factor Analysis (ignoring groups)

CFA validates the hypothesised factor structures derived from EFA.

One-factor model: assumes all items measure a single underlying construct.
Two-factor model: assumes two distinct constructs measured by the items.
Three-factor model: assumes three distinct constructs measured by the items.

1. Data partition

First, we take the dataset (dt_only_k6) and partition it into training and testing sets. This division helps in validating the model built on the training data against an unseen test set, enhancing robustness of the factor analysis findings.

set.seed(123)
part_data <- datawizard::data_partition(dt_only_k6, training_proportion = .7, seed = 123)
training <- part_data$p_0.7
test <- part_data$test

2. Model setup for CFA

Based on the EFA results, we consider three different factor structures. We fit each model to the training data:

# one factor model
structure_k6_one <- psych::fa(training, nfactors = 1) |>
  efa_to_cfa()

# two factor model
structure_k6_two <- psych::fa(training, nfactors = 2) |>
  efa_to_cfa()

# three factor model
structure_k6_three <- psych::fa(training, nfactors = 3) |>
  efa_to_cfa()

# inspect models
structure_k6_one
structure_k6_two
structure_k6_three

One-factor model: All items are linked to a single factor (MR1).
Two-factor model:
- MR1 is linked with kessler_depressed, kessler_hopeless, and kessler_worthless, suggesting these items might represent a more depressive aspect of distress.
- MR2 is associated with kessler_effort, kessler_nervous, and kessler_restless, which could indicate a different aspect, perhaps related to anxiety or agitation.
Three-factor model:
- MR1 includes kessler_depressed, kessler_effort, kessler_hopeless, and kessler_worthless, indicating a broad factor possibly encompassing overall distress.
- MR2 consists solely of kessler_effort.
- MR3 includes kessler_nervous and kessler_restless, which might imply these are distinctive from other distress components.

Do these results make sense? Note they are different from the EFA. Why might that be?

Next we perform the confirmatory factor analysis itself:

# fit and compare models
library(lavaan)

# one latent model
one_latent <-
  suppressWarnings(lavaan::cfa(structure_k6_one, data = test))

# two latents model
two_latents <-
  suppressWarnings(lavaan::cfa(structure_k6_two, data = test))

# three latents model
three_latents <-
  suppressWarnings(lavaan::cfa(structure_k6_three, data = test))

# compare models
compare <-
  performance::compare_performance(one_latent, two_latents, three_latents, verbose = FALSE)

# select cols we want
key_columns_df <- compare[, c("Model", "Chi2", "Chi2_df", "CFI", "RMSEA",
                               "RMSEA_CI_low", "RMSEA_CI_high", "AIC", "BIC")]

# view as table
as.data.frame(key_columns_df) |>
  kbl(format = "markdown")

Metrics

Chi2 (Chi-Square Test): A lower Chi2 value indicates a better fit of the model to the data.
df (Degrees of Freedom): Reflects the model complexity.
CFI (Comparative Fit Index): Values closer to 1 indicate a better fit. A value above 0.95 is generally considered good.
RMSEA (Root Mean Square Error of Approximation): Values less than 0.05 indicate a good fit, and values up to 0.08 are acceptable.
AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion): Lower values are better, indicating a more parsimonious model, with BIC imposing a penalty for model complexity.

Model selection

The three latents model shows the best fit across all indicators, having the lowest Chi2, RMSEA, and the highest CFI. It also has the lowest AIC and BIC scores.

However:

The CFI for the two-factor model is 0.990, which is close to 1 and suggests a very good fit. This is superior to the one-factor model (CFI = 0.947) and slightly less than the three-factor model (CFI = 0.995).
The RMSEA for the two-factor model is 0.044, which is within the excellent fit range (below 0.05). It improves upon the one-factor model's RMSEA of 0.099 and is only slightly worse than the three-factor model's 0.031.
BIC is not very different.

The two-factor model strikes a balance between simplicity and model fit. It has fewer factors than the three-factor model, making it potentially easier to interpret while still capturing the variance in the data.

Does anxiety appear to differ from depression?

Measurement Invariance in Multigroup Confirmatory Factor Analysis

When we use tools like surveys or tests to measure psychological constructs (such as distress, intelligence, satisfaction), we often need to ensure that these tools work similarly across different groups of people. This is crucial for fair comparisons.

Levels of measurement invariance

Measurement invariance tests how consistently a measure operates across different groups, such as ethnic or gender groups. Consider how we can understand it through the K6 Distress Scale applied to different demographic groups in New Zealand:

1. Configural invariance

The measure's structure is the same across groups. For the K6 Distress Scale, both Maori and New Zealand Europeans use the same six questions to gauge distress. However, how strongly each question predicts distress can vary between the groups. (Note we are using "prediction" here; we are only assessing associations.)

2. Metric invariance

The strength of the relationship between each question and the construct (distress) is consistent across groups. If metric invariance holds, a unit change in the latent distress factor affects scores on questions (such as feeling nervous or hopeless) equally for Maori and New Zealand Europeans.

3. Scalar invariance

Beyond having the same structure and relationships, everyone starts from the same baseline. If scalar invariance holds, the actual scores on the distress scale mean the same thing across different groups. If a Maori participant scores 15 and a New Zealander of European descent scores 15, both are experiencing a comparable level of distress.

Concept of "partial invariance"

Sometimes, not all conditions for full metric or scalar invariance are met, which could hinder meaningful comparisons across groups. This is where the concept of "partial invariance" comes into play.

Partial invariance occurs when invariance holds for some but not all items of the scale across groups.

Metric partial invariance: Most items relate similarly to the underlying factor across groups, but one or two do not. Researchers might decide that there is enough similarity to proceed with comparisons of relationships (correlations) but should proceed with caution.
Scalar partial invariance: Most but not all items show the same intercepts across groups. It suggests that while comparisons of the construct means can be made, some scores might need adjustments or nuanced interpretation.

In practical terms, achieving partial invariance allows for some comparisons but signals a need for careful interpretation. For instance, if partial scalar invariance is found on the K6 Distress Scale, researchers might compare overall distress levels between Maori and New Zealand Europeans but should acknowledge that differences in certain item responses might reflect measurement bias rather than true differences in distress.

Understanding these levels of invariance helps ensure that when we compare mental health or other constructs across different groups, we are making fair and meaningful comparisons.

Running multigroup CFA

The following script runs multigroup confirmatory factor analysis (MG-CFA) to assess the invariance of the K6 distress scale across two ethnic groups: European New Zealanders and Maori.

# select needed columns plus 'ethnicity'
# filter dataset for only 'euro' and 'maori' ethnic categories
dt_eth_k6_eth <- df_nz |>
  filter(wave == 2018) |>
  filter(eth_cat == "euro" | eth_cat == "maori") |>
  select(kessler_depressed, kessler_effort, kessler_hopeless,
         kessler_worthless, kessler_nervous, kessler_restless, eth_cat)

# partition the dataset into training and test subsets
# stratify by ethnic category to ensure balanced representation
part_data_eth <- datawizard::data_partition(dt_eth_k6_eth,
  training_proportion = .7, seed = 123, group = "eth_cat")

training_eth <- part_data_eth$p_0.7
test_eth <- part_data_eth$test

# configural invariance models
# run cfa models specifying one, two, and three latent variables
# without constraining across groups
one_latent_eth_configural <- suppressWarnings(
  lavaan::cfa(structure_k6_one, group = "eth_cat", data = test_eth))
two_latents_eth_configural <- suppressWarnings(
  lavaan::cfa(structure_k6_two, group = "eth_cat", data = test_eth))
three_latents_eth_configural <- suppressWarnings(
  lavaan::cfa(structure_k6_three, group = "eth_cat", data = test_eth))

# compare model performances for configural invariance
compare_eth_configural <- performance::compare_performance(
  one_latent_eth_configural, two_latents_eth_configural,
  three_latents_eth_configural, verbose = FALSE)

# metric invariance models
# run cfa models holding factor loadings equal across groups
one_latent_eth_metric <- suppressWarnings(
  lavaan::cfa(structure_k6_one, group = "eth_cat",
              group.equal = "loadings", data = test_eth))
two_latents_eth_metric <- suppressWarnings(
  lavaan::cfa(structure_k6_two, group = "eth_cat",
              group.equal = "loadings", data = test_eth))
three_latents_eth_metric <- suppressWarnings(
  lavaan::cfa(structure_k6_three, group = "eth_cat",
              group.equal = "loadings", data = test_eth))

# compare model performances for metric invariance
compare_eth_metric <- performance::compare_performance(
  one_latent_eth_metric, two_latents_eth_metric,
  three_latents_eth_metric, verbose = FALSE)

# scalar invariance models
# run cfa models holding factor loadings and intercepts equal across groups
one_latent_eth_scalar <- suppressWarnings(
  lavaan::cfa(structure_k6_one, group = "eth_cat",
              group.equal = c("loadings", "intercepts"), data = test_eth))
two_latents_eth_scalar <- suppressWarnings(
  lavaan::cfa(structure_k6_two, group = "eth_cat",
              group.equal = c("loadings", "intercepts"), data = test_eth))
three_latents_eth_scalar <- suppressWarnings(
  lavaan::cfa(structure_k6_three, group = "eth_cat",
              group.equal = c("loadings", "intercepts"), data = test_eth))

# compare model performances for scalar invariance
compare_eth_scalar <- performance::compare_performance(
  one_latent_eth_scalar, two_latents_eth_scalar,
  three_latents_eth_scalar, verbose = FALSE)

Configural invariance results

The table represents the comparison of three MG-CFA models conducted to test for configural invariance across different ethnic categories. Configural invariance refers to whether the pattern of factor loadings is the same across groups. It is the most basic form of measurement invariance.

Looking at the results:

Chi2 (Chi-square): A lower value suggests a better model fit. The three-factor model performs best.
GFI (Goodness of Fit Index) and AGFI (Adjusted Goodness of Fit Index): Values range from 0 to 1, with values closer to 1 suggesting a better fit. All models are close.
NFI, NNFI (TLI), CFI: Range from 0 to 1, with values closer to 1 suggesting a better fit. The two- and three-factor models have the highest values.
RMSEA: Lower values are better, with values below 0.05 considered good and up to 0.08 considered acceptable. Only the two- and three-factor models meet this threshold.
AIC and BIC: The three-factor model has the lowest values.

Analysis of the results:

One latent model: Shows the poorest fit among the models with a high RMSEA and the lowest CFI. The Chi-squared value is significantly high, indicating a substantial misfit with the observed data.
Two latents model: Displays a much improved fit compared to the one-latent model, as evident from its much lower Chi-squared value, lower RMSEA, and higher CFI.
Three latents model: Provides the best fit metrics among the three configurations.

Metric equivalence

Metric equivalence (also known as weak measurement invariance) tests whether factor loadings are equal across groups. The three-factor model again performs best.

Scalar equivalence

Scalar equivalence tests whether both factor loadings and intercepts are equal across groups. This is the strongest form of invariance and is required for meaningful mean comparisons across groups.

Overall, there is good evidence for the three-factor model of Kessler-6, but the two-factor model is close.

Consider: when might we prefer a two-factor model? When might we prefer a three-factor model? When might we prefer a one-factor model?

Conclusion: understanding the limits of association in factor models

This discussion of measurement invariance across different demographic groups underscores the reliance of factor models on the underlying associations in the data. These models are fundamentally descriptive, not prescriptive; they organise the observed data into a coherent structure based on correlations and assumed relationships among variables.

Factor analysis assumes that the latent variables (factors) causally influence the observed indicators. This is a strong assumption that can profoundly affect the interpretation of results. Understanding and critically evaluating our assumptions is essential when applying factor analysis to real-world scenarios.

The assumption that latent variables cause the observed indicators, rather than merely being associated with them, suggests a directional relationship that can affect decisions made based on the analysis. For instance, if we believe that a latent construct like psychological distress causally influences responses on the K6 scale, interventions might be designed to target this distress directly. However, if the relationship is more complex or bidirectional, such straightforward interventions might not be as effective.

Transition

Part 1 covered the practical tools: EFA, CFA, and multigroup CFA for invariance testing. Part 2 asks a deeper question: what do factor models actually assume about the causal relationship between latent constructs and observed indicators? This section is more theoretical and connects measurement to the DAG framework from earlier lectures.

Part 2: How Traditional Measurement Theory Fails

Readings for Part 2

Required

VanderWeele (2022). Constructs and measures. link

Suggested

Harkness & van de Vijver (2003). Questionnaire design in comparative research. link

Key Concepts for Part 2

The Formative Model in Factor Analysis
The Reflective Model in Factor Analysis
How to use causal diagrams to evaluate assumptions

Overview

By the end of this section you will:

Understand the causal assumptions implied by the factor analytic interpretation of the formative and reflective models.
Be able to distinguish between statistical and structural interpretations of these models.
Understand why VanderWeele argues that consistent causal estimation is possible using the theory of multiple versions of treatments for constructs with multiple indicators.

Two ways of thinking about measurement in psychometric research

In psychometric research, formative and reflective models describe the relationship between latent variables and their respective indicators. VanderWeele discusses this in the assigned reading (VanderWeele 2022).

Reflective model (factor analysis)

In a reflective measurement model, also known as an effect indicator model, the latent variable is understood to cause the observed variables. In this model, changes in the latent variable cause changes in the observed variables. Each indicator (observed variable) is a "reflection" of the latent variable: they are effects or manifestations of the latent variable.

The reflective model may be expressed:

$X_{i} = λ_{i} η + ε_{i}$

Here, $X_{i}$ is an observed variable (indicator), $λ_{i}$ is the factor loading for $X_{i}$ , $η$ is the latent variable, and $ε_{i}$ is the error term associated with $X_{i}$ . It is assumed that all the indicators are interchangeable and have a common cause, which is the latent variable $η$ .

In the conventional approach of factor analysis, the assumption is that a common latent variable is responsible for the correlation seen among the indicators. Any fluctuation in the latent variable should lead to similar changes in the indicators.

Diagram: Reflective model. The latent variable $η$ gives rise to indicators $X_{1} \dots X_{n}$ .

    η ──→ X₁
    η ──→ X₂
    η ──→  ⋮
    η ──→ Xₙ

Formative model (factor analysis)

In a formative measurement model, the observed variables are seen as causing or determining the latent variable. Here again there is a single latent variable, but this latent variable is taken to be an effect of the underlying indicators.

The formative model may be expressed:

$η = i \sum λ_{i} X_{i} + ε$

In this equation, $η$ is the latent variable, $λ_{i}$ is the weight for $X_{i}$ (the observed variable), and $ε$ is the error term. The latent variable $η$ is a composite of the observed variables $X_{i}$ .

In the context of a formative model, correlation or interchangeability between indicators is not required. Each indicator contributes distinctively to the latent variable. A modification in one indicator does not automatically imply a corresponding change in the other indicators.

Diagram: Formative model. The indicators $X_{1} \dots X_{n}$ give rise to the latent variable $η$ .

    X₁ ──→ η
    X₂ ──→ η
     ⋮ ──→ η
    Xₙ ──→ η

Structural interpretation of the formative and reflective models

VanderWeele distinguishes between statistical and structural interpretations of the equations presented above.

Statistical model: A mathematical construct that shows how observable variables (indicators) are related to latent or unseen variables.
Structural model: The causal assumptions or hypotheses about the relationships among variables in a statistical model. The assumptions of the factor analytic tradition (that the latent variable is causally efficacious) are structural claims.

The reflective model statistically implies that the observed variables are reflections of the latent variable, expressed as $X_{i} = λ_{i} η + ε_{i}$ . However, the factor analytic tradition makes the additional structural assumption that a univariate latent variable causally influences the observed variables.

The formative model statistically implies that the latent variable is formed by the observed variables, expressed as $η = \sum_{i} λ_{i} X_{i} + ε$ . However, the factor analytic tradition makes the additional assumption that the observed variables give rise to a univariate latent variable.

Problems with the structural interpretations

While the statistical model $X_{i} = λ_{i} η + ε_{i}$ aligns with the standard reflective structural interpretation, it is also compatible with a model where indicators themselves cause the outcome. Cross-sectional data do not provide enough information to discern between these different structural interpretations.

Similarly, the statistical model $η = \sum_{i} λ_{i} X_{i} + ε$ agrees with the standard formative structural interpretation, but it also agrees with a model in which the indicators, not the latent composite, exert causal effects on the outcome.

The formative and reflective conceptions of factor analysis are compatible with:

Indicators having direct causal effects on the outcome (rather than operating through the latent variable).
A multivariate reality ( $η_{1} \dots η_{n}$ ) giving rise to multiple indicators, each of which may have distinct causal relevance.
Only one or several of the indicators being causally efficacious, even though we construct our measure from all of them.

VanderWeele's key observation

While cross-sectional data can provide insights into the relationships between variables, they cannot conclusively determine the causal direction of these relationships.

This result is concerning. The structural assumptions of factor analysis underpin nearly all psychological research. If the cross-sectional data used to derive factor structures cannot decide whether the structural interpretations of factor models are accurate, where does that leave us?

More concerning still, VanderWeele discusses several longitudinal tests for structural interpretations of univariate latent variables that do not pass.

Where does that leave us? In psychology we have heard about a replication crisis. The reliance on factor models may be an aspect of a much larger "causal crisis" (Bulbulia et al. 2022).

Review of the theory of multiple versions of treatment

Perhaps not all is lost. VanderWeele looks to the theory of multiple versions of treatment.

Recall, a causal effect is defined as the difference in the expected potential outcome when everyone is exposed (perhaps contrary to fact) to one level of a treatment, conditional on their levels of a confounder, with the expected potential outcome when everyone is exposed to a different level of a treatment (perhaps contrary to fact), conditional on their levels of a confounder.

$δ = l \sum (E [Y ∣ A = a, l] - E [Y ∣ A = a^{*}, l]) P (l)$

where $δ$ is the causal estimand on the difference scale ( $E [Y^{1} - Y^{0}]$ ).

In causal inference, the multiple versions of treatment theory allows us to handle situations where the treatment is not uniform but instead has several variations. Each variation of the treatment, or "version," can have a different impact on the outcome. Consistency is not violated because it is redefined: for each version of the treatment, the outcome under that version is equal to the observed outcome when that version is received. We may think of the indicator $A$ as corresponding to many versions of the true treatment $K$ . Where conditional independence holds such that there is an absence of confounding for the effect of $K$ on $Y$ given $L$ , we have: $Y_{k} ⊥ ⊥ A ∣ K, L$ . This states that conditional on $L$ , $A$ gives no information about $Y$ once $K$ and $L$ are accounted for. When $Y = Y_{k}$ if $K = k$ and $Y_{k}$ is independent of $K$ conditional on $L$ , then $A$ may be thought of as a coarsened indicator of $K$ . We may estimate consistent causal effects where:

$δ = k, l \sum E [Y_{k} ∣ l] P (k ∣ a, l) P (l) - k, l \sum E [Y_{k} ∣ l] P (k ∣ a^{*}, l) P (l)$

The scenario represents a hypothetical randomised trial where within strata of covariates $L$ , individuals in one group receive a treatment $K$ version randomly assigned from the distribution of $K$ among the $(A = 1, L = l)$ sub-population. Meanwhile, individuals in the other group receive a randomly assigned $K$ version from the $(A = 0, L = l)$ sub-population.

Reflective and formative measurement models may be approached as multiple versions of treatment

VanderWeele applies the following substitution:

$δ = η, l \sum E [Y_{η} ∣ l] P (η ∣ A = a + 1, l) P (l) - η, l \sum E [Y_{η} ∣ l] P (η ∣ A = a, l) P (l)$

Specifically, we substitute $K$ with $η$ and compare the measurement response $A = a + 1$ with $A = a$ . We discover that if the influence of $η$ on $Y$ is not confounded given $L$ , then the multiple versions of reality consistent with the reflective and formative statistical models will not lead to biased estimation. The parameter $δ$ retains its interpretability as a comparison in a hypothetical randomised trial in which the distribution of coarsened measures of $η_{A}$ are balanced within levels of the treatment, conditional on $η_{L}$ .

This connection between measurement and the multiple versions of treatment framework provides hope for consistent causal inference under varying reliabilities of measurement.

However, as with the theory of multiple treatments, we might not know how to interpret our results because we do not know the true relationships between our measured indicators and underlying reality.

Challenging material

VanderWeele's argument below is abstract and philosophically dense. The core insight is that standard factor analysis cannot determine whether indicators cause or reflect the latent variable. Focus on understanding why this matters for causal inference, not on the formal derivation.

VanderWeele's model of reality

VanderWeele's article concludes as follows:

A preliminary outline of a more adequate approach to the construction and use of psychosocial measures might thus be summarised by the following propositions: (1) Traditional univariate reflective and formative models do not adequately capture the relations between the underlying causally relevant phenomena and our indicators and measures. (2) The causally relevant constituents of reality related to our constructs are almost always multidimensional, giving rise both to our indicators from which we construct measures, and also to our language and concepts, from which we can more precisely define constructs. (3) In measure construction, we ought always to specify a definition of the underlying construct, from which items are derived, and by which analytic relations of the items to the definition are made clear. (4) The presumption of a structural univariate reflective model impairs measure construction, evaluation, and use. (5) If a structural interpretation of a univariate reflective factor model is being proposed this should be formally tested, not presumed; factor analysis is not sufficient for assessing the relevant evidence. (6) Even when the causally relevant constituents of reality are multidimensional, and a univariate measure is used, we can still interpret associations with outcomes using theory for multiple versions of treatment, though the interpretation is obscured when we do not have a clear sense of what the causally relevant constituents are. (7) When data permit, examining associations item-by-item, or with conceptually related item sets, may give insight into the various facets of the construct.

A new integrated theory of measurement for psychosocial constructs is needed in light of these points, one that better respects the relations between our constructs, items, indicators, measures, and the underlying causally relevant phenomena. (VanderWeele 2022)

This seems sensible. However, VanderWeele's diagram of multivariate reality ( $R$ ) giving rise to concepts, constructs, latent variables ( $η$ ), indicators ( $X_{1} \dots X_{n}$ ), measures ( $A$ ), and outcomes ( $Y$ ) is not strictly a causal graph. The arrows do not clearly represent causal relations. It leaves unclear what to do in practice.

How measurement error can be incorporated into causal diagrams

By now you are familiar with the New Zealand Attitudes and Values Study (NZAVS), which is a national probability survey that collects a wide range of information, including data on distress, exercise habits, and cultural backgrounds.

Consider a study that seeks to use this dataset to investigate the effect of regular exercise on psychological distress. In contrast to previous graphs, let us allow for latent reality to affect our measurements, as well as the discrepancies between our measurements and true underlying reality.

We represent true exercise by $η_{A}$ . We represent true psychological distress by $η_{Y}$ . Let $η_{L}$ denote a person's true workload, and assume that this state of work affects both levels of exercise and psychological distress.

To bring the model into contact with measurement theory, let us describe measurements of these latent true underlying realities as functions of multiple indicators: $L_{f (X_{1} \dots X_{n})}$ , $A_{f (X_{1} \dots X_{n})}$ , and $Y_{f (X_{1} \dots X_{n})}$ . These constructs are measured realisations of the underlying true states. We assume that the true states of these variables affect their corresponding measured states, and so draw arrows from $η_{L} \to L_{f (X_{1} \dots X_{n})}$ , $η_{A} \to A_{f (X_{1} \dots X_{n})}$ , $η_{Y} \to Y_{f (X_{1} \dots X_{n})}$ .

We also assume unmeasured sources of error that affect the measurements: $U_{L} \to L_{f (X_{1} \dots X_{n})}$ , $U_{A} \to A_{f (X_{1} \dots X_{n})}$ , and $U_{Y} \to Y_{f (X_{1} \dots X_{n})}$ . That is, we allow that our measured indicators may imperfectly reflect the underlying true reality they hope to capture. We use $U_{L}$ , $U_{A}$ and $U_{Y}$ to denote the unmeasured sources of error in the measured indicators.

Allow that the true underlying reality represented by $η_{v a r}$ may be multivariate. Similarly, allow the true underlying reality represented by $U_{v a r}$ to be multivariate.

We now have a causal diagram that more precisely captures VanderWeele's thinking. We have fleshed out $R$ in a way that may include natural language concepts and scientific constructs as latent realities and latent unmeasured sources of error in our constructs.

The utility of describing the measurement dynamics using causal graphs is apparent. We can understand that the measured states, once conditioned upon, create collider biases which open paths between the unmeasured sources of error and the true underlying state that gives rise to our measurements.

Key Insight

Where true unmeasured (multivariate) psycho-physical states are related to true unmeasured (multivariate) sources of error in the measurement of those states, the very act of measurement opens pathways to confounding.

If for each measured construct $v a r_{f (X_{1}, X_{2}, \dots X_{n})}$ , the sources of error $U_{v a r}$ and the unmeasured constituents of reality that give rise to our measures $η_{v a r}$ are uncorrelated with other variables $U_{v a r}^{'}$ , $η_{v a r}^{'}$ , and $v a r_{f (X_{1}, X_{2}, \dots X_{n})}^{'}$ , our estimates may be downwardly biased toward the null. However, d-separation is preserved. Where errors are uncorrelated with true latent realities, there is no new pathway that opens information between our exposure and outcome.

Directed and correlated measurement error

When measurement errors are not merely random but are correlated or directed, additional pathways to confounding can open:

$η_{L} \to L$ : We assume that the true workload state affects its measurement. This measurement, however, may be affected by an unmeasured error source, $U_{L}$ . Personal perceptions of workload can introduce this error. Cultural influences like societal expectations of productivity could shape responses independently of the true workload state.
$η_{A} \to A$ : When it comes to exercise, the true state may affect the measured frequency (questions about exercise are not totally uninformative). However, this measurement is also affected by an unmeasured source of error, $U_{A}$ . For example, a cultural shift towards valuing physical health might prompt participants to report higher activity levels.
$η_{Y} \to Y$ : Actual distress affects the measured distress. However, this measurement is subject to unmeasured error, $U_{Y}$ . For instance, an increased societal acceptance of mental health might change how distress is reported.
$U_{L} \to U_{A}$ and $U_{L} \to U_{Y}$ : The error in the stress measurement can correlate with those in the exercise and mood measurements. This could stem from a common cultural bias affecting how a participant self-reports across these areas.
$η_{A} \to U_{Y}$ and $η_{L} \to U_{A}$ : The actual state of one variable can affect the error in another variable's measurement. For example, a cultural emphasis on physical health leading to increased exercise might, in turn, affect the reporting of distress levels.

Confounding control by baseline measures: dependent directed measurement error in three-wave panels

We propose a three-wave panel design to control confounding. This design adjusts for baseline measurements of both exposure and the outcome.
Understanding this approach in the context of potential directed and correlated measurement errors gives us a clearer picture of its strengths and limitations.
This three-wave panel design incorporates baseline measurements of both exposure and confounders. As a result, any bias that could come from unmeasured sources of measurement errors should be uncorrelated with their baseline effects.
For instance, if individuals have a social desirability bias at baseline, they would have to develop a different bias unrelated to the initial one for new bias to occur due to correlated unmeasured sources of measurement errors.
However, we cannot completely eliminate the possibility of such new bias development. There could also be potential new sources of bias from directed effects of the exposure on the error term of the outcome, which can often occur due to panel attrition.
To mitigate this risk, we adjust for panel attrition/non-response using methods like multiple imputation. We also consistently perform sensitivity analyses to detect any unanticipated bias.
Despite these potential challenges, by including measures of both exposure and outcome at baseline, the chances of new confounding are significantly reduced.
Therefore, adopting this practice should be a standard procedure in multi-wave studies as it substantially minimises the probability of introducing novel confounding factors.

Comment on slow changes

Over long periods of time we can expect additional sources of confounding. Changes in cultural norms and attitudes can occur over the duration of a longitudinal study like the NZAVS, leading to residual confounding. For example, if there is a cultural shift towards increased acceptance of mental health issues, this might change how psychological distress is reported over time, irrespective of baseline responses.

Key Takeaway

We must always perform sensitivity analyses because we can never be certain that our confounding control strategy has worked. The three-wave panel design with baseline measures of exposure and outcome substantially reduces, but does not eliminate, the risk of bias from measurement error.

Appendix A: What is a Correlation Matrix?

A correlation matrix is a square matrix that contains the Pearson correlation coefficients between each pair of variables within a dataset. Each element in the matrix represents the correlation between two variables.

Structure

Dimensions: The matrix is $p \times p$ where $p$ is the number of variables.
Diagonal elements: All diagonal elements are 1, because each variable has a perfect correlation with itself.
Off-diagonal elements: These elements, denoted as $r_{ij}$ , are the Pearson correlation coefficients between the $i^{t h}$ and $j^{t h}$ variables, ranging from -1 to +1.
- $r_{ij} = 1$ indicates a perfect positive linear relationship.
- $r_{ij} = - 1$ indicates a perfect negative linear relationship.
- $r_{ij} = 0$ indicates no linear relationship.

Properties

Symmetry: The matrix is symmetric around the diagonal, meaning $r_{ij} = r_{ji}$ .
Real values: All entries are real numbers.
Bounded values: Values are constrained between -1 and 1, inclusive.

Use

Exploring relationships between variables.
Conducting factor analysis to identify latent factors.

# compute the correlation matrix
library(margot)
library(tidyverse)

dt_only_k6 <- df_nz |>
  dplyr::filter(wave == 2018) |>
  dplyr::select(
    kessler_depressed,
    kessler_effort,
    kessler_hopeless,
    kessler_worthless,
    kessler_nervous,
    kessler_restless
  )

cor_matrix <- cor(dt_only_k6, use = "pairwise.complete.obs", method = "pearson")
print(round(cor_matrix, 2))

Lab materials: Lab 10: Measurement Invariance

Week 11: In-Class Test 2 (20%)

20 May 2026 (w11)

This week is the second in-class test, covering material from weeks 8–10.

What is covered

Heterogeneous treatment effects and machine learning (week 8)
Resource allocation and policy trees (week 9)
Classical measurement theory from a causal perspective (week 10)

Format

Duration: 50 minutes (allocated time: 1 hour 50 minutes)
Closed book, no devices
Bring a pen or pencil
Location: EA120 (the usual seminar room)

Reminders

No electronic devices permitted during the test.
Arrive on time. The test begins at 14:10.
Write clearly; illegible answers cannot be marked.

Week 12: Student Presentations (10%)

27 May 2026

This week, students present summaries of their research reports.

Format

10-minute presentation per student
Summarise your study: question, method, key findings
Assessment criteria: clarity, efficiency, and quality of presentation

Preparation

Prepare slides (or equivalent visual aid)
Practice timing: 10 minutes is short; focus on the essentials
Be ready to answer brief questions from the class

Research report due

The research report is due Friday 30 May (end of week 12). Submit as a single PDF with R code appendix via Nuku.

Lab 1: Git and GitHub

This session introduces version control with Git and GitHub. Setting up these tools first means you can track your work from day one.

Before week 2

Install R and RStudio before next week's lab. Instructions are in Lab 2: Install R and RStudio.

What is version control?

Version control tracks every change you make to your files. Instead of saving report_v2_final_FINAL.docx, you save a single file and Git remembers its entire history. You can go back to any previous version, see exactly what changed, and collaborate without overwriting each other's work.

GitHub is a website that hosts your Git repositories online. It serves as a backup and lets you share your work.

Why bother?

Your lab diary and final report will be easier to manage.
You will never lose work (every version is saved).
Employers value version control skills.
It is the standard tool for reproducible research.

Step 1: Create a GitHub account

Go to https://github.com.
Click Sign up and follow the prompts.
Choose a username you are happy to use professionally (e.g., jsmith-nz, not xXx_gamer_xXx).
Verify your email address.

Student benefits

Apply for the GitHub Student Developer Pack with your university email. It includes free private repositories and other tools.

Step 2: Install Git

macOS

Open Terminal (search for it in Spotlight) and type:

git --version

If Git is not installed, macOS will prompt you to install the Xcode Command Line Tools. Follow the prompts.

Windows

Download Git from https://git-scm.com/download/win. Run the installer and accept the defaults.

Verify installation

Open a terminal (Terminal on macOS, Git Bash on Windows) and type:

git --version

You should see something like git version 2.44.0.

Step 3: Configure Git

Tell Git your name and email (use the same email as your GitHub account):

git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

Step 4: Create your first repository

Go to https://github.com and click the + button (top right), then New repository.
Name it psych-434-lab (or similar).
Add a description (e.g., "Lab diary for PSYCH 434").
Select Private.
Tick Add a README file.
Click Create repository.

Step 5: Clone the repository to your computer

Cloning downloads a copy of the repository to your machine and links it to GitHub.

On your repository page, click the green Code button.
Copy the HTTPS URL.
Open a terminal and navigate to where you want to store your work:

cd ~/Documents

Clone the repository:

git clone https://github.com/YOUR-USERNAME/psych-434-lab.git

Move into the repository folder:

cd psych-434-lab

You now have a local copy linked to GitHub.

Step 6: The basic Git workflow

The everyday workflow has three steps: stage, commit, push.

1. Make a change

Open the README.md file in any text editor and add a line, such as:

This is my lab diary for PSYCH 434 (2026).

Save the file.

2. Stage the change

Staging tells Git which changes you want to include in your next snapshot:

git add README.md

To stage everything at once, use git add -A.

3. Commit the change

A commit is a snapshot with a short message describing what you did:

git commit -m "update readme with course details"

4. Push to GitHub

Push sends your commits to GitHub so they are backed up online:

git push

Authentication

The first time you push, GitHub will ask you to authenticate. Follow the prompts to sign in through your browser.

Step 7: Check your work

Go to your repository page on GitHub. You should see the updated README file with your changes.

Quick reference

Command	What it does
`git status`	Show which files have changed
`git add <file>`	Stage a file for the next commit
`git add -A`	Stage all changes
`git commit -m "message"`	Save a snapshot with a message
`git push`	Upload commits to GitHub
`git pull`	Download changes from GitHub
`git log --oneline`	Show commit history

Workflow summary

Edit files → git add → git commit -m "message" → git push. Repeat.

Terminal basics

You have already used a few terminal commands (cd, git clone). The terminal is a text interface for your computer. Every command you type runs a small program. Here are the commands you will use most often.

Where am I?

pwd

pwd (print working directory) shows the folder you are currently in.

List files

ls

ls lists the files and folders in the current directory. To see hidden files (names starting with .), use:

ls -a

Git stores its data in a hidden folder called .git. Try ls -a inside your repository to see it.

Change directory

cd ~/Documents/psych-434-lab

cd moves you into a folder. A few shortcuts:

Shortcut	Meaning
`~`	Your home folder
`..`	One level up
`.`	The current folder

So cd .. moves up one level, and cd ~ takes you home.

Create a folder

mkdir diaries

mkdir (make directory) creates a new folder.

Create an empty file

touch lab-01.md

touch creates an empty file if it does not already exist.

Terminal quick reference

Command	What it does
`pwd`	Print the current directory
`ls`	List files
`ls -a`	List files including hidden ones
`cd <folder>`	Change directory
`cd ..`	Go up one level
`mkdir <name>`	Create a folder
`touch <name>`	Create an empty file

Organise your repository

Set up a simple folder structure for the course. From inside your psych-434-lab folder:

mkdir diaries

Your repository should look like this:

psych-434-lab/
├── README.md
└── diaries/

Lab diary files go in the diaries/ folder, named by week number:

diaries/
├── lab-01.md
├── lab-02.md
├── lab-03.md
├── ...
└── lab-10.md

There is no lab-07.md (week 7 is test 1). Create your first diary file now:

touch diaries/lab-01.md

Markdown basics

Markdown is a plain-text format that converts to formatted documents. You write in a .md file using simple symbols for headings, bold, lists, and so on. GitHub renders markdown automatically, so your diary will look formatted when you view it online.

Headings

Use # symbols. More # signs mean smaller headings:

# Heading 1
## Heading 2
### Heading 3

Paragraphs

Separate paragraphs with a blank line. A single line break without a blank line will not start a new paragraph.

Bold and italics

This is **bold** and this is *italic*.

Lists

Unordered lists use -:

- first item
- second item
- third item

Numbered lists use 1., 2., etc.:

1. first step
2. second step
3. third step

Inline code

Wrap code in single backticks:

Use the `git push` command to upload your work.

Links

[GitHub](https://github.com)

Markdown reference

GitHub has a concise guide to GitHub-flavoured markdown: Basic writing and formatting syntax.

Write your first lab diary

Create your week 1 diary entry now. Open diaries/lab-01.md in any text editor and write ~150 words covering:

What this lab covered and what you did.
A connection to the week's lecture content.
One thing you found useful, surprising, or challenging.

Use at least one heading, one bold or italic word, and one list. When you are done, stage, commit, and push:

git add diaries/lab-01.md
git commit -m "add lab 01 diary"
git push

Check your repository on GitHub to confirm the file appears and the markdown renders correctly.

Editors

RStudio is your primary editor for this course. For general-purpose code editing (markdown files, scripts, configuration), VS Code and Zed are good options.

Alternative: GitHub Desktop

If you prefer a graphical interface, download GitHub Desktop. It provides the same stage/commit/push workflow with buttons instead of terminal commands. Either approach is fine for this course.

Lab 2: Install R and RStudio

R script

Download the R script for this lab (right-click → Save As)

This session introduces R and RStudio.

Why learn R?

You will need it for your final report (if you choose the report option).
It supports your psychology coursework.
It enhances your coding skills, which will help you in many domains of work, including utilising AI (!).

Installing R

Visit CRAN at https://cran.r-project.org/.
Select the version for your operating system (Windows, Mac, or Linux).
Download and install by following the on-screen instructions.

Installing RStudio

See Johannes Karl's video for a walkthrough.

Step 1: Install RStudio

Go to https://www.rstudio.com/products/rstudio/download/.
Choose the free version of RStudio Desktop and download it for your operating system.
Install RStudio Desktop.
Open RStudio to begin setting up your project environment.

Step 2: Create a new project

In RStudio, go to File > New Project.
Choose New Directory for a new project or Existing Directory if you have a folder ready.
For a new project, select New Project, then provide a directory name.
Specify the location where the project folder will be created.
Click Create Project.

Order your workflow

Use a clear folder structure.
If you are using GitHub, create a location on your machine (not Dropbox).
If you are not using GitHub, choose the cloud (Dropbox or similar).
When creating new files and scripts, use clear labels that anyone could understand. That "anyone" will be your future self.

Step 3: Give the project structure

Within your project, create folders to organise scripts and data. Common folder names include R/ for R scripts, data/ for datasets, and doc/ for documentation.
To create a new R script, go to File > New File > R Script. Save the script in your R/ folder with a meaningful file name.
If you set up Git in week 1, you can initialise a Git repository when creating a new project to track your R work.

Step 4: Working with R scripts

Write your R code in the script editor. Execute code by selecting lines and pressing Ctrl + Enter (Windows/Linux) or Cmd + Enter (Mac).
Use comments (preceded by #) to document your code.
Save your scripts regularly (Ctrl + S or Cmd + S).

Step 5: When you exit RStudio

Before concluding your work, save your workspace or clear it to start fresh (Session > Restart R).

Workflow habits

Use clearly defined script names.
Annotate your code.
Save your scripts often (Ctrl + S or Cmd + S).

Basic R Commands

How to copy code from this page

Open File > New File > R Script in RStudio.
Name and save your new R script.
Copy the code blocks below into your script.
Save: Ctrl + S or Cmd + S.

Assignment (`<-`)

Assignment in R uses the <- operator:

x <- 10 # assigns the value 10 to x
y <- 5  # assigns the value 5 to y

RStudio shortcut for <-

macOS: Option + - (minus key)
Windows/Linux: Alt + - (minus key)

Concatenation (`c()`)

The c() function combines multiple elements into a vector:

numbers <- c(1, 2, 3, 4, 5)
print(numbers)

Operations (`+`, `-`)

x <- 10
y <- 5

sum <- x + y
print(sum)

difference <- x - y
difference

Executing code

Ctrl + Enter (Windows/Linux) or Cmd + Enter (Mac).

Multiplication (`*`) and Division (`/`)

product <- x * y
product

quotient <- x / y
quotient

# element-wise operations on vectors
vector1 <- c(1, 2, 3)
vector2 <- c(4, 5, 6)
vector_product <- vector1 * vector2
vector_product

vector_division <- vector1 / vector2
vector_division

Be mindful of division by zero: 10 / 0 returns Inf, and 0 / 0 returns NaN.

# integer division and modulo
integer_division <- 10 %/% 3  # 3
remainder <- 10 %% 3          # 1

`rm()` Remove Object

devil_number <- 666
devil_number
rm(devil_number)

Logic (`!`, `!=`, `==`)

x_not_y <- x != y     # TRUE
x_equal_10 <- x == 10 # TRUE

OR (`|` and `||`)

# element-wise OR
vector_or <- c(TRUE, FALSE) | c(FALSE, TRUE) # c(TRUE, TRUE)

# single OR (first element only)
single_or <- TRUE || FALSE # TRUE

AND (`&` and `&&`)

# element-wise AND
vector_and <- c(TRUE, FALSE) & c(FALSE, TRUE) # c(FALSE, FALSE)

# single AND (first element only)
single_and <- TRUE && FALSE # FALSE

RStudio workflow shortcuts

Execute code line: Cmd + Return (Mac) or Ctrl + Enter (Win/Linux)
Insert section heading: Cmd + Shift + R (Mac) or Ctrl + Shift + R
Align code: Cmd + Shift + A (Mac) or Ctrl + Shift + A
Comment/uncomment: Cmd/Ctrl + Shift + C
Save all: Cmd/Ctrl + Shift + S
Find/replace: Cmd/Ctrl + F
New file: Cmd/Ctrl + Shift + N
Auto-complete: Tab

For more, explore Tools > Command Palette or Shift + Cmd/Ctrl + P.

Data Types in R

Integers

x <- 42L
str(x)         # int 42

y <- as.numeric(x)
str(y)         # num 42

Integers are useful for counts or indices that do not require fractional values.

Characters

name <- "Alice"

Characters represent text: names, labels, descriptions.

Factors

colors <- factor(c("red", "blue", "green"))

Factors represent categorical data with a limited set of levels.

Ordered factors

education_levels <- c("high school", "bachelor", "master", "ph.d.")

# unordered
education_factor_no_order <- factor(education_levels, ordered = FALSE)

# ordered
education_factor <- factor(education_levels, ordered = TRUE)
education_factor

Ordered factors support logical comparisons based on level order:

edu1 <- ordered("bachelor", levels = education_levels)
edu2 <- ordered("master", levels = education_levels)
edu2 > edu1  # TRUE

Strings

you <- "world!"
greeting <- paste("hello,", you)
greeting  # "hello, world!"

Vectors

Vectors are homogeneous: all elements must be of the same type.

numeric_vector <- c(1, 2, 3, 4, 5)
character_vector <- c("apple", "banana", "cherry")
logical_vector <- c(TRUE, FALSE, TRUE, FALSE)

Manipulating vectors

vector_sum <- numeric_vector + 10
vector_multiplication <- numeric_vector * 2
vector_greater_than_three <- numeric_vector > 3

Accessing elements:

first_element <- numeric_vector[1]
some_elements <- numeric_vector[c(2, 4)]

Functions with vectors

mean(numeric_vector)
sum(numeric_vector)
sort(numeric_vector)
unique(character_vector)

Data Frames

Creating data frames

df <- data.frame(
  name = c("alice", "bob", "charlie"),
  age = c(25, 30, 35),
  gender = c("female", "male", "male")
)
head(df)
str(df)

Accessing elements

# by column name
names <- df$name

# by row and column
second_person <- df[2, ]
age_column <- df[, "age"]

# by condition
very_old_people <- subset(df, age > 25)
mean(very_old_people$age)

Exploring data frames

head(df)    # first six rows
tail(df)    # last six rows
str(df)     # structure
summary(df) # summary statistics

Manipulating data frames

# adding columns
df$employed <- c(TRUE, TRUE, FALSE)

# adding rows
new_person <- data.frame(name = "diana", age = 28, gender = "female", employed = TRUE)
df <- rbind(df, new_person)

# modifying values
df[4, "age"] <- 26

# removing columns
df$employed <- NULL

# removing rows
df <- df[-4, ]

`rbind()` and `cbind()`

rbind() combines data frames by rows; cbind() combines by columns. When using these functions, column names (for rbind) or row counts (for cbind) must match. We will use dplyr for more flexible joining in later weeks.

Summary statistics

set.seed(12345)
vector <- rnorm(n = 40, mean = 0, sd = 1)
mean(vector)
sd(vector)
min(vector)
max(vector)

`table()` for categorical data

set.seed(12345)
gender <- sample(c("male", "female"), size = 100, replace = TRUE, prob = c(0.5, 0.5))
education_level <- sample(c("high school", "bachelor", "master"), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2))
df_table <- data.frame(gender, education_level)
table(df_table)
table(df_table$gender, df_table$education_level)  # cross-tabulation

First Data Visualisation with `ggplot2`

ggplot2 is based on the Grammar of Graphics: you build plots layer by layer.

library(ggplot2)

set.seed(12345)
student_data <- data.frame(
  name = c("alice", "bob", "charlie", "diana", "ethan", "fiona", "george", "hannah"),
  score = sample(80:100, 8, replace = TRUE),
  stringsAsFactors = FALSE
)
student_data$passed <- ifelse(student_data$score >= 90, "passed", "failed")
student_data$passed <- factor(student_data$passed, levels = c("failed", "passed"))
student_data$study_hours <- sample(5:15, 8, replace = TRUE)

Bar plot

ggplot(student_data, aes(x = name, y = score, fill = passed)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("failed" = "red", "passed" = "blue")) +
  labs(title = "student scores", x = "student name", y = "score") +
  theme_minimal()

Scatter plot

ggplot(student_data, aes(x = study_hours, y = score, color = passed)) +
  geom_point(size = 4) +
  scale_color_manual(values = c("failed" = "red", "passed" = "blue")) +
  labs(title = "scores vs. study hours", x = "study hours", y = "score") +
  theme_minimal()

Box plot

ggplot(student_data, aes(x = passed, y = score, fill = passed)) +
  geom_boxplot() +
  scale_fill_manual(values = c("failed" = "red", "passed" = "blue")) +
  labs(title = "score distribution by pass/fail status", x = "status", y = "score") +
  theme_minimal()

Histogram

ggplot(student_data, aes(x = score, fill = passed)) +
  geom_histogram(binwidth = 5, color = "black", alpha = 0.7) +
  scale_fill_manual(values = c("failed" = "red", "passed" = "blue")) +
  labs(title = "histogram of scores", x = "score", y = "count") +
  theme_minimal()

Line plot (time series)

months <- factor(month.abb[1:8], levels = month.abb[1:8])
study_hours <- c(0, 3, 15, 30, 35, 120, 18, 15)
study_data <- data.frame(month = months, study_hours = study_hours)

ggplot(study_data, aes(x = month, y = study_hours, group = 1)) +
  geom_line(linewidth = 1, color = "blue") +
  geom_point(color = "red", size = 1) +
  labs(title = "monthly study hours", x = "month", y = "study hours") +
  theme_minimal()

Base R Graphs

Base R provides plotting functions without additional packages.

# scatter plot
plot(student_data$study_hours, student_data$score,
     main = "scores vs. study hours", xlab = "study hours", ylab = "score",
     pch = 19, col = ifelse(student_data$passed == "passed", "blue", "red"))

# histogram
hist(student_data$score, breaks = 5, col = "skyblue",
     main = "histogram of student scores", xlab = "scores", border = "white")

# box plot
boxplot(score ~ passed, data = student_data,
        main = "score distribution by pass/fail status",
        xlab = "status", ylab = "scores", col = c("red", "blue"))

Summary of Today's Lab

This session covered:

Installing and setting up R and RStudio
Basic arithmetic operations
Data structures: vectors, factors, data frames
Data visualisation with ggplot2 and base R

Where to get help

Large language models: LLMs are effective coding tutors. Help from LLMs for coding does not constitute a breach of academic integrity in this course. For your final report, cite all sources including LLMs.
Stack Overflow: outstanding community resource.
Cross Validated: best place for statistics advice.
Developer websites: Tidyverse.
Your tutors and course coordinator.

Appendix A: At-Home Exercises

Exercise 1: Install the tidyverse package

Open RStudio.
Go to Tools > Install Packages....
Type tidyverse and check Install dependencies.
Click Install.
Load with library(tidyverse).

Exercise 2: Install the parameters and report packages

Go to Tools > Install Packages....
Type parameters, report.
Check Install dependencies and click Install.

Exercise 2b: Install the causalworkshop package

The causalworkshop package provides simulated data for your research report. Run the following in your R console:

install.packages("remotes")
remotes::install_github("go-bayes/causalworkshop@v0.2.1")

Verify the installation:

library(causalworkshop)
d <- simulate_nzavs_data(n = 100, seed = 2026)
head(d)

Exercise 3: Basic operations

Create vector_a <- c(2, 4, 6, 8) and vector_b <- c(1, 3, 5, 7).
Add them, subtract them, multiply vector_a by 2, divide vector_b by 2.
Calculate the mean and standard deviation of both.

Exercise 4: Working with data frames

Create a data frame with columns id (1–4), name, score (88, 92, 85, 95).
Add a passed column (pass mark = 90).
Extract name and score of students who passed.
Explore with summary(), head(), str().

Exercise 5: Logical operations and subsetting

Subset student_data to find students who scored above the mean.
Create an attendance vector and add it as a column.
Subset to select only rows where students were present.

Exercise 6: Cross-tabulation

Create factor variables fruit and colour.
Make a data frame and use table() for cross-tabulation.
Which fruit has the most colour variety?

Exercise 7: Visualisation with ggplot2

Using student_data, create a bar plot of scores by name.
Add a title, axis labels, and colour by pass/fail status.

Appendix B: Solutions

Solution 3: Basic operations

vector_a <- c(2, 4, 6, 8)
vector_b <- c(1, 3, 5, 7)

sum_vector <- vector_a + vector_b
diff_vector <- vector_a - vector_b
double_vector_a <- vector_a * 2
half_vector_b <- vector_b / 2

mean(vector_a); sd(vector_a)
mean(vector_b); sd(vector_b)

Solution 4: Working with data frames

student_data <- data.frame(
  id = 1:4,
  name = c("alice", "bob", "charlie", "diana"),
  score = c(88, 92, 85, 95),
  stringsAsFactors = FALSE
)
student_data$passed <- student_data$score >= 90
passed_students <- student_data[student_data$passed == TRUE, ]
summary(student_data)

Solution 5: Logical operations and subsetting

mean_score <- mean(student_data$score)
students_above_mean <- student_data[student_data$score > mean_score, ]

attendance <- c("present", "absent", "present", "present")
student_data$attendance <- attendance
present_students <- student_data[student_data$attendance == "present", ]

Solution 6: Cross-tabulation

fruit <- factor(c("apple", "banana", "apple", "orange", "banana"))
colour <- factor(c("red", "yellow", "green", "orange", "green"))
fruit_data <- data.frame(fruit, colour)
table(fruit_data$fruit, fruit_data$colour)
# apple has the most colour variety (red, green)

Solution 7: Visualisation

library(ggplot2)
ggplot(student_data, aes(x = name, y = score, fill = passed)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("TRUE" = "blue", "FALSE" = "red")) +
  labs(title = "student scores", x = "name", y = "score") +
  theme_minimal()

Appendix C: Other Data Types

Arrays and matrices

matrix_1 <- matrix(1:9, nrow = 3)
array_1 <- array(1:12, dim = c(2, 3, 2))

# convert matrix to data frame
df_matrix_1 <- data.frame(matrix_1)
colnames(df_matrix_1) <- c("col_1", "col_2", "col_3")

Lists

my_list <- list(name = "John Doe", age = 30, scores = c(90, 80, 70))

# access elements
my_list$name
my_list[["scores"]]

# modify
my_list$gender <- "Male"
my_list$age <- 31
my_list$scores <- NULL

# lists as function return values
calculate_stats <- function(numbers) {
  list(mean = mean(numbers), sum = sum(numbers))
}
results <- calculate_stats(c(1, 2, 3, 4, 5))

Lab 3: Regression, Graphing, and Simulation

R script

Download the R script for this lab (right-click → Save As)

Simulating data: Outcome ~ Treatment

Step 1: Set up your R environment

Ensure R or RStudio is installed and open.

Step 2: Set a seed for reproducibility

set.seed(123)

Step 3: Simulate continuous data

n <- 100
mean <- 50
sd <- 10
data_continuous <- rnorm(n, mean, sd)
head(data_continuous)
hist(data_continuous)

Step 4: Simulate categorical data

levels <- c("Male", "Female")
data_categorical <- sample(levels, n, replace = TRUE)
table(data_categorical)

# with unequal probabilities
data_categorical_unequal <- sample(levels, n, replace = TRUE, prob = c(0.3, 0.7))
table(data_categorical_unequal)

Simulating outcomes from treatments

library(tidyverse)
library(ggplot2)
library(patchwork)
library(parameters)
library(report)
library(ggeffects)

set.seed(123)
groupA_scores <- rnorm(100, mean = 100, sd = 15)
groupB_scores <- rnorm(100, mean = 105, sd = 15)

df_scores <- data.frame(
  group = rep(c("A", "B"), each = 100),
  scores = c(groupA_scores, groupB_scores)
)

df_scores_1 <- df_scores |>
  mutate(group = as.factor(group))

Visualising simulated data

ggplot(df_scores_1, aes(x = group, y = scores, fill = group)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "score distribution by group", x = "group", y = "scores")

Histogram

ggplot(df_scores_1, aes(x = scores, fill = group)) +
  geom_histogram(binwidth = 5, color = "black") +
  labs(title = "distribution of scores by group", x = "scores", y = "frequency") +
  facet_wrap(~group, ncol = 1) +
  theme_minimal()

Exercise 1

Modify the simulation parameters to change each group's mean and standard deviation. Observe how these changes affect the distribution.
Experiment with different bin widths in the histogram. How do large and small bin widths speak differently to the data?

Simulating data for statistical tests

# one-sample t-test
set.seed(123)
data <- rnorm(100, mean = 5, sd = 1)
mod_first_test <- t.test(data, mu = 4)
parameters::parameters(mod_first_test)
report::report(mod_first_test)

# two-sample t-test
group1 <- rnorm(50, mean = 5, sd = 1)
group2 <- rnorm(50, mean = 5.5, sd = 1)
mod_t_test_result <- t.test(group1, group2)
report::report(mod_t_test_result)

# paired t-test
pre_test <- rnorm(30, mean = 80, sd = 10)
post_test <- rnorm(30, mean = pre_test + 5, sd = 5)
mod_pre_post <- t.test(pre_test, post_test, paired = TRUE)
report::report(mod_pre_post)

Understanding regression using simulation

set.seed(123)
n <- 100

treatment <- rnorm(n, mean = 50, sd = 10)
beta_a <- 2
outcome <- 5 + beta_a * treatment + rnorm(n, mean = 0, sd = 20)

df <- data.frame(treatment = treatment, outcome = outcome)

fit <- lm(outcome ~ treatment, data = df)
summary(fit)

predicted_values <- ggeffects::ggemmeans(fit, terms = c("treatment"))
plot(predicted_values, dot_alpha = 0.35, show_data = TRUE, jitter = .1)

Equivalence of ANOVA and regression

set.seed(123)
n <- 90
k <- 3

group <- factor(rep(1:k, each = n/k))
means <- c(100, 100, 220)
sd <- 15
y <- rnorm(n, mean = rep(means, each = n/k), sd = sd)
df_1 <- cbind.data.frame(y, group)

# ANOVA
anova_model <- aov(y ~ group, data = df_1)
report::report(anova_model)

# regression (equivalent)
fit_regression <- lm(y ~ group, data = df_1)
parameters::model_parameters(fit_regression)

Combination plots

library(patchwork)

coefficient_plot <- plot(parameters::model_parameters(fit_regression))

predictive_plot <- plot(
  ggeffects::ggpredict(fit_regression, terms = "group"),
  dot_alpha = 0.35, show_data = TRUE, jitter = .1, colors = "reefs"
) +
  scale_y_continuous(limits = c(0, 260)) +
  labs(title = "predictive graph", x = "treatment group", y = "response")

my_first_combination_plot <- coefficient_plot / predictive_plot +
  plot_annotation(title = "coefficient and predictive plots", tag_levels = "A")

my_first_combination_plot

Upshot

ANOVA partitions variance into between-group and within-group components. Regression estimates the mean of the dependent variable as a linear function of independent variables. For many questions ANOVA is appropriate, but when comparing groups we often want finer-grained comparisons. Note: comparisons do not establish causality.

Exercise 2

Simulate two continuous variables Y and A with n = 100. Variable A should have mean 50, sd 10.
Define the effect of A on Y as beta_A <- 2.
Include an error term with sd 20.
Run lm(Y ~ A) and report results.

Template:

library(parameters)
set.seed()  # pick a number

n <- 100
A <- rnorm(n, mean = 50, sd = 10)
beta_A <- 2

df_3 <- data.frame(
  A = A,
  Y = 5 + beta_A * A + rnorm(n, mean = 0, sd = 20)
)

fit_3 <- lm(Y ~ A, data = df_3)
parameters::model_parameters(fit_3)
report::report(fit_3)

What you have learned

Data simulation: simulating datasets in R for exploring statistical concepts and causal inference.
Data visualisation: expanding your capacity for visualisation with ggplot2.
Statistical tests: t-tests, ANOVA, and regression.
ANOVA and regression: their equivalence and when to prefer regression.

Package references

ggplot2: declarative graphics based on the Grammar of Graphics.
Parameters: utilities for processing model parameters.
Report: automated report generation from statistical models.

Appendix A: Solution to Exercise 2

library(parameters)
set.seed(12345)

n <- 100
A <- rnorm(n, mean = 50, sd = 10)
beta_A <- 2

df_3 <- data.frame(
  A = rnorm(n, mean = 50, sd = 10),
  Y = 5 + beta_A * A + rnorm(n, mean = 0, sd = 20)
)

fit_3 <- lm(Y ~ A, data = df_3)
parameters::model_parameters(fit_3)
report::report(fit_3)

Appendix B: ANOVA post hoc comparisons

tukey_post_hoc <- TukeyHSD(anova_model)
print(tukey_post_hoc)
plot(tukey_post_hoc)

Appendix C: Adding complexity in simulation

data_frame <- data.frame(
  ID = 1:n,
  Gender = sample(c("Male", "Female"), n, replace = TRUE),
  Age = rnorm(n, mean = 30, sd = 5),
  Income = rnorm(n, mean = 50000, sd = 10000)
)

# income as a function of age
intercept <- 20000
beta_age <- 1500
error_sd <- 10000
Age <- rnorm(n, mean = 30, sd = 5)
Income <- intercept + beta_age * Age + rnorm(n, mean = 0, sd = error_sd)
data_complex <- data.frame(Age, Income)

ggplot(data_complex, aes(x = Age, y = Income)) +
  geom_point() +
  theme_minimal() +
  labs(title = "simulated age vs. income", x = "age", y = "income")

Appendix D: Causal Inference Glossary

Download the glossary of key terms in causal inference (PDF)

Lab 4: Regression and Confounding Bias

R script

Download the R script for this lab (right-click → Save As)

Packages

# install and load packages
if (!require(devtools, quietly = TRUE)) {
  install.packages("devtools")
  library(devtools)
}

library("margot")
library("tidyverse")
library("splines")
library("report")
library("performance")
library("ggeffects")
library("sjPlot")
library("dplyr")
library("ggplot2")
library("skimr")
library("gtsummary")
library("table1")
library("kableExtra")
library("patchwork")
library("parameters")

Regression

This week we introduce regression.

Learning Outcomes

By learning regression, you will be better equipped to do psychological science and to evaluate psychological research.

What is Regression?

Broadly speaking, a regression model is a method for inferring the expected average features of a population and its variance conditional on other features of the population as measured in a sample.

We shall see that regression encompasses more than this definition; however, this definition makes a start.

To understand regression, we need to understand the following jargon words: population, sample, measurement, and inference.

What is a population?

In science, a population is a hypothetical construct. It is the set of all potential members of a set of things. In psychological science, that set is typically a collection of individuals. We want to understand "The population of all human beings?" or "The New Zealand adult population"; or "The population of undergraduates who may be recruited for IPRP in New Zealand."

What is a sample?

A sample is a randomly realised sub-population from the larger abstract population that a scientific community hopes to generalise about.

Think of selecting balls randomly from an urn. When pulled at random, the balls may inform us about the urn's contents. For example, if we select one white ball and one black ball, we may infer that the balls in the urn are not all white or all black.

What is "measurement"?

A measure is a tool or method for obtaining numerical descriptions of a sample. We often call measures "scales."

A measurement is the numerical description we obtain from sensors such as statistical surveys, census data, twitter feeds, and so forth.

In the course, we have encountered numerical scales, ordinal scales, and factors. The topic of measurement in psychology is very broad. As we shall see, the trail of the serpent of measurement runs across comparative psychological research.

It is essential to remember that measures can be prone to error. Error-prone scales may nevertheless be helpful. However, we need to investigate their utility against the backdrop of specific interests and purposes.

What is a parameter?

In regression, we combine measurements on samples with probability theory to guess about the properties of a population we will never observe. We call these properties "parameters."

What is statistical inference?

The bulk of statistical inference consists of educated guessing about population parameters.

Probability distributions and statistical guessing

Inference is possible because the parameters of naturally occurring populations are structured by data-generating processes that are approximated by probability distributions. A probability distribution is a mathematical function describing a random event's probability. Today we will be focusing on height.¹

Today we will discuss the "normal" or "Gaussian distribution." A large number of data-generating processes in nature conform to the normal distribution.

Let us consider some examples of randomly generated samples, which we will obtain using R's rnorm function.

10-person sample of heights

# seed
set.seed(123)

# generate 10 samples, average 170, sd = 20
draws_10 <- rnorm(10, mean = 170, sd = 20)

# ggplot quick histogram
ggplot2::qplot(draws_10, binwidth = 2)

100-person sample of heights

# reproducibility
set.seed(123)

# generate 100 samples, average 170, sd = 20
draws_100 <- rnorm(100, mean = 170, sd = 20)

# graph
ggplot2::qplot(draws_100, binwidth = 2)

10000-person sample of heights

# reproducibility
set.seed(123)

# n = 10,000
draws_10000 <- rnorm(1e5, mean = 170, sd = 20)

# plot
ggplot2::qplot(draws_10000, binwidth = 2)

How can I use regression to infer a population parameter?

We can use R to investigate the average height of our imaginary population from which the preceding samples were randomly drawn. We do this in R by writing an "intercept-only" model as follows:

# syntax for an intercept-only model
model <- lm(outcome ~ 1, data = data)

# base R summary
summary(model)

Using the previous simulations:

N = 10 random draws

# write the model and get a nice table for it
sjPlot::tab_model(lm(draws_10 ~ 1))

N = 100 random draws

sjPlot::tab_model(lm(draws_100 ~ 1))

N = 10,000 random draws

sjPlot::tab_model(lm(draws_10000 ~ 1))

What do we notice about the relationship between sample size and the estimated population average?

sjPlot::tab_model(lm(draws_10 ~ 1),
                  lm(draws_100 ~ 1),
                  lm(draws_10000 ~ 1))

Regression with a single covariate

Do mothers' heights predict daughter height? If so, what is the magnitude of the relationship?

Francis Galton is credited with inventing regression analysis. Galton observed that offspring's heights tend to fall between parental height and the population average, which Galton termed "regression to the mean." Galton sought a method for educated guessing about heights, and this led to fitting a line of regression by a method called "least squares." (For a history, see here.)

The following dataset is from "The heredity of height" by Karl Pearson and Alice Lee (1903) (Pearson & Lee, 1903). I obtained it from (Gelman et al., 2020). Let us use this dataset to investigate the relationship between mothers' and daughters' heights.

# import data
df_pearson_lee <-
  data.frame(read.table(
    url(
      "https://raw.githubusercontent.com/avehtari/ROS-Examples/master/PearsonLee/data/MotherDaughterHeights.txt"
    ),
    header = TRUE
  ))

# centre mother's height for later example
df_pearson_lee_centered <- df_pearson_lee |>
  dplyr::mutate(mother_height_c = as.numeric(scale(
    mother_height, center = TRUE, scale = FALSE
  )))

skimr::skim(df_pearson_lee_centered)

Pearson and Lee collected 5,524 observations from mother/daughter height pairs. Let us examine the data, first by plotting the relationship.

What is happening here?

# explore pearson lee data
explore_md <-
  ggplot2::ggplot(data = df_pearson_lee_centered,
                  aes(y = daughter_height, x = mother_height)) +
  geom_jitter(alpha = .2) +
  labs(title = "The relationship between mother's height and daughter's height") +
  ylab("Daughter's height") +
  xlab("Mother's height") +
  theme_classic()

# print
print(explore_md)

Is there a linear predictive relationship between these two parameters? In regression we examine the line of best fit.

# regression
m1 <-
  lm(daughter_height ~ mother_height, data = df_pearson_lee_centered)

# graph
sjPlot::tab_model(m1)

We can plot the coefficient, but in a model with one predictor, this is not very informative. However, as we continue in the course, we shall see that plotting coefficients can be easier than deciphering the numbers in tables. Here are two methods for plotting.

# get model parameters
t_m1 <- parameters::model_parameters(m1, ci = 0.95)

# plot
plot(t_m1) +
  labs(title = "The relationship between mother's height and daughter's height") +
  ylab("Daughter's height")

How do we interpret the regression model?

Let us write the equation out in mathematics. How do we read this?²

$daughter_height = α + β_{1} (mother_height) + ϵ$

The maths says that the expected daughter's height in a population is predicted by the average height of the population when mothers' heights are set to zero units (note, this is impossible; we shall come back to this) plus $β \times$ units of daughter's height (inches) for each additional unit of mother's height (inches).

We can plug the output of the model directly into the equation as follows:

$daughter_height = 29.8 + 0.54 (mother_height)$

Graph the relationship between mother's and daughter's heights

library(ggeffects)
predictions <- ggeffects::ggpredict(m1, terms = "mother_height",
    add.data = TRUE,
    dot.alpha = .1,
    jitter = TRUE)

plot_predictions <-
  plot(predictions) +
  theme_classic() +
  labs(title = "Predicted values of daughter's height from the Pearson/Fox 1903 dataset")

plot_predictions

Regression to predict beyond the range of a dataset

Jyoti Amge is the world's shortest woman at 25 inches. Sandy Allen was the world's tallest woman at 91 inches. What are the expected heights of their daughters and of every intermediary woman in between?

# use the expand.grid command to create a sequence of points for mother's height
df_expand_grid <- expand.grid(mother_height = c(25:91))

# use the predict function to create a new response
df_predict <-
  predict(m1,
          type = "response",
          interval = "confidence",
          newdata = df_expand_grid)

# create a new dataframe for the new sequence of points for mother's height and the predicted data
newdata <- data.frame(df_expand_grid, df_predict)
head(newdata)

Graph the predicted results:

# graph the expected results
predplot <- ggplot(data = newdata,
                   aes(x = mother_height, y = fit)) +
  geom_point() +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = .1) +
  expand_limits(x = c(20, 91), y = c(0, 81)) +
  theme_classic() +
  labs(title = "Predicted values for a broader population")

# plot the two graphs together (making the x and y axis at the same scale)
library("patchwork")

# old plot with the new axis and y-axis scales, and remove points
plot_height <- plot(predplot, add.data = FALSE) + theme_classic()

plot_height_title <-
  plot_height +
  expand_limits(x = c(20, 91), y = c(0, 81)) +
  labs(title = "Predicted values of daughter's height from the Pearson/Fox 1903 dataset")

# double graph
plot_height_title / predplot +
  plot_annotation(title = "What do you notice about these relationships?",
                  tag_levels = "a")

A simple method for obtaining the predicted values from your fitted model is to obtain the effects output without producing a graph.

# predicted values of mother height on daughter height
library(ggeffects)
ggeffects::ggpredict(m1, terms = "mother_height")

Non-linear relationships

Linear regression assumes linearity conditional on a model. Often your data will not be linear!

Consider the following example:

# simulate nonlinear relationship between x and y
b <- c(2, 0.75)
set.seed(12)
x <- rnorm(100)
set.seed(12)
y <- rnorm(100, mean = b[1] * exp(b[2] * x))
dat1 <- data.frame(x, y)

ot1 <- lm(y ~ x, data = dat1)

# plot linear effect
plot(ggeffects::ggpredict(ot1, terms = "x",
     add.data = TRUE,
     dot.alpha = .4))

Non-linear relationship as modelled by a polynomial regression:

# model: quadratic
fit_non_linear <- lm(y ~ x + I(x ^ 2), data = dat1)

# predictive plot
plot(ggeffects::ggpredict(fit_non_linear, terms = "x",
     add.data = TRUE,
     dot.alpha = .4))

Here is another approach:

# non-linear regression
library(splines)

# fit model
fit_non_linear_b <- lm(y ~ x + poly(x, 2), data = dat1)

# graph model
plot(
  ggeffects::ggpredict(fit_non_linear_b, terms = "x",
  add.data = TRUE,
  dot.alpha = .4
))

Non-linear relationship as modelled by a general additive model (spline):

# fit spline: not specified
fit_non_linear_c <- lm(y ~ bs(x), data = dat1)

# model parameters: coefficients are not interpretable
parameters::model_parameters(fit_non_linear_c)

plot(
  ggeffects::ggpredict(fit_non_linear_c, terms = "x",
  add.data = TRUE,
  dot.alpha = .4
))

Centring

Any linear transformation of a predictor is acceptable. Often we centre (or centre and scale) all indicators, which gives us an interpretable intercept (the expected population average when the other indicators are set to their average).

library(ggeffects)

# fit raw data
fit_raw <- lm(daughter_height ~ mother_height, data = df_pearson_lee_centered)

# fit centred data
fit_centered <-
  lm(daughter_height ~ mother_height_c, data = df_pearson_lee_centered)

# compare the models
sjPlot::tab_model(fit_raw, fit_centered)

Graph model:

# graph centred model
plot(
  ggeffects::ggpredict(fit_centered, terms = "mother_height_c",
  add.data = TRUE,
  dot.alpha = .4
))

Note: when fitting a polynomial or any interaction, it is important to centre your indicators. We shall come back to this point in later lectures.

Model evaluation

Reviewers will sometimes ask you to assess model fit.

A simple, but flawed way to assess accuracy of your model fit is to compare a model with one covariate with a simple intercept-only model and to assess improvement in either the AIC statistic or the BIC statistic. The BIC is similar to the AIC but adds a penalty for extra predictors. An absolute improvement in either statistic of n > 10 is considered to be a "better" model.

We can use the performance package to generate a table that compares model fits.

# load library
library(performance)

# intercept only
fig_intercept_only <- lm(daughter_height ~ 1, data = df_pearson_lee)

# covariate added
fig_covariate <- lm(daughter_height ~ mother_height, data = df_pearson_lee)

# evaluate
performance::compare_performance(fig_intercept_only, fig_covariate)

What was the model "improvement?"

# improved fit
BIC(fig_intercept_only) - BIC(fig_covariate)

Generate a report

This is easy with the report package.

For example:

report::report_statistics(fig_covariate)

Or, if you want a longer report:

report::report(fig_covariate)

Use "statistically significant" in place of "significant." This will avoid misleading your audience into thinking your result is important when what you intend to communicate is that it is reliable.

Assumptions of Regression

From Gelman and Hill (Gelman & Hill, 2006):

Validity. The most important assumption is that the data you are analysing should map to the research question you are trying to answer (Gelman et al., 2020, p. 152).
Representativeness. A regression model is fit to data and is used to make inferences about a larger population, hence the implicit assumption in interpreting regression coefficients is that the sample is representative of the population (Gelman et al., 2020, p. 153).
Linearity. The most important mathematical assumption of the linear regression model is that its deterministic component is a linear function of the separate predictors: $y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots$ . If additivity is violated, it might make sense to transform the data (for example, if $y = ab c$ , then $lo g y = lo g a + lo g b + lo g c$ ) or to add interactions. If linearity is violated, perhaps a predictor should be put in as $1/ x$ or $lo g (x)$ instead of simply linearly. Or a more complicated relationship could be expressed using a nonlinear function such as a spline or Gaussian process (Gelman et al., 2020, p. 153).
Independence of errors. The simple regression model assumes that the errors from the prediction line are independent, an assumption that is violated in time-series, spatial, and multilevel settings (Gelman et al., 2020, p. 153).
Equal variance of errors. Unequal variance does not affect the most important aspect of a regression model, which is the form of the predictors (Gelman et al., 2020, p. 153).
Normality of errors. The regression assumption that is generally least important is that the errors are normally distributed. In fact, for the purpose of estimating the regression line (as compared to predicting individual data points), the assumption of normality is barely important at all. Thus, in contrast to many regression textbooks, we do not recommend diagnostics of the normality of regression residuals (Gelman & Hill, 2006, p. 46). A good way to diagnose violations of some of the assumptions (importantly, linearity) is to plot the residuals versus fitted values or individual predictors (Gelman & Hill, 2006, p. 46).

Common Confusions

Regressions do not automatically give us "effects"

People use the word "effect", but that is not what regression gives us by default.

"Normality Assumption"

Gelman and Hill note that the "normality" assumption is the least important. The assumption pertains to the normality of residuals.

Statistical Independence

This is the motivation for doing multi-level modelling: to condition on dependencies in the data. However, multi-level modelling can produce new problems if the error terms in the model are not independent of the outcome.

External Validity (wrong population)

We sample from undergraduates, but infer about the human population.

Model fit and causal inference

The concept of "better model fit" is relative to our interests and purposes. A better (lower) AIC statistic does not tell us whether a model is better for causal inference. We must assess whether a model satisfies the assumptions necessary for valid causal inference.

Simulation Demonstration 1: How Regression Coefficients Mislead

Methodology

Data generation: we simulate a dataset for 1,000 individuals, where religious service attendance ( $A$ ) affects wealth ( $L$ ), which in turn affects charitable donations ( $Y$ ). The simulation is based on predefined parameters that establish $L$ as a mediator between $A$ and $Y$ .
Parameter definitions: the probability of religious service ( $A$ ) is set at 0.5. The effect of $A$ on $L$ (wealth) is given by $β = 2$ . The effect of $L$ on $Y$ (charity) is given by $δ = 1.5$ . Standard deviations for $L$ and $Y$ are set at 1 and 1.5, respectively.
Model specifications: Model 1 (incorrect assumption) considers a linear regression model assuming $L$ as a confounder, including $A$ and $L$ as regressors on $Y$ . This model aligns with the data-generating process and correctly identifies $L$ as a mediator. According to the rules of d-separation (last week): to identify the total effect of $A$ on $Y$ we must not include $L$ . Model 2 (correct model) fits a linear regression model that includes only $A$ as a regressor on $Y$ and omits the mediator $L$ .
Analysis and comparison: the analysis compares the estimated effects of $A$ on $Y$ under both model specifications. By including $L$ as a predictor in Model 1, we induce mediation bias. Model 2 correctly excludes $L$ from the model.
Presentation: the results are displayed in a comparative table. The table contrasts the regression coefficients and significance levels obtained under each model.

# simulation seed
set.seed(123) # reproducibility

# define the parameters
n <- 1000 # number of observations
p <- 0.5  # probability of A = 1
alpha <- 0 # intercept for L
beta <- 2  # effect of A on L
gamma <- 1 # intercept for Y
delta <- 1.5 # effect of L on Y
sigma_L <- 1 # standard deviation of L
sigma_Y <- 1.5 # standard deviation of Y

# simulate the data: fully mediated effect by L
A <- rbinom(n, 1, p) # binary exposure variable
L <- alpha + beta * A + rnorm(n, 0, sigma_L) # mediator L affected by A
Y <- gamma + delta * L + rnorm(n, 0, sigma_Y) # Y affected only by L

# make the data frame
data <- data.frame(A = A, L = L, Y = Y)

# fit regression in which L is assumed to be a confounder
example_fit_1 <- lm(Y ~ A + L, data = data)

# fit regression in which L is assumed to be a mediator
example_fit_2 <- lm(Y ~ A, data = data)

# create gtsummary tables for each regression model
table1 <- gtsummary::tbl_regression(example_fit_1)
table2 <- gtsummary::tbl_regression(example_fit_2)

# merge the tables for comparison
table_comparison <- gtsummary::tbl_merge(
  list(table1, table2),
  tab_spanner = c("Model: Wealth assumed confounder",
                  "Model: Wealth assumed to be a mediator")
)

# make markdown table
markdown_table_0 <- as_kable_extra(table_comparison,
                                   format = "markdown",
                                   booktabs = TRUE)
# print markdown table
markdown_table_0

Compare model fits

By all metrics, Model 1 fits better but it is confounded

performance::compare_performance(example_fit_1, example_fit_2)

Model 1 exhibits mediator bias, but it has a considerably higher $R^{2}$ , and a lower BIC.

Focussing on the BIC (lower is better): Model 1 fits better, but it is confounded.

BIC(example_fit_1) - BIC(example_fit_2)

Simulation 1 takeaway

In this simulation, we discovered that if we assess "effect" by "model fit" we will get the wrong sign for the coefficient of interest. The use of model fit perpetuates the causality crisis in psychological science (Bulbulia et al., 2022). Few are presently aware of this crisis. Causal diagrams show us how we can do better.

Simulation Demonstration 2: How Regression Coefficients Mislead

Methodology

Data generation: we simulate a dataset for 1,000 individuals, where religious service attendance ( $A$ ) affects wealth ( $L$ ), and charitable donations ( $Y$ ) also affect wealth $L$ , but $A$ and $Y$ are not causally related.
Parameter definitions: the probability of religious service ( $A$ ) is set at 0.5. $Y$ has a mean of 0 and sd = 1, and is independent of $A$ . $L$ is a linear function of $A$ and $Y$ .
Model specifications: Model 1 (incorrect) controls for $L$ . Model 2 (correct) does not control for $L$ .
Analysis and comparison: compare models.

# simulation seed
set.seed(123) # reproducibility

# define parameters
n <- 1000 # number of observations
p <- 0.5  # probability of A = 1

# simulate the data
A_1 <- rbinom(n, 1, p) # binary exposure variable
Y_1 <- rnorm(n, 0, 1) # Y independent of A
L_2 <- rnorm(n, A_1 + Y_1)

# make the data frame
data_collider <- data.frame(A = A_1, L = L_2, Y = Y_1)

# fit regression controlling for collider L
collider_example_fit_1 <- lm(Y ~ A + L, data = data_collider)

# fit regression without collider
collider_example_fit_2 <- lm(Y ~ A, data = data_collider)

summary(collider_example_fit_1)

# create gtsummary tables for each regression model
collider_table1 <- gtsummary::tbl_regression(collider_example_fit_1)
collider_table2 <- gtsummary::tbl_regression(collider_example_fit_2)

# merge the tables for comparison
collider_table_comparison <- gtsummary::tbl_merge(
  list(collider_table1, collider_table2),
  tab_spanner = c("Model: Wealth assumed confounder",
                  "Model: Wealth not assumed to be a mediator")
)

# make markdown table
markdown_table_1 <- as_kable_extra(collider_table_comparison,
                                   format = "markdown",
                                   booktabs = TRUE)
# print table
collider_table_comparison

Compare model fits

By all metrics, Model 1 fits better but it is confounded.

$A$ is not causally associated with $Y$ . We generated the data so we know. However, the confounded model shows a better fit, and in this model the coefficient for $A$ is statistically significant (and negative).

performance::compare_performance(collider_example_fit_1, collider_example_fit_2)

Again, the BIC (lower is better) for Model 1 is better, but it is entirely confounded.

BIC(collider_example_fit_1) - BIC(collider_example_fit_2)

How does collider bias work? If we know that someone is not attending church, if they are charitable then we can predict they are wealthy. Similarly if we know someone is not wealthy but charitable, we can predict they attend religious service.

However, in this simulation, we know that religion and wealth are not causally associated because we have simulated the data.

Simulation 2 takeaway

In this simulation, we discovered that if we assess "effect" by "model fit", we get the wrong scientific inference. The use of model fit perpetuates the causality crisis in psychological science (Bulbulia et al., 2022). Let us do better.

Resources on Regression that Do Not Perpetuate the Causality Crisis

Statistical Rethinking (McElreath, 2020)
Regression and Other Stories (Gelman et al., 2020)

Exercises

Setup

Libraries

library(tidyverse)
library(patchwork)
library(lubridate)
library(kableExtra)
library(gtsummary)

Get data

# package with data
library(margot)

# load data
data("df_nz")

Import Pearson and Lee mother's and daughters data

df_pearson_lee <-
  data.frame(read.table(
    url(
      "https://raw.githubusercontent.com/avehtari/ROS-Examples/master/PearsonLee/data/MotherDaughterHeights.txt"
    ),
    header = TRUE
  ))

# centre mother's height for later example
df_pearson_lee <- df_pearson_lee |>
  dplyr::mutate(mother_height_c = as.numeric(scale(
    mother_height, center = TRUE, scale = FALSE
  )))

Note

For all exercises below, use only the 2018 wave of the df_nz dataset.

Q1. Create a descriptive table and a descriptive graph for the `hlth_weight` and `hlth_height` variables in the `df_nz` dataset

Select hlth_height, hlth_weight from the nz dataset. Filter only the 2018 wave. Create a descriptive table and graph these two variables. Annotate your workflow (at each step, describe what you are doing and why).

Q2. Regression height ~ weight and report results

Using the df_nz dataset, write a regression model for height as predicted by weight. Create a table for your results. Create a graph/graphs to clarify the results of your regression model. Briefly report your results.

Q3. Regress height ~ male and report results

Using the df_nz dataset, write a regression model for height as predicted by male. Create a table for your results. Create a graph/graphs to clarify the results of your regression model. Briefly report your results.

Q4. Regression to predict

Using the regression coefficients from the Pearson and Lee 1903 dataset, predict the heights of daughters of women in the df_nz dataset.

Q5. Bonus

On average, how much taller or shorter are women in New Zealand as sampled in the 2019 df_nz dataset compared with women in 1903 as sampled in the Pearson and Lee dataset? Clarify your inference.

Solutions

Solutions: Setup

## libraries
library("tidyverse")
library("patchwork")
library("lubridate")
library("kableExtra")
library("gtsummary")

# get data
library("margot")
data("df_nz")

# import pearson and lee mother's and daughters data
# in 1903, pearson and lee collected 5,524 observations from mother/daughter height pairs.

df_pearson_lee <- data.frame(read.table(
  url(
    "https://raw.githubusercontent.com/avehtari/ROS-Examples/master/PearsonLee/data/MotherDaughterHeights.txt"
  ),
  header = TRUE
))

# centre mother's height for later example
df_pearson_lee <- df_pearson_lee |>
  dplyr::mutate(mother_height_c = as.numeric(scale(
    mother_height, center = TRUE, scale = FALSE
  )))

Solution Q1: Descriptive table

# required libraries
library(gtsummary)
library(dplyr)
library(margot)

# select focal variables and rename them for clarity
df_nzdat <- df_nz |>
  dplyr::filter(wave == 2019) |>
  dplyr::select(hlth_weight, hlth_height, male) |>
  dplyr::rename(weight = hlth_weight,
                height = hlth_height) |>
  dplyr::select(weight, height, male) |>
  dplyr::mutate(weight_c = as.numeric(scale(
    weight, scale = FALSE, center = TRUE
  )))

# create table
df_nzdat |>
  dplyr::select(weight, height) |>
  gtsummary::tbl_summary(
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} / {N} ({p}%)"
    ),
    digits = all_continuous() ~ 2,
    missing_text = "(Missing)"
  ) |>
  bold_labels()

Here is another approach:

# libs we need
library(table1)
library(tidyverse)
library(tidyr)
library(margot)

# filter 2019 wave
df_nz_1 <- df_nz |>
  dplyr::filter(wave == 2019)

# nicer labels
table1::label(df_nz_1$hlth_weight) <- "Weight"
table1::label(df_nz_1$hlth_height) <- "Height"

# table
table1::table1(~ hlth_weight + hlth_height, data = df_nz_1)

Create graph:

library("patchwork")

# create data set where filtering na vals
df_nzdat1 <- df_nzdat |>
  dplyr::filter(!is.na(weight),
                !is.na(height),
                !is.na(male))

weight_density <- ggplot2::ggplot(data = df_nzdat1, aes(x = weight)) +
  geom_density(fill = "chocolate2") +
  labs(title = "Density plot of weight of NZ sample years 2019/2020") +
  theme_classic()

weight_density

height_density <- ggplot2::ggplot(data = df_nzdat1, aes(x = height)) +
  geom_density(fill = "blue2") +
  labs(title = "Density plot of height of NZ sample years 2019/2020") +
  theme_classic()

height_density
weight_density / height_density + plot_annotation(tag_levels = "a")

Here is a density plot:

library(ggplot2)

ggplot2::ggplot(data = df_nzdat1, aes(x = weight, fill = as.factor(male))) +
  geom_density() +
  labs(title = "Density plot of weight of NZ sample years 2019/2020") +
  theme_classic() +
  scale_fill_viridis_d()

Solution Q2: Regress height ~ weight and report results

Model:

# regression of height ~ weight
fit_1 <- lm(height ~ weight_c, data = df_nzdat)

Table:

library(parameters)

# table of model
parameters::model_parameters(fit_1) |>
  print_html(digits = 2,
             select = "{coef}{stars}|({ci})",
             column_labels = c("Estimate", "95% CI"))

Prediction:

library(ggeffects)
library(ggplot2)

# generate the plot
plot(ggeffects::ggpredict(fit_1, terms = "weight_c"))

Briefly report your results (note: replace "significant" with "statistically significant"):

report::report(fit_1)

Solution Q3: Regress height ~ male and report results

Model and table:

# regression of height ~ male
fit_2 <- lm(hlth_height ~ male, data = df_nz)
sjPlot::tab_model(fit_2)

Graph:

# plot over the range of the data
plot_fit_2 <- plot(
  ggeffects::ggpredict(fit_2, terms = "male[all]",
  jitter = .1,
  add.data = TRUE,
  dot.alpha = .2
))

# plot
plot_fit_2 +
  scale_y_continuous(limits = c(1.2, 2.1))

Report:

report::report(fit_2)

Solution Q4: Regression to predict

Using the regression coefficients from the Pearson and Lee 1903 dataset, predict the heights of daughters of women in the df_nz dataset.

# model for daughter height from mother height
fit_3 <- lm(daughter_height ~ mother_height, data = df_pearson_lee)

# create data frame of not_male's in 2019
# notice problem: not_male != woman
# additionally, woman != mother!

df_nz_2 <- df_nz |>
  filter(male == 1) |>
  dplyr::select(hlth_height) |>
  dplyr::mutate(mother_height = hlth_height * 39.36) |>
  dplyr::select(mother_height) |>
  dplyr::arrange((mother_height))

# find min and max heights, store as objects
min_mother_height <- min(df_nz_2$mother_height, na.rm = TRUE)
max_mother_height <- max(df_nz_2$mother_height, na.rm = TRUE)

# expand grid, use stored objects to define boundaries
df_expand_grid_nz_2 <-
  expand.grid(mother_height = seq(
    from = min_mother_height,
    to = max_mother_height,
    length.out = 200
  ))

# use the predict function to create a new response using the pearson and lee regression model
predict_fit_3 <-
  predict(fit_3,
          type = "response",
          interval = "confidence",
          newdata = df_expand_grid_nz_2)

# combine variables into a data frame
df_expand_grid_nz_2_predict_fit_3 <-
  data.frame(df_expand_grid_nz_2, predict_fit_3)

# graph the expected average hypothetical heights of "daughters"
predplot2 <-
  ggplot(data = df_expand_grid_nz_2_predict_fit_3,
         aes(x = mother_height, y = fit)) +
  geom_line(colour = "cadetblue") +
  geom_errorbar(aes(ymin = lwr, ymax = upr), width = .1) +
  scale_x_continuous(limits = c(50, 75)) +
  scale_y_continuous(limits = c(50, 75)) +
  theme_classic() +
  xlab("NZ 2019 female population") +
  ylab("Predicted daughter heights in inches") +
  labs(title = "Regression prediction for hypothetical daughter heights of NZ population in 2019")

# plot
predplot2

Solution Q5: Bonus

On average, how much taller or shorter are women in New Zealand as sampled in 2019 compared with women in 1903 as sampled in the Pearson and Lee dataset?

# create var for 1903 dataset
df_pearson_lee_2 <- df_pearson_lee |>
  dplyr::select(mother_height, daughter_height) |>
  tidyr::pivot_longer(everything(), names_to = "relation", values_to = "f_height") |>
  dplyr::mutate(year_is = factor("1903"))

# create var for the 2019 dataset
df_nz_combine_pearson_lee <- df_nz_2 |>
  dplyr::rename(f_height = mother_height) |>
  dplyr::mutate(year_is = factor("2019"),
                relation = factor("mother"))

head(df_nz_combine_pearson_lee)

# combine data frames row-wise
df_nz_combine_pearson_lee_1 <- rbind(df_pearson_lee_2, df_nz_combine_pearson_lee)

Look at heights in sample:

table(df_nz_combine_pearson_lee_1$year_is)

# box plot
ggplot2::ggplot(data = df_nz_combine_pearson_lee_1,
                aes(x = year_is, y = f_height, fill = year_is)) +
  geom_boxplot(notch = TRUE) +
  labs(title = "Comparison of female height 1903/2019") +
  theme_classic() +
  scale_fill_viridis_d()

Predict heights out of sample:

# regression model
fit_4 <- lm(f_height ~ year_is, data = df_nz_combine_pearson_lee_1)

# table
sjPlot::tab_model(fit_4)

Women in 2019 are taller.

report::report(fit_4)

Graph:

# get meaningful values for the y-axis range
min_mother_height_for_range <- min(df_nz_combine_pearson_lee_1$f_height, na.rm = TRUE)
max_mother_height_for_range <- max(df_nz_combine_pearson_lee_1$f_height, na.rm = TRUE)

# make graph
plot(
  ggeffects::ggpredict(fit_4, terms = "year_is",
  add.data = TRUE,
  dot.alpha = .2
)) +
  labs(title = "Predicted difference in female heights 1903/2019") +
  xlab("Study year") +
  ylab("Predicted female height (inches)") +
  scale_y_continuous(
    limits = c(min_mother_height_for_range, max_mother_height_for_range))

Graph with predicted points:

ggeffects::ggpredict(fit_4, terms = "year_is")

# graph with predicted values based on the model
plot(
  ggeffects::ggpredict(fit_4, terms = "year_is",
  add.data = TRUE,
  dot.alpha = .01,
  colors = "us",
  jitter = .5)) +
  labs(title = "Predicted difference in female heights 1903/2019") +
  xlab("Study year") +
  ylab("Predicted female height (inches)") +
  scale_y_continuous(
    limits = c(min_mother_height_for_range, max_mother_height_for_range))

Appendix: Conceptual Background

Some preliminaries about science.

Science begins with a question

Science begins with a question about the world. The first step in science, then, is to clarify what you want to know.

Because science is a social practice, you will also need to clarify why your question is interesting: so what?

In short, know your question.

Scientific model (or theory)

Sometimes, scientists are interested in specific features of the world: how did virus x originate? Such a question might have a forensic interest: what constellation of events gave rise to a novel infectious disease?

Scientists typically seek generalisations. How do infectious diseases evolve? How do biological organisms evolve? Such questions have applied interests. How can we better prevent infectious diseases? How did life originate?

A scientific model proposes how nature is structured (and unstructured). For example, the theory of evolution by natural selection proposes that life emerges from variation, inheritance, and differential reproduction/survival.

To evaluate a scientific model, scientists must make generalisations beyond individual cases. This is where statistics shines.

What is statistics?

Mathematics is a logic of certainty.

Statistics is a logic of uncertainty.

A statistical model uses the logic of probability to make better guesses.

Applications of statistical models in science

Scientific models seek to explain how nature is structured. Where scientific models conflict, we can combine statistical models with data collection to evaluate the credibility of one theoretical model over others. To do this, a scientific model must make distinct, non-trivial predictions about the world.

If the predictions are not distinct, the observations will not enable a shift in credibility for one theory over another. Consider the theory that predicts any observation. Such a theory would be better classified as a conspiracy theory; it is compatible with any evidence.

The relationship of probability distributions and data-generating processes is complex, intriguing, and both historically and philosophically rich $\dots$ . Because our interests are applied, we will hardly touch upon this richness in this course. ↩
Later, we shall prefer a different way of writing regression equations in maths. (Note: writing maths is not maths; it is just encoding the model that we have written.) ↩

Lab 5: Average Treatment Effects

R script

Download the R script for this lab (right-click → Save As)

This lab introduces causal forests as a tool for estimating average treatment effects (ATEs). You will compare naive, regression-adjusted, g-computation, and causal forest estimates against known ground-truth effects.

What you will learn

Why naive estimates of causal effects are biased when confounding is present
How covariate adjustment and g-computation reduce this bias
How to fit a causal forest and extract the ATE
How to validate estimates against ground truth

New packages

This lab uses the causalworkshop and grf packages. Install them before proceeding if you haven't already.

Setup and data

Install and load the required packages:

# install causalworkshop if needed
# install.packages("remotes")
# remotes::install_github("go-bayes/causalworkshop")

library(causalworkshop)
library(grf)
library(tidyverse)

Generate a simulated three-wave panel dataset. The data are modelled on the New Zealand Attitudes and Values Study (NZAVS), with baseline confounders (wave 0), binary exposures (wave 1), and continuous outcomes (wave 2). Crucially, the data contain known ground-truth treatment effects in the tau_* columns.

# simulate data
d <- simulate_nzavs_data(n = 5000, seed = 2026)

# check structure
dim(d)
names(d)

The data are in long format (three rows per individual). We need to separate the waves:

# separate waves
d0 <- d |> filter(wave == 0)  # baseline confounders
d1 <- d |> filter(wave == 1)  # exposure assignment
d2 <- d |> filter(wave == 2)  # outcomes

# verify alignment
stopifnot(all(d0$id == d1$id), all(d0$id == d2$id))

We will estimate the effect of community group participation (community_group) at wave 1 on wellbeing (wellbeing) at wave 2.

# ground truth: the true ATE
true_ate <- mean(d0$tau_community_wellbeing)
cat("True ATE:", round(true_ate, 3), "\n")

Naive ATE (biased)

A naive estimate ignores confounders. We simply regress the outcome on the exposure:

fit_naive <- lm(d2$wellbeing ~ d1$community_group)
naive_ate <- coef(fit_naive)[2]
cat("Naive ATE:", round(naive_ate, 3), "\n")
cat("True ATE: ", round(true_ate, 3), "\n")
cat("Bias:     ", round(naive_ate - true_ate, 3), "\n")

Why is the naive estimate biased?

People who join community groups differ systematically from those who don't. They tend to be more extraverted, more agreeable, and less neurotic. These same traits also affect wellbeing directly. The naive estimate captures both the causal effect and the confounding.

Adjusted ATE (regression)

We can reduce bias by conditioning on baseline confounders:

# construct analysis dataframe
df <- data.frame(
  y = d2$wellbeing,
  a = d1$community_group,
  age = d0$age,
  male = d0$male,
  nz_european = d0$nz_european,
  education = d0$education,
  partner = d0$partner,
  employed = d0$employed,
  log_income = d0$log_income,
  nz_dep = d0$nz_dep,
  agreeableness = d0$agreeableness,
  conscientiousness = d0$conscientiousness,
  extraversion = d0$extraversion,
  neuroticism = d0$neuroticism,
  openness = d0$openness,
  community_t0 = d0$community_group,
  wellbeing_t0 = d0$wellbeing
)

# regression with covariates
fit_adj <- lm(y ~ a + age + male + nz_european + education + partner +
                employed + log_income + nz_dep + agreeableness +
                conscientiousness + extraversion + neuroticism + openness +
                community_t0 + wellbeing_t0, data = df)

adj_ate <- coef(fit_adj)["a"]
cat("Adjusted ATE:", round(adj_ate, 3), "\n")
cat("True ATE:    ", round(true_ate, 3), "\n")
cat("Bias:        ", round(adj_ate - true_ate, 3), "\n")

What changed?

The adjusted estimate should be much closer to the true ATE. Conditioning on confounders breaks the spurious association between exposure and outcome (recall the fork structure from the ggdag tutorial).

G-computation by hand

G-computation estimates the ATE by predicting outcomes under counterfactual treatment assignments. We create two copies of the data, one where everyone is treated and one where everyone is untreated, predict outcomes for each, and take the average difference.

# create counterfactual datasets
df_treated <- df
df_treated$a <- 1

df_control <- df
df_control$a <- 0

# predict outcomes under each scenario
y_hat_treated <- predict(fit_adj, newdata = df_treated)
y_hat_control <- predict(fit_adj, newdata = df_control)

# ATE via g-computation
gcomp_ate <- mean(y_hat_treated - y_hat_control)
cat("G-computation ATE:", round(gcomp_ate, 3), "\n")
cat("True ATE:         ", round(true_ate, 3), "\n")

G-computation vs regression coefficient

When the treatment is binary and the model has no interactions, the g-computation ATE equals the regression coefficient on the treatment variable. They diverge when interactions are present, because g-computation averages over the empirical distribution of covariates.

ATE via causal forest

A causal forest estimates individual-level treatment effects $τ (x_{i})$ non-parametrically. The ATE is the average of these individual effects, with a valid standard error that accounts for the estimation uncertainty.

# construct matrices for the causal forest
covariate_cols <- c(
  "age", "male", "nz_european", "education", "partner", "employed",
  "log_income", "nz_dep", "agreeableness", "conscientiousness",
  "extraversion", "neuroticism", "openness",
  "community_t0", "wellbeing_t0"
)

X <- as.matrix(df[, covariate_cols])
Y <- df$y
W <- df$a

# fit causal forest
cf <- causal_forest(
  X, Y, W,
  num.trees = 1000,
  honesty = TRUE,
  tune.parameters = "all",
  seed = 2026
)

# extract ATE with standard error
ate_cf <- average_treatment_effect(cf)
cat("Causal forest ATE:", round(ate_cf["estimate"], 3),
    "(SE:", round(ate_cf["std.err"], 3), ")\n")
cat("True ATE:         ", round(true_ate, 3), "\n")

What is honesty?

Setting honesty = TRUE splits the training data in half: one half builds the tree structure, the other estimates the treatment effects within each leaf. This prevents overfitting and ensures valid confidence intervals.

Compare all estimates

results <- data.frame(
  method = c("Naive", "Adjusted regression", "G-computation", "Causal forest"),
  estimate = c(naive_ate, adj_ate, gcomp_ate, ate_cf["estimate"]),
  bias = c(naive_ate - true_ate, adj_ate - true_ate,
           gcomp_ate - true_ate, ate_cf["estimate"] - true_ate)
)
results$estimate <- round(results$estimate, 3)
results$bias <- round(results$bias, 3)
print(results)
cat("\nTrue ATE:", round(true_ate, 3), "\n")

Key takeaway

All three adjusted methods (regression, g-computation, causal forest) should recover the true ATE reasonably well. The naive estimate is substantially biased because it does not account for confounding. The causal forest additionally provides valid standard errors and, as we will see in Lab 6, individual-level treatment effect predictions.

Exercises

Lab diary

Complete at least two of the following exercises for your lab diary.

Different exposure-outcome pair. Repeat the analysis using religious_service as the exposure and belonging as the outcome. How does the bias of the naive estimate compare? Check the true ATE using mean(d0$tau_religious_belonging).
Omit baseline adjustment. Re-fit the causal forest without including community_t0 and wellbeing_t0 in the covariate matrix. How much does the ATE estimate change? Why might baseline values of the exposure and outcome be important confounders?
Sample size comparison. Generate data with n = 1000 and n = 10000. How do the causal forest ATE estimates and standard errors change? What does this tell you about the precision of causal forest estimates?

Lab 6: Conditional Average Treatment Effects

R script

Download the R script for this lab (right-click → Save As)

This lab explores why functional form matters for estimating heterogeneous treatment effects. You will compare parametric and non-parametric estimators, examine individual-level predictions from causal forests, and test whether treatment effects genuinely vary across individuals.

What you will learn

Why OLS can miss treatment effect heterogeneity
How to extract individual treatment effect predictions from a causal forest
How to test for significant heterogeneity using test_calibration()
How to identify which covariates drive effect modification

Why functional form matters

When treatment effects vary across individuals, the method we use to estimate them matters. A linear model assumes effects change at a constant rate with each covariate; a causal forest can capture non-linear and interactive patterns.

library(causalworkshop)
library(grf)
library(tidyverse)

The simulate_nonlinear_data() function generates data where the true treatment effect surface is deliberately non-linear, so that flexible methods outperform rigid ones:

# simulate data with non-linear treatment effects
d_nl <- simulate_nonlinear_data(n = 2000, seed = 2026)

# compare four estimation methods
result <- compare_ate_methods(d_nl)

All four methods (OLS, polynomial, GAM, causal forest) recover the overall ATE reasonably well. But their ability to predict individual effects differs dramatically:

# compare RMSE for individual-level predictions
print(result$summary)

RMSE tells the story

RMSE (root mean squared error) measures how well each method predicts the true individual treatment effect $τ (x_{i})$ . A lower RMSE means the method captures the heterogeneity pattern more accurately. OLS assumes a linear effect surface and typically has the highest RMSE.

Individual treatment effects from the causal forest

Now we return to the NZAVS data from Lab 5. The causal forest estimates $τ (x_{i})$ for each individual: what would their outcome change be if they were treated versus untreated?

# simulate NZAVS data (same as Lab 5)
d <- simulate_nzavs_data(n = 5000, seed = 2026)
d0 <- d |> filter(wave == 0)
d1 <- d |> filter(wave == 1)
d2 <- d |> filter(wave == 2)

# construct matrices
covariate_cols <- c(
  "age", "male", "nz_european", "education", "partner", "employed",
  "log_income", "nz_dep", "agreeableness", "conscientiousness",
  "extraversion", "neuroticism", "openness",
  "community_group", "wellbeing"
)

X <- as.matrix(d0[, covariate_cols])
Y <- d2$wellbeing
W <- d1$community_group

# fit causal forest
cf <- causal_forest(
  X, Y, W,
  num.trees = 1000,
  honesty = TRUE,
  tune.parameters = "all",
  seed = 2026
)

Extract predicted individual treatment effects:

# predicted treatment effects for each individual
tau_hat <- predict(cf)$predictions

# summary statistics
cat("Mean tau_hat:  ", round(mean(tau_hat), 3), "\n")
cat("SD tau_hat:    ", round(sd(tau_hat), 3), "\n")
cat("Range tau_hat: ", round(range(tau_hat), 3), "\n")

Compare with the true individual effects:

# true individual effects from the data-generating process
tau_true <- d0$tau_community_wellbeing

# how well does the forest recover individual effects?
cat("Correlation(tau_hat, tau_true):", round(cor(tau_hat, tau_true), 3), "\n")
cat("RMSE:", round(sqrt(mean((tau_hat - tau_true)^2)), 3), "\n")

Visualise the distribution of predicted effects:

# histogram of predicted treatment effects
ggplot(data.frame(tau_hat = tau_hat), aes(x = tau_hat)) +
  geom_histogram(bins = 40, fill = "steelblue", alpha = 0.7) +
  geom_vline(xintercept = mean(tau_hat), colour = "red", linetype = "dashed") +
  labs(
    title = "Distribution of predicted treatment effects",
    x = expression(hat(tau)(x)),
    y = "Count"
  ) +
  theme_minimal()

Interpreting the histogram

If treatment effects were homogeneous, this histogram would be tightly concentrated around the ATE. A wide spread indicates heterogeneity: some people benefit more from community group participation than others.

Test for heterogeneity

The test_calibration() function tests whether the forest has detected genuine heterogeneity, or whether the variation in $τ (x)$ is just noise.

# test for heterogeneity
cal_test <- test_calibration(cf)
print(cal_test)

Reading the calibration test

The key row is differential.forest.prediction. If its coefficient is significantly greater than zero (p < 0.05), the forest has detected meaningful variation in treatment effects beyond the overall mean. The mean.forest.prediction row tests whether the average effect is non-zero.

Variable importance

Which covariates drive the heterogeneity? The variable_importance() function measures how frequently each variable is used for splitting in the forest:

# variable importance
var_imp <- variable_importance(cf)
importance_df <- data.frame(
  variable = colnames(X),
  importance = as.numeric(var_imp)
) |>
  arrange(desc(importance))

print(importance_df)

Cross-reference with ground truth

The true treatment effect formula for community group participation on wellbeing is:

$τ = 0.20 + 0.10 \times extraversion + 0.05 \times partner - 0.03 \times neuroticism^{2}$

So extraversion, partner status, and neuroticism should appear as important variables. Does the forest recover this pattern?

Subgroup analysis

We can examine whether predicted effects differ across subgroups defined by the important covariates:

# compare effects by extraversion
high_extra <- tau_hat[d0$extraversion > 0]
low_extra <- tau_hat[d0$extraversion <= 0]

cat("Mean tau_hat (high extraversion):", round(mean(high_extra), 3), "\n")
cat("Mean tau_hat (low extraversion): ", round(mean(low_extra), 3), "\n")
cat("Difference:                      ", round(mean(high_extra) - mean(low_extra), 3), "\n")

# compare effects by partner status
partnered <- tau_hat[d0$partner == 1]
unpartnered <- tau_hat[d0$partner == 0]

cat("\nMean tau_hat (partnered):  ", round(mean(partnered), 3), "\n")
cat("Mean tau_hat (unpartnered):", round(mean(unpartnered), 3), "\n")
cat("Difference:                ", round(mean(partnered) - mean(unpartnered), 3), "\n")

Do the subgroup differences match the ground truth?

The tau formula adds $+ 0.10 \times extraversion$ and $+ 0.05 \times partner$ . Highly extraverted and partnered individuals should show larger predicted treatment effects. Check whether this matches what you observe.

Predicted vs true effects scatter plot

# scatter plot of predicted vs true individual effects
ggplot(data.frame(true = tau_true, predicted = tau_hat),
       aes(x = true, y = predicted)) +
  geom_point(alpha = 0.1, colour = "steelblue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "red") +
  labs(
    title = "Predicted vs true individual treatment effects",
    x = expression(tau(x)),
    y = expression(hat(tau)(x))
  ) +
  theme_minimal()

Key takeaway

Causal forests can detect meaningful heterogeneity in treatment effects without requiring the analyst to specify the functional form in advance. The test_calibration() function provides a formal test for heterogeneity, and variable_importance() identifies which covariates drive it. In Lab 8, we will use these individual predictions to evaluate targeting strategies.

Exercises

Lab diary

Complete at least two of the following exercises for your lab diary.

Different seed. Run compare_ate_methods() with seed = 42 instead of seed = 2026. Do the relative RMSE rankings change? Why or why not?
Different exposure-outcome pair. Fit a causal forest for volunteer_work on self_esteem. Run test_calibration() and variable_importance(). Which covariates drive heterogeneity? Does this match the ground-truth tau formula? (Hint: check the simulate_nzavs_data documentation.)
Why does OLS miss heterogeneity? In one paragraph, explain why a linear model that includes only main effects cannot capture the $- 0.03 \times neuroticism^{2}$ term in the treatment effect formula. What would you need to add to the linear model to capture this non-linearity?

Lab 8: RATE and QINI Curves

R script

Download the R script for this lab (right-click → Save As)

This lab evaluates whether targeting treatment to those predicted to benefit most improves outcomes compared with treating everyone. You will compute RATE curves and QINI curves from causal forest predictions and assess targeting efficiency at different population percentiles.

What you will learn

How to rank individuals by predicted treatment benefit
How to compute and interpret RATE curves (gain over random assignment)
How to compute and interpret QINI curves (cumulative targeting gain)
How to characterise the covariate profile of high-benefit individuals

Connection to previous labs

This lab builds directly on Labs 5 and 6. You will use the causal forest fitted in those labs to evaluate whether targeting resources to the most responsive individuals is worthwhile.

Setup

library(causalworkshop)
library(grf)
library(tidyverse)

Re-fit the causal forest from Labs 5-6 (or copy the code from Lab 5):

# simulate data
d <- simulate_nzavs_data(n = 5000, seed = 2026)
d0 <- d |> filter(wave == 0)
d1 <- d |> filter(wave == 1)
d2 <- d |> filter(wave == 2)

# construct matrices
covariate_cols <- c(
  "age", "male", "nz_european", "education", "partner", "employed",
  "log_income", "nz_dep", "agreeableness", "conscientiousness",
  "extraversion", "neuroticism", "openness",
  "community_group", "wellbeing"
)

X <- as.matrix(d0[, covariate_cols])
Y <- d2$wellbeing
W <- d1$community_group

# fit causal forest
cf <- causal_forest(
  X, Y, W,
  num.trees = 1000,
  honesty = TRUE,
  tune.parameters = "all",
  seed = 2026
)

# extract predicted individual treatment effects
tau_hat <- predict(cf)$predictions

Rank individuals by predicted benefit

The first step in any targeting analysis is to sort individuals from highest to lowest predicted treatment effect:

# sort by predicted benefit (descending)
n <- length(tau_hat)
tau_order <- order(tau_hat, decreasing = TRUE)
tau_sorted <- tau_hat[tau_order]

# what does the top of the distribution look like?
cat("Top 5 predicted effects:   ", round(head(tau_sorted, 5), 3), "\n")
cat("Bottom 5 predicted effects:", round(tail(tau_sorted, 5), 3), "\n")
cat("Overall mean:              ", round(mean(tau_hat), 3), "\n")

RATE curve

The RATE (Rank-Weighted Average Treatment Effect) curve shows how much we gain by targeting treatment to the top $q %$ of predicted beneficiaries, compared with random assignment.

For each targeting rate $q$ , we compute the average predicted effect among the top $q %$ of individuals, minus the overall average:

# compute RATE curve
rates <- seq(0.05, 1.00, by = 0.05)
rate_results <- tibble(
  rate = numeric(),
  avg_tau_targeted = numeric(),
  gain_over_random = numeric()
)

for (r in rates) {
  n_targeted <- floor(r * n)
  targeted_idx <- tau_order[seq_len(n_targeted)]

  avg_targeted <- mean(tau_hat[targeted_idx])
  gain <- avg_targeted - mean(tau_hat)

  rate_results <- bind_rows(
    rate_results,
    tibble(rate = r, avg_tau_targeted = avg_targeted, gain_over_random = gain)
  )
}

print(rate_results |> mutate(across(where(is.numeric), \(x) round(x, 3))))

Plot the RATE curve:

ggplot(rate_results, aes(x = rate, y = gain_over_random)) +
  geom_line(colour = "#E69F00", linewidth = 1) +
  geom_point(colour = "#E69F00", size = 2) +
  scale_x_continuous(labels = scales::percent_format()) +
  labs(
    title = "RATE curve: gain from targeting",
    x = "Targeting rate (proportion treated)",
    y = "Gain over random assignment"
  ) +
  theme_minimal()

Reading the RATE curve

A steep curve at low targeting rates means a small group benefits substantially more than average. A flat curve means everyone benefits similarly, and targeting adds no value. The curve always reaches zero at 100% (treating everyone is the same as random).

QINI curve

The QINI curve measures the cumulative gain from targeting. For each percentile $p$ , it computes the total benefit from targeting the top $p %$ , minus the proportional share they would get under random assignment:

# compute QINI curve
qini_results <- tibble(
  percentile = numeric(),
  cumulative_gain = numeric()
)

for (p in rates) {
  n_top <- floor(p * n)
  top_idx <- tau_order[seq_len(n_top)]

  # cumulative gain: total effect for targeted minus proportional share

  cum_gain <- sum(tau_hat[top_idx]) - p * sum(tau_hat)

  qini_results <- bind_rows(
    qini_results,
    tibble(percentile = p, cumulative_gain = cum_gain)
  )
}

print(qini_results |> mutate(across(where(is.numeric), \(x) round(x, 3))))

Plot the QINI curve:

ggplot(qini_results, aes(x = percentile, y = cumulative_gain)) +
  geom_line(colour = "#56B4E9", linewidth = 1) +
  geom_point(colour = "#56B4E9", size = 2) +
  scale_x_continuous(labels = scales::percent_format()) +
  labs(
    title = "QINI curve: cumulative targeting gain",
    x = "Population percentile",
    y = "Cumulative gain over random"
  ) +
  theme_minimal()

Compute the area under the QINI curve (AUQC) via trapezoidal approximation:

# area under QINI curve
auqc <- sum(qini_results$cumulative_gain * 0.05)
cat("Area Under QINI Curve (AUQC):", round(auqc, 3), "\n")

AUQC interpretation

A larger AUQC means targeting is more valuable. An AUQC near zero means there is little heterogeneity to exploit, and random assignment performs nearly as well as targeted assignment.

Targeting efficiency

Create a summary table comparing the top 10%, 20%, and 50% of predicted beneficiaries:

# targeting efficiency at key percentiles
top_10_idx <- tau_order[seq_len(floor(0.10 * n))]
top_20_idx <- tau_order[seq_len(floor(0.20 * n))]
top_50_idx <- tau_order[seq_len(floor(0.50 * n))]

efficiency <- tibble(
  group = c("Top 10%", "Top 20%", "Top 50%", "Everyone"),
  avg_effect = c(
    mean(tau_hat[top_10_idx]),
    mean(tau_hat[top_20_idx]),
    mean(tau_hat[top_50_idx]),
    mean(tau_hat)
  ),
  lift_vs_random = c(
    mean(tau_hat[top_10_idx]) / mean(tau_hat),
    mean(tau_hat[top_20_idx]) / mean(tau_hat),
    mean(tau_hat[top_50_idx]) / mean(tau_hat),
    1
  )
) |>
  mutate(efficiency_gain_pct = round((lift_vs_random - 1) * 100, 1))

print(efficiency |> mutate(across(where(is.numeric), \(x) round(x, 3))))

Characterise the covariate profile of high-benefit individuals:

# who are the top 10%?
top_10_data <- d0[tau_order[seq_len(floor(0.10 * n))], ]
everyone <- d0

cat("Top 10% vs everyone:\n")
cat("  Extraversion:    ", round(mean(top_10_data$extraversion), 2),
    "vs", round(mean(everyone$extraversion), 2), "\n")
cat("  Neuroticism:     ", round(mean(top_10_data$neuroticism), 2),
    "vs", round(mean(everyone$neuroticism), 2), "\n")
cat("  Partner (prop):  ", round(mean(top_10_data$partner), 2),
    "vs", round(mean(everyone$partner), 2), "\n")
cat("  Agreeableness:   ", round(mean(top_10_data$agreeableness), 2),
    "vs", round(mean(everyone$agreeableness), 2), "\n")

Key takeaway

RATE and QINI curves translate heterogeneous treatment effects into actionable targeting decisions. If the curves are steep, concentrating resources on high-benefit individuals improves overall outcomes. If the curves are flat, treating everyone equally is just as effective. In Lab 9, we will learn how to express these targeting decisions as simple, interpretable rules using policy trees.

Exercises

Lab diary

Complete at least two of the following exercises for your lab diary.

Different outcome. Compute RATE and QINI curves for a different outcome (e.g., belonging or life_satisfaction). Is the AUQC larger or smaller? What does this imply about targeting?
Negative effects. Some individuals may have $τ (x) < 0$ , meaning the treatment is predicted to harm them. How many individuals in your sample have negative predicted effects? What are the implications for resource allocation?
AUTOC vs QINI weighting. The RATE curve (AUTOC weighting) emphasises the top of the ranking, while the QINI curve weights all percentiles equally. In one paragraph, explain when each metric would be more useful for a policy-maker.

Lab 9: Policy Trees

R script

Download the R script for this lab (right-click → Save As)

This lab moves from evaluating heterogeneity (Lab 8) to making decisions. Policy trees learn simple, interpretable treatment assignment rules from causal forest predictions. You will fit policy trees of different depths, evaluate their performance, and discuss the ethical implications of algorithmic treatment assignment.

What you will learn

How to construct a reward matrix from causal forest predictions
How to fit and interpret depth-1 and depth-2 policy trees
How to evaluate policy performance against random assignment
How to express learned rules in plain language

Connection to previous labs

This lab uses the causal forest and individual treatment effect predictions from Labs 5-8. The progression is: estimate effects (Lab 5) $\to$ discover heterogeneity (Lab 6) $\to$ evaluate targeting (Lab 8) $\to$ learn assignment rules (this lab).

Setup

library(causalworkshop)
library(grf)
library(policytree)
library(tidyverse)

Install policytree

If you haven't installed policytree, run install.packages("policytree") first.

Re-fit the causal forest (or re-use from previous labs):

# simulate data
d <- simulate_nzavs_data(n = 5000, seed = 2026)
d0 <- d |> filter(wave == 0)
d1 <- d |> filter(wave == 1)
d2 <- d |> filter(wave == 2)

# construct matrices
covariate_cols <- c(
  "age", "male", "nz_european", "education", "partner", "employed",
  "log_income", "nz_dep", "agreeableness", "conscientiousness",
  "extraversion", "neuroticism", "openness",
  "community_group", "wellbeing"
)

X <- as.matrix(d0[, covariate_cols])
Y <- d2$wellbeing
W <- d1$community_group

cf <- causal_forest(
  X, Y, W,
  num.trees = 1000,
  honesty = TRUE,
  tune.parameters = "all",
  seed = 2026
)

tau_hat <- predict(cf)$predictions

The gamma matrix

A policy tree needs a reward matrix (called the "gamma matrix"). Each row is an individual; each column is an action. The entry gives the expected reward for assigning that individual to that action.

With two actions (treat vs not treat), the gamma matrix has two columns:

# construct gamma matrix
# column 1: reward if not treated (control) = 0 (baseline)
# column 2: reward if treated = predicted treatment effect
gamma_matrix <- cbind(
  control = rep(0, length(tau_hat)),
  treatment = tau_hat
)

head(gamma_matrix)

Why is the control reward zero?

We normalise the control reward to zero so that the treatment column represents the gain from treating. A positive value means treatment helps; a negative value means treatment harms. The policy tree then simply needs to decide: for which individuals is the gain positive enough to justify treatment?

Fit a depth-1 policy tree

A depth-1 tree makes a single split, dividing the population into two groups based on one covariate:

# subsample for speed (policy tree fitting can be slow on large datasets)
set.seed(2026)
n_sample <- 500
idx <- sample(seq_len(nrow(X)), n_sample)

X_sample <- as.data.frame(X[idx, ])
gamma_sample <- gamma_matrix[idx, ]

# fit depth-1 policy tree
pt_depth1 <- policy_tree(X_sample, gamma_sample, depth = 1)

# print the tree
print(pt_depth1)

Visualise the tree:

plot(pt_depth1)

Reading the tree

The tree shows one splitting variable and a threshold. Individuals above the threshold go one way; individuals below go the other. The leaf labels (1 or 2) correspond to the columns of the gamma matrix: 1 = control, 2 = treatment.

Fit a depth-2 policy tree

A depth-2 tree makes two sequential splits, creating four groups:

# fit depth-2 policy tree
pt_depth2 <- policy_tree(X_sample, gamma_sample, depth = 2)

# print and plot
print(pt_depth2)
plot(pt_depth2)

Evaluate policies

Predict treatment assignments on the full dataset and compare with random assignment:

# predict actions for full dataset
X_full <- as.data.frame(X)
actions_depth1 <- predict(pt_depth1, X_full)
actions_depth2 <- predict(pt_depth2, X_full)

# compute expected reward under each policy
# action = 1 means control, action = 2 means treatment
reward_depth1 <- ifelse(actions_depth1 == 1, gamma_matrix[, 1], gamma_matrix[, 2])
reward_depth2 <- ifelse(actions_depth2 == 1, gamma_matrix[, 1], gamma_matrix[, 2])
reward_random <- 0.5 * gamma_matrix[, 1] + 0.5 * gamma_matrix[, 2]

# compare policies
policy_comparison <- tibble(
  policy = c("Random assignment", "Depth-1 tree", "Depth-2 tree", "Treat everyone"),
  expected_reward = c(
    mean(reward_random),
    mean(reward_depth1),
    mean(reward_depth2),
    mean(tau_hat)
  ),
  treat_rate = c(
    0.50,
    mean(actions_depth1 == 2),
    mean(actions_depth2 == 2),
    1.00
  )
)

print(policy_comparison |> mutate(across(where(is.numeric), \(x) round(x, 3))))

Is the improvement worth the complexity?

A depth-2 tree is harder to explain than a depth-1 tree but may assign treatments more efficiently. Compare the expected rewards: if depth-2 is only marginally better, the simpler depth-1 rule may be preferable for transparency.

Interpret the rules in plain language

Read the tree output and translate it into a decision rule:

# what variables and thresholds does the depth-2 tree use?
print(pt_depth2)

# example interpretation (your values will differ):
# "treat individuals with extraversion > 0.3 and baseline wellbeing < 0.1"

Exercise

Write out the depth-2 policy tree as a set of plain-language if-then rules. For example: "If extraversion is above X, then treat. Otherwise, if neuroticism is below Y, treat; otherwise, do not treat."

Compare policy assignments with actual treatment

How do the policy-recommended assignments compare with who actually received treatment in the data?

# agreement between policy and observed treatment
agreement <- tibble(
  actual = W,
  policy_depth2 = ifelse(actions_depth2 == 2, 1, 0)
) |>
  mutate(agree = actual == policy_depth2)

cat("Agreement rate:", round(mean(agreement$agree), 3), "\n")
cat("Policy treats: ", round(mean(agreement$policy_depth2), 3), "\n")
cat("Actual treated:", round(mean(agreement$actual), 3), "\n")

Ethical considerations

Statistical optimality is not social optimality

A policy tree maximises expected treatment benefit, but it does not account for:

Fairness. The tree may split on variables correlated with protected characteristics (ethnicity, gender, socioeconomic status). Even if a variable is not in the covariate set, proxy variables may reproduce discriminatory patterns.
Equity. Targeting resources to those who benefit most may mean those who benefit somewhat receive nothing. A policy that is statistically optimal may be socially unjust.
Transparency. A depth-2 tree is interpretable; a depth-5 tree is not. Policy-makers and the public need to understand the rule.
Override. A clinician or social worker should always be able to override an algorithmic recommendation based on individual context the model cannot see.

When presenting policy tree results, always discuss these trade-offs.

Exercises

Lab diary

Complete at least two of the following exercises for your lab diary.

Different outcome. Fit a policy tree for religious_service on belonging. Do the splitting variables change? What does this suggest about which covariates drive effect modification for different outcomes?
Depth-3 tree. Fit a depth = 3 policy tree. Does the expected reward improve substantially over depth-2? Is the tree still interpretable?
Discuss override. In one paragraph, describe a scenario where a clinician should override a policy tree recommendation. What information would the clinician have that the model does not?

Lab 10: Measurement Invariance

R script

Download the R script for this lab (right-click → Save As)

This lab introduces measurement invariance testing, a prerequisite for meaningful cross-group comparisons. You will conduct exploratory factor analysis (EFA), confirmatory factor analysis (CFA), and multigroup CFA on simulated distress scale data with known non-invariance built in.

What you will learn

How to assess factorability (KMO, Bartlett's test) and extract factors using EFA
How to fit a CFA in lavaan and evaluate model fit (CFI, RMSEA, SRMR)
How to test configural, metric, and scalar invariance across groups
How to discover and release non-invariant items (partial invariance)
Why measurement invariance matters for causal inference across groups

Connection to the lecture

This lab aligns with Week 10's lecture on classical measurement theory from a causal perspective. The lecture covers measurement error in DAGs, EFA, CFA, and VanderWeele's model linking measurement to causal identification. This lab focuses on the practical skills: fitting and comparing models.

Setup

library(causalworkshop)
library(psych)
library(lavaan)
library(tidyverse)

Install packages

If you haven't installed psych or lavaan, run install.packages(c("psych", "lavaan")) first.

Generate data

The simulate_measurement_items() function generates a 6-item psychological distress scale (modelled on the Kessler-6) with a known factor structure and built-in measurement non-invariance across two groups:

# simulate measurement data
d <- simulate_measurement_items(n = 2000, seed = 2026)

# check structure
dim(d)
names(d)

The six items correspond to:

item_1: nervous
item_2: hopeless
item_3: restless
item_4: depressed
item_5: effort (everything was an effort)
item_6: worthless

# check the true factor loadings
attr(d, "true_loadings")

# check the true intercepts (they differ for items 3 and 5 across groups)
attr(d, "true_intercepts_group0")
attr(d, "true_intercepts_group1")

What is non-invariance?

Items 3 (restless) and 5 (effort) have different intercepts across groups. This means that at the same level of true distress, group 1 members score higher on these two items. If we ignore this, cross-group comparisons of mean distress scores will be biased.

Exploratory factor analysis (EFA)

Before fitting a CFA, we check whether the data are factorable and how many factors to extract.

Factorability

# select just the items
items <- d |> select(item_1:item_6)

# Kaiser-Meyer-Olkin measure of sampling adequacy
psych::KMO(items)

# Bartlett's test of sphericity
psych::cortest.bartlett(cor(items), n = nrow(items))

Interpreting KMO

KMO values above 0.60 are considered adequate for factor analysis. Values above 0.80 are good. Bartlett's test should be significant (p < 0.05), indicating that correlations between items are sufficiently large for factor analysis.

Extract factors

# one-factor solution
fa_1 <- psych::fa(items, nfactors = 1, fm = "ml", rotate = "none")
print(fa_1$loadings, cutoff = 0.3)

# two-factor solution (for comparison)
fa_2 <- psych::fa(items, nfactors = 2, fm = "ml", rotate = "oblimin")
print(fa_2$loadings, cutoff = 0.3)

How many factors?

The one-factor solution should show all six items loading substantially on a single factor (consistent with the data-generating process). The two-factor solution should not improve fit meaningfully. Compare the proportion of variance explained and check whether the two-factor loadings make theoretical sense.

Confirmatory factor analysis (CFA)

Now we specify the one-factor model and fit it using lavaan:

# specify one-factor CFA model
model <- "
  distress =~ item_1 + item_2 + item_3 + item_4 + item_5 + item_6
"

# fit CFA on full sample
fit_cfa <- cfa(model, data = d)

# summary with fit measures
summary(fit_cfa, fit.measures = TRUE, standardized = TRUE)

Evaluate model fit:

# extract key fit indices
fit_indices <- fitmeasures(fit_cfa, c("cfi", "rmsea", "srmr"))
print(round(fit_indices, 3))

Fit index guidelines

CFI (Comparative Fit Index): > 0.95 is excellent, > 0.90 is acceptable
RMSEA (Root Mean Square Error of Approximation): < 0.06 is excellent, < 0.08 is acceptable
SRMR (Standardised Root Mean Square Residual): < 0.08 is acceptable

Multigroup CFA: invariance testing

We now test whether the factor structure is equivalent across the two groups. Invariance testing proceeds through a hierarchy of progressively stricter constraints:

Step 1: Configural invariance

The same factor structure holds in both groups, but loadings and intercepts are freely estimated:

fit_configural <- cfa(model, data = d, group = "group")
summary(fit_configural, fit.measures = TRUE)

Step 2: Metric invariance

Factor loadings are constrained to be equal across groups:

fit_metric <- cfa(model, data = d, group = "group",
                  group.equal = "loadings")
summary(fit_metric, fit.measures = TRUE)

Compare configural and metric models:

lavTestLRT(fit_configural, fit_metric)

Interpreting the comparison

A non-significant chi-square difference (p > 0.05) means the metric model fits no worse than the configural model, so equal loadings across groups are supported. This is expected because the data-generating process uses the same loadings for both groups.

Step 3: Scalar invariance

Both loadings and intercepts are constrained to be equal:

fit_scalar <- cfa(model, data = d, group = "group",
                  group.equal = c("loadings", "intercepts"))
summary(fit_scalar, fit.measures = TRUE)

Compare metric and scalar models:

lavTestLRT(fit_metric, fit_scalar)

Expected result

Full scalar invariance should fail (significant chi-square difference, p < 0.05). This is by design: items 3 and 5 have different intercepts across groups in the data-generating process. The model is telling us that something is wrong with assuming equal intercepts for all items.

Discover partial non-invariance

When full scalar invariance fails, we can release constraints on specific items to achieve partial scalar invariance. Based on modification indices or theory, we free the intercepts of items 3 and 5:

# partial scalar invariance: free intercepts for items 3 and 5
model_partial <- "
  distress =~ item_1 + item_2 + item_3 + item_4 + item_5 + item_6
  item_3 ~ c(i3a, i3b) * 1
  item_5 ~ c(i5a, i5b) * 1
"

fit_partial <- cfa(model_partial, data = d, group = "group",
                   group.equal = c("loadings", "intercepts"))
summary(fit_partial, fit.measures = TRUE)

Compare partial scalar with metric invariance:

lavTestLRT(fit_metric, fit_partial)

Expected result

Partial scalar invariance should hold (non-significant chi-square difference). By releasing the intercepts for items 3 and 5, we account for the known non-invariance in the data. This means we can compare latent factor means across groups, but only after accounting for differential item functioning on items 3 and 5.

Compare all models

# summary table of fit indices
models <- list(
  Configural = fit_configural,
  Metric = fit_metric,
  Scalar = fit_scalar,
  "Partial Scalar" = fit_partial
)

fit_table <- map_dfr(names(models), function(name) {
  fm <- fitmeasures(models[[name]], c("cfi", "rmsea", "srmr", "chisq", "df"))
  tibble(
    model = name,
    cfi = round(fm["cfi"], 3),
    rmsea = round(fm["rmsea"], 3),
    srmr = round(fm["srmr"], 3),
    chisq = round(fm["chisq"], 1),
    df = fm["df"]
  )
})

print(fit_table)

Connection to causal inference

Why measurement invariance matters for causal inference

If a scale measures the same construct differently across groups (non-invariance), then cross-group comparisons of treatment effects may be biased. In DAG terms, the measured outcome $Y^{*}$ is a function of both the true outcome $Y$ and group membership $G$ :

$Y^{*} = f (Y, G)$

If $f$ differs by group (non-invariance), then even if the treatment has the same causal effect on $Y$ in both groups, the observed effect on $Y^{*}$ will differ. This is measurement bias in the causal inference framework.

Establishing measurement invariance before estimating treatment effects is therefore a prerequisite for valid cross-group comparisons.

Exercises

Lab diary

Complete at least two of the following exercises for your lab diary.

Group by sex. Repeat the invariance testing using male as the grouping variable instead of group. Does the invariance pattern change? (Since the data-generating process only introduces non-invariance by group, invariance by male should hold.)
Two-factor model. Fit a two-factor CFA where items 1-3 load on factor 1 and items 4-6 load on factor 2. Compare fit with the one-factor model. Does the data support a two-factor structure?
Interpretation. In one paragraph, explain what you would conclude about using a single distress score to compare psychological wellbeing across two demographic groups, given that items 3 and 5 show non-invariance. What practical steps would you take before reporting cross-group differences?

Causal Diagrams with ggdag

This tutorial introduces the ggdag package for drawing and analysing directed acyclic graphs (DAGs). DAGs encode causal assumptions and help identify which variables to condition on (and which to leave alone) when estimating causal effects.

How to use this page

This is a reference resource, not a graded lab. Work through the examples at your own pace and return when you need to draw or analyse a DAG for your research report.

Load libraries

# data wrangling
library(tidyverse)

# graphing
library(ggplot2)

# automated causal diagrams
library(ggdag)

The fork: omitted variable bias

Let's use ggdag to identify confounding arising from omitting L in our regression of A on Y.

First we write out the DAG:

# code for creating a DAG
graph_fork <- dagify(Y ~ L,
                   A ~ L,
                   exposure = "A",
                   outcome = "Y") |>
  tidy_dagitty(layout = "tree")

# plot the DAG
graph_fork |>
  ggdag() + theme_dag_blank() + labs(title = "L is a common cause of A and Y")

Next we ask ggdag which variables we need to include for an unbiased estimate:

ggdag::ggdag_adjustment_set(graph_fork) +
  theme_dag_blank() +
  labs(title = "{L} is the exclusive member of the confounder set for A and Y")

The causal graph tells us to obtain an unbiased estimate of A on Y we must condition on L.

When we include the omitted variable L in our simulated dataset, it breaks the spurious association between A and Y:

set.seed(123)
N <- 1000
L <- rnorm(N)
A <- rnorm(N, L)
Y <- rnorm(N, L)

# without control: A appears associated with Y
fit_fork <- lm(Y ~ A)
parameters::model_parameters(fit_fork)

# with control: association vanishes
fit_fork_controlled <- lm(Y ~ A + L)
parameters::model_parameters(fit_fork_controlled)

Mediation and the pipe

Suppose we are interested in the causal effect of A on Y, where the effect operates through a mediator M.

graph_mediation <- dagify(Y ~ M,
                 M ~ A,
                exposure = "A",
                outcome = "Y") |>
  ggdag::tidy_dagitty(layout = "tree")

graph_mediation |>
  ggdag() +
  theme_dag_blank() +
  labs(title = "Mediation Graph")

What should we condition on?

ggdag::ggdag_adjustment_set(graph_fork)

"Backdoor Paths Unconditionally Closed" means we may obtain an unbiased estimate of A on Y without including additional variables, assuming our DAG is correct. There is no "backdoor path" from A to Y that would bias our estimate.

Two variables are d-connected if information flows between them (conditional on the graph), and d-separated if they are conditionally independent.

ggdag::ggdag_dconnected(graph_mediation)

Here d-connection is desirable because it means we can estimate A's effect on Y.

Simulation: pipe confounding

Suppose we want to know whether a ritual action condition (A) influences charity (Y), and the effect operates entirely through perceived social cohesion (M):

$A \to M \to Y$

set.seed(123)
N <- 100
c0 <- rnorm(N, 10, 2)
ritual <- rep(0:1, each = N / 2)
cohesion <- ritual * rnorm(N, .5, .2)
c1 <- c0 + ritual * cohesion
d <- data.frame(c0 = c0, c1 = c1, ritual = ritual, cohesion = cohesion)

# total effect of ritual on charity
parameters::model_parameters(lm(c1 ~ c0 + ritual, data = d))

# conditioning on the mediator blocks the causal path
parameters::model_parameters(lm(c1 ~ c0 + ritual + cohesion, data = d))

The direct effect of ritual drops out when we include cohesion. Once the model knows M, it gets no new information from A. Conditioning on a post-treatment variable creates pipe confounding.

Masked relationships

When two correlated variables have opposing effects on the outcome, their individual effects can be "masked" in simple regressions.

dag_m1 <- dagify(K ~ C + R,
                 R ~ C,
                 exposure = "C",
                 outcome = "K") |>
  tidy_dagitty(layout = "tree")

dag_m1 |> ggdag()

set.seed(123)
n <- 100
C <- rnorm(n)
R <- rnorm(n, C)
K <- rnorm(n, R - C)
d_sim <- data.frame(K = K, R = R, C = C)

# C alone: weak or null
parameters::model_parameters(lm(K ~ C, data = d_sim))

# both: opposing effects "pop"
parameters::model_parameters(lm(K ~ C + R, data = d_sim))

Note that ggdag correctly identifies that you do not need to condition on R to estimate C's total effect on K:

dag_m1 |> ggdag_adjustment_set()

The total effect of C on K combines the direct path ( $C \to K$ ) and the indirect path ( $C \to R \to K$ ); these work in opposite directions.

Collider confounding

The selection-distortion effect (Berkson's paradox). Imagine there is no relationship between the newsworthiness and trustworthiness of science. Selection committees make decisions on the basis of both:

dag_sd <- dagify(S ~ N,
                 S ~ T,
                 labels = c("S" = "Selection",
                            "N" = "Newsworthy",
                            "T" = "Trustworthy")) |>
  tidy_dagitty(layout = "nicely")

dag_sd |>
  ggdag(text = FALSE, use_labels = "label") + theme_dag_blank()

When two arrows enter a variable, conditioning on it opens a path of information between its causes:

ggdag_dseparated(
  dag_sd,
  from = "T",
  to = "N",
  controlling_for = "S",
  text = FALSE,
  use_labels = "label"
) + theme_dag_blank()

# simulation of selection-distortion effect
set.seed(123)
n <- 1000
p <- 0.05

d <- tibble(
    newsworthiness  = rnorm(n, mean = 0, sd = 1),
    trustworthiness = rnorm(n, mean = 0, sd = 1)
  ) |>
  mutate(total_score = newsworthiness + trustworthiness) |>
  mutate(selected = ifelse(total_score >= quantile(total_score, 1 - p), TRUE, FALSE))

# correlation among selected proposals
d |>
  filter(selected == TRUE) |>
  select(newsworthiness, trustworthiness) |>
  cor()

Selection induces a spurious correlation. Among selected proposals, newsworthy ones appear less trustworthy, even though the two are independent in the population.

Collider bias in experiments

Conditioning on a post-treatment variable can open a spurious path even when no experimental effect exists:

dag_ex2 <- dagify(
  C1 ~ C0 + U,
  Ch ~ U + R,
  labels = c(
    "R" = "Ritual",
    "C1" = "Charity-post",
    "C0" = "Charity-pre",
    "Ch" = "Cohesion",
    "U" = "Religiousness (Unmeasured)"
  ),
  exposure = "R",
  outcome = "C1",
  latent = "U"
) |>
  control_for(c("Ch", "C0"))

dag_ex2 |>
  ggdag_collider(text = FALSE, use_labels = "label") +
  ggtitle("Cohesion is a collider that opens a path from ritual to charity")

Taxonomy of confounding

There are four basic structures:

The fork (omitted variable bias)

confounder_triangle(x = "Coffee", y = "Lung Cancer", z = "Smoking") |>
  ggdag_dconnected(text = FALSE, use_labels = "label")

The pipe (fully mediated effects)

mediation_triangle(x = NULL, y = NULL, m = NULL, x_y_associated = FALSE) |>
  tidy_dagitty(layout = "nicely") |>
  ggdag()

The collider

collider_triangle() |>
  ggdag_dseparated(controlling_for = "m")

Confounding by proxy

Controlling for a descendant of a collider introduces collider bias:

dag_sd <- dagify(
  Z ~ X,
  Z ~ Y,
  D ~ Z,
  labels = c("Z" = "Collider", "D" = "Descendant", "X" = "X", "Y" = "Y"),
  exposure = "X",
  outcome = "Y"
) |>
  control_for("D")

dag_sd |>
  ggdag_dseparated(
    from = "X", to = "Y",
    controlling_for = "D",
    text = FALSE, use_labels = "label"
  ) +
  ggtitle("D induces collider bias")

Rules for avoiding confounding

From Statistical Rethinking (p. 286):

List all paths connecting the exposure and outcome
Classify each path as open or closed (a path is open unless it contains a collider)
Classify each path as a backdoor path (has an arrow entering the exposure)
If there are open backdoor paths, decide which variable(s) to condition on to close them

Selection bias in sampling and longitudinal research

Selection bias arises when the sample is not representative of the target population due to conditioning on a collider or its descendant:

coords_mine <- tibble::tribble(
  ~name,           ~x,  ~y,
  "glioma",         1,   2,
  "hospitalized",   2,   3,
  "broken_bone",    3,   2,
  "reckless",       4,   1,
  "smoking",        5,   2
)

dagify(hospitalized ~ broken_bone + glioma,
       broken_bone ~ reckless,
       smoking ~ reckless,
       labels = c(hospitalized = "Hospitalization",
                  broken_bone = "Broken Bone",
                  glioma = "Glioma",
                  reckless = "Reckless \nBehaviour",
                  smoking = "Smoking"),
       coords = coords_mine) |>
  ggdag_dconnected("glioma", "smoking", controlling_for = "hospitalized",
                   text = FALSE, use_labels = "label", collider_lines = FALSE)

In longitudinal research, retention can act as a descendant of a collider, introducing bias when the sample is conditioned on being retained.

Summary

We control for variables to avoid omitted variable bias. But included variable bias is also commonplace. It arises from "pipes", "colliders", and conditioning on descendants of colliders. The ggdag package can help identify adjustment sets, but the results depend on assumptions encoded in your DAG that are not part of the data. Clarify your assumptions.

Keyboard shortcuts

PSYC 434: Conducting Research Across Cultures