Welcome to PSYC 434

Conducting Research Across Cultures | Trimester 1, 2026

Prof Joseph Bulbulia | Victoria University of Wellington


Assessments

AssessmentCLOsWeightDue
Lab diaries (8 × 1.25%)1, 2, 310%Weekly (satisfactory/not)
In-class test 1220%22 April (w7)
In-class test 22, 320%20 May (w11)
In-class presentation1, 2, 310%27 May (w12)
Research report (Option A or B)1, 2, 340%30 May (Fri)

Accessing Lectures and Readings

  • Seminar: Wednesdays, 14:10–17:00, Easterfield Building EA120
  • Schedule: see the Schedule page for topics, readings, and assignments
  • Lectures: weekly content pages contain slides, recordings, and lab materials
  • Tests: in the same room as the seminar (bring a pen/pencil, no devices)

Contact

Course coordinatorProf Joseph Bulbulia, joseph.bulbulia@vuw.ac.nz
OfficeEA313
Office hoursTuesday 14:00-15:00 or by appointment
R helpBoyang Cao, caoboya@myvuw.ac.nz

Course Description

From the VUW course catalogue:

This course focuses on theoretical and practical challenges for conducting research involving individuals from more than one cultural background or ethnicity. Topics include defining and measuring culture; developing culture-sensitive studies; choice of language and translation; communication styles and bias; questionnaire and interview design; qualitative and quantitative data analysis for cultural and cross-cultural research; minorities, power, and ethics in cross-cultural research; and ethno-methodologies and indigenous research methodologies. Appropriate background for this course: PSYC 338.

Course Learning Objectives

  1. Understanding causal inference. Students will develop a clear understanding of causal inference concepts and workflows, with emphasis on how they address common pitfalls in cross-cultural research. We focus first on how to ask causal questions in comparative psychology, and only then on how to answer them: designing studies, analysing data, and drawing appropriately confident conclusions about cause and effect.

  2. Understanding measurement in comparative settings. A substantial portion of this course is devoted to measurement in psychological research. We cover classical techniques for constructing and psychometrically validating measures across cultures, and clarify why statistical tests alone cannot ensure we are measuring what we intend to measure.

  3. Statistical programming in R. Students will learn the basics of programming in the statistical language R, gaining computational tools for applying causal inference methods to real data.

  4. Computing fundamentals: the command line, Git, and GitHub. Students will learn to navigate their computer through the terminal, manage projects with Git, and collaborate through GitHub. These skills matter because the most powerful tools available to researchers today, from LLMs to cloud computing, operate through text-based interfaces. Students who understand their machines will get far more out of them.


Licence

© 2026 Joseph Bulbulia. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Licence.

Schedule

Weekly Schedule (2026)

WeekDate (Wed)ContentLabMain Readings
w125 FebHow to ask a question in psychological science?Git and GitHub
w24 MarCausal diagrams: elementary structuresInstall R and set up your IDEH&R Ch 1–2
w311 MarCausal diagrams: confounding biasRegression, graphing, and simulationH&R Ch 3, 6
w418 MarSelection bias and measurement biasRegression and confounding biasH&R Ch 5, 7–8
w525 MarCausal inference: average treatment effectsAverage treatment effectsH&R Ch 1–3 (review)
w61 AprEffect modification / CATECATE and effect modificationH&R Ch 4–5; GRF guide
8 AprMid-trimester break
15 AprMid-trimester break
w722 AprIn-class test 1 (20%)
w829 AprHeterogeneous treatment effects and machine learningRATE and QINI curvesGRF guide; MAQ
w96 MayResource allocation and policy treesPolicy treesPolicy learning
w1013 MayClassical measurement theory from a causal perspectiveMeasurement invarianceVanderWeele (2022)
w1120 MayIn-class test 2 (20%)
w1227 MayStudent presentations (10%)

H&R = Hernán, M. A. & Robins, J. M. (2025). Causal Inference: What If. Free PDF · Book website

Labs

Labs run in the final 60–90 minutes of the seminar. Nine labs across weeks 1–6 and 8–10. Your best eight lab diaries count toward the 10% assessment. See Assessments for due dates.

Assessments

Overview

AssessmentCLOsWeightDue
Lab diaries (8 × 1.25%)1, 2, 310%Weekly
In-class test 1220%22 April (w7)
In-class test 22, 320%20 May (w11)
In-class presentation1, 2, 310%27 May (w12)
Research report1, 2, 340%30 May (Fri)

Assessment 1: Lab Diaries (10%)

Nine weekly diaries, one per lab (weeks 1–6 and 8–10). There are no labs in week 7 (test 1), week 11 (test 2), or week 12 (presentations). Your best eight diaries count (8 × 1.25%), so you may miss one without penalty. Each diary is graded satisfactory/not satisfactory. You receive full credit for submitting a satisfactory entry. Diaries are due by the end of the lab session.

DiaryWeekDue date
lab-01.mdw1Wed 25 Feb
lab-02.mdw2Wed 4 Mar
lab-03.mdw3Wed 11 Mar
lab-04.mdw4Wed 18 Mar
lab-05.mdw5Wed 25 Mar
lab-06.mdw6Wed 1 Apr
lab-08.mdw8Wed 29 Apr
lab-09.mdw9Wed 6 May
lab-10.mdw10Wed 13 May

What to write

Each diary is a short reflection (~150 words) covering:

  1. What the lab covered and what you did.
  2. A connection to the week's readings or lecture content.
  3. One thing you found useful, surprising, or challenging.

Several labs have focussed exercises.

The labs are marked either full-credit/no-credit. If your reflection shows reflection and engagement you will get full credit (even if you do not get the exercises correct).

Format

Write each diary as a plain markdown (.md) file named by week number: lab-01.md, lab-02.md, …, lab-10.md (there is no lab-07.md). Use GitHub-flavoured markdown formatting: headings, paragraphs, bold, italics, and lists. Because you push diaries to GitHub, your files will render there automatically. These submissions build your markdown fluency; later in the course you will use Quarto to extend markdown to PDF and Word.

Submission

Push your diary files to your private GitHub Classroom repository set up in Lab 1. The commit timestamp is your submission record. Your repository is private and visible only to you and the course coordinator; no additional sharing step is needed.

Markdown example

Here is a minimal diary entry showing basic markdown formatting:

# Lab 01: Introduction to R

This week we installed R and RStudio, then ran our first script.
The exercise connected to the lecture on **causal questions** by
showing how we structure data for analysis.

I found the following steps useful:

- Creating an RStudio project
- Writing a short R script
- Pushing changes to GitHub

Assessment 2: In-Class Test 1 (20%) — 22 April

Covers material from weeks 1–6 (causal diagrams, confounding, ATE, effect modification). Test duration is 50 minutes. The allocated time is 1 hour 50 minutes. Required: pen/pencil. No devices permitted.

Test Location

The test is in class. Come to the seminar room (EA120) with a writing instrument.

Assessment 3: In-Class Test 2 (20%) — 20 May

Covers material from weeks 8–10 (heterogeneous treatment effects, machine learning, resource allocation, policy trees, and classical measurement theory). Same format and conditions as test 1.

Assessment 4: In-Class Presentation (10%) — 27 May

You will present your proposed project for the research report. You may present either (i) your Marsden EOI concept or (ii) your research report concept. Your job is to answer two questions for a non-specialist audience: what is it, and so what.

The presentation is 10 minutes, followed by 1 question. You must answer the question after your talk. You may ask one brief clarifying question before you answer.

You may use the whiteboard and paper notes. Do not use slides, handouts, devices, or other materials.

Your talk should cover the following points, in this order.

  1. Title and motivation (what is it, so what).
  2. Causal question, target population, exposure, and outcome.
  3. A simple causal diagram showing your identification strategy.
  4. Estimand and analysis plan (what you will estimate, and how).
  5. One key limitation or risk, and how you will address it.

Assessment criteria are clarity and structure, causal reasoning, feasibility, and your response to the question.

Assessment 5: Research Report (40%) — Due 30 May

You choose your format

Students choose one of two formats for the research report:

  • Option A: Research Report — quantify an average treatment effect using the NZAVS synthetic dataset.
  • Option B: Marsden Fund EOI — write a first-round Marsden Fund Expression of Interest using the causal inference framework.

You must declare your choice by submitting the option form on Nuku by Friday 3 April (end of w6). If no declaration is received by this date, Option A is assumed.

Generate your data using the causalworkshop package:

# install (once)
install.packages("remotes")
remotes::install_github("go-bayes/causalworkshop@v0.2.1")

# generate data
library(causalworkshop)
d <- simulate_nzavs_data(n = 5000, seed = 2026)

Choose one exposure (community_group, religious_service, or volunteer_work) and one outcome (wellbeing, belonging, self_esteem, or life_satisfaction). Lab sessions support you in this assignment. We assume no statistical background.

Late Penalty

Late assignments, and assignments with extensions, may be subject to delays in marking and may not receive comprehensive feedback.

Assignments submitted late without an approved extension will incur a grade penalty of 5% of the total marks available for the assignment per day late (i.e., in 24-hr increments), up to a maximum of 5 days (up to 24 hrs late = −5%; up to 48 hrs late = −10%, etc.).

Assignments submitted more than five days late without an approved extension will not be graded unless exceptional circumstances are accepted by the Course Coordinator.

Option A: Research Report

Quantify the Average Treatment Effect of a specific exposure on a specific outcome using the synthetic NZAVS panel dataset generated by causalworkshop::simulate_nzavs_data(). You choose one of three exposures and one of four outcomes (see the data generation instructions above).

  • Introduction: 800-word hard limit.
  • Conclusion: 800-word hard limit.
  • Methods/Results: concise, no specific word limit.
  • APA style. Submit as a single PDF with R code appendix.

Assessment Criteria (Option A)

Stating the problem. State your question clearly. Explain its scientific importance. Frame it as a causal question. Identify any subgroup analysis (e.g. cultural group). Explain the causal inference framework for non-specialists. Describe ethics/policy relevance. Confirm data are from the NZAVS simulated dataset using three waves.

Determining the outcome. Define outcome variable $Y$. Assess multiple outcomes if applicable. Explain how outcomes relate to the question. Address outcome type (binary and rare?) and timing (after exposure).

Determining the exposure. Define exposure variable $A$. Explain relevance, positivity, consistency, and exchangeability. Specify exposure type (binary or continuous) and whether you contrast static interventions or modified treatment policies. Confirm exposure precedes outcome.

Accounting for confounders. Define baseline confounders $L$. Justify how they could affect both $A$ and $Y$. Confirm they are measured before exposure. Include baseline measures of exposure and outcome in confounder set. Assess sufficiency; explain sensitivity analysis (E-values) if needed. Address confounder type (z-scores for continuous; one-hot encoding for categorical with three or more levels).

Drawing a causal diagram. Include both measured and unmeasured confounders. Describe potential measurement error biases. Add time indicators ensuring correct temporal order.

Identifying the estimand. State your causal contrast clearly.

Source and target populations. Explain how your sample relates to your target populations.

Eligibility criteria. State the eligibility criteria for the study.

Sample characteristics. Provide descriptive statistics for baseline demographics. Describe magnitude of exposure change from baseline. Include references for more information about the sample. Make clear the data are simulated.

Missing data. Check for missing data. Describe how you will address the problem (IPCW).

Model approach. Decide between G-computation, IPTW, or doubly-robust estimation. Specify model. Explain how machine learning works. Address outcome specifics (logistic regression for rare binary outcomes; z-scores for continuous). Describe sensitivity analysis (E-values).

Option B: Marsden Fund Expression of Interest (EOI)

Write a first-round Marsden Fund Expression of Interest (EOI) following the RSNZ 2026 guidelines. Your research question must use the causal inference framework taught in this course. Assume an Ecology, Human Behaviour, and Evolution (EHB) panel.

Templates and Guidelines

Download the official 2026 RSNZ templates before you begin:

Formatting: 12-point Times New Roman, single spacing, 2 cm margins. Submit as a single PDF.

Required Sections

Section numbers follow the 2026 RSNZ EOI form.

1a. Research Title (max 25 words). Plain language, no jargon. The title should be accessible to a scientifically literate non-specialist.

1d. Research Summary (max 200 words). This summary must be standalone: assessors outside your discipline will read it. Answer four questions in this order:

  1. What is the current state of the field? (1–2 sentences establishing the gap or problem.)
  2. What do you aim to do? (State the causal question plainly.)
  3. How will you do it? (Name the data source, design, and analytic approach.)
  4. What do you expect to find? (One sentence on anticipated results and their significance.)

2. Vision Mātauranga (max 200 words). Describe how the proposed research relates to the four Vision Mātauranga (VM) themes: (i) indigenous innovation, drawing on Māori knowledge, resources, and people; (ii) taiao, achieving environmental sustainability through iwi and hapū relationships with land and sea; (iii) hauora/oranga, improving health and social wellbeing; (iv) mātauranga, exploring indigenous knowledge and its contribution to NZ research. If none of the themes apply, you may state "not applicable" with a considered justification.

3a. Abstract (max 1 page). Cover the following: aims of the research; importance of the research area; novelty, originality, insight, and ambition of the proposed work; potential impact; methodology; and your capacity to deliver.

3b. Benefit Statement (max 400 words, 1 page). Describe the economic, environmental, or health benefit of the research to New Zealand. Explain why NZ is the right place for this research and describe potential impacts for Māori. In a student context the benefit case may be aspirational, but it must be concrete.

3c. References (max 3 pages). Bold your own name. Include article titles and full author lists (up to 12 authors; use "et al." thereafter).

3d. Roles and Resources (max 1 page). Describe the contributions of each team member, the resources required, and any ethical considerations. Use the Roles and Resources form.

Assessment Criteria (Option B)

Research. Quantifiable impact potential through novelty, originality, insight, and ambition. Rigorous methods grounded in prior research. Ability and capacity to deliver.

Benefit. Economic, environmental, or health benefit to New Zealand. Rationale for NZ-based research. In a student context the benefit case may be aspirational but must be concrete.

Vision Mātauranga. Relation to VM themes; where relevant, engagement with Māori. "Not applicable" is acceptable with considered justification.

Causal reasoning (course-specific). Well-defined causal question, clearly stated causal estimand, appropriate identification strategy. This criterion carries substantial weight.

For the full Marsden Fund assessment criteria, see the RSNZ 2026 EOI Guidelines (pdf).

AI use in this course

Students may use AI tools in this course. AI use is permitted, but not required.

AI use policy

  • You may use AI for coding help, brainstorming, and editing for clarity.
  • You are responsible for all submitted work. Verify all claims, code, and references.
  • You must be able to explain your work in your own words.
  • For lab diaries and the final report, add a short note if AI use is substantial (tool, date, and how it was used).
  • If AI output materially shaped your submission, acknowledge it as a source.
  • AI tools are not permitted in in-class tests.
  • Do not upload confidential, identifiable, or sensitive information

Extensions and Materials

Extensions

Extension requests for final report received before the mid-term break

Request a new due date by emailing the course coordinator before 3 April 2026. Reasonable requests will be considered, including periods where major assessments cluster in the same week.

Extensions for laboratory diaries

Laboratory diaries are due by the end of the lab session. To receive credit for a diary, you must attend the lab unless an absence is approved. There are nine labs, but only your best eight diaries are counted, so one missed diary can occur without penalty.

If you are physically unwell, or you are caring for an unwell dependant, do not attend class. Email the course coordinator as soon as possible so the absence can be recorded. When you are able, upload the diary for that week. If the diary shows serious engagement with the lab content, no late points will be deducted.

If you are unable to attend class or lab for personal or work-related reasons unrelated to health, email the course coordinator before class where possible. Requests are considered case by case, and any alternative submission arrangement is determined by the coordinator.

Extensions for Presentations and Class-Tests

Presentations and class tests are in-class assessments. If you or a dependant is unwell, email the course coordinator and we will arrange a rescheduled in-class assessment.

Late Penalty

This is the standard School of Psychological Sciences policy on lateness penalties.

Late assignments, and assignments with extensions, may be subject to delays in marking and may not receive comprehensive feedback.

Assignments submitted late without an approved extension will incur a grade penalty of 5% of the total marks available for the assignment per day late (i.e., in 24-hr increments), up to a maximum of 5 days (up to 24 hrs late = −5%; up to 48 hrs late = −10%, etc.).

Assignments submitted more than five days late without an approved extension will not be graded unless exceptional circumstances are accepted by the Course Coordinator.

Materials and Equipment

Bring a laptop from week 1. Install R by week 2 for data analysis sessions. You may use RStudio or any other IDE/editor you prefer. Contact the instructor if you lack computer access.

For in-class tests, bring a writing utensil. Electronic devices are not permitted during tests.

Course Readings

Primary text

Hernán & Robins (2025)

Chapters 1–9 are the required reading for this course. The book is freely available from the link above. Abbreviated H&R on the schedule and in lecture notes.

Reading strategy

Read each chapter before the corresponding lecture week. The chapters are short (roughly 10–15 pages each) and written in accessible prose with worked examples. Focus on understanding the concepts rather than memorising notation.


Week-by-week readings

Week 1: How to ask a question in psychological science

Required: Hernán & Robins (2025), chapter 1. PDF

Optional: Briggs (2021) (history of measurement in psychology); Bandalos (2018) (psychometrics); Pearl & Mackenzie (2018) (accessible introduction to causal inference); Bulbulia (2024a) (causal questions in psychology).

Week 2: Causal diagrams — five elementary structures

Required: Hernán & Robins (2025), chapters 1–2 and 6. PDF

Optional: Bulbulia (2024a); Bulbulia (2024d) (experimental design and causal diagrams).

Week 3: Causal diagrams — the structures of confounding bias

Required: Hernán & Robins (2025), chapters 3 and 7. PDF

Optional: Bulbulia (2024a).

Week 4: Selection bias and measurement bias

Required: Hernán & Robins (2025), chapters 8 and 9. PDF

Optional: Bulbulia (2024c) (WEIRD samples and external validity); Bulbulia (2024b) (SWIGs and time-varying confounding).

Week 5: Average treatment effects

Required: Hernán & Robins (2025), chapters 1–2 (review identification assumptions). PDF

Week 6: Effect modification and CATE

Required: Hernán & Robins (2025), chapters 4–5. PDF

Optional: GRF documentation (causal forests, used in labs 6, 8, and 9).


General supplementary references

These are not required but provide additional depth on topics covered in the course.

  • Neal (2020), chapters 1–2. Covers the same foundations as Hernán & Robins (2025) with a machine-learning orientation.
  • Pearl et al. (2016). Compact introduction to the graphical (structural) approach to causation.
  • Generalised Random Forests (GRF) website. Documentation, guides, and vignettes used in weeks 6, 8, and 9.

Bandalos, D. L. (2018). Measurement theory and applications for the social sciences. Guilford Publications.

Briggs, D. C. (2021). Historical and conceptual foundations of measurement in the human sciences: Credos and controversies. Routledge.

Bulbulia, J. A. (2024a). Methods in causal inference part 1: Causal diagrams and confounding. Evolutionary Human Sciences, 6, e40. https://doi.org/10.1017/ehs.2024.35

Bulbulia, J. A. (2024b). Methods in causal inference part 2: Interaction, mediation, and time-varying treatments. Evolutionary Human Sciences, 6, e41. https://doi.org/10.1017/ehs.2024.32

Bulbulia, J. A. (2024c). Methods in causal inference part 3: Measurement error and external validity threats. Evolutionary Human Sciences, 6, e42. https://doi.org/10.1017/ehs.2024.33

Bulbulia, J. A. (2024d). Methods in causal inference part 4: Confounding in experiments. Evolutionary Human Sciences, 6, e43. https://doi.org/10.1017/ehs.2024.34

Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if. Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

Neal, B. (2020). Introduction to causal inference from a machine learning perspective. Course Lecture Notes (Draft). https://www.bradyneal.com/Introduction_to_Causal_Inference-Dec17_2020-Neal.pdf

Pearl, J., Glymour, M., & Jewell, N. P. (2016). Causal inference in statistics: A primer. John Wiley & Sons.

Pearl, J., & Mackenzie, D. (2018). The book of why: The new science of cause and effect. Basic books.

Week 1: How to Ask a Question in Psychological Science?

Slides

Lab: Git and GitHub

Causal graph: we refer to this image in the lecture and begin reviewing causal graphs in Week 2

Background readings

None today. Recommended readings are listed at the end of this page.

Key concepts for the test(s)

Today we introduce three problems that recur throughout the course:

  • Defining the question: a causal question requires a clear contrast
  • Specifying the target population: the answer depends on who the question is about
  • Unobservability of causal effects: we never observe both sides of the contrast for one person

Before next week

Bring your laptop in Week 1. Install R before Week 2. Instructions are in Lab 2: Install R and Set Up Your IDE.


Motivating example: does social media harm adolescent wellbeing?

Orben & Przybylski (2019) reports a negative association between time spent on social media and wellbeing among British teenagers. The observed correlation was 0.04, comparable in magnitude to the association between wearing glasses and wellbeing in the same dataset.

The finding was widely reported as evidence that social media harms young people. Some investigators argued the conclusions were not strong enough: those teenagers who most frequently engaged with social media exhibited the lowest wellbeing scores, and the negative association is non-linear (Twenge et al., 2020).

Questions about whether social media use harms young people remain live. On 18 February 2026, CNN reported testimony in ongoing litigation about adolescent social media use (link). Courts, legislators, and parents are making decisions right now on the basis of findings reported in scientific journals.

Yet what do the associational findings really tell us? Can we move from associations to policy-relevant causal conclusions about whether social media use harms young people? If so, what steps would we need to take? And for whom would our conclusions generalise?

These questions will occupy us over the coming weeks. The aim of this course is to provide you with a set of skills that enable you to ask and answer causal questions using observational data, and to identify variability in response across subgroups in the population of interest.

A simple map for week 1

This week gives you a checklist for deciding whether a claim is even asking a causal question yet.

Three questions to ask of any causal claim

  1. What are the two states of the world being compared?
  2. For which population is the comparison meant to hold?
  3. What part of the contrast is necessarily missing from observation?

If a study claim does not answer the first two questions, it is still too vague. If it forgets the third question, it will slip from causal language into loose talk about associations.

Psychology begins with a question

Before we can answer whether social media harms teenagers, we must ask a question that is clear enough to be answered. "Does social media harm wellbeing?" is not yet a causal question because it does not specify what is being compared with what. A causal question compares two states of the world. This question names only one.

We will use two words that are easy to confuse. An association asks whether two variables co-occur. A causal effect, on the other hand, asks what would happen if we changed something about the world.

Consider the difference. "Is time on social media associated with lower wellbeing?" asks whether two variables co-occur. "Would adolescent wellbeing improve if we replaced two hours of nightly doom-scrolling with two hours of study?" asks what would happen under a specific comparison. The second question states a contrast (scrolling versus studying), a population (adolescents), an outcome (wellbeing), and a time horizon (nightly, over some stated period). The first question does not.

The comparison between two states is what we call a causal contrast, or contrast for short. A contrast is the simplest structure a causal question can have: state A versus state B, for a defined group, measured on a defined outcome, over a defined time horizon.

A practical template is: for population, what is the effect of intervention versus control on outcome, measured by measure after an exposure period of time horizon? The arrow of time is built in: the intervention comes first, then we measure the outcome.

Five parts of a usable causal question

  • Population: who is the question about?
  • Intervention: what state of the world are we interested in?
  • Control: what is the comparison condition?
  • Outcome: what do we measure?
  • Timing: when do we measure it?

Everything in this course follows from the demand that psychological questions specify their contrasts. This lecture introduces three problems that make specifying a contrast harder than it first appears.

Problem 1: both sides of the contrast must be precisely defined

When investigators evaluate "time on social media," what do they mean? Passive scrolling, direct messaging, and creative content production may differ in their consequences. We need an interval over which the behaviour occurs: one week, one month, one year. We need to specify what the comparison condition is: passive scrolling versus studying, versus socialising in person, versus something else. Without precise specification, the question has no answer because it has not yet been asked.

The two sides of a contrast have names. The condition whose consequences we want to evaluate is the intervention. The state we compare it against is the control. "Intervention" and "control" are placeholders: neither implies a medical procedure or a laboratory setting. They simply label the two states of the world that define our comparison.

Precision extends to what we measure. "Wellbeing" aggregates self-esteem, life satisfaction, anxiety symptoms, and depressive symptoms. Each is a distinct construct, and they do not always move together. We must define the outcome, its measure (for example, life satisfaction on a 0 to 10 scale), and the time frame over which we assess it. The consequences of scrolling for a teenager's wellbeing in five hundred years are zero because life ends.

Notice that specifying interventions and outcomes forces us to order events along a timeline. For one state to influence another, it must precede it. There must be a contrast condition and a stated time horizon, because timing affects the magnitude of interest. The effects of scrolling for five minutes for three weeks (contrasted with no social media) might differ from the effects of scrolling for five hours every day for five years.

In later weeks we extend this idea to more complex questions with more than two states of the world, or with sequences of actions over time. The same demand applies: name the states, the population, the outcome, and the time horizon.

Problem 2: the answer to a causal question depends on the population

The teenagers that Orben & Przybylski (2019) studied were a convenience sample from one country at one moment in time. Would the association of 0.04 hold in other countries? Would it hold today? The concept of "teenager" is itself vague. It lumps thirteen-year-olds, essentially children, with nineteen-year-olds, essentially adults. The answer to a causal question may systematically differ by age, gender, socioeconomic background, or parental attention.

Before we can evaluate whether social media influences wellbeing, we must specify the target population. The answer to a contrast for one population may differ from, or reverse for, another. There is no abstract answer to a causal question without reference to both the contrast conditions and a population.

The distinction between the sample population (who you studied) and the target population (who you want to learn about) is central to external validity, which we formalise in Week 4. We return to population specificity when we discuss variation in responses across subgroups (Week 6, 8, and 9) and transportability (Week 4).

Problem 3: no more than one side of the contrast can be measured for each individual

Consider Alice, who takes up two hours of doom-scrolling each night before bed. Suppose she enrolled in an experiment and was randomised to the doom-scrolling condition. The contrast is studying mathematics for two hours each night. At the end of the trial Alice reports high life satisfaction. Can we say that doom-scrolling caused Alice's life satisfaction?

We cannot, because no more than one side of the contrast can be measured for Alice in a given period. Alice followed the doom-scrolling protocol. She did not follow the mathematics protocol. We observe only one state of the world for Alice, never both.

This is the central logical problem in causal inference. A causal effect compares two possible futures for the same person, but we only ever observe one future.

We formalise this with potential outcomes notation. Let $Y_i(1)$ denote the outcome that person $i$ would experience under the intervention ($A = 1$), and $Y_i(0)$ the outcome under the control condition ($A = 0$). The individual causal effect is:

$$\delta_i = Y_i(1) - Y_i(0)$$

This quantity, $\delta_i$, is the contrast at the level of a single person: the difference between what would happen to person $i$ under treatment and what would happen under control. We observe only one of $Y_i(1)$ or $Y_i(0)$ for any individual. The individual causal effect $\delta_i$ is therefore never directly observable. This is not a limitation of our methods; it is a logical constraint. No amount of data collection, no statistical technique, and no machine learning algorithm can reveal both potential outcomes for the same person at the same time.

Pair exercise: formulating a contrast

  1. Take the headline "Screen time linked to poor sleep in teenagers."
  2. Write a causal question that specifies both sides of the contrast (screen time versus what?), a defined outcome (which aspect of sleep?), a target population, and a time horizon.
  3. Swap with your partner and critique: is the other side of the contrast well defined? Is the population specific enough?

From individuals to populations

If individual causal effects are unobservable, what can we learn? We can learn about average effects across a population. The average treatment effect (ATE) is:

$$\text{ATE} = \mathbb{E}[Y(1) - Y(0)]$$

This is the expected difference in the outcome if everyone in the target population experienced the intervention versus if everyone experienced the control condition. The ATE is a population-level quantity. It tells us what would happen on average, not what would happen to any particular person.

Causal inference contrasts counterfactual states at the population or subpopulation level. When we say "social media influences wellbeing," we mean that on average, across a defined population, one pattern of use changes life satisfaction relative to the counterfactual of another pattern. We must specify the contrast conditions and the population for this statement to have content.

A short memory aid

  • A causal question needs a contrast.
  • A causal answer is always population-specific.
  • Individual causal effects are not directly observed.

What we learned

Return to the social media question. Orben & Przybylski (2019) found an association of 0.04 between social media use and lower wellbeing. Courts and legislators are treating this as evidence of harm. We now see three reasons why the leap from association to causation fails.

First, "the influence of social media on wellbeing" is undefined until we specify the interventions (scrolling versus what?), the outcomes (which dimension of wellbeing?), and the time frame. Second, the answer to a causal question depends on the target population, and the populations that matter (thirteen-year-olds in Aotearoa New Zealand today) may differ from the population studied (British teenagers before 2019). Third, the individual causal effect is never observable; we can only recover average effects under assumptions we have not yet stated.

The lesson is that before answering a question we must ask it. Psychology begins with a clearly defined question. A well-defined causal question requires a contrast between at least two interventions, a specified outcome and time horizon, and a target population. In later weeks we add the further question of whether the observed data can identify that contrast.

Most psychological research cannot randomise the variables we care about. We cannot randomly assign people to experience trauma, adopt a religion, or lose a job. Week 2 introduces the randomised experiment as the benchmark for causal inference and the graphical tools (causal diagrams) that allow us to reason about causation when randomisation is impossible.

Pair exercise: three problems in one claim

  1. Take the claim "Religion improves mental health."
  2. Specify the contrast by naming a concrete intervention and control condition (religion versus what, exactly?).
  3. Specify the population (for whom, where, and when?).
  4. Specify the outcome, the measure, and the timing (what do we measure, and when do we measure it after the exposure period?).
  5. Rewrite the claim using the course template: for population, what is the effect of intervention versus control on outcome, measured by measure after an exposure period of time horizon?
  6. Swap with your partner and critique: is the contrast precise, is the population defensible, and does the timing make sense (intervention first, outcome later)?

Further reading

For an accessible introduction to causal inference and its history, see Pearl & Mackenzie (2018). The two core causal questions and the formal treatment of causal inference appear in Bulbulia (2024).


Lab materials: Lab 1: Git and GitHub

Bulbulia, J. A. (2024). Methods in causal inference part 1: Causal diagrams and confounding. Evolutionary Human Sciences, 6, e40. https://doi.org/10.1017/ehs.2024.35

Orben, A., & Przybylski, A. K. (2019). The association between adolescent well-being and digital technology use. Nature Human Behaviour, 3(2), 173–182. https://doi.org/10.1038/s41562-018-0506-1

Pearl, J., & Mackenzie, D. (2018). The book of why: The new science of cause and effect. Basic books.

Twenge, J. M., Haidt, J., Joiner, T. E., et al. (2020). Underestimating digital media harm. Nature Human Behaviour, 4, 346–348. https://doi.org/10.1038/s41562-020-0839-4

Week 2: Causal Diagrams: Five Elementary Structures

Slides

Optional Readings

Key concepts for the test(s)

  • Internal validity
  • External validity
  • Causal directed acyclic graph (causal DAG)
  • Five elementary causal structures
  • Confounding
  • d-separation
  • Backdoor path
  • Conditioning
  • Fork
  • Chain
  • Collider bias
  • Mediator bias
  • Four rules of confounding control

Lab 2 setup

Use Lab 2: Install R and Set Up Your IDE for this week's practical work. The optional script is here: Download the R script for Lab 02.


Seminar

Motivating example: the Salk vaccine

The 1954 field trial of the Salk polio vaccine was a multi-site study conducted across many communities in the United States (Dublin, 1955; Francis Jr., 1955). Two protocols were used in different participating areas. In the observed-control protocol, second-grade children whose parents consented received the vaccine. Children in the first and third grades served as controls. In the placebo-controlled protocol, children were randomised to vaccine or placebo under double-blind conditions.

The placebo-controlled protocol supported a causal conclusion. The vaccine reduced paralytic polio (Francis Jr., 1955). The observed-control protocol did not support the same conclusion, because parental consent shaped vaccine assignment.

Parents who consented tended to have higher socioeconomic status, and socioeconomic status also predicted baseline susceptibility to paralytic polio. The observed-control comparison was therefore confounded: vaccination status and polio risk differed for reasons other than the vaccine itself.

Same question. Different assignment mechanism. Different estimate, different reliability.

Salk later reported on the 1956 vaccination campaign (Salk, 1957).

Why this week matters

In Week 1 we defined causal questions. This week we learn how to represent structural assumptions using a causal directed acyclic graph (causal DAG): a diagram in which nodes represent variables and arrows represent assumed causal directions, with no cycles. A causal DAG does not create causal knowledge. It makes assumptions explicit, checkable, and discussable.

A simple map for week 2

For test purposes, most week 2 questions reduce to three steps.

How to read a DAG

  1. Identify the path structure: fork, chain, or collider.
  2. Ask whether the path is open or blocked as drawn.
  3. Ask what conditioning would do: close the path or open it.

If you can do those three things, you can usually explain the bias logic in words.

Independence language

Do not let the notation do more work than it should. It is just shorthand.

  • $A \coprod Y(a)$: $A$ and $Y(a)$ are independent.
  • $A \cancel\coprod Y(a)$: $A$ and $Y(a)$ are dependent.
  • $A \coprod Y(a)\mid L$: $A$ and $Y(a)$ are independent once we condition on $L$.

Conditioning means restricting attention to observations that share the same value of a variable (or, equivalently, adjusting for that variable in an analysis). In a causal DAG, a conditioned variable is drawn inside a box.

Randomisation and exchangeability

Under random assignment,

$$ Y(a) \coprod A. $$

This condition is unconditional exchangeability. Under this condition, a difference in means identifies the average treatment effect (ATE):

$$ \widehat{\text{ATE}}=\hat{\mathbb{E}}[Y\mid A=1]-\hat{\mathbb{E}}[Y\mid A=0]. $$

In observational studies, this condition usually fails without adjustment.

Working definitions

Internal validity concerns whether the study contrast estimates the target causal contrast in the study population.

External validity concerns whether that causal contrast transports to the target population.

These are design properties, not instrument properties. Measurement validity (Week 1) is a precondition, not a synonym.

Confounding bias exists when treatment groups differ systematically in ways that affect the outcome, so that the observed association between $A$ and $Y$ does not equal the causal effect. In graph terms, confounding corresponds to an open non-causal path from $A$ to $Y$ (a "backdoor path," formalised below in the four rules of confounding control).

Causal DAG notation and elements

Our variable naming conventions (adapted from Bulbulia, 2023)

  • $A$: treatment or exposure.
  • $Y$: outcome.
  • $Y(a)$: potential outcome under intervention level $a$.
  • $L$: measured confounder set.
  • $U$: unmeasured cause.
  • $M$: mediator.
  • $\bar{X}$: sequence of variables.
  • $\mathcal{R}$: chance mechanism, including randomisation.

Nodes, edges, and conditioning conventions (adapted from Bulbulia, 2023)

  • Arrows encode assumed causal direction.
  • Boxes indicate conditioning.
  • Open red paths indicate biasing pathways.

Five elementary structures

Five elementary structures (adapted from Bulbulia, 2023)

  1. No causal relation: $A \coprod B$. The variables are statistically independent.
  2. Direct causation: $A\to B$. The variables are statistically dependent: $A \cancel\coprod B$.
  3. Fork: $A\to B$ and $A\to C$. Because $A$ causes both $B$ and $C$, they are associated. Conditioning on $A$ removes that association: $B \coprod C \mid A$.
  4. Chain: $A\to B\to C$. Because $B$ mediates the effect of $A$ on $C$, they are associated. Conditioning on $B$ blocks the path: $A \coprod C \mid B$.
  5. Collider: $A\to C\leftarrow B$. Because $A$ and $B$ both cause $C$ but do not cause each other, they are marginally independent. Conditioning on $C$ opens a spurious association: $A \cancel\coprod B \mid C$.

These five structures generate all patterns of conditional independence and dependence in a causal DAG. Understanding which structures block and which transmit association is the basis for confounding control.

Three questions for any path

  • Is this path causal or non-causal?
  • Is it open or blocked right now?
  • What would conditioning on the middle variable do?

Pair exercise: naming the structure

For each scenario, name the elementary structure, state whether the two end variables are marginally associated, and predict what conditioning does.

  1. Socioeconomic status (SES) causes both neighbourhood quality and health outcomes. What structure links neighbourhood and health? What happens if you condition on SES?
  2. A drug reduces inflammation, and inflammation causes pain. What structure links the drug to pain? What happens if you condition on inflammation?
  3. Genetics affects blood pressure and diet affects blood pressure, but genetics and diet do not cause each other. What structure links genetics and diet through blood pressure? What happens if you condition on blood pressure?

Three identification assumptions

Assumption 1: Causal consistency

If person $i$ receives $A_i=a$, then $Y_i=Y_i(a)$.

Assumption 2: Conditional exchangeability

After conditioning on an adequate set $L$,

$$ Y(a) \coprod A \mid L. $$

Assumption 3: Positivity

For all treatment levels and covariate strata used for inference,

$$ P(A=a\mid L=l)>0. $$

Pair exercise: checking assumptions against a causal DAG

  1. Draw a causal DAG for the Salk vaccine example. Include: parental consent ($L$), vaccine assignment ($A$), polio outcome ($Y$), and socioeconomic status ($U$) as an unmeasured common cause of $L$ and $Y$.
  2. In the observational design (assignment by parental consent), check exchangeability: is $Y(a) \coprod A$? Trace the open backdoor path.
  3. In the randomised design, check exchangeability: is $Y(a) \coprod A$? Explain why the path is now blocked.
  4. Check positivity in each design. In which design is a positivity violation more probable, and why?

Four rules of confounding control

Four rules of confounding control

  1. Condition on common causes (or defensible proxies). If $L$ causes both $A$ and $Y$, the fork $A \leftarrow L \to Y$ opens a backdoor path. Conditioning on $L$ closes it. When $L$ is unmeasured, conditioning on a measured proxy can reduce, though not eliminate, confounding.
  2. Do not condition on mediators when estimating total effects. If $A \to M \to Y$, conditioning on $M$ blocks part of the causal path we want to estimate.
  3. Do not condition on colliders. If $A \to C \leftarrow Y$, conditioning on $C$ opens a spurious path between $A$ and $Y$. The "control for everything" instinct is unsafe for this reason.
  4. Treat descendants carefully. Conditioning on a descendant of a variable is akin to partially conditioning on that variable. A descendant of a collider can transmit collider bias; a descendant of a confounder can partially reduce confounding.

A short rulebook for the test

  • Common cause: usually condition.
  • Mediator: do not condition if you want the total effect.
  • Collider: do not condition.
  • Descendant: ask what it is downstream of before you adjust for it.

A note on the generality of d-separation

Two variables are d-separated ("directionally separated") in a causal DAG when every path between them is blocked. In practice, this means that the DAG implies conditional independence once the relevant conditioning set is stated. The rules above focus on confounding, but d-separation is more general than confounding control. It is the reason the same DAG logic can later be used for collider bias, mediator bias, and measurement problems.

Return to the opening example

The Salk example is a structural lesson about assignment. The observed-control design produced a biased effect estimate because socioeconomic status confounded the comparison. The randomised design blocked that path. Causal DAGs help us state this lesson before analysis. First we define the question. Then we draw assumptions. Then we choose an adjustment set. Then we estimate.

Where do causal assumptions come from?

A causal DAG encodes assumptions. Those assumptions do not come from the data. They come from prior knowledge: theory, mechanism, previous studies, and domain expertise. This dependence on existing knowledge might seem circular. If we need knowledge to draw a causal DAG, and a causal DAG is required for causal inference, where do we start?

Otto Neurath's metaphor of the ship at sea captures the answer:

We are like sailors who on the open sea must reconstruct their ship but are never able to start afresh from the bottom. Where a beam is taken away a new one must at once be put there, and for this the rest of the ship is used as support. In this way, by using the old beams and driftwood, the ship can be shaped entirely anew, but only by gradual reconstruction. (Neurath, 1973, p. 199)

Causal diagrams are planks in Neurath's boat. We build them from the best available knowledge, test their implications, and revise when evidence warrants. The alternative, letting data alone determine causal structure, is not available. Data reveal associations. Associations are compatible with many causal structures. Without assumptions, the data do not tell us which structure generated them.

Next week we apply these structures to the specific problem of confounding bias: the patterns that open backdoor paths and the strategies for closing them.

Epilogue: avoid "within-person" and "between-person"

  1. Students often describe designs as "within-person" or "between-person". These labels feel intuitive, but they hide the causal object. "Between-person" in particular can mislead because it sounds like we compare two different populations. In an experiment we have one population, which we project into two potential states under two intervention levels. Randomisation lets two groups stand in for those two projected states.

In this course we instead name a target population, two intervention regimes, an outcome, and the time that we measure that outcome.

  1. This framing works even when the target population contains one unit. Let the population be Alice. Define two regimes over time: $a=1$ means Alice doom-scrolls for two hours nightly for four weeks, and $a=0$ means she studies for two hours nightly for four weeks. Let the outcome be her life satisfaction, measured after each four-week period. The causal contrast for Alice is $\delta_{\text{Alice}}=Y_{\text{Alice}}(1)-Y_{\text{Alice}}(0)$.

  2. Week 1 shows why this contrast is inaccessible from observation alone. Alice can follow regime $a=1$ or regime $a=0$, but not both in the same period. We therefore observe only one of $Y_{\text{Alice}}(1)$ or $Y_{\text{Alice}}(0)$. This missing counterfactual is not a statistical inconvenience. It is a logical constraint.

  3. Week 2 adds a second lesson. Even when we target a population-level average, we still need a defensible assignment story. Causal DAGs let us state, and critique, the assumptions that connect the observed data to the causal contrast. They do not rescue imprecise language. They force us to say what we compare, for whom, and why.

Pair exercise: Neurath's ship and your own causal DAG

  1. Draw a causal DAG from your own discipline or research interest with at least four variables.
  2. Identify one fork and one chain in your causal DAG.
  3. Swap with your partner. Your partner plays sceptic: challenge one arrow by proposing an alternative causal direction or an omitted common cause.
  4. Revise your causal DAG in response. State what changed and why.

Further reading

The identification assumptions and randomisation framework are treated in Hernán & Robins (2025) (chapters 1-2) and Bulbulia (2024a). See also Bulbulia (2024b).


Lab materials: Lab 2: Install R and Set Up Your IDE

Barrett, M. (2023). Ggdag: Analyze and create elegant directed acyclic graphs. https://github.com/malcolmbarrett/ggdag

Bulbulia, J. A. (2024a). Methods in causal inference part 1: Causal diagrams and confounding. Evolutionary Human Sciences, 6, e40. https://doi.org/10.1017/ehs.2024.35

Bulbulia, J. A. (2024b). Methods in causal inference part 4: Confounding in experiments. Evolutionary Human Sciences, 6, e43. https://doi.org/10.1017/ehs.2024.34

Dublin, T. D. (1955). 1954 poliomyelitis vaccine field trial: Plan, field operations, and follow-up observations. JAMA, 158(14), 1258–1265. https://doi.org/10.1001/jama.1955.02960140020003

Francis Jr., T. (1955). Evaluation of the 1954 poliomyelitis vaccine field trial: Further studies of results determining the effectiveness of poliomyelitis vaccine (salk) in preventing paralytic poliomyelitis. JAMA, 158(14), 1266–1270. https://doi.org/10.1001/jama.1955.02960140028004

Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if. Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

Neurath, O. (1973). Anti-spengler. In M. Neurath & R. S. Cohen (Eds.), Empiricism and sociology (pp. 158–213). Springer Netherlands. https://doi.org/10.1007/978-94-010-2525-6_6

Salk, J. E. (1957). Poliomyelitis vaccination in the fall of 1956. American Journal of Public Health and the Nation’s Health, 47(1), 1–18. https://doi.org/10.2105/AJPH.47.1.1

Week 3: Causal Diagrams: The Structures of Confounding Bias

Slides

Date: 11 Mar 2026

Readings

Required

  • Hernán & Robins (2025), chapter 6. link

Optional

Key concepts for the test(s)

  • Confounding bias
  • Backdoor path
  • Backdoor criterion
  • Valid adjustment set
  • M-bias
  • Regression
  • Intercept
  • Regression coefficient
  • Model fit
  • Why model fit is misleading for causality

Lab 3 setup

Use Lab 3: Regression, Graphing, and Simulation for this week's practical work. The optional script is here: Download the R script for Lab 03.


Week 2 introduced five elementary causal structures and the rules of d-separation. This week we use those structures to diagnose confounding bias: when does conditioning on a variable remove bias, and when does it create bias?

Seminar

Motivating example: higher $R^2$, worse identification

Suppose investigators want the total effect of an exercise programme ($A$) on cardiovascular risk ($Y$). They begin with a regression that adjusts for baseline confounders. Then they add body composition measured after the programme ($M$). The model $R^2$ rises, because body composition is a strong correlate of cardiovascular risk.

Did the higher $R^2$ improve the causal estimate? No. If the programme changes body composition and body composition changes cardiovascular risk, then $M$ is a mediator on the path $A \to M \to Y$. Conditioning on $M$ blocks part of the very effect the investigators wanted to estimate. The model fits the observed data better, but it answers the wrong causal question.

A simple map for week 3

Week 3 asks one repeated question: which variables should we condition on, and why?

A practical workflow

  1. Draw the relevant backdoor paths from $A$ to $Y$.
  2. Decide which variables would block those paths.
  3. Check that you are not conditioning on a mediator, a collider, or a descendant of treatment.

That is the logic behind the backdoor criterion. Regression is only one way of carrying out the conditioning decision.

Learning outcomes

By the end of this week, you will be able to:

  1. Define confounding bias and identify it in a DAG.
  2. Apply the backdoor criterion.
  3. Explain why good model fit does not rule out confounding.
  4. Distinguish confounding problems that time ordering can solve from those it cannot.
  5. Define M-bias and explain why conditioning on a pre-treatment collider can introduce bias.

What is confounding?

Confounding exists when a common cause of treatment $A$ and outcome $Y$ opens a non-causal backdoor path.

Definition: confounding bias

Confounding bias exists when at least one backdoor path from $A$ to $Y$ is open.

A backdoor path starts with an arrow into $A$.

Example: exercise and blood pressure. Health consciousness ($L$) may affect exercise ($A$). Health consciousness ($L$) may also affect blood pressure ($Y$). Then $A \leftarrow L \to Y$ is a backdoor path. If we do not condition on $L$, we mix causal association and spurious association.

Quick test

If a path from $A$ to $Y$ starts with an arrow into $A$, treat it as a candidate backdoor path.

The backdoor criterion

Pearl's backdoor criterion tells us when an adjustment set is valid.

Definition: backdoor criterion

A set $L$ satisfies the backdoor criterion for $A$ and $Y$ if:

  1. No variable in $L$ is a descendant of $A$.
  2. $L$ blocks every backdoor path from $A$ to $Y$.

If both conditions hold, conditioning on $L$ supports exchangeability: $Y(a) \coprod A \mid L$.

A short memory aid

  • Block all backdoor paths.
  • Do not adjust for descendants of treatment.

Pair exercise: applying the backdoor criterion

  1. Draw a DAG for the effect of exercise ($A$) on cardiovascular risk ($Y$) with three additional variables: health consciousness ($L_1$), diet ($L_2$), and body composition ($M$, a mediator on the $A \to Y$ path). Include arrows: $L_1 \to A$, $L_1 \to L_2$, $L_2 \to Y$, $L_1 \to Y$, $A \to M \to Y$.
  2. List all paths from $A$ to $Y$.
  3. Check whether ${L_1}$ satisfies the backdoor criterion. Does it block every backdoor path without conditioning on a descendant of $A$?
  4. Explain why adding $M$ to the adjustment set violates the backdoor criterion (which part of the causal path does it block?).

Confounding and regression

Regression is one way to condition on measured variables. For example,

$$ Y = \beta_0 + \beta_1A + \beta_2L + \varepsilon $$

Definition: key regression terms

  • Intercept ($\beta_0$): expected outcome when covariates are zero.
  • Coefficient ($\beta_1$): expected outcome difference per unit change in $A$, conditional on model terms.
  • Model fit ($R^2$): variance explained by the fitted model.

High $R^2$ does not imply no confounding. Fit is a statistical property. Confounding is a causal-structure property. A model can fit the observed data very well and still answer the wrong causal question.

Why model fit is misleading for causality

A model can fit very well and still be causally wrong.

If investigators condition on a mediator, they can block part of the target effect.

If investigators condition on a collider, they can open a spurious path.

Neither problem is diagnosed by $R^2$.

Confounding problems that time ordering can resolve

Cross-sectional measurements blur temporal order. Longitudinal design can resolve several ambiguities. A common strategy is:

  1. Measure confounders at baseline ($t_0$).
  2. Measure treatment at $t_1$.
  3. Measure outcome at $t_2$.

Confounding problems resolved by time-series data (adapted from Bulbulia, 2023)

Confounding problems that time ordering alone cannot resolve

Some problems persist even with multiple waves.

Confounding problems not resolved by time-series data (adapted from Bulbulia, 2023)

Definition: M-bias

M-bias occurs when investigators condition on a pre-treatment collider.

In the structure $U_1 \to L \leftarrow U_2$, with $U_1 \to A$ and $U_2 \to Y$, conditioning on $L$ opens a previously blocked path between $A$ and $Y$.

M-bias is important because "control for everything" is not a safe rule.

Pair exercise: M-bias in practice

  1. Consider the question "does religious attendance increase charitable giving?" Suppose neighbourhood social capital ($L$) is a collider of two unmeasured causes: one that affects attendance ($U_1$) and one that affects giving ($U_2$).
  2. Draw the DAG with $U_1 \to L \leftarrow U_2$, $U_1 \to A$, and $U_2 \to Y$.
  3. Trace what conditioning on $L$ does: which path opens?
  4. State in one sentence why "adjust for all pre-treatment variables" fails here.

Worked example: mediation assumptions

The assumptions in causal mediation (adapted from Bulbulia, 2023)

Mediation analysis needs stronger assumptions than total-effect analysis. Treatment-induced confounding of the mediator-outcome relation can make standard regression unsuitable.

Return to the opening example

Back to the exercise programme example. Higher $R^2$ did not answer the causal question, because adding post-treatment body composition changed the estimand. To estimate the total effect of the programme on cardiovascular risk, we need a defended DAG and a valid adjustment set, not just a better-fitting regression. This is why we separate modelling from causal identification.

What to remember for the test

  • Confounding is about open non-causal backdoor paths.
  • The backdoor criterion tells us when an adjustment set is valid.
  • Regression can implement conditioning, but it cannot tell us which variables should be conditioned on.
  • Better fit is not the same as better identification.

Confounding is one structural threat to causal identification. Week 4 adds two others: selection bias and measurement bias.

Pair exercise: $R^2$ versus identification

  1. Investigator A adjusts for age, income, and education ($R^2 = 0.42$). Investigator B adjusts for age and conscientiousness ($R^2 = 0.31$).
  2. Explain to your partner why higher $R^2$ does not imply less confounding.
  3. Propose a DAG where Investigator A's larger adjustment set introduces bias (hint: include a collider or mediator).
  4. State what would need to be true for Investigator B's smaller set to satisfy the backdoor criterion.

Further reading

All open access: Bulbulia (2024); Hernán & Robins (2025, 6).


Lab materials: Lab 3: Regression, Graphing, and Simulation

Bulbulia, J. A. (2024). Methods in causal inference part 1: Causal diagrams and confounding. Evolutionary Human Sciences, 6, e40. https://doi.org/10.1017/ehs.2024.35

Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if. Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

Neal, B. (2020). Introduction to causal inference from a machine learning perspective. Course Lecture Notes (Draft). https://www.bradyneal.com/Introduction_to_Causal_Inference-Dec17_2020-Neal.pdf

Suzuki, E., Shinozaki, T., & Yamamoto, E. (2020). Causal Diagrams: Pitfalls and Tips. Journal of Epidemiology, 30(4), 153–162. https://doi.org/10.2188/jea.JE20190192

Week 4: Selection Bias and Measurement Bias

Slides

Readings

Required

Optional

Key concepts for the test(s)

  • Independent/uncorrelated measurement error bias
  • Independent/correlated measurement error bias
  • Dependent/uncorrelated measurement error bias
  • Dependent/correlated measurement error bias
  • Collider bias as a distinct mechanism
  • Selection bias and transportability

Lab 4 setup

Use Lab 4: Writing Regression Models for this week's practical work. The lab page links the student practice script first and the instructor script second.


Seminar

Motivating example: one study, two failure modes

Suppose investigators recruit participants for a bilingualism study through university mailing lists. Recruitment is selective. People with high academic motivation and strong language confidence are more likely to enrol.

Now add measurement error. Suppose the cognitive task is validated only in English. Then non-English-dominant participants may be mismeasured.

One study can fail in two ways. It can fail through biased sampling. It can fail through biased measurement.

Learning outcomes

By the end of this week, you should be able to:

  1. Use fork, chain, and collider structures to recognise how bias enters a study.
  2. Classify four structural types of measurement error.
  3. Explain how conditioning on selection can bias causal contrasts.
  4. Distinguish target, source, and analytic populations for transport claims.

Why this week extends Weeks 2-3

Weeks 2-3 focused on confounding paths between treatment and outcome. That may have made DAGs look like a tool just for "finding confounders". They are more general than that.

Week 4 adds two structural threats. The first threat is selection bias. The second threat is measurement bias. Both can be analysed with the same elementary graph logic you already know.

The same graph logic still applies

The key idea from the earlier weeks is not just "adjust for confounders". The key idea is that a small number of graph structures determine which paths are open and which are blocked. Once you can recognise a fork, a chain, and a collider, you can start to diagnose bias in almost any study.

More complicated studies are built from simpler pieces. Selection problems, attrition problems, and measurement problems usually look confusing at first because they involve more nodes. But the underlying logic is still the same.

Five elementary graph structures, adapted from Bulbulia (2023)

Five elementary structures

  1. No causal relation: no arrow connects the variables.
  2. Direct causation: one variable causes another.
  3. Fork: one variable is a common cause of two others.
  4. Chain: one variable sits on the path between two others.
  5. Collider: one variable is a common effect of two others.

When you see $A \coprod Y$, read this as "A is statistically independent of Y". When you see $A \cancel\coprod Y$, read this as "A is statistically dependent on Y".

A short rulebook for week 4

Four practical rules for conditioning on DAGs, adapted from Bulbulia (2023)

Four practical rules

  • Condition on common causes or their good proxies when you want to block a non-causal backdoor path.
  • Do not condition on mediators when you want the total effect, because you would block part of the causal pathway.
  • Do not condition on colliders, because conditioning opens a path that was previously blocked.
  • Be careful with descendants and proxies, because conditioning on them can behave like conditioning on their parent.

This is the bridge to this week. Selection bias often appears because study entry, study retention, or analytic restriction acts like a collider or a descendant of a collider. Measurement bias often appears because the variable we record is a noisy proxy or downstream consequence of the variable we actually care about.

Pair check: what does conditioning do?

  1. In a fork, $A \leftarrow L \to Y$, what does conditioning on $L$ do?
  2. In a chain, $A \to M \to Y$, what happens if we condition on $M$ when we want the total effect of $A$ on $Y$?
  3. In a collider, $A \to C \leftarrow Y$, what happens if we condition on $C$?

You only need one sentence for each answer.

Common causal questions as graphs

Common causal questions presented as causal graphs, adapted from Bulbulia (2024)

Different questions require different graphs. Different questions require different causal estimands. Different questions require different assumptions. The important point for this week is that the same picture language can represent confounding, selection, measurement, mediation, and transport problems. The effect modification graph (open circle into the outcome) reappears below when we distinguish two mechanisms of selection bias.

A typology of measurement error bias

Measurement error is not a separate universe from the DAGs you have already seen. It becomes easier to understand once we separate the true variable from the recorded variable. Then the same questions return: what causes what, what is shared, and what happens if we condition on the recorded value?

A typology of measurement error bias, adapted from Bulbulia (2024)

Four structural types of measurement error

Measurement error is classified along two dimensions.

Dimension 1: independent vs dependent

  • Independent (undirected): one true variable does not causally affect another variable's measurement error.
  • Dependent (directed): one true variable causally affects another variable's measurement error.

Dimension 2: uncorrelated vs correlated

  • Uncorrelated: errors do not share a common cause.
  • Correlated: errors share a common cause.

Combinations:

  1. Independent, uncorrelated: often attenuates effects toward the null. Example: a self-report anxiety scale adds random noise to the true score. The noise is unrelated to treatment status, so it blurs the signal without creating a false one.
  2. Independent, correlated: can create spurious associations even when no causal effect exists. Example: societies with advanced record-keeping produce more precise records of both religious beliefs and social complexity. The shared cause (record-keeping quality) induces a non-causal association between treatment and outcome measures.
  3. Dependent, uncorrelated: can open non-causal paths from treatment to measured outcome. Example: participants who receive an intervention report their outcomes more favourably because the treatment itself changes how they interpret survey items. The exposure causally affects measurement of the outcome.
  4. Dependent, correlated: can bias in either direction, and the direction is hard to predict analytically. Example: social complexity shapes how historical archives record both religious beliefs and governance structures, and the errors in both records share a common cause in elite patronage of scribes.

Pair exercise: classifying measurement error

For each scenario, classify the measurement error using the two dimensions (independent/dependent, correlated/uncorrelated) and name the type number (1-4).

  1. A self-report screen-time measure adds noise because participants guess rather than track. Social desirability also inflates wellbeing reports. The errors are unrelated to each other and unrelated to treatment status.
  2. A cognitive test for bilingualism effects is validated only in English. Non-English-dominant participants are systematically mismeasured. The treatment (bilingualism) causally affects measurement of the outcome.
  3. A cross-cultural study uses the same translation team for exposure and outcome instruments. Shared translation quality introduces correlated errors in both measures.

For scenario 2, draw a short DAG showing how the treatment ($A$) creates a path through the measurement node to the recorded outcome.

Selection bias and transportability

Selection bias occurs when inclusion in the analytic sample depends on variables related to treatment, outcome, or effect modifiers. It threatens validity in two structurally distinct ways.

Mechanism 1: collider conditioning (internal validity). When both treatment and outcome affect who enters the sample, selection acts as a collider. Conditioning on it opens a non-causal path between $A$ and $Y$. The estimate is biased for the population it claims to describe.

Mechanism 2: effect modifier imbalance (external validity). Even without confounding, a sample can fail to generalise. Suppose treatment $A$ is randomised, so no backdoor paths are open. If a variable $Z$ modifies the effect of $A$ on $Y$, and $Z$ is distributed differently in the analytic sample than in the target population, the sample ATE does not equal the population ATE. No non-causal path is opened; the internal validity is intact. The problem is that the average treatment effect is a weighted average of subgroup effects, and the weights differ between populations.

Effect modification without confounding: $Z$ modifies the effect of $A$ on $Y$ (open circle) but no backdoor path is open

The open circle on the arrow from $Z$ to $Y$ denotes effect modification: $Z$ changes the size of $A$'s effect on $Y$. This is not a standard causal arrow. No confounding is present. Yet if $Z$ is distributed differently in the sample than in the target population, the sample ATE does not transport.

This second mechanism does not require a collider. A study of exercise and blood pressure conducted entirely in young adults may correctly estimate the ATE for young adults. If older adults benefit more (effect modification by age), the sample ATE underestimates the population ATE. The design is unconfounded but the conclusion does not transport.

Threats to external validity, adapted from Bulbulia (2024)

Transportability asks whether effect-relevant structure is compatible between analytic and target populations. This requires knowing where effect modifiers differ, not just whether the sample is "representative" in some demographic sense.

Target, source, and analytic populations

  • Target population: where we want the causal claim to apply.
  • Source population: where recruitment occurs.
  • Analytic sample: who is actually analysed.

Transportability requires that effect-relevant structure is compatible between analytic and target populations.

Collider bias: a distinct mechanism

Collider bias can feel new because the earlier weeks mostly taught you how to close open backdoor paths. Here the warning runs in the opposite direction: some conditioning decisions create bias rather than remove it.

Why collider bias is not confounding. Confounding arises from an open backdoor path through a common cause: $A \leftarrow L \to Y$. We usually reduce confounding by conditioning on $L$. Collider bias works in the opposite direction. In the structure $A \to C \leftarrow Y$, the path is blocked at first ($A \coprod Y$). Conditioning on $C$ opens a spurious association ($A \cancel\coprod Y \mid C$).

Why collider bias is not identical to selection bias. When collider conditioning happens through sample restriction, it appears as selection bias because the sample is truncated. Berkson's bias is the classic example. But collider bias can also appear when we stratify or adjust for a common effect inside a complete dataset. In that case the problem comes from the analytic decision, not from who entered the sample.

Why the same DAG rules still work. Pearl's d-separation criterion tells us which paths are opened and closed by conditioning. That is why DAGs help with more than confounding. The same framework lets us reason about collider bias, mediator bias, measurement error, and selection problems.

For this course, the practical upshot is simple: never condition on common effects, whether through sample restriction, stratification, or statistical adjustment. Conditioning on a collider opens a non-causal path.

Pair exercise: collider bias versus confounding

  1. A hospital study investigates whether depression ($A$) slows recovery ($Y$). Ward admission ($C$) depends on both depression severity and injury severity. Only admitted patients are analysed.
  2. Draw a DAG with $A \to C \leftarrow Y$ (ward admission as a collider of depression and recovery-related injury severity).
  3. Explain the non-causal path that opens when the study conditions on $C$ by restricting to admitted patients.
  4. Your partner argues "this is just confounding by injury severity." Counter by explaining the structural difference: confounding is an open backdoor path through a common cause, whereas collider bias opens a previously blocked path by conditioning on a common effect.
  5. Propose one design change that avoids this bias.

Attrition as a measurement error structure

Right-censoring (attrition) can bias causal estimates through two distinct mechanisms. The first is distortion: if the outcome affects who drops out, conditioning on the end-of-study sample conditions on a common effect of exposure and outcome. This opens a non-causal path. The bias is an internal validity problem; the estimate is wrong for the population it claims to describe.

The second mechanism is restriction: if effect modifiers are distributed differently among survivors than in the baseline population, the average treatment effect (ATE) estimated from the end-of-study sample may not match the ATE for the target population. No non-causal path is opened, but the sample no longer represents the population of interest. This is an external validity problem; the estimate may be correct for survivors but does not transport.

The structural parallel to measurement error is direct. Distortion through attrition mirrors dependent measurement error (type 3 above): the outcome causally affects what is recorded. Restriction through attrition mirrors independent measurement error off the null (type 1): the signal is diluted because the analytic sample differs from the target in composition. Investigators should diagnose which mechanism is operating, because the remedies differ: inverse-probability-of-censoring weights address distortion, whereas reweighting to the target population addresses restriction.

WEIRD samples and effect heterogeneity

A WEIRD sample is not automatically invalid. The problem is Mechanism 2: if effect modifiers are distributed differently between the analytic and target populations, and treatment effects vary by those modifiers, the sample ATE does not transport. A perfectly unconfounded study in a WEIRD sample can produce a correct estimate for that sample and a wrong estimate for the population of interest.

Measurement invariance is a transport problem for constructs. If a scale measures different constructs across groups, between-group contrasts can reflect measurement artefact.

Return to the opening example

Back to the bilingualism study. Two design checks are non-negotiable. First, why did these participants enter the analytic sample? Second, do the instruments measure the same constructs across participants? If either check fails, causal interpretation weakens.

With the structural threat landscape mapped (confounding, selection, measurement), Week 5 shows how the three identification assumptions introduced in Week 2 connect a causal question to a population-level causal contrast.

Pair exercise: auditing a study for two failure modes

  1. Return to the bilingualism example from the start of this lecture.
  2. Name the selection bias mechanism (what variable is acting as a collider or filter?).
  3. Name the measurement bias type from the four-type classification (independent/dependent, correlated/uncorrelated).
  4. Write a two-sentence design critique stating both problems and how each distorts the causal contrast.

Further reading

All open access: J. A. Bulbulia (2024c); J. A. Bulbulia (2024a); J. A. Bulbulia (2024b).


Lab materials: Lab 4: Writing Regression Models

Bulbulia, J. A. (2024a). Methods in causal inference part 1: Causal diagrams and confounding. Evolutionary Human Sciences, 6, e40. https://doi.org/10.1017/ehs.2024.35

Bulbulia, J. A. (2024b). Methods in causal inference part 2: Interaction, mediation, and time-varying treatments. Evolutionary Human Sciences, 6, e41. https://doi.org/10.1017/ehs.2024.32

Bulbulia, J. A. (2024c). Methods in causal inference part 3: Measurement error and external validity threats. Evolutionary Human Sciences, 6, e42. https://doi.org/10.1017/ehs.2024.33

Bulbulia, J., & Hine, D. W. (2024). Causal inference in environmental psychology. PsyArXiv. https://osf.io/preprints/psyarxiv/tbjx8

Hernán, M. A. (2017). Invited commentary: Selection bias without colliders | american journal of epidemiology | oxford academic. American Journal of Epidemiology, 185(11), 1048–1050. https://doi.org/10.1093/aje/kwx077

Hernán, M. A., & Cole, S. R. (2009). Invited commentary: Causal diagrams and measurement bias. American Journal of Epidemiology, 170(8), 959–962. https://doi.org/10.1093/aje/kwp293

Hernán, M. A., Hernández-Díaz, S., & Robins, J. M. (2004). A structural approach to selection bias. Epidemiology, 15(5), 615–625. https://www.jstor.org/stable/20485961

Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if. Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

VanderWeele, T. J., & Hernán, M. A. (2012). Results on differential and dependent measurement error of the exposure and the outcome using signed directed acyclic graphs. American Journal of Epidemiology, 175(12), 1303–1310. https://doi.org/10.1093/aje/kwr458

Week 5: Causal Inference: Average Treatment Effects

Slides

Date: 25 Mar 2026

Readings

Required

  • Hernán & Robins (2025), chapters 1-3. link
  • Cashin et al. (2025). TARGET (Transparent Reporting of Observational Studies Emulating a Target Trial) statement.

Optional

  • Neal (2020), chapters 1-2.

Key concepts for the test(s)

  • The fundamental problem of causal inference
  • Average (marginal) treatment effects
  • Causal consistency
  • Exchangeability
  • Positivity

Lab

Use Lab 5: Average Treatment Effects for this week's hands-on work.

Terminology

In these notes, we use "potential outcomes" and "counterfactual outcomes" interchangeably.


Weeks 1 through 4 built a framework for asking causal questions and identifying the structural threats that obstruct answers: confounding, selection bias, and measurement error. Week 2 stated the three identification assumptions. This week shows how those assumptions, together with a well-specified target trial, connect a causal question to an estimable population contrast.

The shift in emphasis matters. Week 4 asked, "where can bias enter?" Week 5 asks, "what causal contrast do we want, and what assumptions let observed data stand in for the missing counterfactuals?"

Seminar

Opening example: one question, two different answers

Observational studies often suggest that students who choose to use a mindfulness app have lower anxiety than students who do not. Randomised trials usually find a smaller and less consistent average benefit once investigators standardise when the intervention begins, what counts as treatment, and when outcomes are measured.

Same substantive question. Different design. Different answer.

This week explains why.

A simple map for this week

This lecture has three moves.

Three moves

  1. Write the causal contrast we want.
  2. See why we cannot observe that contrast for one person.
  3. Use identification assumptions to recover a population average from observed data.

Potential outcomes and DAGs do different jobs. Potential outcomes define the causal contrast. DAGs help us judge whether exchangeability is plausible, which variables belong in $L$, and whether the design has a coherent time zero.

Step 1: state the causal question

Return to the mindfulness example. Let $A=1$ denote starting a guided mindfulness app at the beginning of semester and completing one 10-minute session per day for eight weeks. Let $A=0$ denote not starting the app. Let $Y$ be anxiety symptoms at week 8.

For student $i$, $Y_i(1)$ is the outcome under the app-based intervention, and $Y_i(0)$ the outcome without it. The individual causal effect is:

$$ Y_i(1)-Y_i(0). $$

We never observe both terms for one student at one time.

The fundamental problem of causal inference

The individual causal effect requires two quantities, but we can observe at most one. If student $i$ starts the app ($A_i = 1$), we observe $Y_i(1)$ but not $Y_i(0)$. If the student does not start it ($A_i = 0$), we observe $Y_i(0)$ but not $Y_i(1)$. The unobserved term is the counterfactual.

This is not a sample-size problem or a measurement problem. It is a structural feature of the physical world: a person cannot simultaneously exist under two incompatible conditions. No dataset, however large, contains both potential outcomes for one individual at one time. Individual causal effects are missing by necessity, not by accident.

Pair exercise: building a potential outcomes table

  1. Consider four students in the mindfulness example. Construct a table with columns: person ($i$), $Y_i(1)$, $Y_i(0)$, $\delta_i = Y_i(1) - Y_i(0)$, treatment received ($A_i$), and observed outcome ($Y_i^{\mathrm{obs}}$).
  2. Assign plausible values: let two students receive $A = 1$ and two receive $A = 0$. Make the true individual effects vary (e.g., one positive, one negative, two zero).
  3. Compute the true ATE from all four $\delta_i$ values.
  4. Compute the naive difference in means: $\bar{Y}{A=1}^{\mathrm{obs}} - \bar{Y}{A=0}^{\mathrm{obs}}$.
  5. Do the two quantities match? Explain why the discrepancy arises (or why it does not).

Step 2: move from individuals to populations

Because individual effects are unobservable, we target a population causal estimand. The average treatment effect (ATE) is:

$$ \text{ATE}=\mathbb{E}[Y(1)-Y(0)]. $$

This is the mean contrast if everyone in the target population received $A=1$ versus $A=0$.

Step 3: connect the causal estimand to observed data

The three identification assumptions are easier to remember if you ask what each one contributes to the argument.

Assumption 1: causal consistency

If $A_i=a$, then $Y_i=Y_i(a)$. The observed outcome equals the potential outcome corresponding to the treatment actually received.

Data scientists estimate parameters for observed data. Causal inference goes further: we estimate contrasts involving counterfactual parameters. We compute the average response when the entire target population is exposed, then when the entire population is unexposed, then contrast these averages. Consistency is what allows us to bridge from counterfactual to observed. Without it, potential outcomes remain purely theoretical.

The general switching equation expresses the observed outcome as a function of treatment and both potential outcomes:

$$ Y_i^{obs} = A_i \cdot Y_i(1) + (1 - A_i) \cdot Y_i(0). $$

Each person carries two potential outcomes, but we observe only the one selected by their actual treatment. For treated individuals ($A_i = 1$), the switching equation reduces to:

$$ Y_i^{obs} = 1 \cdot Y_i(1) + 0 \cdot Y_i(0) = Y_i(1). $$

For untreated individuals ($A_i = 0$):

$$ Y_i^{obs} = 0 \cdot Y_i(1) + 1 \cdot Y_i(0) = Y_i(0). $$

In short:

$$ Y_i = Y_i(1) \quad \text{if } A_i = 1; \qquad Y_i = Y_i(0) \quad \text{if } A_i = 0. $$

Consistency subsumes two conditions that are sometimes stated separately. No interference requires that one person's treatment does not affect another person's outcome. Treatment-version irrelevance requires that there is only one version of each treatment level. Both are special cases: if treatments are heterogeneous or if interference exists, the potential outcome $Y(a)$ is ill-defined. Consistency fails when treatment versions are mixed under one label. If "mindfulness practice" includes different apps, dosages, and start dates under the same label, $Y(1)$ does not refer to one intervention.

Assumption 2: exchangeability

The conditional probability of receiving every value of an exposure level, though not decided by investigators, depends only on the measured covariates ((chatton_g-computation_2020?); (hernan_causal_2023?)).

In a randomised trial, exchangeability holds unconditionally:

$$ Y(a) \coprod A. $$

In observational data, we require conditional exchangeability. For each $a$:

$$ Y(a) \coprod A \mid L, $$

where $L$ is the set of covariates sufficient to ensure the independence of the counterfactual outcomes and the exposure. Equivalently, $A \coprod Y(a) \mid L$. When this condition holds, counterfactual outcomes are independent of actual exposures received, conditional on $L$.

Exchangeability cannot be verified from observed data. It can only be defended by subject-matter knowledge and a plausible DAG. This is the no-unmeasured-confounding assumption.

Assumption 3: positivity

The probability of receiving every value of the exposure within all strata of covariates is greater than zero ((hernan_causal_2023?); (westreich_invited_2010?)):

$$ 0 < P(A=a \mid L=l) < 1, \quad \forall, a \in \mathcal{A},; \forall, l \in \mathcal{L}. $$

There are two types of positivity violation.

Random non-positivity occurs when the sample data do not contain all levels of exposure within strata for whom exposures are defined. For example, if no participants aged 22–24 received treatment, investigators must extrapolate from other ages. Random non-positivity can be addressed by modelling assumptions, but those assumptions carry their own risks.

Deterministic non-positivity occurs when it is scientifically impossible for certain strata to receive specific levels of exposure. For example, biological males cannot receive hysterectomy. Deterministic violations require restricting the analysis to scientifically plausible cases.

Positivity is the one identification assumption we can partially check empirically. Plot propensity score distributions and look for gaps or near-zero densities.

What each assumption buys us

  • Consistency links one observed outcome to one potential outcome. It is what connects counterfactual quantities to data.
  • Exchangeability lets observed outcomes in one group stand in for the missing counterfactual outcomes in the other.
  • Positivity ensures that the needed comparison exists in every relevant subgroup.

How the assumptions recover population contrasts

Start with the easiest case: a randomised trial, where exchangeability holds without conditioning. The assumptions work in sequence:

$$ \begin{aligned} \underbrace{\mathbb{E}[Y(1)]}{\color{blue}{\text{everyone treated}}} &\overset{\text{exchangeability}}{=} \underbrace{\mathbb{E}[Y(1)\mid A=1]}{\color{blue}{\text{treated arm}}} \overset{\text{consistency}}{=} \underbrace{\mathbb{E}[Y \mid A=1]}{\color{teal}{\text{observed treated mean}}}, \ \underbrace{\mathbb{E}[Y(0)]}{\color{red}{\text{everyone untreated}}} &\overset{\text{exchangeability}}{=} \underbrace{\mathbb{E}[Y(0)\mid A=0]}{\color{red}{\text{control arm}}} \overset{\text{consistency}}{=} \underbrace{\mathbb{E}[Y \mid A=0]}{\color{orange}{\text{observed untreated mean}}}. \end{aligned} $$

So the ATE becomes

$$ \text{ATE} = \underbrace{\mathbb{E}[Y(1)]}_{\color{blue}{\text{counterfactual treated mean}}}

  • \underbrace{\mathbb{E}[Y(0)]}{\color{red}{\text{counterfactual untreated mean}}} = \underbrace{\mathbb{E}[Y \mid A=1]}{\color{teal}{\text{observed treated mean}}}
  • \underbrace{\mathbb{E}[Y \mid A=0]}_{\color{orange}{\text{observed untreated mean}}}. $$

This is the key identification move. We replace missing counterfactual averages with observed group averages.

In observational data, the same logic works only after conditioning on a sufficient set $L$. Positivity then ensures that each relevant stratum contains both treated and untreated individuals, so those adjusted comparisons are estimable.

Pair exercise: tracing the identification logic

  1. Your partner claims "students who chose the mindfulness app had lower anxiety, therefore the app works."
  2. Walk through each identification assumption in turn. Where does the reasoning break?
  3. Check consistency: were all students labelled $A = 1$ receiving the same intervention?
  4. Check exchangeability: is $Y(a) \coprod A$, or could the students who chose the app differ systematically from those who did not?
  5. Check positivity: are there covariate strata where no students used (or declined) the app?
  6. State which assumption is most plausible violated and why.

The observational-data version

Assume consistency, exchangeability given $L$, and positivity. Then:

$$ \mathbb{E}[Y(a)] = \sum_l \mathbb{E}[Y \mid A=a, L=l]P(L=l). $$

So the ATE is identified by standardisation:

$$ \text{ATE}=\sum_l \underbrace{\Big(\mathbb{E}[Y \mid A=1,L=l]-\mathbb{E}[Y \mid A=0,L=l]\Big)}{\color{teal}{\text{within-stratum observed contrast}}} \underbrace{P(L=l)}{\color{blue}{\text{stratum weight}}}. $$

What we can check, and what we cannot

Positivity is the only assumption we can directly inspect with data. If some covariate strata contain no treated (or no untreated) individuals, the contrast for those strata relies on model extrapolation rather than observed comparisons.

Consistency requires that "treatment" means the same thing for everyone labelled $A = 1$. In the mindfulness example, beginning a guided app at semester start, trying one unguided breathing exercise in week 5, and attending a group class irregularly are different interventions. A well-specified target trial defines treatment precisely enough that consistency is defensible.

Exchangeability cannot be verified from observed data. We can check whether measured covariates are balanced after adjustment, but we cannot test whether unmeasured common causes remain. This is why the DAG matters: it forces investigators to state which variables they believe are sufficient and to defend that belief with subject-matter knowledge.

Design and subject-matter knowledge are not optional extras. They are what makes identification assumptions assessable.

Quick diagnostic

  • If treatment is vague, worry about consistency.
  • If treated and untreated people differ in causes of the outcome, worry about exchangeability.
  • If one treatment level barely occurs in some subgroup, worry about positivity.

Return to the opening example

The mindfulness discrepancy illustrates what happens when investigators fail to emulate a target trial. If "users" are defined as students who have already adopted the app, the design mixes recent starters with persistent users who may differ in motivation, baseline distress, and help-seeking. That makes the exposed group look healthier than a true start-of-intervention comparison would justify.

The lesson is that design comes before estimation. If the hypothetical trial is not specified, the identifying assumptions are hard to interpret and even harder to defend. We now have the tools to identify and estimate an average causal contrast for a defined population. Week 6 asks the next question: does that contrast vary across subgroups?

Pair exercise: designing a target trial

  1. Your intervention is daily meditation (20 minutes). Your outcome is anxiety symptoms at 6 months. Your target population is university students.
  2. State the causal estimand precisely: what are the two contrast conditions?
  3. Define time zero (when does follow-up begin?).
  4. Name two baseline covariates you would adjust for, and give a causal rationale for each (draw a short DAG if it helps).
  5. Describe one plausible positivity failure in this setting (a subgroup where one side of the contrast is effectively empty).

Lab materials: Lab 5: Average Treatment Effects

The causal workflow

The identification assumptions are not items on a checklist to be ticked off after analysis. They are design commitments that must be defended before estimation begins. The following workflow organises these commitments into a sequence. Each step depends on the ones before it. Also see the course causal workflow reference page.

Step 0: state a well-defined treatment. Specify the hypothetical intervention precisely enough that every member of the target population could, in principle, receive it. "Mindfulness" is too vague: people meditate with different apps, in groups, alone, once, or every day. A clearer intervention is: "start a guided mindfulness app at the beginning of semester and complete one 10-minute session per day for eight weeks." Precision here underwrites consistency and makes the timeline visible.

Step 1: establish time zero. Define the point at which treatment assignment begins and follow-up starts. We cannot do this until the treatment is specified, because time zero is the moment that intervention becomes assigned or initiated. In the mindfulness example, time zero is the beginning of semester when students start the app or are assigned not to start it. Without a clear time zero, consistency is undermined and exchangeability is hard to assess.

Step 2: state a well-defined outcome. Define the outcome so the causal contrast is meaningful and temporally anchored. "Wellbeing" is underspecified; "psychological distress one year post-intervention measured with the Kessler-6" is interpretable and reproducible. Include timing, scale, and instrument.

Step 3: clarify the target population. Say exactly who you aim to inform. Eligibility rules define the source population, but sampling and participation can yield an analytic sample with a different distribution of effect modifiers (Bulbulia (2024)). If you intend to generalise beyond the source population (transport), articulate the additional conditions required.

Step 4: evaluate exchangeability. Make the case that potential outcomes are independent of treatment conditional on covariates: $Y(a) \coprod A \mid L$ (Hernan & Robins (2020)). Use design and diagnostics: DAGs, subject-matter arguments, pre-treatment covariate balance, and overlap checks. If exchangeability is doubtful, redesign rather than rely solely on modelling.

Step 5: ensure causal consistency. Consistency requires that, for units receiving a treatment version compatible with level $a$, the observed outcome equals $Y(a)$. It also presumes well-defined versions and no interference between units (VanderWeele & Hernan (2013); Hernan & Robins (2020)). When multiple versions exist, either refine the intervention so versions are irrelevant to $Y(a)$, or condition on version-defining covariates.

Step 6: check positivity. Each treatment level must occur with non-zero probability at every covariate profile needed for exchangeability ((westreich_invited_2010?)). Diagnose limited overlap using propensity score distributions and extreme weights. Consider design-stage remedies (trimming, restriction) before estimation.

Step 7: ensure measurement aligns with the scientific question. Be explicit about probable forms of measurement error (classical, Berkson, differential, misclassification) and their structural implications for bias (Hernan & Robins (2020); Bulbulia (2024)). Where feasible, incorporate validation studies or calibration models.

Step 8: preserve representativeness from start to finish. Differential attrition, non-response, or measurement processes tied to treatment and outcomes can induce selection bias in the presence of true effects (Hernán (2017); Bulbulia (2024)). Plan strategies such as inverse probability weighting for censoring, multiple imputation under defensible mechanisms, and sensitivity analyses for data missing not at random.

Step 9: document the reasoning that supports steps 0–8. Make assumptions, disagreements, and judgement calls legible. Register or time-stamp the analytic plan. Include identification arguments, code, and data where possible. Report robustness and sensitivity analyses. Transparent reasoning is a scientific result in its own right (Ogburn & Shpitser (2021)).

Reporting: the TARGET checklist

The TARGET (Transparent Reporting of Observational Studies Emulating a Target Trial) statement (Cashin et al. (2025)) provides a structured checklist for reporting studies that emulate a target trial. It maps directly onto the causal workflow above. Use it when writing up results.

No.ItemWorkflow step
Title and Abstract
1Identify that the study emulates a target trial; state objectivesSteps 0–1
2Report the data source(s)Step 3
3Summarise assumptions, methods, findings, conclusionsSteps 4–9
Introduction
4Describe scientific background and gapMotivation
5Summarise the causal questionSteps 0–2
6Describe rationale for emulating a target trialSteps 0–1
Methods
7Cite data sources; describe purpose, type, setting, time periodStep 3
8aEligibility criteria and operationalisationStep 3
8bTreatment strategies and operationalisationStep 0
8cAssignment procedures and operationalisationStep 4
8dFollow-up: starts at assignment; describe operationalisationStep 1
8eOutcomes and operationalisationStep 2
8fCausal contrasts and operationalisationSteps 0–2
8gIdentifying assumptions; describe related variablesSteps 4–6
8hData analysis procedures for each causal estimandSteps 4–8
8iAdditional analyses for each causal estimandStep 9
Results
9Numbers assessed, eligible, and assignedStep 3
10Baseline characteristics by treatment strategyStep 4
11Follow-up length and reasons for end of follow-upSteps 1, 8
12Missing data frequency by variableStep 8
13Outcome frequency or distribution at each waveStep 2
14Effect estimates with measures of precisionSteps 4–6
15Sensitivity of estimates to choices and assumptionsStep 9
Discussion
16Interpretation of key findings
17Limitations, including differences between target trial and emulationSteps 4–8
Other Information
18Ethics approval
19RegistrationStep 9
20Data and code availabilityStep 9
21Funding sources
22Conflicts of interest

Lab materials: Lab 5: Average Treatment Effects


Appendix A: notation variants

Equivalent notations for the individual contrast include

$$ Y_i^{1} - Y_i^{0} $$

and

$$ Y_i(a=1) - Y_i(a=0). $$

Bulbulia, J. A. (2024). Methods in causal inference part 3: Measurement error and external validity threats. Evolutionary Human Sciences, 6, e42. https://doi.org/10.1017/ehs.2024.33

Cashin, A. G., Hansford, H. J., Hernán, M. A., et al. (2025). Transparent reporting of observational studies emulating a target trial—the TARGET statement. JAMA, 334(12), 1084–1093. https://doi.org/10.1001/jama.2025.13350

Hernan, M. A., & Robins, J. M. (2020). Causal inference: What if? Taylor & Francis. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

Hernán, M. A. (2017). Invited commentary: Selection bias without colliders | american journal of epidemiology | oxford academic. American Journal of Epidemiology, 185(11), 1048–1050. https://doi.org/10.1093/aje/kwx077

Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if. Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

Neal, B. (2020). Introduction to causal inference from a machine learning perspective. Course Lecture Notes (Draft). https://www.bradyneal.com/Introduction_to_Causal_Inference-Dec17_2020-Neal.pdf

Ogburn, E. L., & Shpitser, I. (2021). Causal modelling: The two cultures. Observational Studies, 7(1), 179–183. https://doi.org/10.1353/obs.2021.0006

VanderWeele, T. J., & Hernan, M. A. (2013). Causal inference under multiple versions of treatment. Journal of Causal Inference, 1(1), 1–20.

Week 6: Effect Modification and CATE

Slides

Date: 1 Apr 2026

Required reading

  • Hernán & Robins (2025), chapters 4-5. link

Optional reading

Key concepts for assessment

  • Causal estimand versus statistical estimand
  • Interaction (joint interventions)
  • Effect modification (one intervention, subgroup contrasts)
  • CATE $\tau(x)$ and estimated CATE $\hat{\tau}(x)$
  • Why statistical interaction terms do not automatically imply causal effect modification

Week 5 defined the average treatment effect (ATE) and the assumptions required to estimate it from a well-defined intervention at a clear time zero. An average, though, can hide meaningful variation. This week extends the framework from "does the intervention work on average?" to "for whom does it work more, or less?"

The main difficulty this week is vocabulary. Psychology often uses "interaction", "moderation", "heterogeneity", and "personalised effects" as if they were interchangeable. They are not.

Seminar

Motivating example

A randomised exercise programme lowers blood pressure by 3 mmHg on average. That average can hide meaningful variation. Some participants improve a lot, while others barely change. If we only report the ATE, we can miss the information needed for treatment and policy decisions.

A simple map for this week

Keep these four ideas separate from the start.

Four ideas to keep separate

  • Interaction: the joint effect of two interventions.
  • Effect modification: variation in the effect of one intervention across subgroups.
  • Regression product term: a feature of a statistical model, such as $A \times G$.
  • CATE: the subgroup-level causal contrast, $\tau(x)$.

Rule of thumb

If you cannot write the estimand as $\mathbb{E}[Y(1) - Y(0) \mid X = x]$ for baseline $X$ measured at time zero, you are not estimating a CATE.

First distinction: interaction versus effect modification

Start with the scientific question, not the software output. If the design involves one intervention and subgroup contrasts, the question is about effect modification. If the design involves two interventions taken together, the question is about interaction.

Interaction

Interaction concerns two interventions, not one. Let $A$ and $B$ be interventions and let $Y$ be the outcome. On the additive scale, interaction is

$$ \mathbb{E}[Y(1,1)] - \mathbb{E}[Y(1,0)] - \mathbb{E}[Y(0,1)] + \mathbb{E}[Y(0,0)]. $$

If this contrast is non-zero, the joint effect is not additive on this scale.

Effect modification

Effect modification concerns one intervention across subgroups. For a subgroup variable $G$, effect modification exists when

$$ \mathbb{E}[Y(1) - Y(0) \mid G = g_1] \neq \mathbb{E}[Y(1) - Y(0) \mid G = g_2]. $$

This is still the effect of $A$ on $Y$. It is not the causal effect of $G$ on $Y$.

Scale matters

Interaction and effect-modification claims are scale-specific. A difference-scale result need not match a ratio-scale result.

Second distinction: causal modification versus model terms

This is where many psychology papers go wrong.

A regression interaction term ($A\times G$) is a model parameter. Causal effect modification is a property of potential-outcome contrasts under identification assumptions. A model term can be non-zero because of misspecification or bias, so it is not causal evidence by itself.

Pair exercise: interaction versus effect modification

  1. A study reports a "significant exercise-by-age interaction" in a regression of blood pressure on exercise, age, and their product term.
  2. State the causal estimand for interaction (hint: it requires four potential outcomes under joint interventions on exercise and age, which is conceptually odd because we cannot intervene on age).
  3. State the causal estimand for effect modification (hint: it involves one intervention on exercise, with subgroup contrasts across age groups).
  4. Which concept, interaction or effect modification, matches the study design? Give a reason the regression interaction term could be non-zero without any causal modification (e.g., model misspecification or collider bias).

CATE as the operational target

For a measured baseline profile $X=x$ defined at time zero,

$$ \tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x]. $$

$\tau(x)$ is a subgroup average causal contrast. For person $i$, $\hat{\tau}(X_i)$ is an estimate of that subgroup contrast. $\hat{\tau}(X_i)$ is not the unobservable individual contrast $Y_i(1)-Y_i(0)$.

Personalised effects versus true individual effects

Students sometimes read $\hat{\tau}(X_i)$ as the effect of treatment on person $i$. It is not. Person $i$'s true effect, $Y_i(1) - Y_i(0)$, requires both potential outcomes, and we observe at most one. What $\hat{\tau}(X_i)$ estimates is the average effect across all people who share person $i$'s measured profile $X_i$. The estimate is "personalised" in the sense that it uses person $i$'s covariates, but it remains a subgroup average. Two people with identical $X_i$ can have different true effects if they differ on unmeasured variables.

When the literature refers to "individualised treatment effects," the intended meaning is almost always $\hat{\tau}(X_i)$: an estimated subgroup average, not the unknowable individual contrast.

Identification reminders

Week 4's graph logic still matters here. Week 5's design logic still matters too: the treatment must remain well-defined, covariates must precede treatment, and subgroup contrasts are causal only if the same identification conditions still hold. Effect-modification questions are still causal questions, so confounding does not disappear just because we are now interested in subgroup differences.

For interaction with two interventions, we need identification of the joint intervention contrast. A common condition is conditional exchangeability for joint treatment assignment:

$$ Y(a, b) \coprod (A, B) \mid L. $$

Here $L$ must block all relevant backdoor paths from $A$ and $B$ to $Y$.

Diagram illustrating causal interaction. Assessing the joint effect of two interventions, A (e.g., teaching method) and B (e.g., tutoring), on outcome Y (e.g., test score). L_A represents confounders of the A-Y relationship, and L_B represents confounders of the B-Y relationship. Red arrows indicate biasing backdoor paths requiring adjustment.

Identification of causal interaction requires adjusting for all confounders of A-Y (L_A) and B-Y (L_B). Boxes around L_A and L_B indicate conditioning, closing backdoor paths.

For effect modification of $A$ by $G$, we still need valid control of confounding for the $A \to Y$ relation, typically within strata of $G$.

How shall we investigate effect modification of A on Y by G? Can you see the problem?

For a larger handout version of these effect-modification graphs, see Effect modification using causal graphs.

Effect modification by proxy

A variable can modify the treatment effect without directly causing the outcome. In the graph below, $Z$ is the direct effect modifier (open circle: it changes the size of $A$'s effect on $Y$). $G$ inherits this modification through its association with $Z$.

Effect modification by proxy: $G$ modifies $A$'s effect on $Y$ through its relationship to the direct effect modifier $Z$. Open circle denotes effect modification, not a standard causal arrow.

Whether $G$ remains an effect modifier depends on what else is in the model. If investigators condition on $Z$, then $G$ becomes independent of $Y$ and is no longer an effect modifier. Effect modification is relative to the adjustment set, not an intrinsic property of $G$ (VanderWeele & Robins (2007); VanderWeele (2012)).

d-separation does not imply absence of effect modification

The graph below poses a subtler problem. To identify the effect of $A$ on $Y$, we condition on $L$. But $G$ causes $L$, and conditioning on $L$ d-separates $G$ from $Y$. Does this mean $G$ is not an effect modifier?

d-separation $\neq$ no effect modification: $G$ is d-separated from $Y$ conditional on $L$, yet $G$ can still modify the effect of $A$ on $Y$ because $G$ shifts the distribution of $L$.

No. Even when $G \perp!!!\perp Y \mid L$, the CATE for a given level of $G$ is a weighted average of the $L$-specific treatment effects, where the weights come from the distribution of $L$ given $G$:

$$ \tau(g) = \mathbb{E}\left[\mathbb{E}[Y(1) - Y(0) \mid L] \middle| G = g\right]. $$

Two conditions are sufficient for effect modification by $G$. First, the effect of $A$ on $Y$ varies across levels of $L$. Second, the distribution of $L$ differs across levels of $G$ (which it does, because $G \to L$). When both hold, $\tau(g)$ varies with $g$. Effect modification by $G$ is present even though $G$ has no direct structural path to $Y$ after conditioning.

This result has practical consequences. Investigators who equate d-separation with absence of effect modification will miss genuine heterogeneity. A non-significant regression interaction term between $A$ and $G$, after adjusting for $L$, does not prove that $G$ is irrelevant to treatment targeting. The CATE can still vary by $G$ because $G$ shifts the covariate distribution over which the $L$-specific effects are averaged.

Two rules of thumb

  • A variable can modify the treatment effect even if it has no direct arrow to the outcome in the adjusted DAG.
  • Whether a variable is an effect modifier depends on what other variables are in the conditioning set. Effect modification is relative, not absolute.

Pair exercise: why conditioning changes effect modification

  1. An exercise programme ($A$) targets blood pressure ($Y$). Age ($G$) affects fitness ($L$), and $L$ affects $Y$. There is no direct $G \to Y$ path.
  2. Explain to your partner why the CATE still varies by age, even without a direct $G \to Y$ arrow (hint: the distribution of $L$ differs across age groups).
  3. A colleague fits a regression with an $A \times G$ interaction term and finds it non-significant. They conclude "age does not modify the treatment effect." Evaluate this conclusion.
  4. Describe a scenario where two apparent effect modifiers ($G_1$ and $G_2$) both show significant CATE variation individually, but the variation disappears when you condition on both simultaneously.

Why flexible estimators matter

With many covariates, hand-built interaction models are fragile for four reasons. First, the number of possible interactions grows combinatorially: $k$ covariates generate $\binom{k}{2}$ pairwise interactions and far more higher-order terms. Second, each interaction subgroup contains fewer observations, so estimates become noisy. Third, searching across many interactions inflates false-positive rates unless corrected. Fourth, the analyst must specify the functional form in advance, and real treatment-response surfaces are rarely linear.

Flexible estimators such as causal forests learn the heterogeneity surface from data. They can recover non-linear and high-dimensional patterns without requiring the analyst to guess the correct specification. These estimators help with functional form, but they do not remove confounding by design.

Demo: functional form matters

This simulation has randomised treatment.

There is no confounding.

The challenge is functional form.

# install once
# remotes::install_github("go-bayes/causalworkshop@v0.2.1")
library(causalworkshop)

# simulate data with non-linear heterogeneous effects
d <- simulate_nonlinear_data(n = 2000, seed = 2026)

# compare four estimation methods
results <- compare_ate_methods(d)

# summary table: ATE and individual-level RMSE
results$summary

# plot: estimated vs true treatment effects
results$plot_comparison

# plot: estimated effect as a function of x1
results$plot_by_x1

All methods recover the ATE in this simulation. They differ in how well they recover heterogeneity.

Return to the opening example

Back to exercise and blood pressure, the ATE tells us whether the programme helps on average. The CATE tells us where gains are concentrated. For policy and clinical decisions, we usually need both.

After the mid-trimester break and Test 1 (Week 7), Week 8 introduces machine-learning methods that estimate these subgroup contrasts in high dimensions, without requiring the analyst to specify the functional form in advance.

Pair exercise: from average to subgroup

  1. An exercise programme has ATE = 3 mmHg reduction in blood pressure.
  2. Construct a scenario where the conditional average treatment effect (CATE) is 8 mmHg for one subgroup and $-2$ mmHg for another, consistent with this ATE (specify group sizes).
  3. Explain what a policy-maker reading only the ATE is missing.
  4. Your partner claims "$\hat{\tau}(X_i) = 8$ means the programme will reduce my blood pressure by 8 mmHg." Correct this claim using the distinction between estimated subgroup averages and unobservable individual effects.

Lab materials: Lab 6: CATE and Effect Modification


Appendix A: additive interaction simplification

Starting from

$$ \begin{aligned} \big(\mathbb{E}[Y(1,1)] - \mathbb{E}[Y(0,0)]\big)

  • \big(\mathbb{E}[Y(1,0)] - \mathbb{E}[Y(0,0)]\big)
  • \big(\mathbb{E}[Y(0,1)] - \mathbb{E}[Y(0,0)]\big) \end{aligned} $$

we collect terms to obtain

$$ \mathbb{E}[Y(1,1)] - \mathbb{E}[Y(1,0)] - \mathbb{E}[Y(0,1)] + \mathbb{E}[Y(0,0)]. $$

Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if. Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

VanderWeele, T. J. (2009). On the distinction between interaction and effect modification. Epidemiology, 863–871.

VanderWeele, T. J. (2012). Confounding and Effect Modification: Distribution and Measure. Epidemiologic Methods, 1(1), 55–82. https://doi.org/10.1515/2161-962X.1004

VanderWeele, T. J., & Robins, J. M. (2007). Four types of effect modification: a classification based on directed acyclic graphs. Epidemiology (Cambridge, Mass.), 18(5), 561–568. https://doi.org/10.1097/EDE.0b013e318127181b

Week 7: In-Class Test 1 (20%)

22 April 2026

This week is the first in-class test, covering material from weeks 1–6.

What is covered

  • Causal diagrams: elementary structures (week 2)
  • Confounding bias structures (week 3)
  • Selection bias and measurement bias (week 4)
  • Average treatment effects and the three fundamental assumptions (week 5)
  • Effect modification and CATE (week 6)

Format

  • Duration: 50 minutes (allocated time: 1 hour 50 minutes)
  • Closed book, no devices
  • Bring a pen or pencil
  • Location: EA120 (the usual seminar room)

Reminders

  • Bring a pen or pencil
  • You may bring in 1 x A4 sheet with notes.
  • No electronic devices permitted during the test.
  • Arrive on time. The test begins at 14:10.
  • Write clearly; illegible answers cannot be marked.

Week 8: Heterogeneous Treatment Effects and Machine Learning

Date: 29 Apr 2026

Readings

Required

Optional

  • VanderWeele et al. (2020)
  • Suzuki et al. (2020)
  • Bulbulia (2024)
  • Hoffman et al. (2023)

Key concepts

  1. ATE and CATE answer different causal questions.
  2. Causal forests estimate heterogeneity, not just average effects.
  3. Honest splitting and out-of-bag estimation reduce overfitting.
  4. RATE and Qini assess whether targeting has practical value.

Week 6 introduced the conditional average treatment effect (CATE): the contrast for a subgroup defined by baseline covariates at time zero. Estimating CATE with parametric models requires the analyst to guess the functional form. This week introduces causal forests, which learn the heterogeneity surface from data. The machine-learning step changes the estimator, not the identification logic: treatment must still be well-defined, subgroup variables must still precede treatment, and exchangeability and positivity still do the causal work.

Seminar

Motivating example

Suppose a university can fund a community-socialising programme for only 30% of students.

If we treat everyone, budget fails. If we choose badly, impact is small.

So we need a defensible ranking of expected benefit.

From ATE to CATE

The average treatment effect is

$$ \text{ATE}=\mathbb{E}[Y(1)-Y(0)]. $$

The conditional average treatment effect is

$$ \tau(x)=\mathbb{E}[Y(1)-Y(0)\mid X=x]. $$

Here $X$ must be a baseline profile measured before treatment begins. ATE answers "does it work on average?" CATE answers "for whom does it work more or less?"

Why linear interaction models are often not enough

A small interaction model assumes a simple shape.

Real treatment-response surfaces are often non-linear and high-dimensional. In that setting, pre-specified terms can miss important structure.

From regression trees to causal forests

Understanding causal forests requires three steps.

Regression tree. A regression tree splits the covariate space with yes/no questions ("Age $\le$ 20?", "Baseline wellbeing $> 0.3$?"). Each terminal leaf predicts the sample mean of the outcome for units that land there. The result is a piecewise-constant surface, not a global line. A single tree is interpretable but unstable: small changes in the data can shift splits and predictions.

Regression forest. A random forest grows many trees on bootstrap samples and averages their outputs. Averaging cancels much of the noise that makes any one tree unreliable (Breiman (2001)).

Causal forest. To estimate treatment contrasts rather than outcomes, each tree plays an "honest" two-step game (Wager & Athey (2018)). One subsample chooses splits that separate treated from control units. A different subsample estimates treatment-control contrasts within each leaf. The forest uses those leaf-level contrasts to estimate the CATE surface:

$$ \hat{\tau}(x) \approx \tau(x)=\mathbb{E}[Y(1) - Y(0) \mid X = x]. $$

Because individual leaf estimates are noisy and point in many directions, their average is far less variable. The progression matters: students cannot reason about causal forests without first understanding what a tree does and why averaging helps.

Key intuition

A regression tree tiles the covariate space into locally flat regions. A forest averages many such tiles to smooth away noise. A causal forest adds honest splitting so the averaged contrasts estimate treatment effects, not just predictions.

Pair exercise: from tree to forest to causal forest

  1. Explain the three-step progression to your partner in your own words.
  2. Name one strength and one weakness of a single regression tree.
  3. Explain why averaging many trees (a forest) helps with the weakness you identified.
  4. State the two differences between a regression forest and a causal forest: (a) what is the target quantity? (b) what does honest splitting add? Explain in one sentence why honest splitting is necessary when we estimate treatment contrasts rather than predictions.

Honest splitting and out-of-bag prediction

Honest splitting separates model selection from estimation. This separation matters because we estimate parameters for an entire population under two exposures, at most one of which is observed for any individual. If the same data chose the splits and estimated the contrasts, the forest would overfit to noise in the training sample.

The forest adds a second safeguard: out-of-bag (OOB) prediction. Each $\hat{\tau}(x_i)$ is averaged only over trees that never used observation $i$ in their split phase. OOB prediction is analogous to leave-one-out cross-validation but comes for free from the bootstrap structure.

Together, honesty and OOB deliver reliable point estimates and uncertainty intervals even in high-dimensional settings with many covariates.

Missing data in grf

grf can treat missingness as a splitting attribute (MIA) rather than deleting rows by default.

That can preserve sample size, but it does not remove identification concerns. We still need a causal missingness argument, and we still need covariates defined before treatment if they are to index $\tau(x)$.

Is heterogeneity actionable?

After estimating $\hat{\tau}(x)$, we rank units from largest to smallest estimated effect and ask: does treating high-ranked units first yield meaningfully larger gains than treating at random?

The Targeting Operator Characteristic (TOC) curve answers this question. It plots cumulative gain against treatment coverage:

$$ G(q)=\frac{1}{n}\sum_{i=1}^{\lfloor qn\rfloor}\hat{\tau}_{(i)},\qquad 0\le q\le 1, $$

where $\hat{\tau}{(1)} \ge \hat{\tau}{(2)} \ge \cdots$ are the sorted estimated effects. The horizontal axis $q$ is the fraction of the population we would treat. The vertical axis $G(q)$ is the cumulative gain from treating that top-$q$ slice.

Two integrals of the TOC curve summarise targeting value:

RATE AUTOC (area under the TOC) puts equal weight on every value of $q$. It answers: if benefits concentrate among the best prospects, how much can we gain by selecting them? A steep initial rise followed by flattening indicates concentrated heterogeneity.

RATE Qini weights the mid-range of $q$ more heavily. It answers: at a realistic, moderate budget (say, treating 20-50% of individuals), does targeting improve on random allocation? Qini is the practical metric when investigators face a fixed budget constraint.

Both summaries must be computed on held-out or cross-fitted data, not in-sample rankings. The forest reuses information across trees through its bootstrap structure. Evaluating RATE or Qini on the training fold produces optimistic estimates. Computing them on a separate fold blocks this bias and yields honest confidence intervals.

Pair exercise: reading a TOC curve

  1. A university socialising programme produces a TOC curve that rises steeply for the top 20% of ranked students, then flattens.
  2. Interpret the shape: what does the steep rise mean about where treatment gains are concentrated?
  3. Suppose the AUTOC is large but the Qini at $q = 0.3$ (a 30% budget) is small. Explain what this combination means for a decision-maker with a fixed budget.
  4. Why would computing the TOC curve on the same data used to train the causal forest produce misleading results? State the problem in one sentence.

Workflow for this week

  1. Specify the causal estimand, treatment, time zero, and assumptions.
  2. Estimate $\hat{\tau}(x)$ with a causal forest.
  3. Evaluate targeting value with RATE/Qini.
  4. Decide whether heterogeneity is large enough to inform allocation.

Return to the opening example

Back to the university budget.

The question is not only whether the programme works. The question is whether gains are concentrated enough that targeting improves outcomes under a real budget constraint.

Week 9 turns that ranking into transparent policy rules.

Pair exercise: should we target?

  1. The ATE is 0.15 SD. The Qini curve is statistically significant at a budget of $q = 0.3$ (treating 30% of the population).
  2. State the causal estimand that the Qini addresses (what question does it answer beyond the ATE?).
  3. List two non-statistical reasons a decision-maker might choose not to target (consider ethics, logistics, or governance).
  4. A colleague argues "targeting lonely students for a socialising programme is stigmatising." Draft a two-sentence response that takes the concern seriously while explaining what the evidence does and does not show.

Lab materials: Lab 8: RATE and QINI Curves

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

Bulbulia, J. A. (2024). A practical guide to causal inference in three-wave panel studies. PsyArXiv Preprints. https://doi.org/10.31234/osf.io/uyg3d

Hoffman, K. L., Salazar-Barreto, D., Rudolph, K. E., & Díaz, I. (2023). Introducing longitudinal modified treatment policies: A unified framework for studying complex exposures. https://doi.org/10.48550/arXiv.2304.09460

Suzuki, E., Shinozaki, T., & Yamamoto, E. (2020). Causal Diagrams: Pitfalls and Tips. Journal of Epidemiology, 30(4), 153–162. https://doi.org/10.2188/jea.JE20190192

VanderWeele, T. J., Mathur, M. B., & Chen, Y. (2020). Outcome-wide longitudinal designs for causal inference: A new template for empirical studies. Statistical Science, 35(3), 437–466.

Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228–1242. https://doi.org/10.1080/01621459.2017.1319839

Week 9: Resource Allocation and Policy Trees

Date: 6 May 2026

Readings

Required

Optional

  • VanderWeele et al. (2020)
  • Suzuki et al. (2020)
  • Bulbulia (2024)
  • Hoffman et al. (2023)

Key concepts

  1. Policy learning turns CATE scores into treatment rules.
  2. Policy value must be evaluated out of sample.
  3. Shallow policy trees trade a little value for a lot of interpretability.
  4. Fairness constraints are design choices, not automatic outputs.

Week 8 estimated individual-level treatment contrasts and assessed whether targeting has practical value. Rankings alone are not policy. This week translates those rankings into interpretable, publicly defensible allocation rules.

Seminar

Motivating example

A district health board can fund a wellbeing intervention for only 20% of eligible residents.

A causal forest suggests large heterogeneity. The top-ranked group appears to benefit most.

Now we need a rule that can be defended in public, not just a ranking in code.

From ranking to policy

Let $d(x)\in{0,1}$ denote a treatment rule.

Its policy value is

$$ V(d)=\mathbb{E}[Y(d(X))]. $$

With a budget cap $q$, we impose

$$ \mathbb{E}[d(X)]\le q. $$

Policy learning seeks a rule with high value under that constraint.

Why policy trees

A causal forest maps a high-dimensional covariate vector $X$ to a personalised score $\hat{\tau}(X)$. The function itself is too tangled (thousands of overlapping splits) to hand to a decision-maker. The policytree algorithm bridges that gap. It collapses the forest's many $\hat{\tau}(X)$ values into a single shallow decision tree. Each split maximises expected benefit under the budget constraint (Sverdrup et al. (2024)).

In this course we cap tree depth at two, for three reasons. First, at most three yes/no questions per rule means the logic fits on a slide for policy-makers or clinicians. Second, each leaf retains enough observations to yield a stable effect estimate. Third, deeper trees increase computational complexity faster than they improve payoffs; the gain from a depth-3 tree over a depth-2 tree is usually small relative to the added opacity.

Pair exercise: designing a depth-2 policy rule

  1. You have a 20% treatment budget. From this list, choose two splitting variables: age, deprivation index, baseline loneliness, self-esteem, neuroticism.
  2. Sketch a depth-2 tree with your two variables. Label each leaf "treat" or "do not treat."
  3. Verify that approximately 20% of the population falls in the "treat" leaves (make plausible assumptions about the variable distributions).
  4. Give two reasons to prefer a depth-2 tree over a depth-4 tree for a public-health decision.

The result is a transparent allocation rule: a short set of if-then conditions that approximates what the full forest would recommend, at the cost of some lost precision.

Evaluating policy performance

We should not evaluate a policy on the same data used to train the ranking.

Use held-out or cross-fitted evaluation to estimate policy value and uncertainty.

If gains disappear out of sample, we do not deploy.

Practical interpretation

If a tree splits on self-esteem and neuroticism, that does not mean these variables are morally privileged causes.

It means they helped separate high-value and low-value treatment regions under the chosen objective.

Equity and governance

Efficiency is not enough.

Proxy variables can encode historical inequity. A split on deprivation may indirectly stratify by ethnicity or structural disadvantage.

Under Te Tiriti o Waitangi, allocation rules in health settings need explicit equity consideration.

Before deployment, investigators should check:

  1. Who gains and who loses?
  2. Are protected groups differentially affected through proxies?
  3. Does the rule reduce or worsen disparities?
  4. Can affected communities understand and contest the rule?

Pair exercise: equity audit

  1. A policy tree splits first on deprivation index, then on age. The rule treats high-deprivation residents under 40.
  2. Explain how a deprivation split indirectly stratifies by ethnicity in an Aotearoa New Zealand context (consider the correlation between deprivation and Māori/Pasifika populations).
  3. Apply two of the four governance checks above to this rule.
  4. Propose one modification informed by Te Tiriti o Waitangi obligations (e.g., guaranteed allocation floors, community consultation, or co-governance of the decision rule).
  5. Your partner says "the algorithm is objective because it only uses data." Counter this claim in two sentences.

Workflow for this week

  1. Estimate heterogeneity and targeting value.
  2. Fit a shallow policy tree under explicit constraints.
  3. Evaluate policy value out of sample.
  4. Report trade-offs: value, fairness, transparency, and feasibility.

Heterogeneity as scientific discovery

Conditional average treatment effect (CATE) machinery does more than power allocation decisions. It helps science move past a one-size-fits-all mindset. Mapping treatment effects across a high-dimensional covariate space tests whether our conventional categories (gender, age group, clinical severity) capture the differences that matter. Sometimes they do; often they do not. Discovering where the forest finds meaningful splits can generate fresh hypotheses about who responds and why, even when no policy decision is on the table. A forest that splits on loneliness rather than age, for example, suggests the psychological mechanism operates through social connection, not biological ageing.

Return to the opening example

Back to the district health board.

The right question is not "what rule maximises sample gain?" The right question is "what rule performs robustly and remains acceptable under equity and governance standards?"

The workflow from question to policy rule is now in place. One assumption has been present throughout but never examined: that our instruments measure the same construct across the groups we compare. Week 10 asks whether that assumption holds.

Pair exercise: policy tree versus ranking

  1. Strategy A ranks individuals by $\hat{\tau}(X_i)$ and treats the top 20%. Strategy B fits a depth-2 policy tree with a 20% budget constraint.
  2. Compare the two strategies on: (a) estimated policy value, (b) explainability to a non-technical audience, and (c) ability to answer "why was I selected?"
  3. State one scenario where Strategy A (pure ranking) is preferable to Strategy B (policy tree).
  4. State one scenario where Strategy B is preferable.

Lab materials: Lab 9: Policy Trees

Bulbulia, J. A. (2024). A practical guide to causal inference in three-wave panel studies. PsyArXiv Preprints. https://doi.org/10.31234/osf.io/uyg3d

Hoffman, K. L., Salazar-Barreto, D., Rudolph, K. E., & Díaz, I. (2023). Introducing longitudinal modified treatment policies: A unified framework for studying complex exposures. https://doi.org/10.48550/arXiv.2304.09460

Suzuki, E., Shinozaki, T., & Yamamoto, E. (2020). Causal Diagrams: Pitfalls and Tips. Journal of Epidemiology, 30(4), 153–162. https://doi.org/10.2188/jea.JE20190192

Sverdrup, E., Kanodia, A., Zhou, Z., Athey, S., & Wager, S. (2024). Policytree: Policy learning via doubly robust empirical welfare maximization over trees. https://CRAN.R-project.org/package=policytree

VanderWeele, T. J., Mathur, M. B., & Chen, Y. (2020). Outcome-wide longitudinal designs for causal inference: A new template for empirical studies. Statistical Science, 35(3), 437–466.

Week 10: Classical Measurement Theory from a Causal Perspective

Date: 13 May 2026

Readings

Required

Optional

Key concepts

  • EFA and CFA are model-building tools, not causal proofs.
  • Invariance tests are associational diagnostics under a chosen measurement model, not tests of causal comparability.
  • Reflective and formative equations need explicit causal interpretation.
  • Measurement assumptions can open or close bias paths in DAGs.

Weeks 1 through 9 built a workflow from causal question to policy recommendation. Every step assumed that the outcomes we measure mean the same thing for every group in the target population. If they do not, contrasts between groups can reflect measurement artefact, not causal differences. This week examines that assumption.

Seminar

Classical validity and its limits

Psychology textbooks organise measurement quality around four types of validity.

Four classical validity types

  • Content validity: the degree to which an instrument covers the intended domain.
  • Construct validity: whether the construct is accurately defined and operationalised.
  • Criterion validity: whether an instrument accurately predicts performance on an external criterion.
  • Ecological validity: whether an instrument reflects real-world situations and behaviour.

These categories organise useful intuitions. From a causal perspective, each conflates problems that need to be kept separate.

Content validity asks whether items span the construct's domain. It does not specify the causal direction between construct and indicators. Does the construct cause the items (a reflective model), or do the items constitute the construct (a formative model)? Without stating the causal structure, "measures what it's intended to measure" has no formal content.

Construct validity bundles two separate questions. "Accurately defined" concerns the target quantity: what state of the world are we trying to capture? This is analogous to defining a causal estimand (which intervention, in which population, compared with what alternative). "Operationalised" concerns whether the same score means the same thing across individuals. This is a consistency question. Lumping both under one label obscures where a measurement fails.

Criterion validity is purely associational. An instrument can predict an outcome well for non-causal reasons: shared confounders, reverse causation, collider bias. "Predicts performance" tells you nothing about whether the instrument captures the construct that causally affects the criterion. Weeks 2 through 4 showed why prediction and causation are different questions. The same distinction applies here.

Ecological validity gestures at transportability without specifying what changes across settings. A causal framework asks: does the construct-to-indicator relationship hold in the target population? That question is testable through measurement invariance. "Reflects real-world situations" is not testable.

These four categories are qualitative checklists, not properties of a formal model. A causal approach specifies the directed graph connecting constructs to indicators, states the assumptions under which observed scores recover latent quantities, and tests those assumptions. The rest of this lecture shows what that looks like in practice.

Motivating example

The Kessler-6 (K6) is widely used to screen psychological distress.

Two questions must be addressed before we try to compare scores causally across groups.

  1. Do the six items map to the same latent structure?
  2. Is that structure invariant across groups?

These questions are necessary, but they are not sufficient. Even if a measurement model fits well and invariance tests pass, causal interpretation still depends on a defended causal story about what the construct is and how it is measured.

Why this is a causal lecture

Measurement is part of identification. If measurement is unstable, effect estimates and group contrasts can be distorted even when adjustment sets are otherwise defensible.

Learning outcomes

By the end of this week, you should be able to:

  1. State what each classical validity type (content, construct, criterion, ecological) claims, and identify the causal assumptions each leaves implicit.
  2. Run EFA and CFA with clear model-comparison logic.
  3. Run configural, metric, and scalar/threshold invariance tests, and state what they do not establish.
  4. Explain why good fit does not prove a causal latent model.
  5. Explain why invariance tests do not deliver causal structure.
  6. Link measurement assumptions to DAG-based bias reasoning.

Part 1: practical workflow with K6

Step 1: prepare data and inspect factorability

library(margot)
library(tidyverse)
library(performance)

k6 <- df_nz |>
  filter(wave == 2018) |>
  select(
    kessler_depressed,
    kessler_effort,
    kessler_hopeless,
    kessler_worthless,
    kessler_nervous,
    kessler_restless
  )

check_factorstructure(k6)

Bartlett and KMO are entry checks. They do not validate causal interpretation.

Step 2: run EFA

library(psych)
library(parameters)

efa <- psych::fa(k6, nfactors = 3, rotate = "oblimin") |>
  model_parameters(sort = TRUE, threshold = "max")

efa

Oblique rotation is usually appropriate because psychological dimensions often co-vary.

Step 3: compare CFA candidates

library(datawizard)
library(lavaan)
library(performance)

set.seed(123)
parts <- data_partition(k6, training_proportion = 0.7, seed = 123)
train <- parts$p_0.7
test <- parts$test

m1_syntax <- psych::fa(train, nfactors = 1) |> efa_to_cfa()
m2_syntax <- psych::fa(train, nfactors = 2) |> efa_to_cfa()
m3_syntax <- psych::fa(train, nfactors = 3) |> efa_to_cfa()

m1 <- suppressWarnings(cfa(m1_syntax, data = test))
m2 <- suppressWarnings(cfa(m2_syntax, data = test))
m3 <- suppressWarnings(cfa(m3_syntax, data = test))

compare_performance(m1, m2, m3, verbose = FALSE)

Read CFI, RMSEA, AIC, and BIC together. Prefer simpler models when fit is comparable.

Step 4: test invariance across groups

We teach invariance testing because it is widely used and because you may be asked to use it in comparative work. Treat it as a descriptive diagnostic, not as a generator of causal insight. A multi-group CFA invariance test is conditional on a specified measurement model (usually reflective), a chosen parameterisation, and assumptions such as local independence (no causal relations among items once the latent variable is held fixed). Passing an invariance test does not show that the same causally relevant construct exists in both groups. It only shows that a particular statistical measurement model, with particular equality constraints, is compatible with the observed covariance structure. Failures and successes are both compatible with many causal stories. Causal structure is underdetermined by associations in the data.

In this lecture, causal assumptions come first, and this means that statistical tests cannot replace thinking about our assumptions. We must define the construct and state a causal measurement story. Only then can invariance tests play a role, by checking some statistical implications of that story.

For ordinal items, threshold invariance is the analogue of scalar invariance. In practice, fit an ordinal estimator (for example WLSMV) when items are Likert-type.

library(semTools)

k6_eth <- df_nz |>
  filter(wave == 2018, eth_cat %in% c("euro", "maori")) |>
  select(
    kessler_depressed,
    kessler_effort,
    kessler_hopeless,
    kessler_worthless,
    kessler_nervous,
    kessler_restless,
    eth_cat
  )

# example template
# measurementInvariance(
#   model = model_syntax,
#   data = k6_eth,
#   group = "eth_cat"
# )
  • Configural invariance: same loading pattern.
  • Metric invariance: same loadings.
  • Scalar/threshold invariance: same intercepts (continuous) or thresholds (ordinal).

How are these tests useful?

  1. They describe whether a chosen measurement model assigns similar roles to items across groups (within that model class). This can be a compact summary of group differences in the covariance structure.
  2. They can suggest where to look. If constraints fail, you learn which items or thresholds are most responsible, which can motivate substantive work (translation review, response-process interviews, item-by-item analysis, or redesign).
  3. They keep you honest about what is not identified. Even if every invariance test "passes", the causal meaning of the construct is not certified. If a test "fails", it does not tell you whether the problem is measurement bias or a real causal difference in the underlying state (for example, different causes of distress producing different item dynamics). To decide that, you need a causal story and often new data, not a better fit index.

Also note that item means can differ for many reasons that have nothing to do with biased measurement. If group A is more distressed than group B, we should expect different item means even under perfect measurement. Invariance testing is not a way to label differences as artefact. It is a way to describe whether a particular psychometric model is stable across groups.

Under the standard invariance interpretation, if scalar/threshold invariance fails then latent mean comparisons are not identified within that measurement model. In this course, treat this as a warning about interpretability under the assumed model, not as evidence that any observed group difference is "measurement bias" rather than a real difference in underlying causal reality.

Pair exercise: interpreting invariance results

  1. The K6 is tested across two ethnic groups. Configural invariance holds. Metric invariance holds. Scalar/threshold invariance fails on two items: "felt hopeless" and "felt worthless."
  2. State what each level of invariance means in plain language (same structure, same loadings, same intercepts/thresholds).
  3. State the standard invariance interpretation of scalar/threshold failure for group mean comparisons, then state the causal critique: why a cross-sectional associational test cannot decide whether the difference is measurement bias or a real difference in the underlying causal reality.
  4. Propose a hypothesis for why "felt hopeless" and "felt worthless" might function differently across groups (consider cultural norms, translation, different anchoring of response categories, or different causes of distress).
  5. Suppose all invariance tests passed. State one reason this would still not settle the causal question of whether "distress" is the same outcome across groups.

Part 2: how traditional measurement theory fails (for causal inference)

This part restores the core point of the lecture. Causal inference operates under assumptions. For measurement, it is causality all the way down. This is not a matter of attitude. It is a consequence of the potential outcomes framework: causal contrasts require well-defined outcomes under interventions, and constructed measures are outcomes of causal processes (question wording, translation, response styles, incentives, and the world that generates the experiences being reported).

Classical psychometric checks (internal consistency, model fit, invariance tests) can organise associations. They do not, by themselves, evaluate the causal assumptions about direction and causal efficacy that are often imported into practice when we move from a measurement model to a causal claim. The causal question is not "does the model fit?" The causal question is "under which causal assumptions does this measured quantity behave like the variable in our DAG?"

Two ways of thinking about measurement in psychometric research

In psychometric research, formative and reflective models describe the relationship between latent variables and indicators. VanderWeele discusses this distinction, and its implications for causal inference with constructed measures, in the required reading (VanderWeele, 2022).

Reflective model (factor analysis)

In a reflective measurement model (an effect-indicator model), the latent variable is taken to cause the observed indicators. Each indicator is a reflection of the latent variable.

Reflective model: assume univariate latent variable $\eta$ giving rise to indicators $X_1 \dots X_n$. Figure adapted from VanderWeele (2022).

The reflective model is often written:

$$ X_i = \lambda_i\eta + \varepsilon_i. $$

Here, $X_i$ is an observed indicator, $\lambda_i$ is its loading, $\eta$ is a latent variable, and $\varepsilon_i$ is an error term. The equation is a statistical description. The stronger claim enters when we interpret it structurally: we treat $\eta$ as causally efficacious, and we treat the indicators as interchangeable reflections of it.

Formative model (factor analysis)

In a formative measurement model (a cause-indicator model), the indicators are taken to give rise to, or determine, a (univariate) latent variable. Correlation or interchangeability between indicators is not required. Each indicator can contribute distinctively to the latent variable.

Formative model: assume indicators $X_1 \dots X_n$ giving rise to a univariate latent variable $\eta$. Figure adapted from VanderWeele (2022).

The formative model is often written:

$$ \eta = \sum_i \lambda_iX_i + \varepsilon. $$

Again, the equation is a statistical description. It is not an automatic statement about causal direction.

Statistical models versus structural interpretations

VanderWeele distinguishes a statistical model from a structural interpretation (VanderWeele, 2022). A statistical model describes patterns in the observed covariance structure. A structural interpretation adds causal claims about how the world generates those patterns.

The two diagrams below show structural assumptions that are often taken for granted when scale scores are then used as exposures, outcomes, or confounders in causal analyses.

Reflective model: causal assumptions. Figure adapted from VanderWeele (2022).

Formative model: causal assumptions. Figure adapted from VanderWeele (2022).

Why fit is not enough

A well-fitting factor model can be compatible with multiple causal structures. Fit indices alone cannot establish that one latent variable causes all indicators.

This is the central discipline point for this lecture. Fit is about what the statistical model can represent. Identification is about whether, under stated assumptions, the data identify the causal contrast we want.

Problems with the structural interpretations of reflective and formative factor models

Even if we grant the reflective or formative equations as useful statistical summaries, cross-sectional data do not, by themselves, decide the direction of causation among latent variables and indicators (VanderWeele, 2022). This creates a problem because the standard structural interpretations of reflective and formative models are used implicitly across psychology.

The same statistical forms can be compatible with alternative causal stories in which indicators (or the realities they partially reflect) are causally efficacious for the outcome. The compatibility examples below illustrate the issue. The point is not that one of these diagrams is "true". The point is that fit alone does not decide among them.

Formative model is compatible with indicators causing the outcome. Figure adapted from VanderWeele (2022).

Reflective model is compatible with indicators causing the outcome. Figure adapted from VanderWeele (2022).

There are other compatible structural interpretations as well. For example, the "latent" reality may be multivariate, with different constituents giving rise to different indicators, and only some constituents being causally efficacious for the outcome.

Multivariate reality gives rise to indicators, from which we construct measures. Figure adapted from VanderWeele (2022).

Only some constituents of multivariate reality may be causally relevant for $Y$. Figure adapted from VanderWeele (2022).

VanderWeele's key observation is that cross-sectional data can describe relationships, but they cannot conclusively determine causal direction. This is worrying because it means that many psychometric checks do not explicitly evaluate the causal assumptions that later causal claims rely upon (VanderWeele, 2022). VanderWeele also discusses longitudinal tests for structural interpretations of univariate latent variables that often do not support the simple causal stories that are presumed. We might describe the uncritical reliance on factor-model structural interpretations as one component of a wider "causal crisis" in the social sciences (Bulbulia, 2023).

Multiple versions perspective

A coarse score may combine multiple underlying states. This is a multiple-versions problem. We can still estimate useful associations, but interpretation must state what is being averaged and what is unidentified.

Review: multiple versions of treatment

The theory of multiple versions of treatment addresses the fact that real interventions are rarely uniform. Let $K$ denote the "true" versioned treatment and let $A$ be a coarsened indicator of $K$.

Multiple versions of treatment: $A$ as a coarsened indicator of $K$. Figure adapted from VanderWeele (2022).

Recall that a causal effect is defined as the difference in expected potential outcomes if everyone were exposed to one level of a treatment versus another, conditional on covariates $L$:

$$ \delta = \sum_l \left( \mathbb{E}[Y\mid A=a,l] - \mathbb{E}[Y\mid A=a^*,l] \right) P(l). $$

Under the multiple-versions interpretation, we can express a consistent estimand in terms of $K$:

$$ \delta = \sum_{k,l} \mathbb{E}[Y_k\mid l] P(k\mid a,l) P(l) - \sum_{k,l} \mathbb{E}[Y_k\mid l] P(k\mid a^*,l) P(l). $$

This corresponds to a hypothetical randomised trial in which, within strata of $L$, the treated group receives versions $K$ drawn from the version distribution among those with $A=a$ and the control group receives versions drawn from the version distribution among those with $A=a^*$ (VanderWeele & Hernan, 2013).

Reflective and formative measurement models as multiple versions

VanderWeele suggests using this framework to interpret constructed measures of psychosocial constructs (VanderWeele, 2022). Roughly, if $A$ is a constructed measure from indicators $(X_1,\dots,X_n)$, then $A$ can be treated as a coarsened indicator of an underlying reality, and the multiple-versions logic can preserve causal interpretability under strong assumptions.

Multiple versions of treatment applied to measurement. Figure adapted from VanderWeele (2022).

One way to express this is to replace $K$ with an underlying (possibly multivariate) reality $\eta$, and to treat changes in a constructed measure as shifting the distribution of $\eta$ versions:

$$ \delta = \sum_{\eta,l} \mathbb{E}[Y_\eta\mid l] P(\eta\mid A=a+1,l) P(l) - \sum_{\eta,l} \mathbb{E}[Y_\eta\mid l] P(\eta\mid A=a,l) P(l). $$

This offers a reason not to despair. But it is not a free pass. The interpretation remains obscure when we do not have a clear definition of what the causally relevant constituents of the construct are, and when we have not explicitly stated which causal assumptions connect indicators, measures, and outcomes.

VanderWeele's model of reality

VanderWeele concludes by arguing that traditional univariate reflective and formative models do not adequately capture the relations between underlying causally relevant phenomena and our indicators and measures. He argues that the causally relevant constituents of reality are almost always multidimensional, that measure construction should start from construct definition, and that structural interpretations should be tested rather than presumed (VanderWeele, 2022).

VanderWeele's argument can be summarised as the following propositions (VanderWeele, 2022).

  1. Traditional univariate reflective and formative models do not adequately capture the relations between causally relevant phenomena and indicators and measures.
  2. The causally relevant constituents of reality related to psychosocial constructs are almost always multidimensional, giving rise to indicators and to our language and concepts.
  3. Construct measurement should start from an explicit construct definition, from which items are derived and justified.
  4. The presumption of a structural univariate reflective model can impair measure construction, evaluation, and use.
  5. If a structural interpretation of a univariate reflective factor model is proposed, it should be tested rather than presumed; factor analysis alone is not sufficient evidence.
  6. Even when causally relevant constituents are multidimensional but a univariate measure is used, associations with outcomes can be interpreted using multiple versions of treatment theory, though interpretation is obscured without clarity about constituents.
  7. When data permit, examining associations item-by-item, or in conceptually related item sets, can provide insight into facets of the construct.

Multivariate reality gives rise to latent variables, indicators, and constructed measures. Figure adapted from VanderWeele (2022).

This is a compelling sketch. It is not yet a complete causal recipe. In particular, it is not a causal DAG in the sense we have used throughout the course, because the arrows are not yet a clear set of causal claims that we can test with d-separation. It motivates the question we care about in causal inference: what assumptions do we need to connect our constructed measures to the causal contrasts we want to estimate?

A pragmatic causal response: measurement error as a structural threat

We can bring this discussion back to the causal workflow by using causal diagrams to represent measurement dynamics. Let $\eta_A$ denote a "true" exposure state, $\eta_Y$ a "true" outcome state, and $\eta_L$ a "true" confounder state. Let $A_{f(X_1,\dots,X_n)}$, $Y_{f(X_1,\dots,X_n)}$, and $L_{f(X_1,\dots,X_n)}$ denote constructed measures (functions of indicators). Allow unmeasured sources of measurement error, $U_A$, $U_Y$, and $U_L$, to influence the constructed measures.

Uncorrelated non-differential measurement error does not bias estimates under the null (but can still attenuate). Figure adapted from the 2025 lecture materials.

Read the diagram as a measurement-augmented causal model.

  1. The $\eta$ nodes are latent realities ($\eta_L$, $\eta_A$, $\eta_Y$). They are the states we would ideally intervene on and measure without error.
  2. The $var_{f(X_1,\dots,X_n)}$ nodes are constructed measures: functions of multiple indicators.
  3. The $U$ nodes are unmeasured sources of error in those constructed measures. They include stable reporting tendencies, transient mood at the time of survey completion, social desirability, and culturally patterned response styles.

The key edges have the following interpretations.

  1. $\eta_L \rightarrow L_{f(X_1,\dots,X_n)}$: the true confounder state affects its measured realisation.
  2. $\eta_A \rightarrow A_{f(X_1,\dots,X_n)}$: the true exposure state affects its measured realisation.
  3. $\eta_Y \rightarrow Y_{f(X_1,\dots,X_n)}$: the true outcome state affects its measured realisation.
  4. $U_L \rightarrow L_{f(X_1,\dots,X_n)}$, $U_A \rightarrow A_{f(X_1,\dots,X_n)}$, $U_Y \rightarrow Y_{f(X_1,\dots,X_n)}$: unmeasured error sources distort each constructed measure. In the strongest language, our measures "see as through a mirror, in darkness" relative to the underlying reality they hope to capture.
  5. Correlated errors: $U$ nodes may share common causes, so error in one domain can correlate with error in another (for example, a general tendency to present oneself favourably affects multiple self-reports).
  6. Directed errors: true states can affect how other variables are measured (for example, exercise might change how people interpret distress items), creating pathways from $\eta_A$ into $U_Y$.

The utility of describing measurement dynamics using causal graphs is that we can see when measurement itself creates new paths. The act of conditioning on measured variables can introduce collider bias when both true states and measurement errors feed into the measured nodes. When unmeasured (multivariate) psycho-physical states are related to unmeasured sources of error in the measurement of those states, measurement can open pathways to confounding.

One key warning is that measurement error opens additional pathways to confounding when either errors are correlated or when the exposure causally affects the error in the measured outcome.

Measurement error opens an additional pathway to confounding if there are correlated errors or directed effects of the exposure on outcome measurement error. Figure adapted from the 2025 lecture materials.

Confounding control by baseline measures in three-wave panels

One pragmatic design response is to measure baseline values of exposure and outcome (and key confounders), then estimate effects forward in time using a three-wave panel.

  1. This design adjusts for baseline measurements of both exposure and outcome.
  2. Understanding this approach in the context of potential directed and correlated measurement errors clarifies its strengths and limitations.
  3. Baseline measures can reduce the chance that unmeasured sources of measurement error are correlated with later changes in exposure and outcome.
  4. For example, if individuals have a stable social desirability bias at baseline, then to create new bias it would need to change in a way that is unrelated to its baseline effects.
  5. However, we cannot eliminate the possibility of new bias development, nor directed effects of exposure on outcome reporting.
  6. Attrition and non-response can create additional directed measurement structures.
  7. Despite these challenges, including baseline exposure and outcome measures should be standard practice in multi-wave studies because it reduces the likelihood of novel confounding.
  8. Because we can never be certain the assumptions hold, we should still perform sensitivity analyses.

Three-wave panel design with measurement error considerations. Figure adapted from the 2025 lecture materials.

Pair exercise: fit is not identification

  1. A one-factor confirmatory factor analysis (CFA) of six K6 items yields CFI = 0.98 and RMSEA = 0.03.
  2. A colleague claims "the good fit confirms that distress causes all six responses." Evaluate this claim with reference to VanderWeele (2022).
  3. Choose two of the diagrams above that are compatible with the same statistical factor model, and state what causal assumption differs between them.
  4. Explain why the choice matters for causal inference downstream (hint: consider what happens when you use the scale score as a confounder or outcome in a DAG).

Return to the opening example

Back to K6.

A total score can still be useful for screening. But without defended structure and invariance, we should avoid strong causal claims about cross-group latent differences.

Our job as investigators is to separate what the model fits from what the design identifies.

Pair exercise: measurement as an identification problem

  1. Explain to your partner how scalar invariance failure distorts conditional average treatment effect (CATE) estimates even when exchangeability and positivity hold.
  2. A colleague says "the K6 has been validated in hundreds of studies, so measurement is not a concern." Counter this claim in two sentences, distinguishing internal consistency from cross-group invariance.
  3. Propose a workflow step that belongs between drawing the DAG and running estimation, specifically to check measurement assumptions. State what it tests and what a failure would change about the analysis.

Appendix: if you use an LLM

Copy/paste prompt (use the LLM as a tutor, not a replacement):

You are my tutor. Do not solve the problem for me. Ask me short questions and wait for my answers.

We are working in the potential outcomes framework. Do not treat any associational model output (regression, factor analysis, SEM, invariance tests, fit indices) as evidence of causal structure.

Your job is to help me do the causal thinking. Start by asking me to state:

  1. The target population.
  2. The causal contrast (intervention vs control), with timing.
  3. The outcome, and how it is measured (what the questions/items are).
  4. The causal estimand (ATE, CATE, etc.).

Then ask me to list the identification assumptions (consistency, exchangeability, positivity) and the measurement assumptions I am making. If I mention "validity", "reliability", "fit", or "invariance", ask me what causal assumption I think that statistic is meant to support, and what causal alternative it fails to rule out.

Only after I answer should you give feedback. Keep feedback focused on whether my assumptions are explicit, plausible, and testable, and on what additional design or data would reduce reliance on unsupported causal assumptions.


Lab materials: Lab 10: Measurement Invariance

Bulbulia, J. A. (2023). A workflow for causal inference in cross-cultural psychology. Religion, Brain & Behavior, 13(3), 291–306. https://doi.org/10.1080/2153599X.2022.2070245

Fischer, R., & Karl, J. A. (2019). A primer to (cross-cultural) multi-group invariance testing possibilities in r. Frontiers in Psychology, 1507.

Harkness, J. A., Van de Vijver, F. J., & Johnson, T. P. (2003). Questionnaire design in comparative research. Cross-Cultural Survey Methods, 19–34.

Harkness, J. [et. al]. (2003). Questionnaire translation. In Cross-cultural survey methods (pp. 35–56). Wiley.

He, J., & Vijver, F. van de. (2012). Bias and Equivalence in Cross-Cultural Research. Online Readings in Psychology and Culture, 2(2). https://doi.org/10.9707/2307-0919.1111

VanderWeele, T. J. (2022). Constructed measures and causal inference: Towards a new model of measurement for psychosocial constructs. Epidemiology, 33(1), 141–151. https://doi.org/10.1097/EDE.0000000000001434

VanderWeele, T. J., & Hernan, M. A. (2013). Causal inference under multiple versions of treatment. Journal of Causal Inference, 1(1), 1–20.

Week 11: In-Class Test 2 (20%)

20 May 2026 (w11)

This week is the second in-class test, covering material from weeks 8–10.

What is covered

  • Heterogeneous treatment effects and machine learning (week 8)
  • Resource allocation and policy trees (week 9)
  • Classical measurement theory from a causal perspective (week 10)

Format

  • Duration: 50 minutes (allocated time: 1 hour 50 minutes)
  • Closed book, no devices
  • Bring a pen or pencil
  • Location: EA120 (the usual seminar room)

Reminders

  • No electronic devices permitted during the test.
  • Arrive on time. The test begins at 14:10.
  • Write clearly; illegible answers cannot be marked.

Week 12: Student Presentations (10%)

27 May 2026

This week

Each student presents a concise summary of their research report.

The objective is clear causal communication: question, design, assumptions, findings, limits.

Format

  • 10 minutes per presentation.
  • 2-3 minutes for questions.
  • Location: EA120.
  1. Causal question and target population.
  2. Data and study design.
  3. Estimand and identification assumptions.
  4. Main result with uncertainty.
  5. One limitation and one next step.

Practical checklist

Preparation

  • Keep slides readable and sparse.
  • Use one figure or table per central claim.
  • Define symbols and acronyms at first use.
  • Rehearse timing. Ten minutes is short.

Research report due

The research report is due Friday 30 May (end of Week 12). Submit one PDF with an R code appendix via Nuku.

Lab 1: Git and GitHub

This session introduces version control with Git and GitHub. Setting up these tools first means you can track your work from day one.

Week 1 software requirement

Bring your laptop in week 1 and confirm Git/GitHub access. Install both R and RStudio in week 1, and use RStudio as the standard editor for this course. Instructions are in Lab 2: Install R and Set Up Your IDE.


What is version control?

Version control tracks every change you make to your files. Instead of saving e.g. report_v2_final_FINAL.docx, you save a single file and Git remembers its entire history. You can go back to any previous version, see exactly what changed, and collaborate without overwriting each other's work.

GitHub is a website that hosts your Git repositories online. It serves as a backup and lets you share your work. There are other services like GitLab and Bitbucket, but GitHub is the most popular.

Why bother?

  • Your lab diary and final report will be easier to manage.
  • You will never lose work (every version is saved).
  • Employers value version control skills.
  • It is the standard tool for reproducible research -- much more powerful than OSF presentations because every change is tracked and time-stamped.

Step 1: Create a GitHub account

  1. Go to https://github.com.
  2. Click Sign up and follow the prompts. (Get the student version -- see below).
  3. Choose a username you are happy to use professionally (e.g., jsmith-nz, not xXx_gamer_xXx).
  4. Verify your email address.

Student benefits

Apply for the GitHub Student Developer Pack with your university email. It includes free access to GitHub Copilot, cloud credits, and other developer tools.

Step 2: Install Git

macOS

Open Terminal (search for it in Spotlight) and type:

git --version

If Git is not installed, macOS will prompt you to install the Xcode Command Line Tools. Follow the prompts.

Windows

Download Git from https://git-scm.com/download/win. Run the installer and accept the defaults.

Verify installation

Open a terminal (Terminal on macOS, Git Bash on Windows) and type:

git --version

You should see something like git version 2.44.0.

Step 3: Configure Git

Tell Git your name and email (use the same email as your GitHub account):

git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

Step 4: Set up SSH authentication

Before you can push code, GitHub needs to verify who you are. SSH keys let your computer prove its identity without a password each time. You do this once and it works from then on.

Check whether you already have a key

Many students already have an SSH key from a previous course, internship, or Git setup. Before creating a new one, check what is already in your ~/.ssh/ folder:

ls -la ~/.ssh

If you see files such as id_ed25519 and id_ed25519.pub, you may already have a usable key. If you are unsure, it is fine to create a fresh key for this course. GitHub allows more than one SSH key on an account.

Generate an SSH key

Open a terminal and run:

ssh-keygen -t ed25519 -C "your.email@example.com"

Use the email address attached to your GitHub account.

You will see prompts like these:

Generating public/private ed25519 key pair.
Enter file in which to save the key (/Users/yourname/.ssh/id_ed25519):

Press Enter to accept the default location unless you already have a key there and want to keep it. If you already have an id_ed25519 key and want a separate course key, you can type a different file name such as ~/.ssh/id_ed25519_psyc434.

Next you will be asked about a passphrase:

Enter passphrase (empty for no passphrase):
Enter same passphrase again:

You may press Enter twice to skip it. A passphrase adds security, but it also means you may need to unlock the key when you restart your computer.

This creates two files in ~/.ssh/:

FileWhat it isShare it?
id_ed25519Your private keyNever. Do not copy, email, upload, or commit this file. Anyone who has it can impersonate you.
id_ed25519.pubYour public keyYes. This is what you give to GitHub.

Protect your private key

The file ~/.ssh/id_ed25519 (no .pub) is your private key. Treat it like a password. Never paste it into a chat, never commit it to a repository, never upload it anywhere. If you suspect it has been exposed, delete the key from your GitHub account immediately at github.com/settings/keys and generate a new one.

Confirm that the files were created

Run:

ls -l ~/.ssh

You should see both the private key and the public key. If you saved the key under a custom name, look for that name instead of id_ed25519.

Add the key to the SSH agent

The SSH agent remembers your key so Git can use it automatically.

Start the agent:

eval "$(ssh-agent -s)"

Then add your key:

ssh-add ~/.ssh/id_ed25519

If you used a custom file name, replace id_ed25519 with that name.

On some macOS systems, the following command is preferred because it stores the passphrase in the keychain:

ssh-add --apple-use-keychain ~/.ssh/id_ed25519

If that command gives an error, use the plain ssh-add ~/.ssh/id_ed25519 command instead.

Add the public key to your GitHub account

Copy the public key (the file ending in .pub) to your clipboard.

macOS:

pbcopy < ~/.ssh/id_ed25519.pub

Windows (Git Bash):

cat ~/.ssh/id_ed25519.pub

Select and copy the output manually.

If pbcopy does not work on your system, you can also print the key manually:

cat ~/.ssh/id_ed25519.pub

Copy the entire line beginning with ssh-ed25519.

Then add it to GitHub:

  1. Go to github.com/settings/keys.
  2. Click New SSH key.
  3. Give it a title (e.g., "My laptop").
  4. Paste the key into the Key field.
  5. Click Add SSH key.

Test the connection

ssh -T git@github.com

The first time, you will usually see a message like:

The authenticity of host 'github.com (IP ADDRESS)' can't be established.
ED25519 key fingerprint is ...
Are you sure you want to continue connecting (yes/no/[fingerprint])?

Type yes and press Enter. You should then see:

Hi your-username! You've successfully authenticated, but GitHub does not provide shell access.

That message means it worked. All future git push and git pull commands will authenticate automatically.

Common problems

If you get Permission denied (publickey)

This usually means one of four things:

  1. You copied the wrong key to GitHub.
  2. You copied the private key instead of the public key.
  3. Your key was not added to the SSH agent.
  4. You are using a different GitHub account from the one attached to the key.

Work through these checks:

ls -l ~/.ssh
ssh-add -l
cat ~/.ssh/id_ed25519.pub

Then compare the printed public key with the one shown at github.com/settings/keys.

If ~/.ssh/id_ed25519.pub does not exist

The key pair was not created where you think it was. Run ls -la ~/.ssh and look for the file name you chose during ssh-keygen.

If you see Agent admitted failure to sign

Run:

ssh-add ~/.ssh/id_ed25519

and then test again with ssh -T git@github.com.

If ssh -T hangs or fails

Some university networks block SSH. If you are on campus Wi-Fi and the command hangs, try from a different network (e.g., your phone hotspot). If it still fails, contact the course coordinator.

Step 5: Create a private repository on GitHub

  1. Go to github.com/new.
  2. Name your repository psy434-labs (or similar).
  3. Set visibility to Private.
  4. Check Add a README file.
  5. Click Create repository.

Privacy and submission

Your GitHub repository must remain private for the duration of the course. Do not change its visibility to public. Lab diaries are submitted as .md files via NUKU, not through GitHub. GitHub is your version-control tool; NUKU is where marking happens. Submission instructions for each lab diary appear on the NUKU assignment page.

Step 6: Clone the repository to your computer

Cloning downloads a copy of the repository to your machine and links it to GitHub.

  1. On your repository page, click the green Code button.
  2. Select SSH and copy the URL (it starts with git@github.com:).
  3. Open a terminal and navigate to where you want to store your work:
mkdir -p ~/Documents/psy434
cd ~/Documents/psy434
  1. Clone the repository:
git clone git@github.com:YOUR-USERNAME/psy434-labs.git

Replace YOUR-USERNAME with your GitHub username.

  1. Move into the repository folder:
cd psy434-labs

You now have a local copy linked to GitHub.

Sanity check that you are in the right place:

pwd
ls

You should see your location end with psy434-labs, and you should see the files in your repository (it may be empty at first).

If you are new to Terminal

These commands help you check where you are and what files you have:

pwd        # print the current folder (your location)
ls         # list files in the current folder
cd ..      # go up one folder
cd ~       # go to your home folder

If you ever see an error like "No such file or directory", run pwd and ls to check your location and spelling.

Checkpoint

If you have cloned your repository successfully, you are on track. Everything below can be finished before next week's lab if you run out of time.

Step 7: The basic Git workflow

The everyday workflow has three steps: stage, commit, push.

Recommended editor

RStudio is the recommended editor for creating and editing .md, .R, and .qmd files. It reduces file-extension errors (for example, accidentally saving README.md.txt) and all lab instructions assume RStudio. Install instructions are in Lab 2. You may use another editor, but you will need to adapt instructions yourself.

1. Create your first file

Your repository already has a README.md from Step 5. Open it and replace its contents with:

# PSYCH 434 lab diary

This is my lab diary for PSYCH 434 (2026).

Save the file in the root of your repository folder (psy434-labs/). Use RStudio if available (File > New File > Text File), then save as README.md.

Windows file extensions

If you create the file in File Explorer, make sure it is named README.md (not README.md.txt). If you cannot see extensions, turn on "File name extensions" in File Explorer.

Creating a file from the command line

If you are already in your repository folder, you can create an empty file with:

touch README.md

Then open it in a text editor and paste in the text above. If you are not sure you are in the right place, run pwd and check that the folder name ends with psy434-labs.

2. Check what changed

Before staging, check what Git sees:

git status

You should see README.md in red under "Untracked files". This command is your best friend: run it whenever you are unsure what state your repository is in.

A typical first output looks something like this:

On branch main

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        README.md

nothing added to commit but untracked files present

Read this output slowly:

  1. On branch main tells you which branch you are working on.
  2. Untracked files means Git sees the file, but is not yet tracking it.
  3. The suggested git add command tells you the next step.

If you see "not a git repository"

You are not in your repository folder. Run pwd and ls, then cd psy434-labs and try git status again.

3. Stage the change

Staging tells Git which changes you want to include in your next snapshot:

git add README.md

Run git status again immediately afterward:

git status

Now README.md should appear in green under a heading such as Changes to be committed. That means the file is staged and ready for the next commit.

You can inspect exactly what is staged with:

git diff --staged

This command shows the line-by-line changes that will go into the commit. It is a good habit to check this before every commit.

Safety: always name the files you are staging

git add README.md stages one file. You know exactly what will be committed. Commands like git add . or git add -A stage everything Git can see in the current directory and below. If you run one of these from the wrong folder, you can accidentally stage thousands of files, including private data, SSH keys, or your entire home directory. One student ran git add .. (note the two dots, meaning the parent folder) and began uploading gigabytes of data.

Rules:

  1. Always run pwd first. Confirm you are inside your repository folder.
  2. Always run git status before committing. Check that only the files you expect appear in green.
  3. Prefer naming files explicitly (e.g., git add README.md diaries/lab-01.md). Use git add . only when you have just checked git status and every listed file belongs in the commit.
  4. If you see hundreds of files listed in git status, stop. You are probably in the wrong directory. Do not commit. Run pwd, then cd to the right folder.

4. Commit the change

A commit is a snapshot with a short message describing what you did:

git commit -m "add readme with course details"

Think of the commit message as an instruction to your future self. Good commit messages are short and specific.

Good examples:

git commit -m "add readme with course details"
git commit -m "start lab 01 diary"
git commit -m "fix typo in README"

Poor examples:

git commit -m "stuff"
git commit -m "update"
git commit -m "final final really final"

Sanity check that Git recorded your commit:

git log -1 --oneline

You should see your commit message listed.

Common mistake

There must be a space after commit and between -m and the message:

git commit -m "update readme"   # correct
git commit-m "update readme"    # wrong: git does not recognise commit-m
git commit -m"update readme"    # wrong: missing space before the message

If Git says "nothing to commit"

Either you forgot to save the file in your editor, or you forgot to stage it. Run:

git status

If the file appears under Changes not staged for commit, run git add README.md again. If Git shows nothing changed, save the file and try again.

5. Push to GitHub

Push sends your commits to GitHub so they are backed up online:

git push

If you set up authentication in Step 4, this should work without a password prompt.

If you see an error about an upstream branch, run:

git push -u origin HEAD

That command tells Git which remote branch your local branch should track. You usually only need to do it once.

If the push succeeds, Git will print a summary showing which branch was updated on GitHub.

If git push is rejected

Sometimes GitHub has changes that your computer does not yet have, for example if you edited the README on the GitHub website. In that case, run:

git pull --rebase
git push

If Git reports a conflict, stop and ask for help rather than guessing.

Step 8: Check your work

Go to your repository page on GitHub. You should see the updated README file with your changes.

You should also see:

  1. The latest commit message near the top of the file list.
  2. The commit count link, which you can click to view history.
  3. Your updated README rendered as formatted text on the repository front page.

If the page has not changed, refresh the browser and check that your git push command succeeded.

Quick reference

CommandWhat it does
git statusShow which files have changed
git add <file>Stage a file for the next commit
git add -AStage all changes
git commit -m "message"Save a snapshot with a message
git pushUpload commits to GitHub
git pullDownload changes from GitHub
git log --onelineShow commit history

Workflow summary

Edit files → git addgit commit -m "message"git push. Repeat.

Emergency: stuck in a rebase

If you accidentally start a rebase and want to get back to where you were:

git rebase --abort

If you are resolving a rebase conflict, the usual flow is:

git add <file>
git rebase --continue

If you are unsure, run git status and ask for help before you try random commands.

What never belongs in a repository

Git remembers everything you commit, even after you delete the file. If you push a secret to GitHub, assume it is compromised. Removing it from later commits does not remove it from the history.

Never commit any of the following:

CategoryExamples
SSH or API keys~/.ssh/id_ed25519, .env files, API tokens
Passwords and credentialsdatabase connection strings, login details
Personal dataNZAVS datasets, participant records, anything with names or ID numbers
Large binary files.zip, .rds, .csv, .mp4 (your .gitignore already excludes these)

Your .gitignore blocks most data and output files automatically, but it cannot protect you if you stage files from outside your repository or override it with a force flag.

If you accidentally push something private

  1. Do not panic, but act quickly.
  2. Revoke the credential. If it is an SSH key, delete it from github.com/settings/keys and generate a new one (repeat Step 4). If it is an API token or password, revoke it on the service that issued it. Once revoked, the exposed copy is useless.
  3. Delete and recreate the repository. Removing the file from a later commit does not remove it from the history. The simplest fix is to delete the repository on GitHub, create a new one, and push your current local folder to it. Your local files are unaffected.
  4. If personal data was exposed (e.g., participant records with names or ID numbers), this is a privacy breach. Report it to the course coordinator immediately so it can be escalated to the university.

Terminal basics

You have already used a few terminal commands (cd, git clone). The terminal is a text interface for your computer. Every command you type runs a small program. Here are the commands you will use most often.

Where am I?

pwd

pwd (print working directory) shows the folder you are currently in.

List files

ls

ls lists the files and folders in the current directory. To see hidden files (names starting with .), use:

ls -a

Git stores its data in a hidden folder called .git. Try ls -a inside your repository to see it.

Change directory

cd ~/Documents/psy434/psy434-labs

cd moves you into a folder. A few shortcuts:

ShortcutMeaning
~Your home folder
..One level up
.The current folder

So cd .. moves up one level, and cd ~ takes you home.

Create a folder

mkdir diaries

mkdir (make directory) creates a new folder.

Create an empty file

touch lab-01.md

touch creates an empty file if it does not already exist. (Windows users: touch works in Git Bash but not in PowerShell or Command Prompt. Make sure you are using Git Bash.)

Terminal quick reference

CommandWhat it does
pwdPrint the current directory
lsList files
ls -aList files including hidden ones
cd <folder>Change directory
cd ..Go up one level
mkdir <name>Create a folder
touch <name>Create an empty file

Organise your repository

Set up a simple folder structure for the course. From inside your repository folder:

mkdir diaries data R

Git does not track empty folders. To make sure diaries/, data/, and R/ appear on GitHub, add a placeholder file to each:

touch diaries/.gitkeep data/.gitkeep R/.gitkeep

(data/.gitkeep is the only file in data/ that should be tracked.)

Each folder has a purpose:

FolderContents
diaries/Weekly lab diary entries (.md files)
data/Datasets you generate or download (ignored by Git because of .gitignore, but useful for keeping your project organised locally)
R/R scripts and Quarto documents (.R, .qmd)

A tidy repository separates source files (code you write) from generated output (plots, PDFs, HTML). Source files go into Git; output does not. If your code is correct, anyone can regenerate the output by running it. This principle, that results follow from code, is the basis of reproducible research.

Naming conventions

Good file names are lowercase, use hyphens instead of spaces, and sort naturally:

GoodBadWhy
lab-01.mdLab 1.mdspaces break terminal commands
lab-02.mdlab2.mdthe hyphen and zero-padding (01, 02, …) keep files in order
clean-data.RCleanData_FINAL(2).Rone clear name, no version suffixes
fig-ate-by-age.pngFigure 3.pngdescribes content, not position

Three rules of thumb:

  1. No spaces. Use hyphens (-) or underscores (_). Spaces require quoting in the terminal and cause errors in scripts.
  2. Zero-pad numbers. 01, 02, ... 10 sort correctly; 1, 2, ... 10 does not (your system puts 10 before 2).
  3. Name for content, not sequence. A file called analysis-ate.R is still meaningful six months later; script3.R is not.

Next, create a .gitignore file. This tells Git to ignore files that should not be tracked (system files, R temporary files, datasets, large binaries, etc.).

If you are using RStudio:

  1. Go to File > New File > Text File.
  2. Paste the contents below.
  3. Go to File > Save As....
  4. Save the file as .gitignore in the root of your repository (psy434-labs/).
  5. In the Files pane, click More > Show Hidden Files so dotfiles are visible.
  6. Open .gitignore from the Files pane whenever you want to edit it.

Paste the following contents:

# system files
.DS_Store
Thumbs.db

# R
.Rhistory
.RData
.Rproj.user
*.Rproj

# data files
data/**
!data/.gitkeep
*.rds
*.qs
*.parquet
*.arrow
*.csv
*.xlsx
*.sav
*.dta

# large binary and archive files
*.zip
*.7z
*.tar
*.gz
*.bz2
*.xz
*.dmg
*.iso
*.mp4
*.mov
*.avi
*.mp3
*.wav

# output files
*.pdf
*.html
*.png
*.jpg
*.jpeg
*.svg
*.gif
*.tiff
_files/
*_cache/
.quarto/

Save the file in the root of your repository (the same folder as README.md). The filename must start with a dot: .gitignore, not gitignore.

Your repository should look like this:

psy434-labs/
├── .gitignore
├── README.md
├── R/
│   └── .gitkeep
├── data/
│   └── .gitkeep
└── diaries/
    └── .gitkeep

Notice that data/ exists on your computer but Git ignores its contents by default. This is intentional: data files can be large, and anyone with your code can regenerate them. Keep data local; keep code in Git.

Before every commit, check that you are staging only source files (.md, .R, .qmd, and small config files):

git status
git diff --cached --name-only

If you accidentally stage a data file or large object, unstage it:

git restore --staged <file>

If you already committed it, remove it from tracking while keeping the local copy:

git rm --cached <file>
git commit -m "stop tracking data file"
git push

Lab diary files go in the diaries/ folder, named by week number:

diaries/
├── lab-01.md
├── lab-02.md
├── lab-03.md
├── ...
└── lab-10.md

There is no lab-07.md (week 7 is test 1). Create your first diary file now:

touch diaries/lab-01.md

Markdown basics

Markdown is a plain-text format that converts to formatted documents. You write in a .md file using simple symbols for headings, bold, lists, and so on. GitHub renders markdown automatically, so your diary will look formatted when you view it online.

Headings

Use # symbols. More # signs mean smaller headings:

# Heading 1
## Heading 2
### Heading 3

Paragraphs

Separate paragraphs with a blank line. A single line break without a blank line will not start a new paragraph.

Bold and italics

This is **bold** and this is *italic*.

Lists

Unordered lists use -:

- first item
- second item
- third item

Numbered lists use 1., 2., etc.:

1. first step
2. second step
3. third step

Inline code

Wrap code in single backticks:

Use the `git push` command to upload your work.
[GitHub](https://github.com)

Markdown reference

GitHub has a concise guide to GitHub-flavoured markdown: Basic writing and formatting syntax.

Write your first lab diary

Create your week 1 diary entry now. Open diaries/lab-01.md in RStudio and write ~150 words covering:

  1. What this lab covered and what you did.
  2. A connection to the week's lecture content.
  3. One thing you found useful, surprising, or challenging.

Use at least one heading, one bold or italic word, and one list. When you are done, stage these files, commit, and push:

git add .gitignore diaries/lab-01.md diaries/.gitkeep data/.gitkeep R/.gitkeep
git commit -m "add lab 01 diary and repo structure"
git push

Check your repository on GitHub to confirm the file appears and the markdown renders correctly.

Editors

RStudio is the recommended editor for this course. All examples and in-class instructions assume RStudio. You are welcome to use any editor you prefer (VS Code, Zed, Neovim, etc.), but you will need to translate instructions on your own. Avoid rich-text editors (Word, Pages, TextEdit) that can silently change file format or extensions.

Alternative: GitHub Desktop

If you prefer a graphical interface, download GitHub Desktop. It provides the same stage/commit/push workflow with buttons instead of terminal commands. Either approach is fine for this course.

Lab 2: Install R and Set Up Your IDE

Today's workflow

Complete this lab in a local RStudio project first. You can do all core exercises without git/GitHub, then connect to git/GitHub at the end.

This session introduces R and RStudio, then builds your core R skills.

Why learn R?

  • You will need it for your final report (if you choose the report option).
  • It supports your psychology coursework.
  • It enhances your coding skills, which will help you in many domains of work, including utilising AI (!).

Installing R

  1. Visit CRAN at https://cran.r-project.org/.
  2. Select the version for your operating system (Windows, Mac, or Linux).
  3. Download and install by following the on-screen instructions.

Installing RStudio

Step 1: Install RStudio

  1. Go to https://posit.co/download/rstudio-desktop/.
  2. Choose the free version of RStudio Desktop and download it for your operating system.
  3. Install RStudio Desktop.
  4. Open RStudio to begin setting up your project environment.

Step 2: Choose your working folder and create lab folders

Use any folder you like for this lab. If you already have labs-YOUR-USERNAME from Lab 1, you can use that. If not, create a new folder:

mkdir -p ~/Documents/psy434/lab-02
cd ~/Documents/psy434/lab-02
mkdir -p diaries data R
pwd
ls -a

If you chose a different location, use that path instead.

Step 3: Open your folder as an RStudio project

  1. In RStudio, go to File > New Project.
  2. Choose Existing Directory.
  3. Browse to your folder (e.g., ~/Documents/psy434/lab-02).
  4. Click Create Project.

RStudio creates a .Rproj file in your folder.

File naming

Use clear labels that anyone could understand. That "anyone" will be your future self. Prefer lowercase with hyphens: lab-02-intro.R, not Lab 2 Intro.R.

Step 4: Create your first R script

Now that RStudio is installed, download the starter script:

  1. Download the R script for this lab (right-click → Save As).
  2. Save it as R/lab-02.R inside your project folder.
  3. Open R/lab-02.R in RStudio.

If downloading is inconvenient, create your own script via File > New File > R Script and save it as R/lab-02.R.

Step 5: Working with R scripts

  1. Write your R code in the script editor. Execute code by selecting lines and pressing Ctrl + Enter (Windows/Linux) or Cmd + Enter (Mac).
  2. Use comments (preceded by #) to document your code.
  3. Save your scripts regularly (Ctrl + S or Cmd + S).

Step 6: When you exit RStudio

Before concluding your work, restart R (Session > Restart R) and choose not to save the workspace image when prompted.

Workflow habits

  • Use clearly defined script names.
  • Annotate your code.
  • Save your scripts often (Ctrl + S or Cmd + S).

Running R from the terminal

You can run R without opening RStudio.

Interactive console

Open a terminal and type:

R

You will see the R prompt (>). Try a quick calculation:

1 + 1

Type q() to quit. When asked to save the workspace, type n.

Running a script

If you have an R script saved in your project folder, run it directly:

Rscript R/lab-02.R

Output prints to the terminal. This is useful for running code without opening RStudio, and is how R scripts are run on servers and in automated pipelines.

RStudio vs terminal

RStudio is easier for interactive exploration (viewing plots, inspecting data). The terminal is useful for running finished scripts and for working on remote machines. Both use the same R installation. Use whichever suits the task.


Basic R Commands

How to copy code from this page

  • Open File > New File > R Script in RStudio.
  • Name and save your new R script.
  • Copy the code blocks below into your script.
  • Save: Ctrl + S or Cmd + S.

Assignment (<-)

Assignment in R uses the <- operator:

x <- 10 # assigns the value 10 to x
y <- 5  # assigns the value 5 to y

RStudio shortcut for <-

  • macOS: Option + - (minus key)
  • Windows/Linux: Alt + - (minus key)

Concatenation (c())

The c() function combines multiple elements into a vector:

numbers <- c(1, 2, 3, 4, 5)
print(numbers)

Operations (+, -)

x <- 10
y <- 5

total <- x + y
print(total)

difference <- x - y
difference

Executing code

Ctrl + Enter (Windows/Linux) or Cmd + Enter (Mac).

Multiplication (*) and Division (/)

product <- x * y
product

quotient <- x / y
quotient

# element-wise operations on vectors
vector1 <- c(1, 2, 3)
vector2 <- c(4, 5, 6)
vector_product <- vector1 * vector2
vector_product

vector_division <- vector1 / vector2
vector_division

Be mindful of division by zero: 10 / 0 returns Inf, and 0 / 0 returns NaN.

# integer division and modulo
integer_division <- 10 %/% 3  # 3
remainder <- 10 %% 3          # 1

rm() Remove Object

devil_number <- 666
devil_number
rm(devil_number)

Logic (!, !=, ==)

x_not_y <- x != y     # TRUE
x_equal_10 <- x == 10 # TRUE

OR (| and ||)

# element-wise OR
vector_or <- c(TRUE, FALSE) | c(FALSE, TRUE) # c(TRUE, TRUE)

# single OR (first element only)
single_or <- TRUE || FALSE # TRUE

AND (& and &&)

# element-wise AND
vector_and <- c(TRUE, FALSE) & c(FALSE, TRUE) # c(FALSE, FALSE)

# single AND (first element only)
single_and <- TRUE && FALSE # FALSE

RStudio workflow shortcuts

  • Execute code line: Cmd + Return (Mac) or Ctrl + Enter (Win/Linux)
  • Insert section heading: Cmd + Shift + R (Mac) or Ctrl + Shift + R
  • Align code: Cmd + Shift + A (Mac) or Ctrl + Shift + A
  • Comment/uncomment: Cmd/Ctrl + Shift + C
  • Save all: Cmd/Ctrl + Shift + S
  • Find/replace: Cmd/Ctrl + F
  • New file: Cmd/Ctrl + Shift + N
  • Auto-complete: Tab

For more, explore Tools > Command Palette or Shift + Cmd/Ctrl + P.


Data Types in R

Integers

x <- 42L
str(x)         # int 42

y <- as.numeric(x)
str(y)         # num 42

Integers are useful for counts or indices that do not require fractional values.

Characters

name <- "Alice"

Characters represent text: names, labels, descriptions.

Factors

colours <- factor(c("red", "blue", "green"))

Factors represent categorical data with a limited set of levels.

Ordered factors

education_levels <- c("high school", "bachelor", "master", "ph.d.")

# unordered
education_factor_no_order <- factor(education_levels, ordered = FALSE)

# ordered
education_factor <- factor(education_levels, ordered = TRUE)
education_factor

Ordered factors support logical comparisons based on level order:

edu1 <- ordered("bachelor", levels = education_levels)
edu2 <- ordered("master", levels = education_levels)
edu2 > edu1  # TRUE

Strings

you <- "world!"
greeting <- paste("hello,", you)
greeting  # "hello, world!"

Vectors

Vectors are homogeneous: all elements must be of the same type.

numeric_vector <- c(1, 2, 3, 4, 5)
character_vector <- c("apple", "banana", "cherry")
logical_vector <- c(TRUE, FALSE, TRUE, FALSE)

Manipulating vectors

vector_sum <- numeric_vector + 10
vector_multiplication <- numeric_vector * 2
vector_greater_than_three <- numeric_vector > 3

Accessing elements:

first_element <- numeric_vector[1]
some_elements <- numeric_vector[c(2, 4)]

Functions with vectors

mean(numeric_vector)
sum(numeric_vector)
sort(numeric_vector)
unique(character_vector)

Data Frames

Creating data frames

df <- data.frame(
  name = c("alice", "bob", "charlie"),
  age = c(25, 30, 35),
  gender = c("female", "male", "male")
)
head(df)
str(df)

Accessing elements

# by column name
name_column <- df$name

# by row and column
second_person <- df[2, ]
age_column <- df[, "age"]

# by condition
very_old_people <- subset(df, age > 25)
mean(very_old_people$age)

Exploring data frames

head(df)    # first six rows
tail(df)    # last six rows
str(df)     # structure
summary(df) # summary statistics

Manipulating data frames

# adding columns
df$employed <- c(TRUE, TRUE, FALSE)

# adding rows
new_person <- data.frame(name = "diana", age = 28, gender = "female", employed = TRUE)
df <- rbind(df, new_person)

# modifying values
df[4, "age"] <- 26

# removing columns
df$employed <- NULL

# removing rows
df <- df[-4, ]

rbind() and cbind()

rbind() combines data frames by rows; cbind() combines by columns. When using these functions, column names (for rbind) or row counts (for cbind) must match. We will use dplyr for more flexible joining in later weeks.


Summary statistics

set.seed(12345)
vector <- rnorm(n = 40, mean = 0, sd = 1)
mean(vector)
sd(vector)
min(vector)
max(vector)

table() for categorical data

set.seed(12345)
gender <- sample(c("male", "female"), size = 100, replace = TRUE, prob = c(0.5, 0.5))
education_level <- sample(c("high school", "bachelor", "master"), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2))
df_table <- data.frame(gender, education_level)
table(df_table)
table(df_table$gender, df_table$education_level)  # cross-tabulation

First Data Visualisation with ggplot2

ggplot2 is based on the Grammar of Graphics: you build plots layer by layer. Install it once if needed: install.packages("ggplot2").

library(ggplot2)

set.seed(12345)
student_data <- data.frame(
  name = c("alice", "bob", "charlie", "diana", "ethan", "fiona", "george", "hannah"),
  score = sample(80:100, 8, replace = TRUE),
  stringsAsFactors = FALSE
)
student_data$passed <- ifelse(student_data$score >= 90, "passed", "failed")
student_data$passed <- factor(student_data$passed, levels = c("failed", "passed"))
student_data$study_hours <- sample(5:15, 8, replace = TRUE)

Bar plot

ggplot(student_data, aes(x = name, y = score, fill = passed)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("failed" = "red", "passed" = "blue")) +
  labs(title = "student scores", x = "student name", y = "score") +
  theme_minimal()

Scatter plot

ggplot(student_data, aes(x = study_hours, y = score, color = passed)) +
  geom_point(size = 4) +
  scale_color_manual(values = c("failed" = "red", "passed" = "blue")) +
  labs(title = "scores vs. study hours", x = "study hours", y = "score") +
  theme_minimal()

Box plot

ggplot(student_data, aes(x = passed, y = score, fill = passed)) +
  geom_boxplot() +
  scale_fill_manual(values = c("failed" = "red", "passed" = "blue")) +
  labs(title = "score distribution by pass/fail status", x = "status", y = "score") +
  theme_minimal()

Histogram

ggplot(student_data, aes(x = score, fill = passed)) +
  geom_histogram(binwidth = 5, color = "black", alpha = 0.7) +
  scale_fill_manual(values = c("failed" = "red", "passed" = "blue")) +
  labs(title = "histogram of scores", x = "score", y = "count") +
  theme_minimal()

Line plot (time series)

months <- factor(month.abb[1:8], levels = month.abb[1:8])
study_hours <- c(0, 3, 15, 30, 35, 120, 18, 15)
study_data <- data.frame(month = months, study_hours = study_hours)

ggplot(study_data, aes(x = month, y = study_hours, group = 1)) +
  geom_line(linewidth = 1, color = "blue") +
  geom_point(color = "red", size = 1) +
  labs(title = "monthly study hours", x = "month", y = "study hours") +
  theme_minimal()

Summary of today's lab

This session covered:

  • Installing and setting up R and RStudio
  • Basic arithmetic operations
  • Data structures: vectors, factors, data frames
  • Data visualisation with ggplot2

End-of-lab git/GitHub check-in

After you finish the R tasks above, you can save this work to GitHub.

If you already completed Lab 1: Git and GitHub, run:

git status
git add R/lab-02.R
git commit -m "add lab 2 r script and setup"
git push

If git/GitHub is still new for you, stop here and return to this section after your Lab 1 workflow is working.

Where to get help

  1. Large language models: LLMs are effective coding tutors. Help from LLMs for coding does not constitute a breach of academic integrity in this course. For your final report, cite all sources including LLMs.
  2. Stack Overflow: outstanding community resource.
  3. Cross Validated: best place for statistics advice.
  4. Developer websites: Tidyverse.
  5. Your tutors and course coordinator.

Appendix A: At-home exercises

Exercise 1: Install the tidyverse package

  1. Open RStudio.
  2. Go to Tools > Install Packages....
  3. Type tidyverse and check Install dependencies.
  4. Click Install.
  5. Load with library(tidyverse).

Exercise 2: Install the parameters and report packages

  1. Go to Tools > Install Packages....
  2. Type parameters, report.
  3. Check Install dependencies and click Install.

Exercise 2b: Install the causalworkshop package

The causalworkshop package provides simulated data for your research report. Run the following in your R console:

install.packages("remotes")
remotes::install_github("go-bayes/causalworkshop@v0.2.1")

Verify the installation:

library(causalworkshop)
d <- simulate_nzavs_data(n = 100, seed = 2026)
head(d)

Exercise 3: Basic operations

  1. Create vector_a <- c(2, 4, 6, 8) and vector_b <- c(1, 3, 5, 7).
  2. Add them, subtract them, multiply vector_a by 2, divide vector_b by 2.
  3. Calculate the mean and standard deviation of both.

Exercise 4: Working with data frames

  1. Create a data frame with columns id (1–4), name, score (88, 92, 85, 95).
  2. Add a passed column (pass mark = 90).
  3. Extract name and score of students who passed.
  4. Explore with summary(), head(), str().

Exercise 5: Logical operations and subsetting

  1. Subset student_data to find students who scored above the mean.
  2. Create an attendance vector and add it as a column.
  3. Subset to select only rows where students were present.

Exercise 6: Cross-tabulation

  1. Create factor variables fruit and colour.
  2. Make a data frame and use table() for cross-tabulation.
  3. Which fruit has the most colour variety?

Exercise 7: Visualisation with ggplot2

  1. Using student_data, create a bar plot of scores by name.
  2. Add a title, axis labels, and colour by pass/fail status.

Appendix B: Solutions

Solution 3: Basic operations

vector_a <- c(2, 4, 6, 8)
vector_b <- c(1, 3, 5, 7)

sum_vector <- vector_a + vector_b
diff_vector <- vector_a - vector_b
double_vector_a <- vector_a * 2
half_vector_b <- vector_b / 2

mean(vector_a); sd(vector_a)
mean(vector_b); sd(vector_b)

Solution 4: Working with data frames

student_data <- data.frame(
  id = 1:4,
  name = c("alice", "bob", "charlie", "diana"),
  score = c(88, 92, 85, 95),
  stringsAsFactors = FALSE
)
student_data$passed <- student_data$score >= 90
passed_students <- student_data[student_data$passed == TRUE, ]
summary(student_data)

Solution 5: Logical operations and subsetting

mean_score <- mean(student_data$score)
students_above_mean <- student_data[student_data$score > mean_score, ]

attendance <- c("present", "absent", "present", "present")
student_data$attendance <- attendance
present_students <- student_data[student_data$attendance == "present", ]

Solution 6: Cross-tabulation

fruit <- factor(c("apple", "banana", "apple", "orange", "banana"))
colour <- factor(c("red", "yellow", "green", "orange", "green"))
fruit_data <- data.frame(fruit, colour)
table(fruit_data$fruit, fruit_data$colour)
# apple has the most colour variety (red, green)

Solution 7: Visualisation

library(ggplot2)
ggplot(student_data, aes(x = name, y = score, fill = passed)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("TRUE" = "blue", "FALSE" = "red")) +
  labs(title = "student scores", x = "name", y = "score") +
  theme_minimal()

Lab 3: Regression and Bias Mechanisms

R script

Download the R script for this lab (right-click → Save As)

This lab introduces regression as a tool for describing associations and then shows why regression coefficients do not become causal effects simply because we fit a model. We begin with a simple simulated regression, and then we examine three common sources of bias: omitted variable bias, mediator bias, and collider bias.

What you will learn

  1. How to simulate a simple regression model and interpret its slope.
  2. Why population inference depends on assumptions, not on software output alone.
  3. How omitted variable bias arises when a common cause is left out.
  4. How mediator bias arises when we condition on a variable on the causal pathway.
  5. How collider bias arises when we condition on a common effect.

Packages

required_packages <- c("tidyverse", "parameters", "report", "ggeffects", "ggdag")
missing_packages <- required_packages[
  !vapply(required_packages, \(pkg) requireNamespace(pkg, quietly = TRUE), logical(1))
]

if (length(missing_packages) > 0) {
  install.packages(missing_packages)
}

library(tidyverse)
library(parameters)
library(report)
library(ggeffects)
library(ggdag)

Regression refresher

A regression model describes how the expected value of an outcome changes with a predictor. In this lab, the first simulation uses study hours as the predictor and exam score as the outcome.

set.seed(123)
n <- 200
study_hours <- rnorm(n, mean = 10, sd = 2)
exam_score <- 50 + 3 * study_hours + rnorm(n, mean = 0, sd = 6)

df_regression <- tibble(
  study_hours = study_hours,
  exam_score = exam_score
)

fit_regression <- lm(exam_score ~ study_hours, data = df_regression)
summary(fit_regression)
report::report(fit_regression)

The coefficient for study_hours tells us how much the expected exam score changes for a one-unit increase in study hours. In this simulation, the data-generating slope is positive, so the fitted line should recover a positive relationship.

We can visualise the fitted line with raw data:

predicted_values <- ggeffects::ggpredict(fit_regression, terms = "study_hours")
plot(predicted_values, dot_alpha = 0.25, show_data = TRUE, jitter = 0.05)

A reminder about inference

Regression helps us summarise patterns in data, but it does not settle causal questions on its own. A coefficient can be statistically precise and still be causally misleading if the model conditions on the wrong variables, omits an important pre-treatment cause, or conditions on a post-treatment variable that should have been left alone.

This is why model fit is not the same as causal identification. A model with a lower BIC or higher $R^2$ can still answer the wrong causal question.

Omitted variable bias

Omitted variable bias occurs when a common cause of treatment and outcome is not included in the model. In the script, l causes both a and y, but a has no causal effect on y.

set.seed(434)
n <- 2000
l <- rnorm(n)
a <- 0.9 * l + rnorm(n)
y <- 1.2 * l + rnorm(n)

df_fork <- tibble(y = y, a = a, l = l)

fit_fork_naive <- lm(y ~ a, data = df_fork)
fit_fork_adjusted <- lm(y ~ a + l, data = df_fork)

parameters::model_parameters(fit_fork_naive)
parameters::model_parameters(fit_fork_adjusted)

The naive model reports an association between a and y because both are driven by l. Once we adjust for l, the coefficient for a should collapse toward zero. This is the logic of adjustment for observed pre-treatment confounding.

The script also draws a DAG for this pattern:

dag_fork <- dagify(
  y ~ l,
  a ~ l,
  exposure = "a",
  outcome = "y"
) |>
  tidy_dagitty(layout = "tree")

ggdag(dag_fork) +
  theme_dag_blank()

Mediator bias

Mediator bias occurs when we condition on a variable that lies on the pathway from treatment to outcome, even though our goal is to estimate the total effect.

set.seed(435)
n <- 2000
a <- rbinom(n, 1, 0.5)
m <- 1.5 * a + rnorm(n)
y <- 2 * m + rnorm(n)

df_pipe <- tibble(y = y, a = a, m = m)

fit_pipe_total <- lm(y ~ a, data = df_pipe)
fit_pipe_overcontrolled <- lm(y ~ a + m, data = df_pipe)

parameters::model_parameters(fit_pipe_total)
parameters::model_parameters(fit_pipe_overcontrolled)

Here the total effect of a on y operates through m. If we condition on m, we block the pathway that carries the treatment effect. As a result, the coefficient for a shrinks toward zero, even though the treatment really does matter.

In words, if you want the total effect, you do not control for the mediator.

Collider bias

Collider bias occurs when we condition on a variable that is caused by both the treatment and the outcome. Before conditioning, the treatment and outcome may be unrelated. After conditioning, we create a spurious association.

set.seed(436)
n <- 2000
a <- rnorm(n)
y <- rnorm(n)
c_var <- a + y + rnorm(n)

df_collider <- tibble(y = y, a = a, c_var = c_var)

fit_collider_unadjusted <- lm(y ~ a, data = df_collider)
fit_collider_adjusted <- lm(y ~ a + c_var, data = df_collider)

parameters::model_parameters(fit_collider_unadjusted)
parameters::model_parameters(fit_collider_adjusted)

The unadjusted model should show little or no relationship between a and y. The adjusted model should create one. That is collider bias.

Compare the three bias patterns

The script finishes by collecting the treatment coefficient from each simulation into a single table:

results <- tibble(
  scenario = c(
    "omitted variable bias",
    "omitted variable bias",
    "mediator bias",
    "mediator bias",
    "collider bias",
    "collider bias"
  ),
  model = c(
    "naive",
    "adjusted for l",
    "total effect",
    "overcontrolled",
    "do not condition",
    "condition on collider"
  ),
  estimate = c(
    coef(fit_fork_naive)[["a"]],
    coef(fit_fork_adjusted)[["a"]],
    coef(fit_pipe_total)[["a"]],
    coef(fit_pipe_overcontrolled)[["a"]],
    coef(fit_collider_unadjusted)[["a"]],
    coef(fit_collider_adjusted)[["a"]]
  )
) |>
  mutate(estimate = round(estimate, 3))

print(results)

This comparison is the point of the lab. Different forms of conditioning error create different forms of bias. Regression does not rescue us from those mistakes. We have to supply the right causal structure.

Exercise

  1. Increase the strength of the common cause l in the omitted variable simulation. What happens to the naive coefficient?
  2. Increase the effect of a on m in the mediator simulation. What happens to the difference between the total-effect and overcontrolled models?
  3. Increase the contribution of a and y to the collider c_var. What happens to the adjusted coefficient?

Take-home message

Regression is useful, but causal interpretation depends on design and assumptions. The practical rule is simple. Control for common causes. Do not control for mediators when estimating total effects. Do not control for colliders.

Lab 4: Writing Regression Models

R scripts

  1. Download the student practice script (right-click → Save As)
  2. Download the instructor script with extensions (right-click → Save As)

Last week asked you to learn R and causal inference at the same time. That is a heavy lift. This week slows down and focuses on one skill: writing regression models in R and seeing how the results change when you change the formula.

Start with the student practice script. It is shorter, repeats the same workflow, and gives you clear places to edit the model. The instructor script comes second. It adds extra annotation, more examples, and an optional extension that returns to the Week 4 question about samples and populations.

What you will learn

  1. How to write a simple regression formula in R.
  2. How adding a second predictor can change a coefficient.
  3. How an interaction changes fitted lines.
  4. How to rerun a model after changing the formula.
  5. Optional: how factor terms, curved relationships, and sample-to-population differences extend the same ideas.

Packages

The student script uses tidyverse. The instructor script also uses parameters.

required_packages <- c("tidyverse")
missing_packages <- required_packages[
  !vapply(required_packages, \(pkg) requireNamespace(pkg, quietly = TRUE), logical(1))
]

if (length(missing_packages) > 0) {
  install.packages(missing_packages)
}

library(tidyverse)

How to use this lab

  1. Open the student practice script first.
  2. Run one exercise at a time.
  3. Change only the formula after ~.
  4. Rerun the model and the lines immediately below it.
  5. Write down what changed before moving to the next exercise.

Student practice script

The student script has three core exercises.

  1. One predictor. Fit exam_score ~ study_hours, then change it to exam_score ~ 1 and see what the fitted line becomes.
  2. Add a second predictor. Start with exam_score ~ study_hours, then add motivation and see how the coefficient for study_hours changes.
  3. Add an interaction. Start with exam_score ~ study_hours + workshop, then change it to exam_score ~ study_hours * workshop and compare the fitted lines.

The aim is not to memorise syntax. The aim is to notice what each change in the formula does.

Instructor script with extensions

Use the instructor script after you have worked through the student version.

It includes:

  1. More annotation around the simulation code.
  2. Extra exercises with a factor predictor and a curved relationship.
  3. Cleaner comparison tables.
  4. An optional extension on sample versus population estimands.

That final section reconnects the lab to the Week 4 theme. It is useful, but it is not the place to start if you are still getting comfortable with R syntax.

Questions to answer

  1. In exercise 1, what happens to the fitted line when you change exam_score ~ study_hours to exam_score ~ 1?
  2. In exercise 2, does the coefficient for study_hours get larger or smaller after adding motivation?
  3. In exercise 3, what changes when you replace + with * in the formula?
  4. Which formula felt easiest to interpret, and which felt hardest?

Optional extension

If you finish early, open the instructor script and run the optional section at the end. In one short paragraph, explain why the conditional coefficients can look similar even when the average treatment effect changes across populations.

Lab 5: Average Treatment Effects

R script

Download the R script for this lab (right-click → Save As)

This lab introduces several ways to estimate average treatment effects (ATEs). You will compare naive, regression-adjusted, g-computation, and causal forest estimates against known ground-truth effects, then finish with one short illustration of inverse probability of treatment weighting (IPTW).

What you will learn

  1. Why naive estimates of causal effects are biased when confounding is present
  2. How covariate adjustment and g-computation reduce this bias
  3. How confounding control can also come from an exposure model through IPTW
  4. How to fit a causal forest and extract the ATE
  5. How to validate estimates against ground truth

New packages

This lab uses the causalworkshop and grf packages. Install them before proceeding if you haven't already.

Setup and data

Install and load the required packages:

# install packages if needed
# install.packages("grf")
# if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
# remotes::install_github("go-bayes/causalworkshop@v0.2.1")

library(causalworkshop)
library(grf)
library(tidyverse)

Generate a simulated three-wave panel dataset. The data are modelled on the New Zealand Attitudes and Values Study (NZAVS), with baseline confounders (wave 0), binary exposures (wave 1), and continuous outcomes (wave 2). Crucially, the data contain known ground-truth treatment effects in the tau_* columns.

# simulate data
d <- simulate_nzavs_data(n = 5000, seed = 2026)

# check structure
dim(d)
names(d)

The data are in long format (three rows per individual). We need to separate the waves:

# separate waves
d0 <- d |> filter(wave == 0)  # baseline confounders
d1 <- d |> filter(wave == 1)  # exposure assignment
d2 <- d |> filter(wave == 2)  # outcomes

# verify alignment
stopifnot(all(d0$id == d1$id), all(d0$id == d2$id))

We will estimate the effect of community group participation (community_group) at wave 1 on wellbeing (wellbeing) at wave 2.

# ground truth: the true ATE
true_ate <- mean(d0$tau_community_wellbeing)
cat("True ATE:", round(true_ate, 3), "\n")

Naive ATE (biased)

A naive estimate ignores confounders. We simply regress the outcome on the exposure:

fit_naive <- lm(d2$wellbeing ~ d1$community_group)
naive_ate <- coef(fit_naive)[2]
cat("Naive ATE:", round(naive_ate, 3), "\n")
cat("True ATE: ", round(true_ate, 3), "\n")
cat("Bias:     ", round(naive_ate - true_ate, 3), "\n")

Why is the naive estimate biased?

People who join community groups differ systematically from those who don't. They tend to be more extraverted, more agreeable, and less neurotic. These same traits also affect wellbeing directly. The naive estimate captures both the causal effect and the confounding.

Adjusted ATE (regression)

We can reduce bias by conditioning on baseline confounders:

# construct analysis dataframe
df <- data.frame(
  y = d2$wellbeing,
  a = d1$community_group,
  age = d0$age,
  male = d0$male,
  nz_european = d0$nz_european,
  education = d0$education,
  partner = d0$partner,
  employed = d0$employed,
  log_income = d0$log_income,
  nz_dep = d0$nz_dep,
  agreeableness = d0$agreeableness,
  conscientiousness = d0$conscientiousness,
  extraversion = d0$extraversion,
  neuroticism = d0$neuroticism,
  openness = d0$openness,
  community_t0 = d0$community_group,
  wellbeing_t0 = d0$wellbeing
)

# regression with covariates
fit_adj <- lm(y ~ a + age + male + nz_european + education + partner +
                employed + log_income + nz_dep + agreeableness +
                conscientiousness + extraversion + neuroticism + openness +
                community_t0 + wellbeing_t0, data = df)

adj_ate <- coef(fit_adj)["a"]
cat("Adjusted ATE:", round(adj_ate, 3), "\n")
cat("True ATE:    ", round(true_ate, 3), "\n")
cat("Bias:        ", round(adj_ate - true_ate, 3), "\n")

What changed?

The adjusted estimate should be much closer to the true ATE. Conditioning on confounders breaks the spurious association between exposure and outcome (recall the fork structure from the ggdag tutorial).

G-computation by hand

G-computation estimates the ATE by predicting outcomes under counterfactual treatment assignments. We create two copies of the data, one where everyone is treated and one where everyone is untreated, predict outcomes for each, and take the average difference.

# create counterfactual datasets
df_treated <- df
df_treated$a <- 1

df_control <- df
df_control$a <- 0

# predict outcomes under each scenario
y_hat_treated <- predict(fit_adj, newdata = df_treated)
y_hat_control <- predict(fit_adj, newdata = df_control)

# ATE via g-computation
gcomp_ate <- mean(y_hat_treated - y_hat_control)
cat("G-computation ATE:", round(gcomp_ate, 3), "\n")
cat("True ATE:         ", round(true_ate, 3), "\n")

G-computation vs regression coefficient

When the treatment is binary and the model has no interactions, the g-computation ATE equals the regression coefficient on the treatment variable. They diverge when interactions are present, because g-computation averages over the empirical distribution of covariates.

ATE via causal forest

A causal forest estimates individual-level treatment effects $\widehat{\tau}(x_i)$ non-parametrically. The ATE is the average of these individual effects, with a valid standard error that accounts for the estimation uncertainty.

# construct matrices for the causal forest
covariate_cols <- c(
  "age", "male", "nz_european", "education", "partner", "employed",
  "log_income", "nz_dep", "agreeableness", "conscientiousness",
  "extraversion", "neuroticism", "openness",
  "community_t0", "wellbeing_t0"
)

X <- as.matrix(df[, covariate_cols])
Y <- df$y
W <- df$a

# fit causal forest
cf <- causal_forest(
  X, Y, W,
  num.trees = 1000,
  honesty = TRUE,
  tune.parameters = "all",
  seed = 2026
)

# extract ATE with standard error
ate_cf <- average_treatment_effect(cf)
cat("Causal forest ATE:", round(ate_cf["estimate"], 3),
    "(SE:", round(ate_cf["std.err"], 3), ")\n")
cat("True ATE:         ", round(true_ate, 3), "\n")

What is honesty?

Setting honesty = TRUE splits the training data in half: one half builds the tree structure, the other estimates the treatment effects within each leaf. This prevents overfitting and ensures valid confidence intervals.

Compare all estimates

results <- data.frame(
  method = c("Naive", "Adjusted regression", "G-computation", "Causal forest"),
  estimate = c(naive_ate, adj_ate, gcomp_ate, ate_cf["estimate"]),
  bias = c(naive_ate - true_ate, adj_ate - true_ate,
           gcomp_ate - true_ate, ate_cf["estimate"] - true_ate)
)
results$estimate <- round(results$estimate, 3)
results$bias <- round(results$bias, 3)
print(results)
cat("\nTrue ATE:", round(true_ate, 3), "\n")

Key takeaway

All three adjusted methods (regression, g-computation, causal forest) should recover the true ATE reasonably well. The naive estimate is substantially biased because it does not account for confounding. The causal forest additionally provides valid standard errors and, as we will see in Lab 6, individual-level treatment effect predictions.

Optional extension: the same ATE from an exposure model

So far we have controlled confounding through an outcome model. G-computation works by modelling $Y \mid A, L$ and then using predict() to compare the treated and untreated worlds.

IPTW takes the other route. It models treatment assignment, $A \mid L$, then gives more weight to people who received an unexpectedly rare treatment for their covariate pattern. This creates a pseudo-population in which treatment is less confounded by $L$.

# model the probability of treatment
ps_model <- glm(
  a ~ age + male + nz_european + education + partner +
    employed + log_income + nz_dep + agreeableness +
    conscientiousness + extraversion + neuroticism + openness +
    community_t0 + wellbeing_t0,
  data = df,
  family = binomial()
)

ps_hat <- predict(ps_model, type = "response")

# stabilised IPTW weights
p_treated <- mean(df$a)
iptw <- ifelse(
  df$a == 1,
  p_treated / ps_hat,
  (1 - p_treated) / (1 - ps_hat)
)

# quick weight check
tibble(
  statistic = c("min", "median", "max"),
  value = c(min(iptw), median(iptw), max(iptw))
)

# weighted ATE model
fit_iptw <- lm(y ~ a, data = df, weights = iptw)
iptw_ate <- coef(fit_iptw)[["a"]]

tibble(
  method = c("G-computation", "IPTW"),
  estimate = c(gcomp_ate, iptw_ate),
  bias = c(gcomp_ate - true_ate, iptw_ate - true_ate)
)

What to notice

IPTW is aiming at the same ATE as g-computation, but it gets there through an exposure model rather than an outcome model.

This is why IPTW is useful to see now. Later, doubly robust estimators combine both ideas: an outcome model and an exposure model.

Exercises

Lab diary

Complete at least two of the following exercises for your lab diary.

  1. Different exposure-outcome pair. Repeat the analysis using religious_service as the exposure and belonging as the outcome. How does the bias of the naive estimate compare? Check the true ATE using mean(d0$tau_religious_belonging).

  2. Omit baseline adjustment. Re-fit the causal forest without including community_t0 and wellbeing_t0 in the covariate matrix. How much does the ATE estimate change? Why might baseline values of the exposure and outcome be important confounders?

  3. Sample size comparison. Generate data with n = 1000 and n = 10000. How do the causal forest ATE estimates and standard errors change? What does this tell you about the precision of causal forest estimates?

Lab 6: Conditional Average Treatment Effects

R script

Download the R script for this lab (right-click → Save As)

This lab explores why functional form matters for estimating heterogeneous treatment effects. You will compare parametric and non-parametric estimators, examine individual-level predictions from causal forests, and test whether treatment effects genuinely vary across individuals.

What you will learn

  1. Why OLS can miss treatment effect heterogeneity
  2. How to extract individual treatment effect predictions from a causal forest
  3. How to test for significant heterogeneity using test_calibration()
  4. How to identify which covariates drive effect modification

Why functional form matters

When treatment effects vary across individuals, the method we use to estimate them matters. A linear model assumes effects change at a constant rate with each covariate; a causal forest can capture non-linear and interactive patterns.

library(causalworkshop)
library(grf)
library(tidyverse)

The simulate_nonlinear_data() function generates data where the true treatment effect surface is deliberately non-linear, so that flexible methods outperform rigid ones:

# simulate data with non-linear treatment effects
d_nl <- simulate_nonlinear_data(n = 2000, seed = 2026)

# compare four estimation methods
result <- compare_ate_methods(d_nl)

All four methods (OLS, polynomial, GAM, causal forest) recover the overall ATE reasonably well. But their ability to predict individual effects differs dramatically:

# compare RMSE for individual-level predictions
print(result$summary)

RMSE tells the story

RMSE (root mean squared error) measures how well each method predicts the true individual treatment effect $\tau(x_i)$. A lower RMSE means the method captures the heterogeneity pattern more accurately. OLS assumes a linear effect surface and typically has the highest RMSE.

Individual treatment effects from the causal forest

Now we return to the NZAVS data from Lab 5. The causal forest estimates $\widehat{\tau}(x_i)$ for each individual: what would their outcome change be if they were treated versus untreated?

# simulate NZAVS data (same as Lab 5)
d <- simulate_nzavs_data(n = 5000, seed = 2026)
d0 <- d |> filter(wave == 0)
d1 <- d |> filter(wave == 1)
d2 <- d |> filter(wave == 2)

# construct matrices
covariate_cols <- c(
  "age", "male", "nz_european", "education", "partner", "employed",
  "log_income", "nz_dep", "agreeableness", "conscientiousness",
  "extraversion", "neuroticism", "openness",
  "community_group", "wellbeing"
)

X <- as.matrix(d0[, covariate_cols])
Y <- d2$wellbeing
W <- d1$community_group

# fit causal forest
cf <- causal_forest(
  X, Y, W,
  num.trees = 1000,
  honesty = TRUE,
  tune.parameters = "all",
  seed = 2026
)

Extract predicted individual treatment effects:

# predicted treatment effects for each individual
tau_hat <- predict(cf)$predictions

# summary statistics
cat("Mean tau_hat:  ", round(mean(tau_hat), 3), "\n")
cat("SD tau_hat:    ", round(sd(tau_hat), 3), "\n")
cat("Range tau_hat: ", round(range(tau_hat), 3), "\n")

Compare with the true individual effects:

# true individual effects from the data-generating process
tau_true <- d0$tau_community_wellbeing

# how well does the forest recover individual effects?
cat("Correlation(tau_hat, tau_true):", round(cor(tau_hat, tau_true), 3), "\n")
cat("RMSE:", round(sqrt(mean((tau_hat - tau_true)^2)), 3), "\n")

Visualise the distribution of predicted effects:

# histogram of predicted treatment effects
ggplot(data.frame(tau_hat = tau_hat), aes(x = tau_hat)) +
  geom_histogram(bins = 40, fill = "steelblue", alpha = 0.7) +
  geom_vline(xintercept = mean(tau_hat), colour = "red", linetype = "dashed") +
  labs(
    title = "Distribution of predicted treatment effects",
    x = expression(hat(tau)(x)),
    y = "Count"
  ) +
  theme_minimal()

Interpreting the histogram

If treatment effects were homogeneous, this histogram would be tightly concentrated around the ATE. A wide spread indicates heterogeneity: some people benefit more from community group participation than others.

Test for heterogeneity

The test_calibration() function tests whether the forest has detected genuine heterogeneity, or whether the variation in $\widehat{\tau}(x)$ is just noise.

# test for heterogeneity
cal_test <- test_calibration(cf)
print(cal_test)

Reading the calibration test

The key row is differential.forest.prediction. If its coefficient is significantly greater than zero (p < 0.05), the forest has detected meaningful variation in treatment effects beyond the overall mean. The mean.forest.prediction row tests whether the average effect is non-zero.

Variable importance

Which covariates drive the heterogeneity? The variable_importance() function measures how frequently each variable is used for splitting in the forest:

# variable importance
var_imp <- variable_importance(cf)
importance_df <- data.frame(
  variable = colnames(X),
  importance = as.numeric(var_imp)
) |>
  arrange(desc(importance))

print(importance_df)

Cross-reference with ground truth

The true treatment effect formula for community group participation on wellbeing is:

$$\tau = 0.20 + 0.10 \times \text{extraversion} + 0.05 \times \text{partner} - 0.03 \times \text{neuroticism}^2$$

So extraversion, partner status, and neuroticism should appear as important variables. Does the forest recover this pattern?

Subgroup analysis

We can examine whether predicted effects differ across subgroups defined by the important covariates:

# compare effects by extraversion
high_extra <- tau_hat[d0$extraversion > 0]
low_extra <- tau_hat[d0$extraversion <= 0]

cat("Mean tau_hat (high extraversion):", round(mean(high_extra), 3), "\n")
cat("Mean tau_hat (low extraversion): ", round(mean(low_extra), 3), "\n")
cat("Difference:                      ", round(mean(high_extra) - mean(low_extra), 3), "\n")

# compare effects by partner status
partnered <- tau_hat[d0$partner == 1]
unpartnered <- tau_hat[d0$partner == 0]

cat("\nMean tau_hat (partnered):  ", round(mean(partnered), 3), "\n")
cat("Mean tau_hat (unpartnered):", round(mean(unpartnered), 3), "\n")
cat("Difference:                ", round(mean(partnered) - mean(unpartnered), 3), "\n")

Do the subgroup differences match the ground truth?

The tau formula adds $+0.10 \times \text{extraversion}$ and $+0.05 \times \text{partner}$. Highly extraverted and partnered individuals should show larger predicted treatment effects. Check whether this matches what you observe.

Predicted vs true effects scatter plot

# scatter plot of predicted vs true individual effects
ggplot(data.frame(true = tau_true, predicted = tau_hat),
       aes(x = true, y = predicted)) +
  geom_point(alpha = 0.1, colour = "steelblue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "red") +
  labs(
    title = "Predicted vs true individual treatment effects",
    x = expression(tau(x)),
    y = expression(hat(tau)(x))
  ) +
  theme_minimal()

Key takeaway

Causal forests can detect meaningful heterogeneity in treatment effects without requiring the analyst to specify the functional form in advance. The test_calibration() function provides a formal test for heterogeneity, and variable_importance() identifies which covariates drive it. In Lab 8, we will use these individual predictions to evaluate targeting strategies.

Exercises

Lab diary

Complete at least two of the following exercises for your lab diary.

  1. Different seed. Run compare_ate_methods() with seed = 42 instead of seed = 2026. Do the relative RMSE rankings change? Why or why not?

  2. Different exposure-outcome pair. Fit a causal forest for volunteer_work on self_esteem. Run test_calibration() and variable_importance(). Which covariates drive heterogeneity? Does this match the ground-truth tau formula? (Hint: check the simulate_nzavs_data documentation.)

  3. Why does OLS miss heterogeneity? In one paragraph, explain why a linear model that includes only main effects cannot capture the $-0.03 \times \text{neuroticism}^2$ term in the treatment effect formula. What would you need to add to the linear model to capture this non-linearity?

Lab 8: RATE and QINI Curves

R script

Download the R script for this lab (right-click → Save As)

This lab evaluates whether targeting treatment to those predicted to benefit most improves outcomes compared with treating everyone. You will compute RATE curves and QINI curves from causal forest predictions and assess targeting efficiency at different population percentiles.

What you will learn

  1. How to rank individuals by predicted treatment benefit
  2. How to compute and interpret RATE curves (gain over random assignment)
  3. How to compute and interpret QINI curves (cumulative targeting gain)
  4. How to characterise the covariate profile of high-benefit individuals

Connection to previous labs

This lab builds directly on Labs 5 and 6. You will use the causal forest fitted in those labs to evaluate whether targeting resources to the most responsive individuals is worthwhile.

Setup

library(causalworkshop)
library(grf)
library(tidyverse)

Install packages

If simulate_nzavs_data() is missing, upgrade causalworkshop and install grf: install.packages("grf") if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes") remotes::install_github("go-bayes/causalworkshop@v0.2.1")

Re-fit the causal forest from Labs 5-6 (or copy the code from Lab 5):

# simulate data
d <- simulate_nzavs_data(n = 5000, seed = 2026)
d0 <- d |> filter(wave == 0)
d1 <- d |> filter(wave == 1)
d2 <- d |> filter(wave == 2)

# construct matrices
covariate_cols <- c(
  "age", "male", "nz_european", "education", "partner", "employed",
  "log_income", "nz_dep", "agreeableness", "conscientiousness",
  "extraversion", "neuroticism", "openness",
  "community_group", "wellbeing"
)

X <- as.matrix(d0[, covariate_cols])
Y <- d2$wellbeing
W <- d1$community_group

# fit causal forest
cf <- causal_forest(
  X, Y, W,
  num.trees = 1000,
  honesty = TRUE,
  tune.parameters = "all",
  seed = 2026
)

# extract predicted individual treatment effects
tau_hat <- predict(cf)$predictions

Rank individuals by predicted benefit

The first step in any targeting analysis is to sort individuals from highest to lowest predicted treatment effect:

# sort by predicted benefit (descending)
n <- length(tau_hat)
tau_order <- order(tau_hat, decreasing = TRUE)
tau_sorted <- tau_hat[tau_order]

# what does the top of the distribution look like?
cat("Top 5 predicted effects:   ", round(head(tau_sorted, 5), 3), "\n")
cat("Bottom 5 predicted effects:", round(tail(tau_sorted, 5), 3), "\n")
cat("Overall mean:              ", round(mean(tau_hat), 3), "\n")

RATE curve

The RATE (Rank-Weighted Average Treatment Effect) curve shows how much we gain by targeting treatment to the top $q%$ of predicted beneficiaries, compared with random assignment.

For each targeting rate $q$, we compute the average predicted effect among the top $q%$ of individuals, minus the overall average:

# compute RATE curve
rates <- seq(0.05, 1.00, by = 0.05)
rate_results <- tibble(
  rate = numeric(),
  avg_tau_targeted = numeric(),
  gain_over_random = numeric()
)

for (r in rates) {
  n_targeted <- floor(r * n)
  targeted_idx <- tau_order[seq_len(n_targeted)]

  avg_targeted <- mean(tau_hat[targeted_idx])
  gain <- avg_targeted - mean(tau_hat)

  rate_results <- bind_rows(
    rate_results,
    tibble(rate = r, avg_tau_targeted = avg_targeted, gain_over_random = gain)
  )
}

print(rate_results |> mutate(across(where(is.numeric), \(x) round(x, 3))))

Plot the RATE curve:

ggplot(rate_results, aes(x = rate, y = gain_over_random)) +
  geom_line(colour = "#E69F00", linewidth = 1) +
  geom_point(colour = "#E69F00", size = 2) +
  scale_x_continuous(labels = scales::percent_format()) +
  labs(
    title = "RATE curve: gain from targeting",
    x = "Targeting rate (proportion treated)",
    y = "Gain over random assignment"
  ) +
  theme_minimal()

Reading the RATE curve

A steep curve at low targeting rates means a small group benefits substantially more than average. A flat curve means everyone benefits similarly, and targeting adds no value. The curve always reaches zero at 100% (treating everyone is the same as random).

QINI curve

The QINI curve measures the cumulative gain from targeting. For each percentile $p$, it computes the total benefit from targeting the top $p%$, minus the proportional share they would get under random assignment:

# compute QINI curve
qini_results <- tibble(
  percentile = numeric(),
  cumulative_gain = numeric()
)

for (p in rates) {
  n_top <- floor(p * n)
  top_idx <- tau_order[seq_len(n_top)]

  # cumulative gain: total effect for targeted minus proportional share

  cum_gain <- sum(tau_hat[top_idx]) - p * sum(tau_hat)

  qini_results <- bind_rows(
    qini_results,
    tibble(percentile = p, cumulative_gain = cum_gain)
  )
}

print(qini_results |> mutate(across(where(is.numeric), \(x) round(x, 3))))

Plot the QINI curve:

ggplot(qini_results, aes(x = percentile, y = cumulative_gain)) +
  geom_line(colour = "#56B4E9", linewidth = 1) +
  geom_point(colour = "#56B4E9", size = 2) +
  scale_x_continuous(labels = scales::percent_format()) +
  labs(
    title = "QINI curve: cumulative targeting gain",
    x = "Population percentile",
    y = "Cumulative gain over random"
  ) +
  theme_minimal()

Compute the area under the QINI curve (AUQC) via trapezoidal approximation:

# area under QINI curve via trapezoidal rule
qini_for_area <- bind_rows(
  tibble(percentile = 0, cumulative_gain = 0),
  qini_results
)

auqc <- sum(
  diff(qini_for_area$percentile) *
    (head(qini_for_area$cumulative_gain, -1) +
       tail(qini_for_area$cumulative_gain, -1)) / 2
)
cat("Area Under QINI Curve (AUQC):", round(auqc, 3), "\n")

AUQC interpretation

A larger AUQC means targeting is more valuable. An AUQC near zero means there is little heterogeneity to exploit, and random assignment performs nearly as well as targeted assignment.

Targeting efficiency

Create a summary table comparing the top 10%, 20%, and 50% of predicted beneficiaries:

# targeting efficiency at key percentiles
top_10_idx <- tau_order[seq_len(floor(0.10 * n))]
top_20_idx <- tau_order[seq_len(floor(0.20 * n))]
top_50_idx <- tau_order[seq_len(floor(0.50 * n))]

overall_mean <- mean(tau_hat)

efficiency <- tibble(
  group = c("Top 10%", "Top 20%", "Top 50%", "Everyone"),
  avg_effect = c(
    mean(tau_hat[top_10_idx]),
    mean(tau_hat[top_20_idx]),
    mean(tau_hat[top_50_idx]),
    mean(tau_hat)
  )
) |>
  mutate(
    gain_vs_random = avg_effect - overall_mean,
    lift_vs_random = if_else(
      abs(overall_mean) > 1e-8,
      avg_effect / overall_mean,
      NA_real_
    ),
    efficiency_gain_pct = if_else(
      abs(overall_mean) > 1e-8,
      round((lift_vs_random - 1) * 100, 1),
      NA_real_
    )
  )

print(efficiency |> mutate(across(where(is.numeric), \(x) round(x, 3))))

If mean(tau_hat) is close to zero, prefer gain_vs_random over lift_vs_random because ratios become unstable.

Characterise the covariate profile of high-benefit individuals:

# who are the top 10%?
top_10_data <- d0[tau_order[seq_len(floor(0.10 * n))], ]
everyone <- d0

cat("Top 10% vs everyone:\n")
cat("  Extraversion:    ", round(mean(top_10_data$extraversion), 2),
    "vs", round(mean(everyone$extraversion), 2), "\n")
cat("  Neuroticism:     ", round(mean(top_10_data$neuroticism), 2),
    "vs", round(mean(everyone$neuroticism), 2), "\n")
cat("  Partner (prop):  ", round(mean(top_10_data$partner), 2),
    "vs", round(mean(everyone$partner), 2), "\n")
cat("  Agreeableness:   ", round(mean(top_10_data$agreeableness), 2),
    "vs", round(mean(everyone$agreeableness), 2), "\n")

Key takeaway

RATE and QINI curves translate heterogeneous treatment effects into actionable targeting decisions. If the curves are steep, concentrating resources on high-benefit individuals improves overall outcomes. If the curves are flat, treating everyone equally is just as effective. In Lab 9, we will learn how to express these targeting decisions as simple, interpretable rules using policy trees.

Exercises

Lab diary

Complete at least two of the following exercises for your lab diary.

  1. Different outcome. Compute RATE and QINI curves for a different outcome (e.g., belonging or life_satisfaction). Is the AUQC larger or smaller? What does this imply about targeting?

  2. Negative effects. Some individuals may have $\widehat{\tau}(x) < 0$, meaning the treatment is predicted to harm them. How many individuals in your sample have negative predicted effects? What are the implications for resource allocation?

  3. AUTOC vs QINI weighting. The RATE curve (AUTOC weighting) emphasises the top of the ranking, while the QINI curve weights all percentiles equally. In one paragraph, explain when each metric would be more useful for a policy-maker.

Lab 9: Policy Trees

R script

Download the R script for this lab (right-click → Save As)

This lab moves from evaluating heterogeneity (Lab 8) to making decisions. Policy trees learn simple, interpretable treatment assignment rules from causal forest predictions. You will fit policy trees of different depths, evaluate their performance, and discuss the ethical implications of algorithmic treatment assignment.

What you will learn

  1. How to construct a reward matrix from causal forest predictions
  2. How to fit and interpret depth-1 and depth-2 policy trees
  3. How to evaluate policy performance against random assignment
  4. How to express learned rules in plain language

Connection to previous labs

This lab uses the causal forest and individual treatment effect predictions from Labs 5-8. The progression is: estimate effects (Lab 5) $\to$ discover heterogeneity (Lab 6) $\to$ evaluate targeting (Lab 8) $\to$ learn assignment rules (this lab).

Setup

library(causalworkshop)
library(grf)
library(policytree)
library(tidyverse)

Install packages

If needed, run: install.packages(c("grf", "policytree")) if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes") remotes::install_github("go-bayes/causalworkshop@v0.2.1")

Re-fit the causal forest (or re-use from previous labs):

# simulate data
d <- simulate_nzavs_data(n = 5000, seed = 2026)
d0 <- d |> filter(wave == 0)
d1 <- d |> filter(wave == 1)
d2 <- d |> filter(wave == 2)
tau_true <- d0$tau_community_wellbeing

# construct matrices
covariate_cols <- c(
  "age", "male", "nz_european", "education", "partner", "employed",
  "log_income", "nz_dep", "agreeableness", "conscientiousness",
  "extraversion", "neuroticism", "openness",
  "community_group", "wellbeing"
)

# train/test split for policy evaluation
set.seed(2026)
idx_train <- sample(seq_len(nrow(d0)), size = floor(0.7 * nrow(d0)))
idx_test <- setdiff(seq_len(nrow(d0)), idx_train)

X_train <- as.matrix(d0[idx_train, covariate_cols])
Y_train <- d2$wellbeing[idx_train]
W_train <- d1$community_group[idx_train]

X_test <- as.matrix(d0[idx_test, covariate_cols])
W_test <- d1$community_group[idx_test]
tau_true_test <- tau_true[idx_test]

cf <- causal_forest(
  X_train, Y_train, W_train,
  num.trees = 1000,
  honesty = TRUE,
  tune.parameters = "all",
  seed = 2026
)

tau_hat_train <- predict(cf)$predictions

The gamma matrix

A policy tree needs a reward matrix (called the "gamma matrix"). Each row is an individual; each column is an action. The entry gives the expected reward for assigning that individual to that action.

With two actions (treat vs not treat), the gamma matrix has two columns:

# construct gamma matrix
# column 1: reward if not treated (control) = 0 (baseline)
# column 2: reward if treated = predicted treatment effect
gamma_matrix_train <- cbind(
  control = rep(0, length(tau_hat_train)),
  treatment = tau_hat_train
)

head(gamma_matrix_train)

Why is the control reward zero?

We normalise the control reward to zero so that the treatment column represents the gain from treating. A positive value means treatment helps; a negative value means treatment harms. The policy tree then simply needs to decide: for which individuals is the gain positive enough to justify treatment?

Fit a depth-1 policy tree

A depth-1 tree makes a single split, dividing the population into two groups based on one covariate:

# subsample for speed (policy tree fitting can be slow on large datasets)
set.seed(2026)
n_sample <- min(500, nrow(X_train))
idx <- sample(seq_len(nrow(X_train)), n_sample)

X_sample <- as.data.frame(X_train[idx, ])
gamma_sample <- gamma_matrix_train[idx, ]

# fit depth-1 policy tree
pt_depth1 <- policy_tree(X_sample, gamma_sample, depth = 1)

# print the tree
print(pt_depth1)

Visualise the tree:

plot(pt_depth1)

Reading the tree

The tree shows one splitting variable and a threshold. Individuals above the threshold go one way; individuals below go the other. The leaf labels (1 or 2) correspond to the columns of the gamma matrix: 1 = control, 2 = treatment.

Fit a depth-2 policy tree

A depth-2 tree makes two sequential splits, creating four groups:

# fit depth-2 policy tree
pt_depth2 <- policy_tree(X_sample, gamma_sample, depth = 2)

# print and plot
print(pt_depth2)
plot(pt_depth2)

Evaluate policies

Predict treatment assignments on a held-out test set and compare with random assignment:

# predict actions for held-out test set
X_test_df <- as.data.frame(X_test)
actions_depth1 <- predict(pt_depth1, X_test_df)
actions_depth2 <- predict(pt_depth2, X_test_df)

# compute expected reward under each policy using true effects in the test set
# action = 1 means control, action = 2 means treatment
reward_depth1 <- ifelse(actions_depth1 == 2, tau_true_test, 0)
reward_depth2 <- ifelse(actions_depth2 == 2, tau_true_test, 0)
reward_random <- 0.5 * tau_true_test
reward_treat_all <- tau_true_test

# compare policies
policy_comparison <- tibble(
  policy = c("Random assignment", "Depth-1 tree", "Depth-2 tree", "Treat everyone"),
  expected_reward = c(
    mean(reward_random),
    mean(reward_depth1),
    mean(reward_depth2),
    mean(reward_treat_all)
  ),
  treat_rate = c(
    0.50,
    mean(actions_depth1 == 2),
    mean(actions_depth2 == 2),
    1.00
  )
)

print(policy_comparison |> mutate(across(where(is.numeric), \(x) round(x, 3))))

Is the improvement worth the complexity?

A depth-2 tree is harder to explain than a depth-1 tree but may assign treatments more efficiently. Compare the expected rewards: if depth-2 is only marginally better, the simpler depth-1 rule may be preferable for transparency.

Interpret the rules in plain language

Read the tree output and translate it into a decision rule:

# what variables and thresholds does the depth-2 tree use?
print(pt_depth2)

# example interpretation (your values will differ):
# "treat individuals with extraversion > 0.3 and baseline wellbeing < 0.1"

Exercise

Write out the depth-2 policy tree as a set of plain-language if-then rules. For example: "If extraversion is above X, then treat. Otherwise, if neuroticism is below Y, treat; otherwise, do not treat."

Compare policy assignments with actual treatment

How do the policy-recommended assignments compare with who actually received treatment in the data?

# agreement between policy and observed treatment
agreement <- tibble(
  actual = W_test,
  policy_depth2 = ifelse(actions_depth2 == 2, 1, 0)
) |>
  mutate(agree = actual == policy_depth2)

cat("Agreement rate:", round(mean(agreement$agree), 3), "\n")
cat("Policy treats: ", round(mean(agreement$policy_depth2), 3), "\n")
cat("Actual treated:", round(mean(agreement$actual), 3), "\n")

Ethical considerations

Statistical optimality is not social optimality

A policy tree maximises expected treatment benefit, but it does not account for:

  • Fairness. The tree may split on variables correlated with protected characteristics (ethnicity, gender, socioeconomic status). Even if a variable is not in the covariate set, proxy variables may reproduce discriminatory patterns.
  • Equity. Targeting resources to those who benefit most may mean those who benefit somewhat receive nothing. A policy that is statistically optimal may be socially unjust.
  • Transparency. A depth-2 tree is interpretable; a depth-5 tree is not. Policy-makers and the public need to understand the rule.
  • Override. A clinician or social worker should always be able to override an algorithmic recommendation based on individual context the model cannot see.

When presenting policy tree results, always discuss these trade-offs.

Exercises

Lab diary

Complete at least two of the following exercises for your lab diary.

  1. Different outcome. Fit a policy tree for religious_service on belonging. Do the splitting variables change? What does this suggest about which covariates drive effect modification for different outcomes?

  2. Depth-3 tree. Fit a depth = 3 policy tree. Does the expected reward improve substantially over depth-2? Is the tree still interpretable?

  3. Discuss override. In one paragraph, describe a scenario where a clinician should override a policy tree recommendation. What information would the clinician have that the model does not?

Lab 10: Measurement Invariance

R script

Download the R script for this lab (right-click → Save As)

This lab introduces measurement invariance testing, a prerequisite for meaningful cross-group comparisons. You will conduct exploratory factor analysis (EFA), confirmatory factor analysis (CFA), and multigroup CFA on simulated distress scale data with known non-invariance built in.

What you will learn

  1. How to assess factorability (KMO, Bartlett's test) and extract factors using EFA
  2. How to fit a CFA in lavaan and evaluate model fit (CFI, RMSEA, SRMR)
  3. How to test configural, metric, and scalar invariance across groups
  4. How to discover and release non-invariant items (partial invariance)
  5. Why measurement invariance matters for causal inference across groups

Connection to the lecture

This lab aligns with Week 10's lecture on classical measurement theory from a causal perspective. The lecture covers measurement error in DAGs, EFA, CFA, and VanderWeele's model linking measurement to causal identification. This lab focuses on the practical skills: fitting and comparing models.

Setup

library(causalworkshop)
library(psych)
library(lavaan)
library(tidyverse)

Install packages

If you haven't installed these packages, run: install.packages(c("psych", "lavaan")) if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes") remotes::install_github("go-bayes/causalworkshop@v0.2.1")

Generate data

The simulate_measurement_items() function generates a 6-item psychological distress scale (modelled on the Kessler-6) with a known factor structure and built-in measurement non-invariance across two groups:

# simulate measurement data
d <- simulate_measurement_items(n = 2000, seed = 2026)

# check structure
dim(d)
names(d)

The six items correspond to:

  • item_1: nervous
  • item_2: hopeless
  • item_3: restless
  • item_4: depressed
  • item_5: effort (everything was an effort)
  • item_6: worthless
# check the true factor loadings
attr(d, "true_loadings")

# check the true intercepts (they differ for items 3 and 5 across groups)
attr(d, "true_intercepts_group0")
attr(d, "true_intercepts_group1")

What is non-invariance?

Items 3 (restless) and 5 (effort) have different intercepts across groups. This means that at the same level of true distress, group 1 members score higher on these two items. If we ignore this, cross-group comparisons of mean distress scores will be biased.

Exploratory factor analysis (EFA)

Before fitting a CFA, we check whether the data are factorable and how many factors to extract.

Factorability

# select just the items
items <- d |> select(item_1:item_6)

# Kaiser-Meyer-Olkin measure of sampling adequacy
psych::KMO(items)

# Bartlett's test of sphericity
psych::cortest.bartlett(cor(items), n = nrow(items))

Interpreting KMO

KMO values above 0.60 are considered adequate for factor analysis. Values above 0.80 are good. Bartlett's test should be significant (p < 0.05), indicating that correlations between items are sufficiently large for factor analysis.

Extract factors

# one-factor solution
fa_1 <- psych::fa(items, nfactors = 1, fm = "ml", rotate = "none")
print(fa_1$loadings, cutoff = 0.3)

# two-factor solution (for comparison)
fa_2 <- psych::fa(items, nfactors = 2, fm = "ml", rotate = "oblimin")
print(fa_2$loadings, cutoff = 0.3)

How many factors?

The one-factor solution should show all six items loading substantially on a single factor (consistent with the data-generating process). The two-factor solution should not improve fit meaningfully. Compare the proportion of variance explained and check whether the two-factor loadings make theoretical sense.

Confirmatory factor analysis (CFA)

Now we specify the one-factor model and fit it using lavaan:

# specify one-factor CFA model
model <- "
  distress =~ item_1 + item_2 + item_3 + item_4 + item_5 + item_6
"

# fit CFA on full sample
fit_cfa <- cfa(model, data = d)

# summary with fit measures
summary(fit_cfa, fit.measures = TRUE, standardized = TRUE)

Evaluate model fit:

# extract key fit indices
fit_indices <- fitmeasures(fit_cfa, c("cfi", "rmsea", "srmr"))
print(round(fit_indices, 3))

Fit index guidelines

  • CFI (Comparative Fit Index): > 0.95 is excellent, > 0.90 is acceptable
  • RMSEA (Root Mean Square Error of Approximation): < 0.06 is excellent, < 0.08 is acceptable
  • SRMR (Standardised Root Mean Square Residual): < 0.08 is acceptable

Multigroup CFA: invariance testing

We now test whether the factor structure is equivalent across the two groups. Invariance testing proceeds through a hierarchy of progressively stricter constraints:

Step 1: Configural invariance

The same factor structure holds in both groups, but loadings and intercepts are freely estimated:

fit_configural <- cfa(model, data = d, group = "group")
summary(fit_configural, fit.measures = TRUE)

Step 2: Metric invariance

Factor loadings are constrained to be equal across groups:

fit_metric <- cfa(model, data = d, group = "group",
                  group.equal = "loadings")
summary(fit_metric, fit.measures = TRUE)

Compare configural and metric models:

lavTestLRT(fit_configural, fit_metric)

Interpreting the comparison

A non-significant chi-square difference (p > 0.05) means the metric model fits no worse than the configural model, so equal loadings across groups are supported. Because chi-square is sensitive to large sample sizes, also inspect $\Delta$CFI, $\Delta$RMSEA, and $\Delta$SRMR below.

Step 3: Scalar invariance

Both loadings and intercepts are constrained to be equal:

fit_scalar <- cfa(model, data = d, group = "group",
                  group.equal = c("loadings", "intercepts"))
summary(fit_scalar, fit.measures = TRUE)

Compare metric and scalar models:

lavTestLRT(fit_metric, fit_scalar)

Expected result

Full scalar invariance should fail. In this lab, chi-square differences are usually significant, and fit-index changes should also worsen when intercepts are constrained. This is by design: items 3 and 5 have different intercepts across groups in the data-generating process.

Discover partial non-invariance

When full scalar invariance fails, we can release constraints on specific items to achieve partial scalar invariance. Based on modification indices or theory, we free the intercepts of items 3 and 5:

# partial scalar invariance: free intercepts for items 3 and 5
model_partial <- "
  distress =~ item_1 + item_2 + item_3 + item_4 + item_5 + item_6
  item_3 ~ c(i3a, i3b) * 1
  item_5 ~ c(i5a, i5b) * 1
"

fit_partial <- cfa(model_partial, data = d, group = "group",
                   group.equal = c("loadings", "intercepts"))
summary(fit_partial, fit.measures = TRUE)

Compare partial scalar with metric invariance:

lavTestLRT(fit_metric, fit_partial)

Calculate fit-index changes between adjacent models:

delta_fit <- tibble(
  comparison = c("Metric - Configural", "Scalar - Metric", "Partial Scalar - Metric"),
  delta_cfi = c(
    fitmeasures(fit_metric, "cfi") - fitmeasures(fit_configural, "cfi"),
    fitmeasures(fit_scalar, "cfi") - fitmeasures(fit_metric, "cfi"),
    fitmeasures(fit_partial, "cfi") - fitmeasures(fit_metric, "cfi")
  ),
  delta_rmsea = c(
    fitmeasures(fit_metric, "rmsea") - fitmeasures(fit_configural, "rmsea"),
    fitmeasures(fit_scalar, "rmsea") - fitmeasures(fit_metric, "rmsea"),
    fitmeasures(fit_partial, "rmsea") - fitmeasures(fit_metric, "rmsea")
  ),
  delta_srmr = c(
    fitmeasures(fit_metric, "srmr") - fitmeasures(fit_configural, "srmr"),
    fitmeasures(fit_scalar, "srmr") - fitmeasures(fit_metric, "srmr"),
    fitmeasures(fit_partial, "srmr") - fitmeasures(fit_metric, "srmr")
  )
) |>
  mutate(across(starts_with("delta_"), \(x) round(x, 3)))

print(delta_fit)

Expected result

Partial scalar invariance should hold after jointly considering chi-square tests and fit-index changes. As a rule of thumb, evidence against invariance is often flagged by $\Delta$CFI < -0.01, $\Delta$RMSEA > 0.015, or $\Delta$SRMR > 0.01. By releasing items 3 and 5, you should see improved fit relative to full scalar constraints.

Compare all models

# summary table of fit indices
models <- list(
  Configural = fit_configural,
  Metric = fit_metric,
  Scalar = fit_scalar,
  "Partial Scalar" = fit_partial
)

fit_table <- map_dfr(names(models), function(name) {
  fm <- fitmeasures(models[[name]], c("cfi", "rmsea", "srmr", "chisq", "df"))
  tibble(
    model = name,
    cfi = round(fm["cfi"], 3),
    rmsea = round(fm["rmsea"], 3),
    srmr = round(fm["srmr"], 3),
    chisq = round(fm["chisq"], 1),
    df = fm["df"]
  )
})

print(fit_table)

Connection to causal inference

Why measurement invariance matters for causal inference

If a scale measures the same construct differently across groups (non-invariance), then cross-group comparisons of treatment effects may be biased. In DAG terms, the measured outcome $Y^*$ is a function of both the true outcome $Y$ and group membership $G$:

$$Y^* = f(Y, G)$$

If $f$ differs by group (non-invariance), then even if the treatment has the same causal effect on $Y$ in both groups, the observed effect on $Y^*$ will differ. This is measurement bias in the causal inference framework.

Establishing measurement invariance before estimating treatment effects is therefore a prerequisite for valid cross-group comparisons.

Exercises

Lab diary

Complete at least two of the following exercises for your lab diary.

  1. Group by sex. Repeat the invariance testing using male as the grouping variable instead of group. Does the invariance pattern change? (Since the data-generating process only introduces non-invariance by group, invariance by male should hold.)

  2. Two-factor model. Fit a two-factor CFA where items 1-3 load on factor 1 and items 4-6 load on factor 2. Compare fit with the one-factor model. Does the data support a two-factor structure?

  3. Interpretation. In one paragraph, explain what you would conclude about using a single distress score to compare psychological wellbeing across two demographic groups, given that items 3 and 5 show non-invariance. What practical steps would you take before reporting cross-group differences?

Test 1: Study Sheet

Comprehensive study sheet covering lectures 1–6 (weeks 1–6).

Practice Quiz (2024, with Answers)

Last year's in-class quiz covering key concepts and causal diagrams. Includes answers.

Practice Test (2025 Test 1)

The 2025 PSYCH 434 Test 1. Part 1 is multiple choice (20 questions, some with more than one correct answer). Part 2 is short answer (choose 4 of 6). Use this as practice for the 2026 test.

Potential Outcomes and Causal Inference

This page introduces the fundamental problem of causal inference, the potential outcomes framework, and the three identification assumptions needed to estimate causal effects from data. It draws on the Women's Health Initiative (WHI) hormone replacement therapy case study as a motivating example.


Motivating example: hormone replacement therapy

Observational evidence (1980s–1990s)

Throughout the 1980s and 1990s, observational studies suggested that oestrogen therapy reduced all-cause mortality in postmenopausal women by roughly 30% (hazard ratio $\approx 0.68$ for current users vs. never users). Professional bodies endorsed hormone replacement therapy (HRT) on this basis:

  • 1992, American College of Physicians: "Women who have coronary heart disease or who are at increased risk ... are likely to benefit from hormone therapy."
  • 1996, American Heart Association: "ERT does look promising as a long-term protection against heart attack."

The experiment disagreed

The Women's Health Initiative (WHI) was a large randomised, double-blind, placebo-controlled trial enrolling 16,000 women aged 50–79. Participants were assigned to oestrogen plus progestin therapy and followed for up to eight years.

The experimental hazard ratio for all-cause mortality was 1.23 (initiators vs. non-initiators), the opposite direction from the observational finding.

What went wrong?

The discrepancy was not a failure of causal assumptions. It was a failure of study design: the observational studies did not correctly emulate a target trial. Specifically, they failed to align "time zero" (the start of follow-up) with the moment of treatment initiation, introducing survivor bias. When investigators re-analysed the observational data using a target trial emulation framework that matched treatment initiation to the start of follow-up, the observational estimates aligned with the experimental findings.

Lesson

If you want causal inferences from observational data, design the analysis as though you were running an experiment. Specify the target trial first.


The fundamental problem of causal inference

Causality is never directly observed. To quantify a causal effect, we need to compare two states of the world for the same individual, but each individual can experience only one.

Notation

Let $A$ denote a binary exposure ($A = 1$: treated, $A = 0$: untreated) and $Y$ denote the outcome.

  • $Y_i(1)$: the potential outcome for individual $i$ under treatment.
  • $Y_i(0)$: the potential outcome for individual $i$ under control.

The individual causal effect is:

$$\tau_i = Y_i(1) - Y_i(0)$$

We say there is a causal effect when $Y_i(1) - Y_i(0) \neq 0$.

The missing-data problem

At most one potential outcome is observed for each individual. The unobserved outcome is the counterfactual:

  • If $A_i = 1$ is observed, then $Y_i(0)$ is counterfactual.
  • If $A_i = 0$ is observed, then $Y_i(1)$ is counterfactual.

Individual-level causal effects are therefore generally unidentifiable. However, under certain assumptions, we can identify average causal effects at the population level.


Three identification assumptions

1. Causal consistency

The potential outcome corresponding to the exposure an individual actually receives equals their observed outcome:

$$Y_i(a) = Y_i \mid A_i = a$$

This assumption requires that treatment is well-defined (no hidden versions of treatment) and that there is no interference between units (one person's treatment does not affect another's outcome).

2. Exchangeability

The potential outcomes are independent of treatment assignment. In a randomised experiment, this holds by design. In observational studies, we require conditional exchangeability: after conditioning on a set of measured covariates $L$, treatment assignment is independent of potential outcomes:

$$Y(a) \coprod A \mid L$$

When exchangeability holds, the Average Treatment Effect (ATE) is identified:

$$\text{ATE} = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] = \mathbb{E}(Y \mid A=1) - \mathbb{E}(Y \mid A=0)$$

In observational settings with confounders $L$:

$$\text{ATE} = \sum_{l} \Big[\mathbb{E}(Y \mid A=1, L=l) - \mathbb{E}(Y \mid A=0, L=l)\Big] \Pr(L=l)$$

3. Positivity

Every individual has a non-zero probability of receiving each treatment level, conditional on their covariates:

$$P(A = a \mid L = l) > 0 \quad \text{for all } a \text{ and } l$$

Positivity is the only assumption that can be verified with data. Violations occur when certain subgroups never receive a particular treatment level, making causal effect estimates for those subgroups extrapolations rather than identifiable quantities.

Observational challenges

In observational settings, all three assumptions face threats. Causal consistency may fail when treatment varies across individuals (e.g., different forms of "religious service attendance"). Exchangeability is violated when unmeasured confounders exist. Positivity fails when certain subgroups have no access to treatment. These threats motivate the careful study designs and sensitivity analyses covered in later weeks.


From experiments to observational data

Randomised experiments address the fundamental problem by balancing confounders across treatment groups. Random assignment satisfies exchangeability by design, and controlled treatment administration satisfies consistency. Although individual causal effects remain unobservable, random assignment allows inference about average (marginal) causal effects.

In observational data, we must satisfy these three assumptions through study design, covariate adjustment, and sensitivity analysis. The remainder of this course develops the tools for doing so: causal diagrams (weeks 2–4), estimation methods (weeks 5–6, 8–9), and measurement considerations (week 10).


Discussion questions

  1. Where in your own research would an average treatment effect be the right causal estimand, and when would it mask disparities that stakeholders care about?
  2. Which of the three identification assumptions is most fragile in your field, and what designs or measurements could strengthen it?
  3. What are examples of post-treatment variables you have been tempted to adjust for, and how would doing so bias the total effect?

Further reading

  • Hernan MA, Robins JM. Causal Inference: What If. Chapman & Hall/CRC, 2025. Chapters 1–3. Book site
  • See the Course Readings page for a chapter-by-week guide.

The Causal Workflow: Ten Steps

This page summarises a ten-step framework for conducting causal inference with observational data. Each step addresses a potential threat to valid causal interpretation. The framework draws on Hernan and Robins (What If, 2024), VanderWeele (2022), and Bulbulia (2024).

How to use this page

This is a reference resource. Use it as a checklist when planning your research report. Each step links back to the lecture where it was introduced.

Step 0: State a well-defined treatment

Specify the hypothetical intervention precisely enough that every member of the target population could, in principle, receive it. "Mindfulness" is too vague because people meditate with different apps, in groups, alone, once, or every day. A clearer intervention is: "start a guided mindfulness app at the beginning of semester and complete one 10-minute session per day for eight weeks."

Precision here underwrites the causal consistency assumption (Step 5). If the treatment is vaguely defined, different people effectively receive different interventions, and the potential outcome $Y(a)$ is not well-defined. It also makes the relevant time origin visible.

See Week 5 for the formal definition of causal consistency.

Step 1: Establish time zero

Define the point at which treatment assignment begins and follow-up starts. We cannot do this until the treatment is specified, because time zero is defined relative to treatment assignment or initiation.

See Week 5 for why ambiguous time zero distorts target trial emulation.

Step 2: State a well-defined outcome

Define the outcome so the causal contrast is meaningful and temporally anchored. "Wellbeing" is underspecified; "psychological distress one year post-intervention, measured with the Kessler-6" is interpretable and reproducible. Include timing, scale, and instrument.

See Week 10 for how measurement choices affect causal identification.

Step 3: Clarify the target population

Say exactly who you aim to inform. Eligibility rules define the source population, but sampling and participation can yield a study population with a different distribution of effect modifiers. If you intend to generalise beyond the source population (transportability), articulate the additional conditions required.

See Week 6 for how effect modification interacts with population composition.

Step 4: Evaluate exchangeability

Make the case that potential outcomes are independent of treatment conditional on covariates: $Y(a) \coprod A \mid X$. Use DAGs, subject-matter arguments, pre-treatment covariate balance checks, and overlap diagnostics. If exchangeability is doubtful, redesign (e.g., stronger measurement, alternative identification strategies) rather than rely solely on modelling.

See Week 5 for the formal definition of exchangeability.

Step 5: Ensure causal consistency

Consistency requires that, for individuals receiving a treatment version compatible with level $a$, the observed outcome equals $Y(a)$. It also presumes well-defined versions and no interference between units. When multiple versions exist, either refine the intervention so versions are irrelevant to $Y(a)$, or condition on version-defining covariates.

See Week 5 for examples of consistency violations.

Step 6: Check positivity (overlap)

Each treatment level must occur with non-zero probability at every covariate profile needed for exchangeability:

$$P(A = a \mid L = l) > 0$$

Diagnose limited overlap using propensity score distributions or extreme weights. Consider design-stage remedies (trimming, restriction, adaptive sampling) before estimation.

See Week 5 for the formal positivity assumption.

Step 7: Ensure measurement aligns with the scientific question

Verify that constructs are captured by instruments whose error structures do not distort the causal contrast of interest. Be explicit about forms of measurement error (classical, Berkson, differential, misclassification) and their structural implications for bias.

See Week 10 for how measurement error creates collider bias in DAGs.

Step 8: Preserve representativeness

End-of-study analyses should reflect the target population's distribution of effect modifiers. Differential attrition, non-response, or measurement processes tied to treatment and outcomes can induce selection bias. Plan strategies such as inverse probability weighting for censoring, multiple imputation, and sensitivity analyses for missing-not-at-random data.

See Week 4 for selection bias structures.

Step 9: Document transparently

Make assumptions, disagreements, and judgement calls legible. Register or timestamp your analytic plan. Include identification arguments (DAGs), code, and data where possible. Report robustness and sensitivity analyses. Transparent reasoning is a scientific result in its own right.

Summary table

StepRequirementCore assumption
0Well-defined treatmentConsistency
1Time zero aligned to assignmentTemporal coherence
2Well-defined outcomeInterpretability
3Target populationGeneralisability
4ExchangeabilityConditional independence
5Causal consistencyNo interference, well-defined versions
6PositivityOverlap
7Measurement validityNo differential error
8RepresentativenessNo selection bias
9Transparent documentationReproducibility

Reporting Guide

This guide covers how to report results from causal inference analyses. It follows the ten-step causal workflow and demonstrates best practices for communicating average treatment effects, heterogeneous effects, and sensitivity analyses. This resource directly supports the research report assessment (40%).


The ten-step causal inference workflow

Before reporting results, ensure each step has been addressed.

Steps 0–3: problem definition

  1. Well-defined treatment. Specify the exposure precisely, including the contrast (e.g., "weekly religious service attendance vs. less than weekly").
  2. Time zero. State when treatment assignment or initiation begins and when follow-up starts.
  3. Well-defined outcome. State the outcome measure, its scale, and when it was assessed.
  4. Target population. Define who the results apply to, including any weighting for population representativeness.

Steps 4–6: identification strategy

  1. Exchangeability. Describe how conditional independence was achieved (e.g., "rich baseline covariate control including 32 covariates").
  2. Consistency. Explain why the treatment is well-defined and uniform across individuals.
  3. Positivity. Report verification through transition tables showing exposure variation across covariate strata.

Steps 7–9: implementation

  1. Measurement validity. Note the psychometric properties of outcome scales.
  2. Attrition handling. Describe how panel dropout was addressed (e.g., inverse probability of censoring weights).
  3. Transparent reporting. Document all analytical decisions and assumptions.

Target trial emulation

Frame your causal question as: "How would outcomes change if we intervened to set everyone's exposure to level $a=1$ rather than $a=0$, conditional on baseline characteristics?"


Reporting average treatment effects

Standard ATE table format

Include these elements for each outcome:

OutcomeEstimate (SD units)95% CIE-valueInterpretation
Outcome A0.12[0.08, 0.16]1.8Small positive effect
Outcome B0.15[0.11, 0.19]2.1Moderate positive effect

Key reporting elements

  1. Effect sizes: report in standard deviation units for continuous outcomes.
  2. Confidence intervals: show uncertainty around each estimate.
  3. E-values: indicate robustness to unmeasured confounding.
  4. Sample size: total analysed after exclusions and weighting.

Example results text

"Weekly religious service attendance showed positive causal effects across all cooperation measures. The largest effects were observed for social outcomes: sense of belonging ($\beta = 0.18$, 95% CI: 0.14–0.22) and social support ($\beta = 0.15$, 95% CI: 0.11–0.19). All effects were robust to moderate unmeasured confounding (E-values > 1.6)."


Reporting heterogeneous treatment effects

RATE results summary

When heterogeneity is detected, report it systematically:

OutcomeRATE-AUTOCp-valueRATE-Qinip-valueEvidence
Social support0.120.0030.080.012Strong
Belonging0.150.0010.110.004Strong
Charitable donations0.060.0890.040.156Moderate
Volunteering0.030.2340.020.445Weak

Example heterogeneity text

"We found substantial heterogeneity in treatment effects for social outcomes (RATE-Qini > 0.08, $p < 0.05$), but limited heterogeneity for behavioural outcomes. This suggests that while religious service benefits most people socially, individual responses vary considerably in magnitude."


Reporting policy tree results

Present policy trees with both standardised and original-scale interpretations.

Subgroup identification with data-scale effects

High-response subgroups for charitable donations:

  1. Older adults with high agreeableness (age > 45, agreeableness > +1 SD)
    • Standardised effect: $\beta = 0.28$ (95% CI: 0.21–0.35)
    • Data-scale effect: NZ$847 additional annual donations (95% CI: NZ$635–1,058)
    • Sample proportion: 23%
  2. Parents with medium conscientiousness (parent = yes, conscientiousness > 0)
    • Standardised effect: $\beta = 0.22$ (95% CI: 0.16–0.28)
    • Data-scale effect: NZ$665 additional annual donations (95% CI: NZ$484–846)
    • Sample proportion: 31%
  3. All others
    • Standardised effect: $\beta = 0.08$ (95% CI: 0.04–0.12)
    • Data-scale effect: NZ$242 additional annual donations (95% CI: NZ$121–363)
    • Sample proportion: 46%

Example policy tree text

"Policy tree analysis identified two subgroups with enhanced treatment response for charitable donations. Older adults (45+) with high agreeableness showed the largest increase ($\beta = 0.28$ SD, equivalent to NZ$847 additional annual donations), representing 23% of the sample. In practical terms, targeted interventions toward these subgroups could generate 2.8–3.5 times more charitable giving than population-wide approaches."


Sensitivity analysis: E-values

Interpretation

E-values quantify robustness to unmeasured confounding. The E-value is the minimum strength of association, on the risk ratio scale, that an unmeasured confounder would need with both the treatment and the outcome to explain away the observed effect.

E-value rangeInterpretation
$\geq 2.0$Robust to strong confounding
1.5–2.0Robust to moderate confounding
$< 1.5$Vulnerable to weak confounding

Example sensitivity text

"To assess robustness to unmeasured confounding, we calculated E-values for all estimates. The observed effect on sense of belonging (E-value = 2.4) would require an unmeasured confounder associated with both religious service and belonging by a risk ratio of 2.4 each to explain away the result."


Methods section template

A complete methods section following the ten-step workflow should include:

  1. Treatment definition: what the exposure is, how it is coded, and the contrast of interest.
  2. Time zero and follow-up: when assignment occurs, when follow-up starts, and why that timing matches the intervention.
  3. Outcome definition: measures used, timing of assessment, any transformations applied.
  4. Target population: sampling frame, weighting strategy, eligibility criteria.
  5. Causal identification: covariates conditioned on, justification for conditional exchangeability.
  6. Statistical analysis: estimation method, key tuning parameters (e.g., number of trees, minimum node size).
  7. Attrition handling: censoring weights, stages of dropout addressed.
  8. Heterogeneity assessment: RATE metrics, false discovery rate correction, policy tree depth.
  9. Sensitivity analysis: E-values for all primary estimates.

Reporting checklist

Do report

  • Both standardised and data-scale effects
  • Effect sizes with confidence intervals
  • Sample sizes after exclusions and weighting
  • E-values for sensitivity analysis
  • Clear practical interpretation of effect magnitudes
  • Subgroup sizes and effect magnifications
  • Target trial framework and causal question
  • Explicit treatment and outcome definitions

Do not report

  • Model coefficients without interpretation
  • p-values alone without effect sizes
  • Only standardised effects for policy-relevant outcomes
  • Technical details that obscure main findings
  • Causal claims beyond your identification strategy

Figure presentation

ATE plots

  • Use forest plots with confidence intervals.
  • Order by effect magnitude or E-value.
  • Include sample sizes.

Policy tree plots

  • Show decision rules clearly.
  • Include sample proportions in each node.
  • Provide plain-language interpretation alongside the tree.

Simulation Guide

Simulations are pedagogical tools that let us see what causal inference methods do when we know the truth. In observational research, we never know the true causal effect. In a simulation, we build the data-generating process ourselves, so we can compare each method's estimate against the ground-truth parameter. The four simulations in this course illustrate distinct threats to valid causal inference and distinct strategies for addressing them.

Download the R script

Required R packages

tidyverse, stdReg, gtsummary, clarify, grf


Generalisability and transportability

Connects to Week 4: external validity and selection bias.

This simulation creates two populations that differ in the prevalence of an effect modifier $Z$. In the sample, $Z = 1$ is rare ($p = 0.1$); in the target population, $Z = 1$ is common ($p = 0.5$). The treatment effect depends on $Z$: individuals with $Z = 1$ benefit more from treatment. Because the sample under-represents these high-benefit individuals, the naive sample Average Treatment Effect (ATE) underestimates the population ATE.

The simulation fits three models: an unweighted model on the sample, a weighted model on the sample (using inverse-probability-of-sampling weights), and an oracle model on the full population. The regression coefficients are nearly identical across all three models, yet the marginal ATEs differ. This dissociation is the central lesson: model coefficients describe conditional associations, but the ATE is a marginal quantity that depends on the distribution of effect modifiers in the target population. Weighting the sample to match the population distribution of $Z$ recovers the correct ATE.

The script also includes a manual calculation section that shows exactly what stdReg does under the hood: create counterfactual datasets in which everyone receives $A = 0$ and everyone receives $A = 1$, predict outcomes under each scenario, and take the mean difference. This "g-computation" procedure makes the marginalisation step explicit.

Key takeaway

Regression coefficients can be correct and yet the ATE can still be wrong for the target population. External validity requires that the distribution of effect modifiers in the sample matches the target, or that we adjust for the mismatch.


Cross-sectional data ambiguity

Connects to Week 3: confounding versus mediation.

This simulation generates data in which $A$ causes $L$, and $L$ causes $Y$. The variable $L$ is therefore a mediator, not a confounder. Two models are fit: one that conditions on $L$ (treating it as a confounder) and one that omits $L$ (treating it as a mediator). The model that conditions on $L$ returns a near-zero estimate for the effect of $A$ on $Y$ because it blocks the very path through which $A$ operates. The model that omits $L$ correctly recovers the total effect.

The crux of the problem is that with cross-sectional data alone, the investigator cannot distinguish a confounder from a mediator. Both the fork $A \leftarrow L \rightarrow Y$ and the chain $A \rightarrow L \rightarrow Y$ produce the same observed association between $A$, $L$, and $Y$. The correct modelling decision depends on the assumed causal structure, which the data themselves do not reveal.

Warning

Good model fit does not resolve this ambiguity. A model that conditions on a mediator can fit the data well while returning a biased causal estimate. Model fit is a statistical property; confounding is a structural (causal) property.


Confounding control strategies

Connects to Weeks 3–4: conditioning choices and the backdoor criterion.

This simulation builds a three-wave panel structure with a baseline covariate $L_0$, a prior outcome $Y_0$, a prior exposure $A_0$, an unmeasured confounder $U$, a treatment $A_1$, and an outcome $Y_2$. The true treatment effect is $\delta_{A_1} = 0.3$, and the outcome also depends on $Y_0$, $A_0$, $L_0$, their interactions, and $U$.

Three models are compared. The "no control" model regresses $Y_2$ on $A_1$ alone and overestimates the effect because it leaves all confounding paths open. The "standard" model adds $L_0$ but still omits $Y_0$ and $A_0$, leaving residual confounding. The "interaction" model conditions on $L_0$, $A_0$, $Y_0$, and their interactions with $A_1$, recovering an estimate close to the true value.

The simulation uses the clarify package to obtain simulation-based confidence intervals for each ATE. The progressive improvement from no control to standard to interaction control illustrates that closing more backdoor paths moves the estimate closer to the truth, but only conditioning on the right set of variables eliminates confounding entirely.

Key takeaway

In a three-wave panel, conditioning on the prior exposure, prior outcome, and baseline covariates (along with their interactions) is ordinarily necessary to satisfy the backdoor criterion. Omitting any of these leaves residual confounding.


Causal forest estimation

Connects to Week 8: machine learning for heterogeneous treatment effects.

This simulation uses the same data-generating process as the confounding control simulation, then fits a causal forest (from the grf package) with $L_0$, $A_0$, and $Y_0$ as covariates. The causal forest is a non-parametric method that estimates individual-level treatment effects $\hat{\tau}(x)$ by partitioning the covariate space adaptively. Unlike the parametric models above, the causal forest does not require the investigator to specify interaction terms; it discovers them from the data.

The simulation reports the causal forest's ATE estimate and its standard error. Comparing this estimate to the parametric interaction model illustrates two points. First, the causal forest can recover the ATE without requiring the analyst to guess the correct functional form. Second, the causal forest provides a standard error that accounts for the adaptive splitting, making valid inference possible even in the non-parametric setting.

Key takeaway

Causal forests automate the discovery of heterogeneous treatment effects but still require the investigator to supply the correct set of confounders. Machine learning solves the functional-form problem, not the identification problem.


Running the simulations

To run the full script, open R and execute:

source("scripts/simulations.R")

Each section prints its results to the console. You can also run individual sections by selecting and executing the relevant code blocks. The script sets random seeds so that results are reproducible.

Glossary and DAG Hand-outs

A reference page covering causal inference terminology and links to hand-out PDFs on directed acyclic graphs (DAGs), confounding, selection bias, and measurement error.


Causal inference glossary

Causal inference rests on mathematical foundations that enjoy broad agreement, but the terminology varies across disciplines. The same word sometimes carries different (even opposite) meanings in different literatures. Terms to watch include "selection", "fixed effects", "standardisation", "moderator", "structural equation model", and "identification".

Core concepts

TermDefinition
Average Treatment Effect (ATE)The expected difference in potential outcomes across the entire population: $\text{ATE} = \mathbb{E}[Y(1) - Y(0)]$. Also called the marginal effect.
Conditional Average Treatment Effect (CATE)The ATE within a subgroup defined by covariates $X = x$: $\text{CATE}(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x]$.
Potential outcomesThe outcomes that would be observed under each possible treatment level. For individual $i$: $Y_i(1)$ under treatment, $Y_i(0)$ under control. Also called counterfactual outcomes.
CounterfactualThe potential outcome corresponding to the treatment level not actually received. Unobservable for any given individual.
Causal consistency$Y_i(a) = Y_i$ when $A_i = a$. The observed outcome equals the potential outcome under the treatment actually received. Requires well-defined treatment and no interference.
ExchangeabilityPotential outcomes are independent of treatment assignment: $Y(a) \coprod A$. In observational studies, we require conditional exchangeability: $Y(a) \coprod A \mid L$.
PositivityEvery subgroup has a non-zero probability of receiving each treatment level: $P(A = a \mid L = l) > 0$.

Graphical concepts

TermDefinition
DAG (directed acyclic graph)A graph with directed edges (arrows) and no cycles. Used to encode causal assumptions about which variables influence which.
ConfounderA common cause of both the exposure and the outcome. Creates a non-causal (backdoor) path that must be blocked for valid causal inference.
ColliderA variable caused by two or more other variables on a path. Conditioning on a collider opens a spurious association.
MediatorA variable on the causal pathway between exposure and outcome ($A \to M \to Y$). Conditioning on a mediator blocks the indirect effect.
Backdoor pathA non-causal path from exposure to outcome that passes through a common cause. Blocking all backdoor paths satisfies the backdoor criterion.
d-separationA graphical criterion for determining conditional independence. Two variables are d-separated given a set $Z$ if every path between them is blocked by $Z$.

Estimation and sensitivity

TermDefinition
Propensity scoreThe probability of receiving treatment given covariates: $e(L) = P(A = 1 \mid L)$. Used for weighting, matching, or stratification.
E-valueThe minimum strength of association an unmeasured confounder would need with both treatment and outcome (on the risk ratio scale) to explain away an observed effect.
RATE (Rank Average Treatment Effect)A metric for assessing treatment effect heterogeneity. Measures how well a prioritisation rule identifies individuals with larger effects.
QINI curveA cumulative gain curve showing the benefit of treating individuals in order of predicted treatment effect. Area under the QINI curve summarises heterogeneity.
Policy treeA decision tree that assigns treatment based on covariates to maximise a welfare criterion. Used for identifying high-response subgroups.
Doubly robust estimationAn estimation strategy that yields consistent causal estimates if either the outcome model or the propensity score model (but not necessarily both) is correctly specified.

DAG hand-outs

The following hand-outs cover DAG conventions, common structures, and specific forms of bias. All PDFs are available for download from the hand-outs folder (Dropbox).

Foundations

Hand-outTopic
1a. Local conventionsConventions for causal diagram construction and interpretation
1b. Directed graph terminologyCore terminology for directed acyclic graphs
S1. Graphical keyVisual reference guide for DAG symbols and notation
S2. GlossaryComprehensive glossary of causal inference terminology

Common applications

Hand-outTopic
2. Common causal questionsFrequently encountered causal questions and how to set them up
6. Effect modificationWhen and how treatment effects vary across subgroups
9. External validityGeneralising causal findings across populations and contexts

Time series and confounding

Hand-outTopic
3. Time series approachesHow longitudinal data help address confounding bias
4. Three-wave panel methodsUsing three-wave panel data for causal inference
5. Time series limitationsWhen time series approaches may not resolve confounding
S3. Time-resolved confoundingAdvanced approaches to time-varying confounding

Advanced topics

Hand-outTopic
7. Selection biasSelection bias in longitudinal studies
8. Measurement errorStructural approaches to representing and addressing measurement error
10. Experimental designHow experiments address confounding and selection bias

Supplementary materials

Hand-outTopic
S5. Timing examplesPractical examples of confounding and timing issues
S6. Detailed panel examplesWhat can go wrong in a three-wave panel
S7. Cross-sectional approachesWhen to report multiple DAGs in cross-sectional studies
S8. Bias correctionQuantitative approaches to bias correction
S9. Mediator biasConfounding bias in mediation analysis
S10. Misclassification biasExamples of misclassification bias and bias towards the null

Accessing hand-outs

All PDFs are in the hand-outs folder on Dropbox. File names match the numbering in the tables above (e.g., 1a-terminologylocalconventions.pdf, S2-glossary.pdf).

Causal Diagrams with ggdag

This tutorial introduces the ggdag package for drawing and analysing directed acyclic graphs (DAGs). DAGs encode causal assumptions and help identify which variables to condition on (and which to leave alone) when estimating causal effects.

How to use this page

This is a reference resource, not a graded lab. Work through the examples at your own pace and return when you need to draw or analyse a DAG for your research report.

Load libraries

# data wrangling
library(tidyverse)

# graphing
library(ggplot2)

# automated causal diagrams
library(ggdag)

The fork: omitted variable bias

Let's use ggdag to identify confounding arising from omitting L in our regression of A on Y.

First we write out the DAG:

# code for creating a DAG
graph_fork <- dagify(Y ~ L,
                   A ~ L,
                   exposure = "A",
                   outcome = "Y") |>
  tidy_dagitty(layout = "tree")

# plot the DAG
graph_fork |>
  ggdag() + theme_dag_blank() + labs(title = "L is a common cause of A and Y")

Next we ask ggdag which variables we need to include for an unbiased estimate:

ggdag::ggdag_adjustment_set(graph_fork) +
  theme_dag_blank() +
  labs(title = "{L} is the exclusive member of the confounder set for A and Y")

The causal graph tells us to obtain an unbiased estimate of A on Y we must condition on L.

When we include the omitted variable L in our simulated dataset, it breaks the spurious association between A and Y:

set.seed(123)
N <- 1000
L <- rnorm(N)
A <- rnorm(N, L)
Y <- rnorm(N, L)

# without control: A appears associated with Y
fit_fork <- lm(Y ~ A)
parameters::model_parameters(fit_fork)

# with control: association vanishes
fit_fork_controlled <- lm(Y ~ A + L)
parameters::model_parameters(fit_fork_controlled)

Mediation and the pipe

Suppose we are interested in the causal effect of A on Y, where the effect operates through a mediator M.

graph_mediation <- dagify(Y ~ M,
                 M ~ A,
                exposure = "A",
                outcome = "Y") |>
  ggdag::tidy_dagitty(layout = "tree")

graph_mediation |>
  ggdag() +
  theme_dag_blank() +
  labs(title = "Mediation Graph")

What should we condition on?

ggdag::ggdag_adjustment_set(graph_fork)

"Backdoor Paths Unconditionally Closed" means we may obtain an unbiased estimate of A on Y without including additional variables, assuming our DAG is correct. There is no "backdoor path" from A to Y that would bias our estimate.

Two variables are d-connected if information flows between them (conditional on the graph), and d-separated if they are conditionally independent.

ggdag::ggdag_dconnected(graph_mediation)

Here d-connection is desirable because it means we can estimate A's effect on Y.

Simulation: pipe confounding

Suppose we want to know whether a ritual action condition (A) influences charity (Y), and the effect operates entirely through perceived social cohesion (M):

$A \to M \to Y$

set.seed(123)
N <- 100
c0 <- rnorm(N, 10, 2)
ritual <- rep(0:1, each = N / 2)
cohesion <- ritual * rnorm(N, .5, .2)
c1 <- c0 + ritual * cohesion
d <- data.frame(c0 = c0, c1 = c1, ritual = ritual, cohesion = cohesion)

# total effect of ritual on charity
parameters::model_parameters(lm(c1 ~ c0 + ritual, data = d))

# conditioning on the mediator blocks the causal path
parameters::model_parameters(lm(c1 ~ c0 + ritual + cohesion, data = d))

The direct effect of ritual drops out when we include cohesion. Once the model knows M, it gets no new information from A. Conditioning on a post-treatment variable creates pipe confounding.

Masked relationships

When two correlated variables have opposing effects on the outcome, their individual effects can be "masked" in simple regressions.

dag_m1 <- dagify(K ~ C + R,
                 R ~ C,
                 exposure = "C",
                 outcome = "K") |>
  tidy_dagitty(layout = "tree")

dag_m1 |> ggdag()
set.seed(123)
n <- 100
C <- rnorm(n)
R <- rnorm(n, C)
K <- rnorm(n, R - C)
d_sim <- data.frame(K = K, R = R, C = C)

# C alone: weak or null
parameters::model_parameters(lm(K ~ C, data = d_sim))

# both: opposing effects "pop"
parameters::model_parameters(lm(K ~ C + R, data = d_sim))

Note that ggdag correctly identifies that you do not need to condition on R to estimate C's total effect on K:

dag_m1 |> ggdag_adjustment_set()

The total effect of C on K combines the direct path ($C \to K$) and the indirect path ($C \to R \to K$); these work in opposite directions.

Collider confounding

The selection-distortion effect (Berkson's paradox). Imagine there is no relationship between the newsworthiness and trustworthiness of science. Selection committees make decisions on the basis of both:

dag_sd <- dagify(S ~ N,
                 S ~ T,
                 labels = c("S" = "Selection",
                            "N" = "Newsworthy",
                            "T" = "Trustworthy")) |>
  tidy_dagitty(layout = "nicely")

dag_sd |>
  ggdag(text = FALSE, use_labels = "label") + theme_dag_blank()

When two arrows enter a variable, conditioning on it opens a path of information between its causes:

ggdag_dseparated(
  dag_sd,
  from = "T",
  to = "N",
  controlling_for = "S",
  text = FALSE,
  use_labels = "label"
) + theme_dag_blank()
# simulation of selection-distortion effect
set.seed(123)
n <- 1000
p <- 0.05

d <- tibble(
    newsworthiness  = rnorm(n, mean = 0, sd = 1),
    trustworthiness = rnorm(n, mean = 0, sd = 1)
  ) |>
  mutate(total_score = newsworthiness + trustworthiness) |>
  mutate(selected = ifelse(total_score >= quantile(total_score, 1 - p), TRUE, FALSE))

# correlation among selected proposals
d |>
  filter(selected == TRUE) |>
  select(newsworthiness, trustworthiness) |>
  cor()

Selection induces a spurious correlation. Among selected proposals, newsworthy ones appear less trustworthy, even though the two are independent in the population.

Collider bias in experiments

Conditioning on a post-treatment variable can open a spurious path even when no experimental effect exists:

dag_ex2 <- dagify(
  C1 ~ C0 + U,
  Ch ~ U + R,
  labels = c(
    "R" = "Ritual",
    "C1" = "Charity-post",
    "C0" = "Charity-pre",
    "Ch" = "Cohesion",
    "U" = "Religiousness (Unmeasured)"
  ),
  exposure = "R",
  outcome = "C1",
  latent = "U"
) |>
  control_for(c("Ch", "C0"))

dag_ex2 |>
  ggdag_collider(text = FALSE, use_labels = "label") +
  ggtitle("Cohesion is a collider that opens a path from ritual to charity")

Taxonomy of confounding

There are four basic structures:

The fork (omitted variable bias)

confounder_triangle(x = "Coffee", y = "Lung Cancer", z = "Smoking") |>
  ggdag_dconnected(text = FALSE, use_labels = "label")

The pipe (fully mediated effects)

mediation_triangle(x = NULL, y = NULL, m = NULL, x_y_associated = FALSE) |>
  tidy_dagitty(layout = "nicely") |>
  ggdag()

The collider

collider_triangle() |>
  ggdag_dseparated(controlling_for = "m")

Confounding by proxy

Controlling for a descendant of a collider introduces collider bias:

dag_sd <- dagify(
  Z ~ X,
  Z ~ Y,
  D ~ Z,
  labels = c("Z" = "Collider", "D" = "Descendant", "X" = "X", "Y" = "Y"),
  exposure = "X",
  outcome = "Y"
) |>
  control_for("D")

dag_sd |>
  ggdag_dseparated(
    from = "X", to = "Y",
    controlling_for = "D",
    text = FALSE, use_labels = "label"
  ) +
  ggtitle("D induces collider bias")

Rules for avoiding confounding

From Statistical Rethinking (p. 286):

  1. List all paths connecting the exposure and outcome
  2. Classify each path as open or closed (a path is open unless it contains a collider)
  3. Classify each path as a backdoor path (has an arrow entering the exposure)
  4. If there are open backdoor paths, decide which variable(s) to condition on to close them

Selection bias in sampling and longitudinal research

Selection bias arises when the sample is not representative of the target population due to conditioning on a collider or its descendant:

coords_mine <- tibble::tribble(
  ~name,           ~x,  ~y,
  "glioma",         1,   2,
  "hospitalized",   2,   3,
  "broken_bone",    3,   2,
  "reckless",       4,   1,
  "smoking",        5,   2
)

dagify(hospitalized ~ broken_bone + glioma,
       broken_bone ~ reckless,
       smoking ~ reckless,
       labels = c(hospitalized = "Hospitalization",
                  broken_bone = "Broken Bone",
                  glioma = "Glioma",
                  reckless = "Reckless \nBehaviour",
                  smoking = "Smoking"),
       coords = coords_mine) |>
  ggdag_dconnected("glioma", "smoking", controlling_for = "hospitalized",
                   text = FALSE, use_labels = "label", collider_lines = FALSE)

In longitudinal research, retention can act as a descendant of a collider, introducing bias when the sample is conditioned on being retained.

Summary

We control for variables to avoid omitted variable bias. But included variable bias is also commonplace. It arises from "pipes", "colliders", and conditioning on descendants of colliders. The ggdag package can help identify adjustment sets, but the results depend on assumptions encoded in your DAG that are not part of the data. Clarify your assumptions.

Vim Motions with Zed: 2-Week Starter Plan

This plan helps students build core Vim motions in a low-friction GUI environment before moving to terminal Neovim.

Setup

  1. Ask students to install Zed from the official download page: https://zed.dev/download.
  2. In Zed, enable Vim mode:
    1. Open command palette (cmd-shift-p on macOS, ctrl-shift-p on Windows/Linux).
    2. Search vim mode and enable it.
  3. Keep lessons short and repeat daily.

Learning Goals

By the end of two weeks, students should be able to:

  • move quickly with hjkl, w, b, e, 0, $, gg, G
  • edit confidently with x, dd, yy, p, u, ctrl-r
  • use operator + motion combinations (d2w, c$, yap)
  • navigate and search with /, n, N

Week 1: Movement and Basic Editing

Day 1

  • modes: normal vs insert (i, a, esc)
  • movement: hjkl
  • mini drill (5 min): move cursor only, no mouse

Day 2

  • word motions: w, b, e
  • line anchors: 0, $
  • drill: reach highlighted targets in as few keystrokes as possible

Day 3

  • delete and undo: x, dd, u, ctrl-r
  • drill: clean a noisy paragraph without mouse use

Day 4

  • copy/paste: yy, p
  • counts: 3w, 2dd, 5j
  • drill: duplicate and rearrange short code blocks

Day 5

  • search: /pattern, n, N
  • week check: 10-minute editing challenge in pairs

Week 2: Composable Editing

Day 6

  • operator + motion concept: d, c, y + motion
  • examples: dw, d$, cw

Day 7

  • text objects (starter): iw, aw, ip, ap
  • examples: diw, ciw, yap

Day 8

  • navigation scale-up: gg, G
  • drill: move through a longer file with no scrolling bar

Day 9

  • practical refactor drill:
    • rename repeated terms
    • reorder lines
    • delete filler text

Day 10

  • capstone: 15-minute “mouse-free edit” task
  • reflection: what still feels slow, what now feels automatic

Classroom Routine (5-10 minutes each class)

  1. One new motion rule.
  2. One short live demo.
  3. One timed drill.
  4. One debrief prompt: “Which keystroke saved you the most time today?”

Suggested Bridge to Neovim

After two weeks, invite interested students to repeat the same drills in Neovim. Keep tasks identical so they transfer motor memory rather than relearn workflows.

Zed Download and Install

Download Zed from the official page:

Long term benefit: after installation, enable Vim mode in Zed via the command palette by searching for vim mode. Warning: it will take you several weeks --several months to learn Vim motions, however, the investment will save you time and effort in the long run.

Suggested Answers: Pair Exercises

These are brief suggested answers for the pair exercises embedded in weekly lectures. They are intended as discussion guides, not definitive solutions. Many exercises are deliberately open-ended.


Week 1

Formulating a contrast

A well-formed causal question might be: "Would replacing two hours of nightly screen time with two hours of reading reduce sleep onset latency (in minutes) over four weeks among 14-to-16-year-olds in Aotearoa New Zealand?" Both sides of the contrast are specified (screen time versus reading), the outcome is defined (sleep onset latency), the time horizon is stated (four weeks), and the target population is named (14-to-16-year-olds in Aotearoa NZ). Common critique points: "screen time" is vague (passive scrolling? gaming? messaging?), "poor sleep" needs a measurable operationalisation, and "teenagers" lumps heterogeneous age groups.

Three problems in one claim

  1. Definitional clarity: "religion" could mean attendance, belief, prayer, or community membership. "Mental health" could mean depression, life satisfaction, anxiety, or a composite. Neither side of the contrast is specified.
  2. Population specificity: the answer may differ between adolescents and older adults, between countries with majority-religion norms and secular societies, or between denominations.
  3. Unobservability: we cannot observe the same person both practising and not practising religion. The individual causal effect is missing by construction.

A rewrite: "Among adults aged 40-65 in Aotearoa New Zealand, would initiating weekly religious service attendance (versus maintaining no attendance) reduce depressive symptoms (PHQ-9 score) over 12 months?"


Week 2

Naming the structure

  1. Fork. SES causes both neighbourhood quality and health outcomes: neighbourhood $\leftarrow$ SES $\to$ health. Neighbourhood and health are marginally associated (through SES). Conditioning on SES removes the association.
  2. Chain. Drug $\to$ inflammation $\to$ pain. Drug and pain are marginally associated (through the mediating path). Conditioning on inflammation blocks the path and removes the association between drug and pain.
  3. Collider. Genetics $\to$ BP $\leftarrow$ diet. Genetics and diet are marginally independent (neither causes the other). Conditioning on blood pressure opens a spurious association: among people with the same BP, knowing genetic risk tells you something about diet (they must compensate).

Checking assumptions against a causal DAG

In the observational design, parental consent ($L$) is driven by SES ($U$), and $U$ also affects polio risk ($Y$). The backdoor path $A \leftarrow L \leftarrow U \to Y$ is open. Exchangeability fails: $Y(a) \cancel\coprod A$.

In the randomised design, $A$ is assigned by a chance mechanism ($\mathcal{R}$) that is independent of $U$ and $L$. The backdoor path through $L$ and $U$ is severed because $A$ no longer depends on $L$. Exchangeability holds: $Y(a) \coprod A$.

Positivity is more probable to fail in the observational design: some SES strata may have near-universal consent or refusal, leaving no comparison group.

Neurath's ship and your own causal DAG

Answers vary by discipline. The key check is whether the partner can identify a fork (common cause generating spurious association) and a chain (mediating path). The sceptic's challenge should propose either a reversed arrow or a missing common cause that would change the adjustment strategy.


Week 3

Applying the backdoor criterion

Paths from $A$ to $Y$: (1) $A \to M \to Y$ (causal, through mediator); (2) $A \leftarrow L_1 \to Y$ (backdoor through health consciousness); (3) $A \leftarrow L_1 \to L_2 \to Y$ (backdoor through health consciousness and diet).

${L_1}$ satisfies the backdoor criterion: it blocks both backdoor paths (paths 2 and 3) and $L_1$ is not a descendant of $A$. Conditioning on ${L_1}$ supports exchangeability.

Adding $M$ violates the criterion because $M$ is a descendant of $A$ (it lies on the causal path $A \to M \to Y$). Conditioning on $M$ blocks part of the total effect we want to estimate.

M-bias in practice

The DAG: $U_1 \to A$ (attendance), $U_2 \to Y$ (giving), $U_1 \to L \leftarrow U_2$ (neighbourhood social capital is a collider of two unmeasured causes). Without conditioning on $L$, the path $A \leftarrow U_1 \to L \leftarrow U_2 \to Y$ is blocked at the collider $L$.

Conditioning on $L$ opens this path, creating a spurious association between $A$ and $Y$ through the unmeasured causes. "Adjust for all pre-treatment variables" fails because $L$ is pre-treatment but is a collider: conditioning on it opens, rather than closes, a biasing path.

$R^2$ versus identification

$R^2$ measures variance explained, a statistical property. Confounding is a structural property of the DAG. A model with high $R^2$ can still be biased if the adjustment set includes a collider (opening a spurious path) or a mediator (blocking part of the causal path).

Example DAG where the larger set introduces bias: if Investigator A's set includes a variable $C$ that is a collider ($A \to C \leftarrow Y$), conditioning on $C$ opens a non-causal path and biases the estimate, despite improving $R^2$. Investigator B's smaller set ${$age, conscientiousness$}$ would satisfy the backdoor criterion if conscientiousness blocks all backdoor paths and is not a descendant of $A$.


Week 4

Classifying measurement error

  1. Type 1: independent, uncorrelated. The screen-time noise and the wellbeing noise do not share a common cause and neither is causally affected by the other variable. This typically attenuates toward the null.
  2. Type 3: dependent, uncorrelated. The treatment (bilingualism) causally affects how the outcome (cognitive performance) is measured, because the test instrument is language-dependent. The DAG shows $A \to$ measurement error node $\to Y^$ (recorded outcome), opening a non-causal path from $A$ to $Y^$.
  3. Type 2: independent, correlated. The shared translation team creates a common cause of errors in both measures. Neither variable's true value causes the other's measurement error, but the errors co-vary through the shared cause.

Collider bias versus confounding

The DAG: depression ($A$) $\to$ ward admission ($C$) $\leftarrow$ injury severity $\to$ recovery ($Y$). Marginally, $A$ and $Y$ may be independent (or associated only through a causal path). Restricting to admitted patients conditions on $C$, opening the path $A \to C \leftarrow$ injury severity $\to Y$.

This is not confounding. Confounding requires an open backdoor path through a common cause (e.g., $A \leftarrow L \to Y$). Here, the path was blocked before conditioning. Conditioning on the collider $C$ actively opens a previously blocked path. Among admitted patients, less depressed individuals tend to have more severe injuries (otherwise they would not have been admitted), creating a spurious negative association between depression and recovery.

Design fix: analyse all eligible patients regardless of admission status, or use inverse probability weighting to account for selection into the hospital sample.

Auditing a study for two failure modes

Selection bias: university mailing list recruitment acts as a filter. Academic motivation and language confidence jointly affect enrolment, making the analytic sample unrepresentative. If motivation or confidence also relates to bilingualism or cognitive outcomes, conditioning on sample membership distorts the contrast.

Measurement bias: type 3 (dependent, uncorrelated). The treatment (bilingualism) causally affects how the English-only cognitive test measures the outcome. Non-English-dominant bilinguals are systematically mismeasured, and this mismeasurement depends on treatment status.


Week 5

Building a potential outcomes table

The key distinction is between the hidden science and the observed data. In the hidden science, each student has both potential outcomes and an individual effect. In the observed data, one potential outcome and hence $\delta_i$ are missing for every student.

One possible hidden science table is:

$i$$Y_i(1)$$Y_i(0)$$\delta_i$
101$-1$
2000
3101
4110

If treatment assignment is $A_1=A_2=1$ and $A_3=A_4=0$, the observed-data table is:

$i$$Y_i(1)$$Y_i(0)$$\delta_i$$A_i$$Y_i^{\text{obs}}$
10NANA10
20NANA10
3NA0NA00
4NA1NA01

The true ATE in the hidden science is $(-1 + 0 + 1 + 0)/4 = 0$. The naive observed difference in means is $\bar{Y}{A=1}^{\mathrm{obs}} - \bar{Y}{A=0}^{\mathrm{obs}} = 0 - 0.5 = -0.5$. The discrepancy arises because treatment assignment is not random with respect to the potential outcomes: students 1 and 2, who received $A=1$, differ in their hidden counterfactual outcomes from students 3 and 4, who received $A=0$. Exchangeability does not hold.

Tracing the identification logic

The claim "students who chose the mindfulness app had lower anxiety, therefore the app works" compares $\mathbb{E}[Y \mid A=1]$ with $\mathbb{E}[Y \mid A=0]$ and interprets the difference causally.

Consistency is questionable if "used the app" pools different versions of treatment under one label: different apps, different session lengths, different start dates, or irregular adherence. Multiple versions undermine the link between $A_i = 1$ and a well-defined $Y_i(1)$.

Exchangeability is the most plausible violated assumption: students who chose the app may have differed from non-users in baseline anxiety, motivation, help-seeking, or available time. The treated group may therefore have had different counterfactual outcomes even without the app, so $Y(0) \cancel\coprod A$.

Positivity may also fail: in some covariate strata, such as students with very high workload or students already receiving intensive therapy, almost no one may choose one side of the contrast, leaving no meaningful comparison group in those strata.

Designing a target trial

Causal estimand: the average difference in anxiety symptoms (e.g., GAD-7 score) at 6 months if all university students practised 20 minutes of daily meditation versus if all maintained their current routine (no meditation).

Time zero: the date of programme enrolment (or randomisation in the target trial).

Two baseline covariates with causal rationale: (1) baseline anxiety (GAD-7 at enrolment), because prior anxiety affects both the decision to meditate and the outcome; (2) academic workload (full-time vs part-time enrolment), because workload affects adherence to meditation and anxiety levels.

Positivity failure: students with severe clinical anxiety may be referred to treatment rather than a meditation programme, so the stratum "severe baseline anxiety" may contain no one in the meditation arm.


Week 6

Interaction versus effect modification

The causal estimand for interaction requires four potential outcomes: $Y(a=1,g=\text{young})$, $Y(a=1,g=\text{old})$, $Y(a=0,g=\text{young})$, $Y(a=0,g=\text{old})$. This is conceptually odd because we cannot intervene on age.

The causal estimand for effect modification involves one intervention (exercise) with subgroup contrasts: $\mathbb{E}[Y(1)-Y(0) \mid G=\text{young}]$ versus $\mathbb{E}[Y(1)-Y(0) \mid G=\text{old}]$. This is the design that matches the study.

The regression interaction term could be non-zero without causal modification if, for example, the linear specification is wrong (the true effect varies non-linearly with a confounder correlated with age), or if age is a collider or descendant of a collider in the DAG.

Why conditioning changes effect modification

Even without a direct $G \to Y$ path, the CATE varies by age because $G$ (age) affects $L$ (fitness), and if the treatment effect varies with $L$, then the distribution of $L$ within age strata determines the subgroup average. Different age groups have different fitness distributions, so $\tau(g)$ differs.

The colleague's null interaction conclusion is premature: the regression test depends on the conditioning set and the functional form. A non-significant $A \times G$ term in a linear model does not rule out effect modification visible with a richer specification or different conditioning set.

Two apparent modifiers could vanish together if both $G_1$ and $G_2$ are proxies for the same underlying variable $L$. Each captures part of the variation in $L$; conditioning on both accounts for $L$ fully, and the residual variation in treatment effect disappears.

From average to subgroup

If 60% of participants have CATE = 8 and 40% have CATE = $-2$: ATE = $0.6 \times 8 + 0.4 \times (-2) = 4.8 - 0.8 = 4.0$. Adjust proportions: e.g., 50% with CATE = 8 and 50% with CATE = $-2$: ATE = $4 - 1 = 3$ mmHg. The policy-maker misses that 50% of participants are harmed (blood pressure increases by 2 mmHg).

The claim "$\hat{\tau}(X_i) = 8$ means the programme will reduce my blood pressure by 8" confuses an estimated subgroup average with an individual effect. $\hat{\tau}(X_i) = 8$ estimates the average effect for everyone sharing person $i$'s measured profile. Person $i$'s true individual effect $Y_i(1) - Y_i(0)$ is unobservable and could be larger, smaller, or opposite in sign.


Week 8

From tree to forest to causal forest

A single regression tree is interpretable (you can read the decision path), but unstable: small changes in the data shift splits and predictions (high variance).

Averaging many trees (a forest) reduces variance. Each tree's idiosyncratic splits cancel out, producing smoother, more reliable predictions.

Two differences in a causal forest: (a) the target is $\tau(x) = \mathbb{E}[Y(1)-Y(0) \mid X=x]$, a treatment contrast, not a prediction of $Y$; (b) honest splitting uses one subsample to choose splits and a separate subsample to estimate contrasts within leaves. Honest splitting is necessary because treatment contrasts require estimating quantities under two exposures, only one of which is observed per person. Using the same data for splitting and estimation would overfit to noise in the individual-level contrasts.

Reading a TOC curve

A steep initial rise means treatment gains are heavily concentrated among the top-ranked individuals. The programme helps some people a lot but most people only a little.

Large AUTOC but small Qini at $q=0.3$ means that gains concentrate in a very narrow top slice (perhaps the top 5-10%), and by the time you expand to 30% coverage, the additional individuals contribute little. For a decision-maker with a 30% budget, the targeting advantage over random allocation is small.

Computing the TOC curve on training data overfits: the forest's rankings are optimised for the training sample, so in-sample evaluation inflates the apparent targeting value. Honest evaluation requires held-out or cross-fitted data.

Should we target?

The Qini addresses the causal estimand: "does treating the top $q$ fraction (ranked by estimated treatment effect) yield greater total benefit than treating a random $q$ fraction?" It goes beyond the ATE by asking whether benefits are concentrated enough to justify selective allocation.

Two non-statistical reasons not to target: (1) logistical feasibility (screening and scoring may cost more than universal provision); (2) stigma or fairness concerns (singling out individuals with high loneliness scores may be perceived as labelling).

Response to the stigma concern: "The concern about stigmatisation is legitimate and must shape how targeting is implemented. The evidence shows that some students benefit substantially more than others, but it does not mandate that selection criteria be disclosed or that participation be compulsory. A self-referral design using the targeting criteria as capacity planning could capture most of the benefit without labelling individuals."


Week 9

Designing a depth-2 policy rule

Example: split first on deprivation index (high vs low), then split the high-deprivation leaf on baseline loneliness (high vs low). Treat the "high deprivation, high loneliness" leaf. If roughly 40% of the population is high-deprivation and 50% of those are high-loneliness, the treated group is approximately 20%.

Two reasons to prefer depth-2 over depth-4: (1) a depth-2 tree has at most 4 leaves and 3 yes/no questions, which a policy-maker or clinician can explain in a sentence; (2) deeper trees split the sample into smaller subgroups, reducing the number of observations per leaf and increasing the variance of the estimated policy value.

Equity audit

A deprivation split indirectly stratifies by ethnicity because in Aotearoa NZ, Māori and Pasifika populations are disproportionately represented in high-deprivation areas due to historical and structural inequities. A rule that targets high deprivation will differentially affect these groups.

Applying governance checks: (1) "Who gains and who loses?" The high-deprivation-under-40 group gains access; everyone else is excluded. If the excluded group includes high-deprivation people over 40 who also benefit, the rule creates an age-based inequity within disadvantaged communities. (2) "Can affected communities understand and contest the rule?" A depth-2 tree is transparent enough to explain, but communities need a mechanism to challenge the split variables and thresholds.

Te Tiriti modification: guarantee a minimum allocation floor for Māori regardless of the tree's splits, ensuring that the algorithmic rule does not reduce access below current levels for tangata whenua.

"The algorithm is objective because it only uses data" is incorrect. The algorithm optimises a chosen objective function on data that reflect historical inequities. Structural disadvantage is encoded in the variables. Objectivity in computation does not imply fairness in outcomes.

Policy tree versus ranking

(a) Estimated policy value: Strategy A (pure ranking) typically achieves equal or slightly higher policy value because it uses the full granularity of $\hat{\tau}(X_i)$. Strategy B loses some value by collapsing to a few leaves.

(b) Explainability: Strategy B is far more explainable. A depth-2 tree is a short set of if-then rules. Strategy A requires explaining a continuous score derived from thousands of overlapping tree splits.

(c) Ability to answer "why was I selected?": Strategy B can give a clear answer ("because your deprivation index is above 8 and your loneliness score is above the median"). Strategy A can only say "because your estimated benefit score was in the top 20%," which is opaque.

Strategy A is preferable when the decision is internal (e.g., a research team allocating limited follow-up resources) and does not require public justification. Strategy B is preferable when the rule must be defended publicly, contested by affected communities, or implemented by non-technical staff.


Week 10

Interpreting invariance results

Configural invariance means the same items load on the same factors in both groups (same pattern of zero and non-zero loadings). Metric invariance means the factor loadings are equal across groups (a one-unit increase in the latent variable produces the same change in item responses). Scalar/threshold invariance means the intercepts (or thresholds for ordinal items) are equal, so the same latent level produces the same expected response.

Failing scalar/threshold invariance means that even at the same latent distress level, the two groups endorse "felt hopeless" and "felt worthless" differently. A one-unit difference in total score does not correspond to the same latent difference across groups. Group mean comparisons on the total score therefore confound true latent differences with measurement artefact.

Hypothesis for differential functioning: cultural norms about expressing hopelessness or worthlessness may differ. In some cultural contexts, endorsing "felt worthless" may carry greater stigma, leading to systematically lower endorsement at the same latent distress level. Alternatively, translation may anchor response categories differently.

Fit is not identification

Good fit means the model reproduces the observed covariance matrix. It does not establish causal direction. Multiple causal structures can generate the same covariance pattern.

Reflective DAG: $\eta \to X_1, \eta \to X_2, \ldots, \eta \to X_6$. The latent variable causes the indicators. Formative DAG: $X_1 \to \eta, X_2 \to \eta, \ldots, X_6 \to \eta$. The indicators cause the composite. Both can produce identical fit statistics for a single-factor solution.

The choice matters for downstream causal inference. If the construct is reflective and we use it as a confounder, we assume the latent variable is the true common cause. If the construct is actually formative (a composite of independent causes), conditioning on the composite may not block the backdoor paths we intend to close, because each component may have a different causal relationship with treatment and outcome.

Measurement as an identification problem

Scalar/threshold non-invariance means the same response pattern corresponds to different latent levels across groups. Put differently, the mapping from the latent outcome $Y^$ to the measured outcome $Y$ depends on group membership. This matters for CATE because CATE is defined by a group contrast. If measurement differs by group, the estimated heterogeneity can be measurement artefact. This can happen even when exchangeability and positivity hold for $Y^$, because the analysis uses $Y$, not $Y^*$.

Example intuition: population A is distressed and population B is not. In A, "worthlessness" may be caused by unemployment. In B, it may be rare and have different causes. The same K6 item can have different causal parents across groups. The factor structure and item means can therefore differ without any change in "true distress".

Counter to "validated in hundreds of studies": most validation evidence is about internal consistency, short-term stability, and associations with other variables. Those are associational properties. They do not establish that the items have the same meaning, the same causes, or the same measurement function across the particular groups you want to compare. A scale can be reliable within each group and still be non-comparable across groups.

Proposed workflow step (between DAG and estimation): write down a measurement submodel as causal assumptions. State whether your estimand is the effect on reported K6 ($Y$) or the effect on the underlying state ($Y^*$). Then draw a measurement DAG that makes explicit what causes item responses in each group, including stigma, translation, and response norms. Decide what design or data would support those assumptions. If you choose to run measurement invariance tests, treat them as descriptive stress tests of a specific reflective model, not as evidence that measurement is causally comparable. If the stress test fails, the honest conclusion is that your group comparison is not identified without stronger measurement assumptions or better data.