Week 5: Causal Inference: Average Treatment Effects
Date: 25 Mar 2026
Readings
Required
- Hernán & Robins (2025), chapters 1-3. link
- Cashin et al. (2025). TARGET (Transparent Reporting of Observational Studies Emulating a Target Trial) statement.
Optional
- Neal (2020), chapters 1-2.
Key concepts for the test(s)
- The fundamental problem of causal inference
- Average (marginal) treatment effects
- Causal consistency
- Exchangeability
- Positivity
Lab
Use Lab 5: Average Treatment Effects for this week's hands-on work.
Terminology
In these notes, we use "potential outcomes" and "counterfactual outcomes" interchangeably.
Weeks 1 through 4 built a framework for asking causal questions and identifying the structural threats that obstruct answers: confounding, selection bias, and measurement error. Week 2 stated the three identification assumptions. This week shows how those assumptions, together with a well-specified target trial, connect a causal question to an estimable population contrast.
The shift in emphasis matters. Week 4 asked, "where can bias enter?" Week 5 asks, "what causal contrast do we want, and what assumptions let observed data stand in for the missing counterfactuals?"
Seminar
Opening example: one question, two different answers
Observational studies often suggest that students who choose to use a mindfulness app have lower anxiety than students who do not. Randomised trials usually find a smaller and less consistent average benefit once investigators standardise when the intervention begins, what counts as treatment, and when outcomes are measured.
Same substantive question. Different design. Different answer.
This week explains why.
A simple map for this week
This lecture has three moves.
Three moves
- Write the causal contrast we want.
- See why we cannot observe that contrast for one person.
- Use identification assumptions to recover a population average from observed data.
Potential outcomes and DAGs do different jobs. Potential outcomes define the causal contrast. DAGs help us judge whether exchangeability is plausible, which variables belong in $L$, and whether the design has a coherent time zero.
Step 1: state the causal question
Return to the mindfulness example. Let $A=1$ denote starting a guided mindfulness app at the beginning of semester and completing one 10-minute session per day for eight weeks. Let $A=0$ denote not starting the app. Let $Y$ be anxiety symptoms at week 8.
For student $i$, $Y_i(1)$ is the outcome under the app-based intervention, and $Y_i(0)$ the outcome without it. The individual causal effect is:
$$ Y_i(1)-Y_i(0). $$
We never observe both terms for one student at one time.
The fundamental problem of causal inference
The individual causal effect requires two quantities, but we can observe at most one. If student $i$ starts the app ($A_i = 1$), we observe $Y_i(1)$ but not $Y_i(0)$. If the student does not start it ($A_i = 0$), we observe $Y_i(0)$ but not $Y_i(1)$. The unobserved term is the counterfactual.
This is not a sample-size problem or a measurement problem. It is a structural feature of the physical world: a person cannot simultaneously exist under two incompatible conditions. No dataset, however large, contains both potential outcomes for one individual at one time. Individual causal effects are missing by necessity, not by accident.
Pair exercise: building a potential outcomes table
- Consider four students in the mindfulness example. Construct a table with columns: person ($i$), $Y_i(1)$, $Y_i(0)$, $\delta_i = Y_i(1) - Y_i(0)$, treatment received ($A_i$), and observed outcome ($Y_i^{\mathrm{obs}}$).
- Assign plausible values: let two students receive $A = 1$ and two receive $A = 0$. Make the true individual effects vary (e.g., one positive, one negative, two zero).
- Compute the true ATE from all four $\delta_i$ values.
- Compute the naive difference in means: $\bar{Y}{A=1}^{\mathrm{obs}} - \bar{Y}{A=0}^{\mathrm{obs}}$.
- Do the two quantities match? Explain why the discrepancy arises (or why it does not).
Step 2: move from individuals to populations
Because individual effects are unobservable, we target a population causal estimand. The average treatment effect (ATE) is:
$$ \text{ATE}=\mathbb{E}[Y(1)-Y(0)]. $$
This is the mean contrast if everyone in the target population received $A=1$ versus $A=0$.
Step 3: connect the causal estimand to observed data
The three identification assumptions are easier to remember if you ask what each one contributes to the argument.
Assumption 1: causal consistency
If $A_i=a$, then $Y_i=Y_i(a)$. The observed outcome equals the potential outcome corresponding to the treatment actually received.
Data scientists estimate parameters for observed data. Causal inference goes further: we estimate contrasts involving counterfactual parameters. We compute the average response when the entire target population is exposed, then when the entire population is unexposed, then contrast these averages. Consistency is what allows us to bridge from counterfactual to observed. Without it, potential outcomes remain purely theoretical.
The general switching equation expresses the observed outcome as a function of treatment and both potential outcomes:
$$ Y_i^{obs} = A_i \cdot Y_i(1) + (1 - A_i) \cdot Y_i(0). $$
Each person carries two potential outcomes, but we observe only the one selected by their actual treatment. For treated individuals ($A_i = 1$), the switching equation reduces to:
$$ Y_i^{obs} = 1 \cdot Y_i(1) + 0 \cdot Y_i(0) = Y_i(1). $$
For untreated individuals ($A_i = 0$):
$$ Y_i^{obs} = 0 \cdot Y_i(1) + 1 \cdot Y_i(0) = Y_i(0). $$
In short:
$$ Y_i = Y_i(1) \quad \text{if } A_i = 1; \qquad Y_i = Y_i(0) \quad \text{if } A_i = 0. $$
Consistency subsumes two conditions that are sometimes stated separately. No interference requires that one person's treatment does not affect another person's outcome. Treatment-version irrelevance requires that there is only one version of each treatment level. Both are special cases: if treatments are heterogeneous or if interference exists, the potential outcome $Y(a)$ is ill-defined. Consistency fails when treatment versions are mixed under one label. If "mindfulness practice" includes different apps, dosages, and start dates under the same label, $Y(1)$ does not refer to one intervention.
Assumption 2: exchangeability
The conditional probability of receiving every value of an exposure level, though not decided by investigators, depends only on the measured covariates ((chatton_g-computation_2020?); (hernan_causal_2023?)).
In a randomised trial, exchangeability holds unconditionally:
$$ Y(a) \coprod A. $$
In observational data, we require conditional exchangeability. For each $a$:
$$ Y(a) \coprod A \mid L, $$
where $L$ is the set of covariates sufficient to ensure the independence of the counterfactual outcomes and the exposure. Equivalently, $A \coprod Y(a) \mid L$. When this condition holds, counterfactual outcomes are independent of actual exposures received, conditional on $L$.
Exchangeability cannot be verified from observed data. It can only be defended by subject-matter knowledge and a plausible DAG. This is the no-unmeasured-confounding assumption.
Assumption 3: positivity
The probability of receiving every value of the exposure within all strata of covariates is greater than zero ((hernan_causal_2023?); (westreich_invited_2010?)):
$$ 0 < P(A=a \mid L=l) < 1, \quad \forall, a \in \mathcal{A},; \forall, l \in \mathcal{L}. $$
There are two types of positivity violation.
Random non-positivity occurs when the sample data do not contain all levels of exposure within strata for whom exposures are defined. For example, if no participants aged 22–24 received treatment, investigators must extrapolate from other ages. Random non-positivity can be addressed by modelling assumptions, but those assumptions carry their own risks.
Deterministic non-positivity occurs when it is scientifically impossible for certain strata to receive specific levels of exposure. For example, biological males cannot receive hysterectomy. Deterministic violations require restricting the analysis to scientifically plausible cases.
Positivity is the one identification assumption we can partially check empirically. Plot propensity score distributions and look for gaps or near-zero densities.
What each assumption buys us
- Consistency links one observed outcome to one potential outcome. It is what connects counterfactual quantities to data.
- Exchangeability lets observed outcomes in one group stand in for the missing counterfactual outcomes in the other.
- Positivity ensures that the needed comparison exists in every relevant subgroup.
How the assumptions recover population contrasts
Start with the easiest case: a randomised trial, where exchangeability holds without conditioning. The assumptions work in sequence:
$$ \begin{aligned} \underbrace{\mathbb{E}[Y(1)]}{\color{blue}{\text{everyone treated}}} &\overset{\text{exchangeability}}{=} \underbrace{\mathbb{E}[Y(1)\mid A=1]}{\color{blue}{\text{treated arm}}} \overset{\text{consistency}}{=} \underbrace{\mathbb{E}[Y \mid A=1]}{\color{teal}{\text{observed treated mean}}}, \ \underbrace{\mathbb{E}[Y(0)]}{\color{red}{\text{everyone untreated}}} &\overset{\text{exchangeability}}{=} \underbrace{\mathbb{E}[Y(0)\mid A=0]}{\color{red}{\text{control arm}}} \overset{\text{consistency}}{=} \underbrace{\mathbb{E}[Y \mid A=0]}{\color{orange}{\text{observed untreated mean}}}. \end{aligned} $$
So the ATE becomes
$$ \text{ATE} = \underbrace{\mathbb{E}[Y(1)]}_{\color{blue}{\text{counterfactual treated mean}}}
- \underbrace{\mathbb{E}[Y(0)]}{\color{red}{\text{counterfactual untreated mean}}} = \underbrace{\mathbb{E}[Y \mid A=1]}{\color{teal}{\text{observed treated mean}}}
- \underbrace{\mathbb{E}[Y \mid A=0]}_{\color{orange}{\text{observed untreated mean}}}. $$
This is the key identification move. We replace missing counterfactual averages with observed group averages.
In observational data, the same logic works only after conditioning on a sufficient set $L$. Positivity then ensures that each relevant stratum contains both treated and untreated individuals, so those adjusted comparisons are estimable.
Pair exercise: tracing the identification logic
- Your partner claims "students who chose the mindfulness app had lower anxiety, therefore the app works."
- Walk through each identification assumption in turn. Where does the reasoning break?
- Check consistency: were all students labelled $A = 1$ receiving the same intervention?
- Check exchangeability: is $Y(a) \coprod A$, or could the students who chose the app differ systematically from those who did not?
- Check positivity: are there covariate strata where no students used (or declined) the app?
- State which assumption is most plausible violated and why.
The observational-data version
Assume consistency, exchangeability given $L$, and positivity. Then:
$$ \mathbb{E}[Y(a)] = \sum_l \mathbb{E}[Y \mid A=a, L=l]P(L=l). $$
So the ATE is identified by standardisation:
$$ \text{ATE}=\sum_l \underbrace{\Big(\mathbb{E}[Y \mid A=1,L=l]-\mathbb{E}[Y \mid A=0,L=l]\Big)}{\color{teal}{\text{within-stratum observed contrast}}} \underbrace{P(L=l)}{\color{blue}{\text{stratum weight}}}. $$
What we can check, and what we cannot
Positivity is the only assumption we can directly inspect with data. If some covariate strata contain no treated (or no untreated) individuals, the contrast for those strata relies on model extrapolation rather than observed comparisons.
Consistency requires that "treatment" means the same thing for everyone labelled $A = 1$. In the mindfulness example, beginning a guided app at semester start, trying one unguided breathing exercise in week 5, and attending a group class irregularly are different interventions. A well-specified target trial defines treatment precisely enough that consistency is defensible.
Exchangeability cannot be verified from observed data. We can check whether measured covariates are balanced after adjustment, but we cannot test whether unmeasured common causes remain. This is why the DAG matters: it forces investigators to state which variables they believe are sufficient and to defend that belief with subject-matter knowledge.
Design and subject-matter knowledge are not optional extras. They are what makes identification assumptions assessable.
Quick diagnostic
- If treatment is vague, worry about consistency.
- If treated and untreated people differ in causes of the outcome, worry about exchangeability.
- If one treatment level barely occurs in some subgroup, worry about positivity.
Return to the opening example
The mindfulness discrepancy illustrates what happens when investigators fail to emulate a target trial. If "users" are defined as students who have already adopted the app, the design mixes recent starters with persistent users who may differ in motivation, baseline distress, and help-seeking. That makes the exposed group look healthier than a true start-of-intervention comparison would justify.
The lesson is that design comes before estimation. If the hypothetical trial is not specified, the identifying assumptions are hard to interpret and even harder to defend. We now have the tools to identify and estimate an average causal contrast for a defined population. Week 6 asks the next question: does that contrast vary across subgroups?
Pair exercise: designing a target trial
- Your intervention is daily meditation (20 minutes). Your outcome is anxiety symptoms at 6 months. Your target population is university students.
- State the causal estimand precisely: what are the two contrast conditions?
- Define time zero (when does follow-up begin?).
- Name two baseline covariates you would adjust for, and give a causal rationale for each (draw a short DAG if it helps).
- Describe one plausible positivity failure in this setting (a subgroup where one side of the contrast is effectively empty).
Lab materials: Lab 5: Average Treatment Effects
The causal workflow
The identification assumptions are not items on a checklist to be ticked off after analysis. They are design commitments that must be defended before estimation begins. The following workflow organises these commitments into a sequence. Each step depends on the ones before it. Also see the course causal workflow reference page.
Step 0: state a well-defined treatment. Specify the hypothetical intervention precisely enough that every member of the target population could, in principle, receive it. "Mindfulness" is too vague: people meditate with different apps, in groups, alone, once, or every day. A clearer intervention is: "start a guided mindfulness app at the beginning of semester and complete one 10-minute session per day for eight weeks." Precision here underwrites consistency and makes the timeline visible.
Step 1: establish time zero. Define the point at which treatment assignment begins and follow-up starts. We cannot do this until the treatment is specified, because time zero is the moment that intervention becomes assigned or initiated. In the mindfulness example, time zero is the beginning of semester when students start the app or are assigned not to start it. Without a clear time zero, consistency is undermined and exchangeability is hard to assess.
Step 2: state a well-defined outcome. Define the outcome so the causal contrast is meaningful and temporally anchored. "Wellbeing" is underspecified; "psychological distress one year post-intervention measured with the Kessler-6" is interpretable and reproducible. Include timing, scale, and instrument.
Step 3: clarify the target population. Say exactly who you aim to inform. Eligibility rules define the source population, but sampling and participation can yield an analytic sample with a different distribution of effect modifiers (Bulbulia (2024)). If you intend to generalise beyond the source population (transport), articulate the additional conditions required.
Step 4: evaluate exchangeability. Make the case that potential outcomes are independent of treatment conditional on covariates: $Y(a) \coprod A \mid L$ (Hernan & Robins (2020)). Use design and diagnostics: DAGs, subject-matter arguments, pre-treatment covariate balance, and overlap checks. If exchangeability is doubtful, redesign rather than rely solely on modelling.
Step 5: ensure causal consistency. Consistency requires that, for units receiving a treatment version compatible with level $a$, the observed outcome equals $Y(a)$. It also presumes well-defined versions and no interference between units (VanderWeele & Hernan (2013); Hernan & Robins (2020)). When multiple versions exist, either refine the intervention so versions are irrelevant to $Y(a)$, or condition on version-defining covariates.
Step 6: check positivity. Each treatment level must occur with non-zero probability at every covariate profile needed for exchangeability ((westreich_invited_2010?)). Diagnose limited overlap using propensity score distributions and extreme weights. Consider design-stage remedies (trimming, restriction) before estimation.
Step 7: ensure measurement aligns with the scientific question. Be explicit about probable forms of measurement error (classical, Berkson, differential, misclassification) and their structural implications for bias (Hernan & Robins (2020); Bulbulia (2024)). Where feasible, incorporate validation studies or calibration models.
Step 8: preserve representativeness from start to finish. Differential attrition, non-response, or measurement processes tied to treatment and outcomes can induce selection bias in the presence of true effects (Hernán (2017); Bulbulia (2024)). Plan strategies such as inverse probability weighting for censoring, multiple imputation under defensible mechanisms, and sensitivity analyses for data missing not at random.
Step 9: document the reasoning that supports steps 0–8. Make assumptions, disagreements, and judgement calls legible. Register or time-stamp the analytic plan. Include identification arguments, code, and data where possible. Report robustness and sensitivity analyses. Transparent reasoning is a scientific result in its own right (Ogburn & Shpitser (2021)).
Reporting: the TARGET checklist
The TARGET (Transparent Reporting of Observational Studies Emulating a Target Trial) statement (Cashin et al. (2025)) provides a structured checklist for reporting studies that emulate a target trial. It maps directly onto the causal workflow above. Use it when writing up results.
| No. | Item | Workflow step |
|---|---|---|
| Title and Abstract | ||
| 1 | Identify that the study emulates a target trial; state objectives | Steps 0–1 |
| 2 | Report the data source(s) | Step 3 |
| 3 | Summarise assumptions, methods, findings, conclusions | Steps 4–9 |
| Introduction | ||
| 4 | Describe scientific background and gap | Motivation |
| 5 | Summarise the causal question | Steps 0–2 |
| 6 | Describe rationale for emulating a target trial | Steps 0–1 |
| Methods | ||
| 7 | Cite data sources; describe purpose, type, setting, time period | Step 3 |
| 8a | Eligibility criteria and operationalisation | Step 3 |
| 8b | Treatment strategies and operationalisation | Step 0 |
| 8c | Assignment procedures and operationalisation | Step 4 |
| 8d | Follow-up: starts at assignment; describe operationalisation | Step 1 |
| 8e | Outcomes and operationalisation | Step 2 |
| 8f | Causal contrasts and operationalisation | Steps 0–2 |
| 8g | Identifying assumptions; describe related variables | Steps 4–6 |
| 8h | Data analysis procedures for each causal estimand | Steps 4–8 |
| 8i | Additional analyses for each causal estimand | Step 9 |
| Results | ||
| 9 | Numbers assessed, eligible, and assigned | Step 3 |
| 10 | Baseline characteristics by treatment strategy | Step 4 |
| 11 | Follow-up length and reasons for end of follow-up | Steps 1, 8 |
| 12 | Missing data frequency by variable | Step 8 |
| 13 | Outcome frequency or distribution at each wave | Step 2 |
| 14 | Effect estimates with measures of precision | Steps 4–6 |
| 15 | Sensitivity of estimates to choices and assumptions | Step 9 |
| Discussion | ||
| 16 | Interpretation of key findings | — |
| 17 | Limitations, including differences between target trial and emulation | Steps 4–8 |
| Other Information | ||
| 18 | Ethics approval | — |
| 19 | Registration | Step 9 |
| 20 | Data and code availability | Step 9 |
| 21 | Funding sources | — |
| 22 | Conflicts of interest | — |
Lab materials: Lab 5: Average Treatment Effects
Appendix A: notation variants
Equivalent notations for the individual contrast include
$$ Y_i^{1} - Y_i^{0} $$
and
$$ Y_i(a=1) - Y_i(a=0). $$
Bulbulia, J. A. (2024). Methods in causal inference part 3: Measurement error and external validity threats. Evolutionary Human Sciences, 6, e42. https://doi.org/10.1017/ehs.2024.33
Cashin, A. G., Hansford, H. J., Hernán, M. A., et al. (2025). Transparent reporting of observational studies emulating a target trial—the TARGET statement. JAMA, 334(12), 1084–1093. https://doi.org/10.1001/jama.2025.13350
Hernan, M. A., & Robins, J. M. (2020). Causal inference: What if? Taylor & Francis. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Hernán, M. A. (2017). Invited commentary: Selection bias without colliders | american journal of epidemiology | oxford academic. American Journal of Epidemiology, 185(11), 1048–1050. https://doi.org/10.1093/aje/kwx077
Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if. Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Neal, B. (2020). Introduction to causal inference from a machine learning perspective. Course Lecture Notes (Draft). https://www.bradyneal.com/Introduction_to_Causal_Inference-Dec17_2020-Neal.pdf
Ogburn, E. L., & Shpitser, I. (2021). Causal modelling: The two cultures. Observational Studies, 7(1), 179–183. https://doi.org/10.1353/obs.2021.0006
VanderWeele, T. J., & Hernan, M. A. (2013). Causal inference under multiple versions of treatment. Journal of Causal Inference, 1(1), 1–20.