Week 5: Causal Inference: Average Treatment Effects

Slides

Date: 25 Mar 2026

Key idea

Causal inference replaces unobservable counterfactual averages with observed averages. Three identification assumptions, consistency, exchangeability, and positivity, license that substitution. They are design commitments to defend before estimation, not boxes to tick afterwards.

Readings

Required

  • Hernán & Robins (2025), chapters 1-3. link
  • Cashin et al. (2025). TARGET (Transparent Reporting of Observational Studies Emulating a Target Trial) statement.

Optional

  • Neal (2020), chapters 1-2.

Key concepts for the test(s)

  • The fundamental problem of causal inference
  • Average (marginal) treatment effects
  • Causal consistency
  • Exchangeability
  • Positivity

Lab

Use Lab 5: Average Treatment Effects for this week's hands-on work.

Terminology

In these notes, we use "potential outcomes" and "counterfactual outcomes" interchangeably.


Weeks 1 through 4 built a framework for asking causal questions and identifying the structural threats that obstruct answers: confounding, selection bias, and measurement error. Week 2 stated the three identification assumptions. This week shows how those assumptions, together with a well-specified target trial, connect a causal question to an estimable population contrast. We asked, "where can bias enter?" Week 5 asks, "what causal contrast do we want, and what assumptions let observed data stand in for the missing counterfactuals?"

Seminar

Opening example: one question, two different answers

Observational studies often suggest that students who choose to use a mindfulness app have lower anxiety than students who do not. Randomised trials usually find a smaller and less consistent average benefit once investigators standardise when the intervention begins, what counts as treatment, and when outcomes are measured.

Same substantive question. Different design. Different answer.

This week explains why.

A simple map for this week

This lecture has three moves.

Three moves

  1. Write the causal contrast we want.
  2. See why we cannot observe that contrast for one person.
  3. Use identification assumptions to recover a population average from observed data.

Potential outcomes and DAGs do different jobs. Potential outcomes define the causal contrast. DAGs help us judge whether exchangeability is plausible, which variables belong in $L$, and whether the design has a coherent time zero.

Move 1: state the causal question

Return to the mindfulness example. Let $A=1$ denote starting a guided mindfulness app at the beginning of semester and completing one 10-minute session per day for eight weeks. Let $A=0$ denote not starting the app. Let $Y$ be anxiety symptoms at week 8.

For student $i$, $Y_i(1)$ is the outcome under the app-based intervention, and $Y_i(0)$ the outcome without it. The individual causal effect is:

$$ Y_i(1)-Y_i(0). $$

We never observe both terms for one student at one time.

The fundamental problem of causal inference

The individual causal effect requires two quantities, but we can observe at most one. If student $i$ starts the app ($A_i = 1$), we observe $Y_i(1)$ but not $Y_i(0)$. If the student does not start it ($A_i = 0$), we observe $Y_i(0)$ but not $Y_i(1)$. The unobserved term is the counterfactual.

This is a structural feature of the physical world: a person cannot simultaneously exist under two incompatible conditions. No dataset, however large, contains both potential outcomes for one individual at one time. Individual causal effects are missing by necessity.

Pair exercise: building a potential outcomes table

  1. Consider four students in the mindfulness example. Construct a table with columns: person ($i$), $Y_i(1)$, $Y_i(0)$, $\delta_i = Y_i(1) - Y_i(0)$, treatment received ($A_i$), and observed outcome ($Y_i^{\mathrm{obs}}$).
  2. Assign plausible values: let two students receive $A = 1$ and two receive $A = 0$. Make the true individual effects vary (e.g., one positive, one negative, two zero).
  3. Compute the true ATE from all four $\delta_i$ values.
  4. Compute the naive difference in means: $\bar{Y}{A=1}^{\mathrm{obs}} - \bar{Y}{A=0}^{\mathrm{obs}}$.
  5. Do the two quantities match? Explain why the discrepancy arises (or why it does not).

Move 2: from individuals to populations

Because individual effects are unobservable, we target a population causal estimand: the quantity we want our analysis to estimate. The average treatment effect (ATE) is:

$$ \text{ATE}=\mathbb{E}[Y(1)-Y(0)]. $$

This is the mean contrast if everyone in the target population received $A=1$ versus $A=0$.

Move 3: connect the causal estimand to observed data

The three identification assumptions are easier to remember if you ask what each one contributes to the argument.

Assumption 1: causal consistency

If $A_i=a$, then $Y_i=Y_i(a)$. The observed outcome equals the potential outcome corresponding to the treatment actually received.

Data scientists estimate parameters for observed data. Causal inference goes further: we estimate contrasts involving counterfactual parameters. We compute the average response when the entire target population is exposed, then when the entire population is unexposed, then contrast these averages. Consistency is what allows us to bridge from counterfactual to observed. Without it, potential outcomes remain purely theoretical.

The general switching equation expresses the observed outcome as a function of treatment and both potential outcomes:

$$ Y_i^{obs} = A_i \cdot Y_i(1) + (1 - A_i) \cdot Y_i(0). $$

Each person carries two potential outcomes, but we observe only the one selected by their actual treatment. For treated individuals ($A_i = 1$), the switching equation reduces to:

$$ Y_i^{obs} = 1 \cdot Y_i(1) + 0 \cdot Y_i(0) = Y_i(1). $$

For untreated individuals ($A_i = 0$):

$$ Y_i^{obs} = 0 \cdot Y_i(1) + 1 \cdot Y_i(0) = Y_i(0). $$

In short:

$$ Y_i = Y_i(1) \quad \text{if } A_i = 1; \qquad Y_i = Y_i(0) \quad \text{if } A_i = 0. $$

Consistency subsumes two conditions that are sometimes stated separately. No interference requires that one person's treatment does not affect another person's outcome. Treatment-version irrelevance requires that there is only one version of each treatment level. Both are special cases: if treatments are heterogeneous or if interference exists, the potential outcome $Y(a)$ is ill-defined. Consistency fails when treatment versions are mixed under one label. If "mindfulness practice" includes different apps, dosages, and start dates under the same label, $Y(1)$ does not refer to one intervention.

Assumption 2: exchangeability

Within levels of the measured covariates $L$, treatment is independent of the potential outcomes: once we know a person's $L$, their treatment status carries no further information about how they would respond under either exposure (Chatton et al. (2020); Hernán & Robins (2025)).

In a randomised trial, exchangeability holds unconditionally:

$$ Y(a) \coprod A. $$

In observational data, we require conditional exchangeability. For each $a$:

$$ Y(a) \coprod A \mid L, $$

where $L$ is the set of covariates sufficient to ensure the independence of the counterfactual outcomes and the exposure. Equivalently, $A \coprod Y(a) \mid L$. When this condition holds, counterfactual outcomes are independent of actual exposures received, conditional on $L$.

Exchangeability cannot be verified from observed data. It can only be defended by subject-matter knowledge and a plausible DAG. This is the no-unmeasured-confounding assumption.

Assumption 3: positivity

The probability of receiving every value of the exposure within all strata of covariates is greater than zero (Hernán & Robins (2025); Westreich & Cole (2010)):

$$ 0 < P(A=a \mid L=l) < 1, \quad \forall, a \in \mathcal{A},; \forall, l \in \mathcal{L}. $$

There are two types of positivity violation.

Random non-positivity occurs when the sample data do not contain all levels of exposure within strata for whom exposures are defined. For example, if no participants aged 22–24 received treatment, investigators must extrapolate from other ages. Random non-positivity can be addressed by modelling assumptions, but those assumptions carry their own risks.

Deterministic non-positivity occurs when it is scientifically impossible for certain strata to receive specific levels of exposure. For example, biological males cannot receive hysterectomy. Deterministic violations require restricting the analysis to scientifically plausible cases.

Positivity is the one identification assumption we can partially check empirically. The propensity score is the conditional probability of receiving treatment given covariates, $P(A=1 \mid L)$. Plot propensity score distributions and look for gaps or near-zero densities.

What each assumption buys us

  • Consistency links one observed outcome to one potential outcome. It is what connects counterfactual quantities to data.
  • Exchangeability lets observed outcomes in one group stand in for the missing counterfactual outcomes in the other.
  • Positivity ensures that the needed comparison exists in every relevant subgroup.

How the assumptions recover population contrasts

Start with the easiest case: a randomised trial, where exchangeability holds without conditioning. The assumptions work in sequence:

$$ \begin{aligned} \underbrace{\mathbb{E}[Y(1)]}_{\textcolor{blue}{\text{everyone treated}}} &\overset{\text{exchangeability}}{=} \underbrace{\mathbb{E}[Y(1)\mid A=1]}_{\textcolor{blue}{\text{treated arm}}} \overset{\text{consistency}}{=} \underbrace{\mathbb{E}[Y \mid A=1]}_{\textcolor{teal}{\text{observed treated mean}}}, \newline \underbrace{\mathbb{E}[Y(0)]}_{\textcolor{red}{\text{everyone untreated}}} &\overset{\text{exchangeability}}{=} \underbrace{\mathbb{E}[Y(0)\mid A=0]}_{\textcolor{red}{\text{control arm}}} \overset{\text{consistency}}{=} \underbrace{\mathbb{E}[Y \mid A=0]}_{\textcolor{orange}{\text{observed untreated mean}}}. \end{aligned} $$

So the ATE becomes

$$ \begin{aligned} \text{ATE} &= \underbrace{\mathbb{E}[Y(1)]}_{\textcolor{blue}{\text{counterfactual treated mean}}} {}- \underbrace{\mathbb{E}[Y(0)]}_{\textcolor{red}{\text{counterfactual untreated mean}}} \newline &= \underbrace{\mathbb{E}[Y \mid A=1]}_{\textcolor{teal}{\text{observed treated mean}}} {}- \underbrace{\mathbb{E}[Y \mid A=0]}_{\textcolor{orange}{\text{observed untreated mean}}}. \end{aligned} $$

This is the key identification move. We replace missing counterfactual averages with observed group averages.

In observational data, the same logic works only after conditioning on a sufficient set $L$. Positivity then ensures that each relevant stratum contains both treated and untreated individuals, so those adjusted comparisons are estimable.

Pair exercise: tracing the identification logic

  1. Your partner claims "students who chose the mindfulness app had lower anxiety, therefore the app works."
  2. Walk through each identification assumption in turn. Where does the reasoning break?
  3. Check consistency: were all students labelled $A = 1$ receiving the same intervention?
  4. Check exchangeability: is $Y(a) \coprod A$, or could the students who chose the app differ systematically from those who did not?
  5. Check positivity: are there covariate strata where no students used (or declined) the app?
  6. State which assumption is most plausible violated and why.

The observational-data version

Assume consistency, exchangeability given $L$, and positivity. Then:

$$ \mathbb{E}[Y(a)] = \sum_l \mathbb{E}[Y \mid A=a, L=l]P(L=l). $$

So the ATE is identified by standardisation:

$$ \text{ATE}=\sum_l \underbrace{\Big(\mathbb{E}[Y \mid A=1,L=l]-\mathbb{E}[Y \mid A=0,L=l]\Big)}_{\textcolor{teal}{\text{within-stratum observed contrast}}} \underbrace{P(L=l)}_{\textcolor{blue}{\text{stratum weight}}}. $$

What we can check, and what we cannot

Positivity is the only assumption we can directly inspect with data. If some covariate strata contain no treated (or no untreated) individuals, the contrast for those strata relies on model extrapolation rather than observed comparisons.

Consistency requires that "treatment" means the same thing for everyone labelled $A = 1$. In the mindfulness example, beginning a guided app at semester start, trying one unguided breathing exercise in week 5, and attending a group class irregularly are different interventions. A well-specified target trial defines treatment precisely enough that consistency is defensible.

Exchangeability cannot be verified from observed data. We can check whether measured covariates are balanced after adjustment, but we cannot test whether unmeasured common causes remain. This is why the DAG matters: it forces investigators to state which variables they believe are sufficient and to defend that belief with subject-matter knowledge.

Design and subject-matter knowledge are not optional extras. They are what makes identification assumptions assessable.

Quick diagnostic

  • If treatment is vague, worry about consistency.
  • If treated and untreated people differ in causes of the outcome, worry about exchangeability.
  • If one treatment level barely occurs in some subgroup, worry about positivity.

Pair exercise: designing a target trial

  1. Your intervention is daily meditation (20 minutes). Your outcome is anxiety symptoms at 6 months. Your target population is university students.
  2. State the causal estimand precisely: what are the two contrast conditions?
  3. Define time zero (when does follow-up begin?).
  4. Name two baseline covariates you would adjust for, and give a causal rationale for each (draw a short DAG if it helps).
  5. Describe one plausible positivity failure in this setting (a subgroup where one side of the contrast is effectively empty).

The causal workflow

The identification assumptions are not items on a checklist to be ticked off after analysis. They are design commitments that must be defended before estimation begins. The following workflow organises these commitments into a sequence. Each step depends on the ones before it. Also see the course causal workflow reference page.

Step 0: define the target population. Say exactly who the answer is meant to inform before choosing the exposure or outcomes. "University students" may be too broad if the intervention is a guided app that only some students could realistically use. The population choice shapes which treatment versions are coherent, which outcomes matter, and where positivity may fail.

Step 1: state a well-defined treatment. Specify the hypothetical intervention precisely enough that every member of the target population could, in principle, receive it. "Mindfulness" is too vague: people meditate with different apps, in groups, alone, once, or every day. A clearer intervention is: "start a guided mindfulness app at the beginning of semester and complete one 10-minute session per day for eight weeks." Precision here underwrites consistency and makes the timeline visible.

Step 2: establish time zero. Define the point at which treatment assignment begins and follow-up starts. We cannot do this until the treatment is specified, because time zero is the moment that intervention becomes assigned or initiated. In the mindfulness example, time zero is the beginning of semester when students start the app or are assigned not to start it. Without a clear time zero, consistency is undermined and exchangeability is hard to assess.

Step 3: state a well-defined outcome. Define the outcome so the causal contrast is meaningful and temporally anchored. "Sense of Purpose" is underspecified; "psychological distress one year post-intervention measured with the Kessler-6" is interpretable and reproducible. Include timing, scale, and instrument.

Step 4: evaluate exchangeability. Make the case that potential outcomes are independent of treatment conditional on covariates: $Y(a) \coprod A \mid L$ (Hernán & Robins (2025)). Use design and diagnostics: DAGs, subject-matter arguments, pre-treatment covariate balance, and overlap checks. If exchangeability is doubtful, redesign rather than rely solely on modelling.

Step 5: ensure causal consistency. Consistency requires that, for units receiving a treatment version compatible with level $a$, the observed outcome equals $Y(a)$. It also presumes well-defined versions and no interference between units (VanderWeele & Hernan (2013); Hernán & Robins (2025)). When multiple versions exist, either refine the intervention so versions are irrelevant to $Y(a)$, or condition on version-defining covariates.

Step 6: check positivity. Each treatment level must occur with non-zero probability at every covariate profile needed for exchangeability (Westreich & Cole (2010)). Diagnose limited overlap using propensity score distributions and extreme weights. Consider design-stage remedies (trimming, restriction) before estimation.

Step 7: ensure measurement aligns with the scientific question. Be explicit about probable forms of measurement error (classical, Berkson, differential, misclassification) and their structural implications for bias (Hernán & Robins (2025); Bulbulia (2024)). Where feasible, incorporate validation studies or calibration models.

Step 8: preserve representativeness from start to finish. Differential attrition, non-response, or measurement processes tied to treatment and outcomes can induce selection bias in the presence of true effects (Hernán (2017); Bulbulia (2024)). Plan strategies such as inverse probability weighting for censoring, multiple imputation under defensible mechanisms, and sensitivity analyses for data missing not at random.

Step 9: document the reasoning that supports steps 0–8. Make assumptions, disagreements, and judgement calls legible. Register or time-stamp the analytic plan. Include identification arguments, code, and data where possible. Report robustness and sensitivity analyses. Transparent reasoning is a scientific result in its own right (Ogburn & Shpitser (2021)).

A note on reporting standards

When a study that emulates a target trial is published, journals in epidemiology and medicine increasingly expect a thorough, standardised account of its design and assumptions. The TARGET statement (Transparent Reporting of Observational Studies Emulating a Target Trial; Cashin et al. (2025)) is one such reporting format. It asks authors to state, point by point, the target population, the treatment strategies, time zero, the outcomes, the identification assumptions, and the sensitivity analyses. Each of those points corresponds to a step of the causal workflow above.

Read the TARGET statement once, to see the level of detail a professional causal analysis is held to. You are not asked to apply its full checklist in this course. For the final assessment, Option A reports follow the course research-report template and reporting guide, which ask for the same reasoning in a shorter, course-specific form.

Return to the opening example

The mindfulness discrepancy illustrates what happens when investigators fail to emulate a target trial. If "users" are defined as students who have already adopted the app, the design mixes recent starters with persistent users who may differ in motivation, baseline distress, and help-seeking. That makes the exposed group look healthier than a true start-of-intervention comparison would justify.

The lesson is that design comes before estimation. If the hypothetical trial is not specified, the identifying assumptions are hard to interpret and even harder to defend. We now have the tools to identify and estimate an average causal contrast for a defined population. Week 6 asks the next question: does that contrast vary across subgroups?

Lab materials: Lab 5: Average Treatment Effects


Appendix A: notation variants

Equivalent notations for the individual contrast include

$$ Y_i^{1} - Y_i^{0} $$

and

$$ Y_i(a=1) - Y_i(a=0). $$

Bulbulia, J. A. (2024). Methods in causal inference part 3: Measurement error and external validity threats. Evolutionary Human Sciences, 6, e42. https://doi.org/10.1017/ehs.2024.33

Cashin, A. G., Hansford, H. J., Hernán, M. A., Swanson, S. A., Lee, H., Jones, M. D., Dahabreh, I. J., Dickerman, B. A., Egger, M., Garcia-Albeniz, X., et al. (2025). Transparent reporting of observational studies emulating a target trial—the TARGET statement. JAMA, 334(12), 1084–1093. https://doi.org/10.1001/jama.2025.13350

Chatton, A., Le Borgne, F., Leyrat, C., Gillaizeau, F., Rousseau, C., Barbin, L., Laplaud, D., Léger, M., Giraudeau, B., & Foucher, Y. (2020). G-computation, propensity score-based methods, and targeted maximum likelihood estimator for causal inference with different covariates sets: a comparative simulation study. Scientific Reports, 10(1), 9219. https://doi.org/10.1038/s41598-020-65917-x

Hernán, M. A. (2017). Invited commentary: Selection bias without colliders | american journal of epidemiology | oxford academic. American Journal of Epidemiology, 185(11), 1048–1050. https://doi.org/10.1093/aje/kwx077

Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if. Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

Neal, B. (2020). Introduction to causal inference from a machine learning perspective. Course Lecture Notes (Draft). https://www.bradyneal.com/Introduction_to_Causal_Inference-Dec17_2020-Neal.pdf

Ogburn, E. L., & Shpitser, I. (2021). Causal modelling: The two cultures. Observational Studies, 7(1), 179–183. https://doi.org/10.1353/obs.2021.0006

VanderWeele, T. J., & Hernan, M. A. (2013). Causal inference under multiple versions of treatment. Journal of Causal Inference, 1(1), 1–20.

Westreich, D., & Cole, S. R. (2010). Invited commentary: positivity in practice. American Journal of Epidemiology, 171(6). https://doi.org/10.1093/aje/kwp436