Week 5: Causal Inference: Average Treatment Effects

Date: 25 Mar 2026

Readings

Required

Hernan, M. A. & Robins, J. M. (2024). Causal Inference: What If. Chapters 1--3. link

Optional

Neal, B. (2020). Introduction to Causal Inference. Chapters 1--2. link

Key concepts for the test(s)

The Fundamental Problem of Causal Inference
Causal Inference in Randomised Experiments
Causal Inference in Observational Studies: Average (Marginal) Treatment Effects
Three Fundamental Assumptions for Causal Inference

Lab

For the lab, copy and paste code chunks following the "Part 2: Lab" section.

Terminology

In these notes, we use the terms "counterfactual outcomes" and "potential outcomes" interchangeably.

Seminar

Learning Outcomes

You will understand why causation is never directly observed.
You will understand how experiments address this "causal gap."
You will understand how applying three principles from experimental research allows human scientists to close this "causal gap" when making inferences about a population as a whole, that is, inferences about "marginal effects."

Opening

Robert Frost

Two roads diverged in a yellow wood, And sorry I could not travel both And be one traveler, long I stood And looked down one as far as I could To where it bent in the undergrowth;

Then took the other, as just as fair, And having perhaps the better claim, Because it was grassy and wanted wear; Though as for that the passing there Had worn them really about the same,

And both that morning equally lay In leaves no step had trodden black. Oh, I kept the first for another day! Yet knowing how way leads on to way, I doubted if I should ever come back.

I shall be telling this with a sigh Somewhere ages and ages hence: Two roads diverged in a wood, and I -- I took the one less traveled by, And that has made all the difference.

-- The Road Not Taken

Introduction: The Estrogen Therapy Paradox

Why observational data can mislead

In the 1980s and 1990s, observational studies found that postmenopausal women using estrogen replacement therapy had substantially lower all-cause mortality than non-users, with a hazard ratio of 0.68 (current users vs never-users). Major medical organisations endorsed estrogen therapy:

1992 American College of Physicians: "Women who are at increased risk of coronary heart disease are probable to benefit from hormone therapy."
1996 American Heart Association: "ERT does look promising as a long-term protection against heart attack."

The Women's Health Initiative (WHI) then ran a massive randomised, double-blind, placebo-controlled trial with 16,000 women. The result: a hazard ratio of 1.23 for initiators versus non-initiators. The treatment increased risk.

What went wrong? The observational studies were confounded. Women who chose estrogen therapy differed systematically from those who did not, in ways that also affected their health outcomes. The story illustrates why the three assumptions covered in this lecture (causal consistency, exchangeability, positivity) are not mere formalities.

Motivating Example

Consider the following cross-cultural question:

Does bilingualism improve cognitive abilities in children?

There is evidence that bilingual children perform better at cognitive tasks, but is this improvement truly caused by learning more than one language, or is it confounded by other factors (e.g., cultural environment, parental influence)? How can we know? Each child might answer, like the traveller in Frost's poem:

"And sorry I could not travel both. And be one traveler $\dots$ "

Part 1: The Fundamental Problem of Causal Inference as a Missing Data Problem

The fundamental problem of causal inference is that causality is never directly observed.

Let $Y$ and $A$ denote random variables.

We formulate a causal question by asking whether experiencing an exposure $A$ , when this exposure is set to level $A = a$ , would lead to a difference in $Y$ , compared to what would have occurred had the exposure been set to a different level, say $A = a^{'}$ will lead to a difference in outcome $Y$ . For simplicity, we imagine binary exposure such that $A = 1$ denotes receiving the "bilingual" exposure and $A = 0$ denotes receiving the "monolingual" exposure. Assume these are the only two exposures of interest:

Let:

$Y_{i} (a = 1)$ denote the cognitive ability of child $i$ if the child were bilingual (potential outcome when $A_{i} = 1$ ).
$Y_{i} (a = 0)$ denote the cognitive ability of child $i$ if the child were monolingual (potential outcome when $A_{i} = 0$ ).

What does it mean to quantify a causal effect? We may define the individual-level causal effect of bilingualism on cognitive ability for child $i$ as the difference between two states of the world: one for which the child experiences a bilingual exposure and the other for which the child does not. We write this contrast by referring to the potential outcomes under different levels of exposure:

$Causal Effect_{i} = Y_{i} (1) - Y_{i} (0) .$

We say there is a causal effect of the bilingual exposure if

$Y_{i} (1) - Y_{i} (0) \neq = 0.$

Because each child experiences only one exposure condition in reality, we cannot directly compute this difference from any dataset. The missing observation is called the counterfactual:

If $Y_{i} ∣ A_{i} = 1$ is observed, then $Y_{i} (0) ∣ A_{i} = 1$ is counterfactual.
If $Y_{i} ∣ A_{i} = 0$ is observed, then $Y_{i} (1) ∣ A_{i} = 1$ is counterfactual.

"And sorry I could not travel both / And be one traveler, long I stood $\dots$ "

In short, individuals cannot simultaneously experience both exposure conditions, so one outcome is inevitably missing.

How can we make contrasts between counterfactual (potential) outcomes?

Challenging material ahead

The next three sections introduce the formal assumptions that underpin all causal inference. The notation is dense, but the core ideas are intuitive. Focus on the plain-language interpretation, not the formalism. Each assumption is followed by a "key intuition" box.

Fundamental Assumption 1: Causal Consistency

Causal consistency means that the potential outcome corresponding to the exposure an individual actually receives is exactly what we observe. In other words, if individual $i$ receives exposure $a$ , then the potential outcome (or equivalently the counterfactual outcome under a given level of exposure $A = a$ , that is $Y_{i} (a)$ ) is equivalent to the observed outcome: $Y_{i} ∣ A_{i} \equiv a$ . Where the symbol $\equiv$ means "equivalent to", when we assume that the causal consistency assumption is satisfied, we assume that:

$counterfactual Y_{i} (1) counterfactual Y_{i} (0) \equiv observable (Y_{i} ∣ A_{i} = 1), \equiv observable (Y_{i} ∣ A_{i} = 0) .$

Notice however that we cannot generally obtain individual causal effects because at any given time, each individual may only receive at most one level of an exposure. Where the symbol $⟹$ means "implies," at any given time, receiving one level of an exposure precludes receiving any other level of that exposure:

$Y_{i} ∣ A_{i} = 1 ⟹ Y_{i} (0) ∣ A_{i} = 1 is counterfactual$

Likewise:

$Y_{i} ∣ A_{i} = 0 ⟹ Y_{i} (1) ∣ A_{i} = 1 is counterfactual$

Because of the laws of physics (above the atomic scale), an individual can experience only one exposure level at any moment. Consequently, we can observe only one of the two counterfactual outcomes needed to quantify a causal effect. This is the fundamental problem of causal inference. Counterfactual contrasts cannot be individually observed.

However, because of the causal consistency assumption, we can nevertheless recover half of the missing counterfactual (or "potential") outcomes needed to estimate average treatment effects. We may do this if two other assumptions are satisfied.

Key intuition: causal consistency

Causal consistency says: "what you see is what you get." If a person actually received the treatment, their observed outcome equals their potential outcome under treatment. This seems obvious, but it fails when treatments are vaguely defined (e.g., "exercise more") because different versions of the treatment could produce different outcomes.

Fundamental Assumption 2: Exchangeability

Exchangeability justifies recovering unobserved counterfactuals from observed outcomes and averaging them. By accepting that $Y_{i} (a) = Y_{i}$ if $A_{i} = a$ , we can estimate population-level average potential outcomes. In an experiment where exposure groups are comparable, we define the Average Treatment Effect (ATE) as:

$ATE = E [Y (1)] - E [Y (0)] = E (Y ∣ A = 1) - E (Y ∣ A = 0) .$

Because randomisation ensures that missing counterfactuals are exchangeable with those observed, we can still estimate $E [Y (a)]$ . For instance, assume:

$counterfactual E [Y (1) ∣ A = 1] = unobservable E [Y (1) ∣ A = 0] = observed (Y_{i} ∣ A_{i} = 1)$

which lets us infer the average outcome if everyone were treated. Likewise, if

$counterfactual E [Y (0) ∣ A = 0] = unobservable E [Y (1) ∣ A = 0] = observed E (Y ∣ A_{i} = 0)$

then we can infer the average outcome if everyone were given the control. The difference between these two quantities gives the ATE:

$ATE = [E [Y (1) ∣ A = 1] by consistency: \equiv observed E [Y ∣ A = 1] + E [Y (1) ∣ A = 0] by exchangeability: unobservable, yet \equiv E [Y ∣ A = 1]] - [E [Y (0) ∣ A = 0] by consistency: \equiv observed E [Y ∣ A = 0] + E [Y (0) ∣ A = 1] by exchangeability: unobservable, yet \equiv E [Y ∣ A = 0]]$

We have it that $E [Y ∣ A = 1]$ and $E [Y ∣ A = 0]$ and $E [Y (1) ∣ A = 0]$ are observed. If both consistency and exchangeability are satisfied then we may use these observed quantities to identify contrasts of counterfactual quantities.

Thus, although individual-level counterfactuals are missing, the consistency assumption and the exchangeability assumption allow us to identify the average effect of treatment using observed data. Randomised controlled experiments allow us to meet these assumptions. Randomisation warrants the exchangeability assumption. Control warrants the consistency assumption.

Key intuition: exchangeability

Exchangeability says: "the treated group is a valid stand-in for the untreated group, and vice versa." In a randomised experiment, this holds by design. In observational data, we try to achieve it by conditioning on all common causes of treatment and outcome (the confounders identified in your DAG).

Fundamental Assumption 3: Positivity

There is one further assumption, called positivity. It states that treatment assignments cannot be deterministic. That is, for every covariate pattern $L = l$ , each individual has a non-zero probability of receiving every treatment level to be compared:

$P (A = a ∣ L = l) > 0.$

Randomised experiments achieve positivity by design, at least for the sample that is selected into the study. In observational settings violations occur if some subgroups never receive a particular treatment. If treatments occur but are rare, we may have sufficient data from which to obtain convincing causal inferences.

Positivity is the only assumption that can be verified with data. We will consider how to assess this assumption using data when we develop our data analytic workflows in the second half of this course.

Key intuition: positivity

Positivity says: "everyone could plausibly receive either treatment." If some people would never receive the treatment (e.g., a drug contraindicated for their condition), we cannot estimate what would happen if they did. Positivity is the only one of the three assumptions we can check directly with data.

Section summary

The three assumptions work together. Causal consistency links potential outcomes to observed data. Exchangeability lets us use the treated group's outcomes to infer what would have happened to the untreated group (and vice versa). Positivity ensures that both groups exist at every covariate combination. When all three hold, the ATE is identified from observed data.

Challenges with Observational Data

1. Satisfying Causal Consistency is Difficult in Observational Settings

Below are some ways in which real-world complexities can violate causal consistency in observational studies. Causal consistency requires there is no interference between units (also called "SUTVA" or "Stable Unit Treatment Value"). Causal consistency also requires that each treatment level is well-defined and applied uniformly. If these conditions fail, then $Y (a)$ may not reflect a consistent exposure across individuals. We are then comparing apples with oranges. Consider some examples:

Cultural differences: one group's "second-language exposure" may differ qualitatively from another's if cultural norms shape how, when, or by whom the second language is taught. A child in a bilingual community may receive diverse and immersive language experiences all of which are coded as $A = 1$ . Yet the treatments might be quite different. We might be comparing apples with oranges.
Age of acquisition: the developmental effect of learning a second language may vary by when the child is exposed. Comparing acquisition at, say, age two with acquisition at, say, age twelve might be comparing apples with oranges.
Language variation: sign languages, highly tonal languages, or unwritten languages may demand different cognitive tasks than spoken, nontonal, or widely documented languages. Lumping them together as "learning a second language" can obscure the fact that these distinct exposures might produce fundamentally different outcomes. Again, comparisons here would be of apples with oranges.

These sources of heterogeneity reveal why careful delineation of treatments is crucial. If the actual exposures differ across individuals, then consistency $(Y_{i} (a) = Y_{i} ∣ A_{i} = a)$ may fail, because $A = a$ is not the same phenomenon for everyone.

2. Conditional Exchangeability (No Unmeasured Confounding) Is Difficult to Achieve

In theory, we can identify a causal effect from observational data if all confounders $L$ are measured. Formally, we need the potential outcomes to be independent of treatment once we condition on $L$ . One way to express this assumption is: $Y (a) ∐ A ∣ L$ . If the potential outcomes are independent of treatment assignment, we can identify the Average Treatment Effect (ATE) as:

$ATE = l \sum [E (Y ∣ A = 1, L = l) - E (Y ∣ A = 0, L = l)] Pr (L = l) .$

In randomised experiments, conditioning is automatic because $A$ is unrelated to potential outcomes by design. In observational studies, ensuring or approximating such conditional exchangeability is often difficult. For example, bilingualism research would need to consider:

Cultural histories: cultures that value language acquisition might also value knowledge acquisition. Associations might arise from culture, not causation.
Personal values: families who place a high priority on bilingualism may also promote other developmental resources.

If important confounders go unmeasured or are poorly measured, these differences can bias causal effect estimates.

3. The Positivity Assumption May Fail: Treatments Might Not Exist for All

Positivity requires that each individual could, in principle, receive any exposure level. In real-world observational settings, some groups have no access to bilingual education (or no reason to be monolingual), making certain treatment levels impossible for them. If a treatment level does not appear in the data for a given subgroup, any causal effect estimate for that subgroup is purely an extrapolation (Westreich & Cole, 2010; Hernan & Robins, 2023).

Summary

We introduced the fundamental problem of causal inference by distinguishing correlation (associations in the data) from causation (contrasts between potential outcomes, of which only one can be observed for each individual).

Randomised experiments address this problem by balancing confounding variables across treatment levels. Although individual causal effects are unobservable, random assignment allows us to infer average causal effects, also called marginal effects.

In observational data, inferring average treatment effects demands that we satisfy three assumptions that are automatically satisfied in (well-conducted) experiments: causal consistency, exchangeability, and positivity. These assumptions ensure that we can compare like-with-like (that the population-level treatment effect is consistent across individuals), that there are no unmeasured common causes of the exposure and outcomes that may lead to associations in the absence of causality, and that every exposure level is a real possibility for each subgroup.

Lab materials: Lab 5: Average Treatment Effects

Appendix A: Notation Variants

Consider that in the causal inference literature, we may write the contrast of two potential outcomes under treatment as:

$Causal Effect_{i} = Y_{i}^{a = 1} - Y_{i}^{a = 0}$

$Causal Effect_{i} = Y_{i}^{1} - Y_{i}^{0}$

or:

$Causal Effect_{i} = Y_{1} - Y_{0}$

Where subscripts are dropped. You will soon encounter many notational variants across sources.

Keyboard shortcuts

PSYC 434: Conducting Research Across Cultures