Suggested Answers: Pair Exercises
These are brief suggested answers for the pair exercises embedded in weekly lectures. They are intended as discussion guides, not definitive solutions. Many exercises are deliberately open-ended.
Week 1
Formulating a contrast
A well-formed causal question might be: "Would replacing two hours of nightly screen time with two hours of reading reduce sleep onset latency (in minutes) over four weeks among 14-to-16-year-olds in Aotearoa New Zealand?" Both sides of the contrast are specified (screen time versus reading), the outcome is defined (sleep onset latency), the time horizon is stated (four weeks), and the target population is named (14-to-16-year-olds in Aotearoa New Zealand). Common critique points: "screen time" is vague (passive scrolling? gaming? messaging?), "poor sleep" needs a measurable operationalisation, and "teenagers" lumps heterogeneous age groups.
Three problems in one claim
- Definitional clarity: "religion" could mean attendance, belief, prayer, or community membership. "Mental health" could mean depression, life satisfaction, anxiety, or a composite. Neither side of the contrast is specified.
- Population specificity: the answer may differ between adolescents and older adults, between countries with majority-religion norms and secular societies, or between denominations.
- Unobservability: we cannot observe the same person both practising and not practising religion. The individual causal effect is missing by construction.
A rewrite: "Among adults aged 40-65 in Aotearoa New Zealand, would initiating weekly religious service attendance (versus maintaining no attendance) reduce depressive symptoms (PHQ-9 score) over 12 months?"
Week 2
Naming the structure
- Fork. SES causes both neighbourhood quality and health outcomes: neighbourhood $\leftarrow$ SES $\to$ health. Neighbourhood and health are marginally associated (through SES). Conditioning on SES removes the association.
- Chain. Drug $\to$ inflammation $\to$ pain. Drug and pain are marginally associated (through the mediating path). Conditioning on inflammation blocks the path and removes the association between drug and pain.
- Collider. Genetics $\to$ BP $\leftarrow$ diet. Genetics and diet are marginally independent (neither causes the other). Conditioning on blood pressure opens a spurious association: among people with the same BP, knowing genetic risk tells you something about diet (they must compensate).
Checking assumptions against a causal DAG
In the observational design, parental consent ($L$) is driven by SES ($U$), and $U$ also affects polio risk ($Y$). The backdoor path $A \leftarrow L \leftarrow U \to Y$ is open. Exchangeability fails: $Y(a) \cancel\coprod A$.
In the randomised design, $A$ is assigned by a chance mechanism ($\mathcal{R}$) that is independent of $U$ and $L$. The backdoor path through $L$ and $U$ is severed because $A$ no longer depends on $L$. Exchangeability holds: $Y(a) \coprod A$.
Positivity is more probable to fail in the observational design: some SES strata may have near-universal consent or refusal, leaving no comparison group.
Neurath's ship and your own causal DAG
Answers vary by discipline. The key check is whether the partner can identify a fork (common cause generating spurious association) and a chain (mediating path). The sceptic's challenge should propose either a reversed arrow or a missing common cause that would change the adjustment strategy.
Week 3
Applying the backdoor criterion
Paths from $A$ to $Y$: (1) $A \to M \to Y$ (causal, through mediator); (2) $A \leftarrow L_1 \to Y$ (backdoor through health consciousness); (3) $A \leftarrow L_1 \to L_2 \to Y$ (backdoor through health consciousness and diet).
${L_1}$ satisfies the backdoor criterion: it blocks both backdoor paths (paths 2 and 3) and $L_1$ is not a descendant of $A$. Conditioning on ${L_1}$ supports exchangeability.
Adding $M$ violates the criterion because $M$ is a descendant of $A$ (it lies on the causal path $A \to M \to Y$). Conditioning on $M$ blocks part of the total effect we want to estimate.
M-bias in practice
The DAG: $U_1 \to A$ (attendance), $U_2 \to Y$ (giving), $U_1 \to L \leftarrow U_2$ (neighbourhood social capital is a collider of two unmeasured causes). Without conditioning on $L$, the path $A \leftarrow U_1 \to L \leftarrow U_2 \to Y$ is blocked at the collider $L$.
Conditioning on $L$ opens this path, creating a spurious association between $A$ and $Y$ through the unmeasured causes. "Adjust for all pre-treatment variables" fails because $L$ is pre-treatment but is a collider: conditioning on it opens, rather than closes, a biasing path.
$R^2$ versus identification
$R^2$ measures variance explained, a statistical property. Confounding is a structural property of the DAG. A model with high $R^2$ can still be biased if the adjustment set includes a collider (opening a spurious path) or a mediator (blocking part of the causal path).
Example DAG where the larger set introduces bias: if Investigator A's set includes a variable $C$ that is a collider ($A \to C \leftarrow Y$), conditioning on $C$ opens a non-causal path and biases the estimate, despite improving $R^2$. Investigator B's smaller set ${$age, conscientiousness$}$ would satisfy the backdoor criterion if conscientiousness blocks all backdoor paths and is not a descendant of $A$.
Week 4
Classifying measurement error
- Type 1: independent, uncorrelated. The screen-time noise and the purpose noise do not share a common cause and neither is causally affected by the other variable. This typically attenuates toward the null.
- Type 3: dependent, uncorrelated. The treatment (bilingualism) causally affects how the outcome (cognitive performance) is measured, because the test instrument is language-dependent. The DAG shows $A \to$ measurement error node $\to Y^$ (recorded outcome), opening a non-causal path from $A$ to $Y^$.
- Type 2: independent, correlated. The shared translation team creates a common cause of errors in both measures. Neither variable's true value causes the other's measurement error, but the errors co-vary through the shared cause.
Collider bias versus confounding
The DAG: depression ($A$) $\to$ ward admission ($C$) $\leftarrow$ injury severity $\to$ recovery ($Y$). Marginally, $A$ and $Y$ may be independent (or associated only through a causal path). Restricting to admitted patients conditions on $C$, opening the path $A \to C \leftarrow$ injury severity $\to Y$.
This is not confounding. Confounding requires an open backdoor path through a common cause (e.g., $A \leftarrow L \to Y$). Here, the path was blocked before conditioning. Conditioning on the collider $C$ actively opens a previously blocked path. Among admitted patients, less depressed individuals tend to have more severe injuries (otherwise they would not have been admitted), creating a spurious negative association between depression and recovery.
Design fix: analyse all eligible patients regardless of admission status, or use inverse probability weighting to account for selection into the hospital sample.
Auditing a study for two failure modes
Selection bias: university mailing list recruitment acts as a filter. Academic motivation and language confidence jointly affect enrolment, making the analytic sample unrepresentative. If motivation or confidence also relates to bilingualism or cognitive outcomes, conditioning on sample membership distorts the contrast.
Measurement bias: type 3 (dependent, uncorrelated). The treatment (bilingualism) causally affects how the English-only cognitive test measures the outcome. Non-English-dominant bilinguals are systematically mismeasured, and this mismeasurement depends on treatment status.
Week 5
Building a potential outcomes table
The key distinction is between the hidden science and the observed data. In the hidden science, each student has both potential outcomes and an individual effect. In the observed data, one potential outcome and hence $\delta_i$ are missing for every student.
One possible hidden science table is:
| $i$ | $Y_i(1)$ | $Y_i(0)$ | $\delta_i$ |
|---|---|---|---|
| 1 | 0 | 1 | $-1$ |
| 2 | 0 | 0 | 0 |
| 3 | 1 | 0 | 1 |
| 4 | 1 | 1 | 0 |
If treatment assignment is $A_1=A_2=1$ and $A_3=A_4=0$, the observed-data table is:
| $i$ | $Y_i(1)$ | $Y_i(0)$ | $\delta_i$ | $A_i$ | $Y_i^{\text{obs}}$ |
|---|---|---|---|---|---|
| 1 | 0 | NA | NA | 1 | 0 |
| 2 | 0 | NA | NA | 1 | 0 |
| 3 | NA | 0 | NA | 0 | 0 |
| 4 | NA | 1 | NA | 0 | 1 |
The true ATE in the hidden science is $(-1 + 0 + 1 + 0)/4 = 0$. The naive observed difference in means is $\bar{Y}{A=1}^{\mathrm{obs}} - \bar{Y}{A=0}^{\mathrm{obs}} = 0 - 0.5 = -0.5$. The discrepancy arises because treatment assignment is not random with respect to the potential outcomes: students 1 and 2, who received $A=1$, differ in their hidden counterfactual outcomes from students 3 and 4, who received $A=0$. Exchangeability does not hold.
Tracing the identification logic
The claim "students who chose the mindfulness app had lower anxiety, therefore the app works" compares $\mathbb{E}[Y \mid A=1]$ with $\mathbb{E}[Y \mid A=0]$ and interprets the difference causally.
Consistency is questionable if "used the app" pools different versions of treatment under one label: different apps, different session lengths, different start dates, or irregular adherence. Multiple versions undermine the link between $A_i = 1$ and a well-defined $Y_i(1)$.
Exchangeability is the most plausible violated assumption: students who chose the app may have differed from non-users in baseline anxiety, motivation, help-seeking, or available time. The treated group may therefore have had different counterfactual outcomes even without the app, so $Y(0) \cancel\coprod A$.
Positivity may also fail: in some covariate strata, such as students with very high workload or students already receiving intensive therapy, almost no one may choose one side of the contrast, leaving no meaningful comparison group in those strata.
Designing a target trial
Causal estimand: the average difference in anxiety symptoms (e.g., GAD-7 score) at 6 months if all university students practised 20 minutes of daily meditation versus if all maintained their current routine (no meditation).
Time zero: the date of programme enrolment (or randomisation in the target trial).
Two baseline covariates with causal rationale: (1) baseline anxiety (GAD-7 at enrolment), because prior anxiety affects both the decision to meditate and the outcome; (2) academic workload (full-time vs part-time enrolment), because workload affects adherence to meditation and anxiety levels.
Positivity failure: students with severe clinical anxiety may be referred to treatment rather than a meditation programme, so the stratum "severe baseline anxiety" may contain no one in the meditation arm.
Week 6
Interaction versus effect modification
The causal estimand for interaction requires four potential outcomes: $Y(a=1,g=\text{young})$, $Y(a=1,g=\text{old})$, $Y(a=0,g=\text{young})$, $Y(a=0,g=\text{old})$. This is conceptually odd because we cannot intervene on age.
The causal estimand for effect modification involves one intervention (exercise) with subgroup contrasts: $\mathbb{E}[Y(1)-Y(0) \mid G=\text{young}]$ versus $\mathbb{E}[Y(1)-Y(0) \mid G=\text{old}]$. This is the design that matches the study.
The regression interaction term could be non-zero without causal modification if, for example, the linear specification is wrong (the true effect varies non-linearly with a confounder correlated with age), or if age is a collider or descendant of a collider in the DAG.
Why conditioning changes effect modification
Even without a direct $G \to Y$ path, the CATE varies by age because $G$ (age) affects $L$ (fitness), and if the treatment effect varies with $L$, then the distribution of $L$ within age strata determines the subgroup average. Different age groups have different fitness distributions, so $\tau(g)$ differs.
The colleague's null interaction conclusion is premature: the regression test depends on the conditioning set and the functional form. A non-significant $A \times G$ term in a linear model does not rule out effect modification visible with a richer specification or different conditioning set.
Two apparent modifiers could vanish together if both $G_1$ and $G_2$ are proxies for the same underlying variable $L$. Each captures part of the variation in $L$; conditioning on both accounts for $L$ fully, and the residual variation in treatment effect disappears.
From average to subgroup
If 60% of participants have CATE = 8 and 40% have CATE = $-2$: ATE = $0.6 \times 8 + 0.4 \times (-2) = 4.8 - 0.8 = 4.0$. Adjust proportions: e.g., 50% with CATE = 8 and 50% with CATE = $-2$: ATE = $4 - 1 = 3$ mmHg. The policy-maker misses that 50% of participants are harmed (blood pressure increases by 2 mmHg).
The claim "$\hat{\tau}(X_i) = 8$ means the programme will reduce my blood pressure by 8" confuses an estimated subgroup average with an individual effect. $\hat{\tau}(X_i) = 8$ estimates the average effect for everyone sharing person $i$'s measured profile. Person $i$'s true individual effect $Y_i(1) - Y_i(0)$ is unobservable and could be larger, smaller, or opposite in sign.
Week 8
From tree to forest to causal forest
A single regression tree is interpretable (you can read the decision path), but unstable: small changes in the data shift splits and predictions (high variance).
Averaging many trees (a forest) reduces variance. Each tree's idiosyncratic splits cancel out, producing smoother, more reliable predictions.
Two differences in a causal forest: (a) the target is $\tau(x) = \mathbb{E}[Y(1)-Y(0) \mid X=x]$, a treatment contrast, not a prediction of $Y$; (b) honest splitting uses one subsample to choose splits and a separate subsample to estimate contrasts within leaves. Honest splitting is necessary because treatment contrasts require estimating quantities under two exposures, only one of which is observed per person. Using the same data for splitting and estimation would overfit to noise in the individual-level contrasts.
Reading a TOC curve
A steep initial rise means treatment gains are heavily concentrated among the top-ranked individuals. The programme helps some people a lot but most people only a little.
Large AUTOC but small Qini at $q=0.3$ means that gains concentrate in a very narrow top slice (perhaps the top 5-10%), and by the time you expand to 30% coverage, the additional individuals contribute little. For a decision-maker with a 30% budget, the targeting advantage over random allocation is small.
Computing the TOC curve on training data overfits: the forest's rankings are optimised for the training sample, so in-sample evaluation inflates the apparent targeting value. Honest evaluation requires held-out or cross-fitted data.
Should we target?
The Qini addresses the causal estimand: "does treating the top $q$ fraction (ranked by estimated treatment effect) yield greater total benefit than treating a random $q$ fraction?" It goes beyond the ATE by asking whether benefits are concentrated enough to justify selective allocation.
Two non-statistical reasons not to target: (1) logistical feasibility (screening and scoring may cost more than universal provision); (2) stigma or fairness concerns (singling out individuals with high loneliness scores may be perceived as labelling).
Response to the stigma concern: "The concern about stigmatisation is legitimate and must shape how targeting is implemented. The evidence shows that some students benefit substantially more than others, but it does not mandate that selection criteria be disclosed or that participation be compulsory. A self-referral design using the targeting criteria as capacity planning could capture most of the benefit without labelling individuals."
Week 9
Reading a simple policy rule
The supplied tree first splits on deprivation index, then splits the high-deprivation branch on baseline loneliness. The high-response leaf is "high deprivation, high loneliness".
Plain-language summary: the strongest estimated response is among residents in high-deprivation areas who also report high baseline loneliness.
If roughly 40% of the population is high-deprivation and half of that group is high-loneliness, the high-response region contains approximately:
$$ 0.40 \times 0.50 = 0.20 $$
That is about 20% of residents.
This does not mean the analysis must treat exactly 20%. In our lab workflow, the policy tree helps describe where estimated responses are strongest, and the report should give the expected mean difference for that region. The size of the region is still useful because it tells readers whether the finding describes a small niche group or a sizeable part of the eligible population.
Depth-1 versus depth-2
The depth-1 rule can be stated as: treat residents above the log-income threshold and do not treat residents at or below it. The depth-2 rule can be stated as: first split on openness; for those below the first threshold, split again on openness, and for those above it, split on neuroticism; treat only the leaves labelled "treat."
A lift of $0.028$ standard-deviation units is small for any one person. Across $10,000$ eligible people, however, it is an average improvement applied many times. If the rule is implemented correctly and the estimate transports, the population-level gain could be meaningful even though the individual-level gain sounds modest.
Deploy the depth-1 rule when implementation must be simple, when staff will apply the rule under time pressure, or when the depth-2 splits are unstable across resamples. Deploy the depth-2 rule when the intervention is high-stakes, the extra gain is meaningful, the confidence interval supports the improvement, and the implementing organisation can apply the rule reliably.
The extra evidence that would support depth-2 is a held-out policy-value estimate whose uncertainty clearly favours depth-2, stable splits across resamples, and an equity audit showing that the added split does not worsen disparities.
Equity audit
Plain-language rule: offer the programme to high-deprivation residents under 40.
A deprivation split can affect social groups differently because deprivation is correlated with many background conditions: income, housing, neighbourhood resources, family structure, age, health, and sometimes ethnicity. A rule that targets high deprivation may therefore produce uneven treatment shares even when group membership is not an explicit splitter. This does not make deprivation the causal root of the heterogeneity. Deprivation may be the strongest measured marker of deeper causes, including ethnic injustice or other upstream social processes.
The first empirical check is a cross-tabulation of assigned action by relevant social groups, ideally with uncertainty intervals for the treated share in each group. The analyst should also check the distribution of need and estimated benefit across those groups, because equal treatment shares can still hide unequal need.
Applying governance checks: (1) "Who gains and who loses?" The high-deprivation-under-40 group gains access; everyone else is excluded. If the excluded group includes high-deprivation people over 40 who also benefit, the rule creates an age-based inequity within disadvantaged communities. (2) "Can affected communities understand and contest the rule?" A depth-2 tree is transparent enough to explain, but communities need a mechanism to challenge the split variables and thresholds.
The model cannot decide whether maximising expected benefit, equal access, need, individual choice, fiscal constraint, or another principle should govern the allocation. That judgement belongs to democratic and institutional decision-making. The analyst can clarify the consequences of different choices and report whether the rule behaves as advertised.
"The algorithm is objective because it only uses data" is too quick. The algorithm optimises a chosen objective function on data produced by social institutions and past decisions. Mechanical consistency in computation does not settle whether the rule is fair, legitimate, or publicly acceptable.
Policy tree versus ranking
(a) Estimated policy value: Strategy A (pure ranking) typically achieves equal or slightly higher policy value because it uses the full granularity of $\hat{\tau}(X_i)$. Strategy B loses some value by collapsing to a few leaves.
(b) Explainability: Strategy B is far more explainable. A depth-2 tree is a short set of if-then rules. Strategy A requires explaining a continuous score derived from thousands of overlapping tree splits.
(c) Ability to answer "why was I selected?": Strategy B can give a clear answer ("because your deprivation index is above 8 and your loneliness score is above the median"). Strategy A can only say "because your estimated benefit score was in the top 20%," if a 20% budget was imposed, which is opaque.
Strategy A is preferable when the decision is internal (e.g., a research
team allocating limited follow-up resources), a fixed treatment share
must be met exactly, and public justification is less central. Strategy
B is preferable when the rule must be defended publicly, contested by
affected communities, or implemented by non-technical staff. Under the
default policytree workflow, the percentage treated is an output of
the fitted rule, not a fixed input.
Both strategies must answer the same equity question: who is excluded under the rule, and does that exclusion worsen disparities for protected or structurally disadvantaged groups?
Week 10
Interpreting invariance results
Configural invariance means the same items load on the same factors in both groups (same pattern of zero and non-zero loadings). Metric invariance means the factor loadings are equal across groups (a one-unit increase in the latent variable produces the same change in item responses). Scalar/threshold invariance means the intercepts (or thresholds for ordinal items) are equal, so the same latent level produces the same expected response.
Failing scalar/threshold invariance means that even at the same latent distress level, the two groups endorse "felt hopeless" and "felt worthless" differently. A one-unit difference in total score does not correspond to the same latent difference across groups. Group mean comparisons on the total score therefore confound true latent differences with measurement artefact.
Hypothesis for differential functioning: cultural norms about expressing hopelessness or worthlessness may differ. In some cultural contexts, endorsing "felt worthless" may carry greater stigma, leading to systematically lower endorsement at the same latent distress level. Alternatively, translation may anchor response categories differently.
Fit is not identification
Good fit means the model reproduces the observed covariance matrix. It does not establish causal direction. Multiple causal structures can generate the same covariance pattern.
Reflective DAG: $\eta \to X_1, \eta \to X_2, \ldots, \eta \to X_6$. The latent variable causes the indicators. Formative DAG: $X_1 \to \eta, X_2 \to \eta, \ldots, X_6 \to \eta$. The indicators cause the composite. Both can produce identical fit statistics for a single-factor solution.
The choice matters for downstream causal inference. If the construct is reflective and we use it as a confounder, we assume the latent variable is the true common cause. If the construct is actually formative (a composite of independent causes), conditioning on the composite may not block the backdoor paths we intend to close, because each component may have a different causal relationship with treatment and outcome.
Measurement as an identification problem
Scalar/threshold non-invariance means the same response pattern corresponds to different latent levels across groups. Put differently, the mapping from the latent outcome $Y^$ to the measured outcome $Y$ depends on group membership. This matters for CATE because CATE is defined by a group contrast. If measurement differs by group, the estimated heterogeneity can be measurement artefact. This can happen even when exchangeability and positivity hold for $Y^$, because the analysis uses $Y$, not $Y^*$.
Example intuition: population A is distressed and population B is not. In A, "worthlessness" may be caused by unemployment. In B, it may be rare and have different causes. The same K6 item can have different causal parents across groups. The factor structure and item means can therefore differ without any change in "true distress".
Counter to "validated in hundreds of studies": most validation evidence is about internal consistency, short-term stability, and associations with other variables. Those are associational properties. They do not establish that the items have the same meaning, the same causes, or the same measurement function across the particular groups you want to compare. A scale can be reliable within each group and still be non-comparable across groups.
Proposed workflow step (between DAG and estimation): write down a measurement submodel as causal assumptions. State whether your estimand is the effect on reported K6 ($Y$) or the effect on the underlying state ($Y^*$). Then draw a measurement DAG that makes explicit what causes item responses in each group, including stigma, translation, and response norms. Decide what design or data would support those assumptions. If you choose to run measurement invariance tests, treat them as descriptive stress tests of a specific reflective model, not as evidence that measurement is causally comparable. If the stress test fails, the honest conclusion is that your group comparison is not identified without stronger measurement assumptions or better data.