Suggested Answers: Pair Exercises
These are brief suggested answers for the pair exercises embedded in weekly lectures. They are intended as discussion guides, not definitive solutions. Many exercises are deliberately open-ended.
Week 1
Formulating a contrast
A well-formed causal question might be: "Would replacing two hours of nightly screen time with two hours of reading reduce sleep onset latency (in minutes) over four weeks among 14-to-16-year-olds in Aotearoa New Zealand?" Both sides of the contrast are specified (screen time versus reading), the outcome is defined (sleep onset latency), the time horizon is stated (four weeks), and the target population is named (14-to-16-year-olds in Aotearoa NZ). Common critique points: "screen time" is vague (passive scrolling? gaming? messaging?), "poor sleep" needs a measurable operationalisation, and "teenagers" lumps heterogeneous age groups.
Three problems in one claim
- Definitional clarity: "religion" could mean attendance, belief, prayer, or community membership. "Mental health" could mean depression, life satisfaction, anxiety, or a composite. Neither side of the contrast is specified.
- Population specificity: the answer may differ between adolescents and older adults, between countries with majority-religion norms and secular societies, or between denominations.
- Unobservability: we cannot observe the same person both practising and not practising religion. The individual causal effect is missing by construction.
A rewrite: "Among adults aged 40-65 in Aotearoa New Zealand, would initiating weekly religious service attendance (versus maintaining no attendance) reduce depressive symptoms (PHQ-9 score) over 12 months?"
Week 2
Naming the structure
- Fork. SES causes both neighbourhood quality and health outcomes: neighbourhood $\leftarrow$ SES $\to$ health. Neighbourhood and health are marginally associated (through SES). Conditioning on SES removes the association.
- Chain. Drug $\to$ inflammation $\to$ pain. Drug and pain are marginally associated (through the mediating path). Conditioning on inflammation blocks the path and removes the association between drug and pain.
- Collider. Genetics $\to$ BP $\leftarrow$ diet. Genetics and diet are marginally independent (neither causes the other). Conditioning on blood pressure opens a spurious association: among people with the same BP, knowing genetic risk tells you something about diet (they must compensate).
Checking assumptions against a causal DAG
In the observational design, parental consent ($L$) is driven by SES ($U$), and $U$ also affects polio risk ($Y$). The backdoor path $A \leftarrow L \leftarrow U \to Y$ is open. Exchangeability fails: $Y(a) \cancel\coprod A$.
In the randomised design, $A$ is assigned by a chance mechanism ($\mathcal{R}$) that is independent of $U$ and $L$. The backdoor path through $L$ and $U$ is severed because $A$ no longer depends on $L$. Exchangeability holds: $Y(a) \coprod A$.
Positivity is more probable to fail in the observational design: some SES strata may have near-universal consent or refusal, leaving no comparison group.
Neurath's ship and your own causal DAG
Answers vary by discipline. The key check is whether the partner can identify a fork (common cause generating spurious association) and a chain (mediating path). The sceptic's challenge should propose either a reversed arrow or a missing common cause that would change the adjustment strategy.
Week 3
Applying the backdoor criterion
Paths from $A$ to $Y$: (1) $A \to M \to Y$ (causal, through mediator); (2) $A \leftarrow L_1 \to Y$ (backdoor through health consciousness); (3) $A \leftarrow L_1 \to L_2 \to Y$ (backdoor through health consciousness and diet).
${L_1}$ satisfies the backdoor criterion: it blocks both backdoor paths (paths 2 and 3) and $L_1$ is not a descendant of $A$. Conditioning on ${L_1}$ supports exchangeability.
Adding $M$ violates the criterion because $M$ is a descendant of $A$ (it lies on the causal path $A \to M \to Y$). Conditioning on $M$ blocks part of the total effect we want to estimate.
M-bias in practice
The DAG: $U_1 \to A$ (attendance), $U_2 \to Y$ (giving), $U_1 \to L \leftarrow U_2$ (neighbourhood social capital is a collider of two unmeasured causes). Without conditioning on $L$, the path $A \leftarrow U_1 \to L \leftarrow U_2 \to Y$ is blocked at the collider $L$.
Conditioning on $L$ opens this path, creating a spurious association between $A$ and $Y$ through the unmeasured causes. "Adjust for all pre-treatment variables" fails because $L$ is pre-treatment but is a collider: conditioning on it opens, rather than closes, a biasing path.
$R^2$ versus identification
$R^2$ measures variance explained, a statistical property. Confounding is a structural property of the DAG. A model with high $R^2$ can still be biased if the adjustment set includes a collider (opening a spurious path) or a mediator (blocking part of the causal path).
Example DAG where the larger set introduces bias: if Investigator A's set includes a variable $C$ that is a collider ($A \to C \leftarrow Y$), conditioning on $C$ opens a non-causal path and biases the estimate, despite improving $R^2$. Investigator B's smaller set ${$age, conscientiousness$}$ would satisfy the backdoor criterion if conscientiousness blocks all backdoor paths and is not a descendant of $A$.
Week 4
Classifying measurement error
- Type 1: independent, uncorrelated. The screen-time noise and the wellbeing noise do not share a common cause and neither is causally affected by the other variable. This typically attenuates toward the null.
- Type 3: dependent, uncorrelated. The treatment (bilingualism) causally affects how the outcome (cognitive performance) is measured, because the test instrument is language-dependent. The DAG shows $A \to$ measurement error node $\to Y^$ (recorded outcome), opening a non-causal path from $A$ to $Y^$.
- Type 2: independent, correlated. The shared translation team creates a common cause of errors in both measures. Neither variable's true value causes the other's measurement error, but the errors co-vary through the shared cause.
Collider bias versus confounding
The DAG: depression ($A$) $\to$ ward admission ($C$) $\leftarrow$ injury severity $\to$ recovery ($Y$). Marginally, $A$ and $Y$ may be independent (or associated only through a causal path). Restricting to admitted patients conditions on $C$, opening the path $A \to C \leftarrow$ injury severity $\to Y$.
This is not confounding. Confounding requires an open backdoor path through a common cause (e.g., $A \leftarrow L \to Y$). Here, the path was blocked before conditioning. Conditioning on the collider $C$ actively opens a previously blocked path. Among admitted patients, less depressed individuals tend to have more severe injuries (otherwise they would not have been admitted), creating a spurious negative association between depression and recovery.
Design fix: analyse all eligible patients regardless of admission status, or use inverse probability weighting to account for selection into the hospital sample.
Auditing a study for two failure modes
Selection bias: university mailing list recruitment acts as a filter. Academic motivation and language confidence jointly affect enrolment, making the analytic sample unrepresentative. If motivation or confidence also relates to bilingualism or cognitive outcomes, conditioning on sample membership distorts the contrast.
Measurement bias: type 3 (dependent, uncorrelated). The treatment (bilingualism) causally affects how the English-only cognitive test measures the outcome. Non-English-dominant bilinguals are systematically mismeasured, and this mismeasurement depends on treatment status.
Week 5
Building a potential outcomes table
The key distinction is between the hidden science and the observed data. In the hidden science, each student has both potential outcomes and an individual effect. In the observed data, one potential outcome and hence $\delta_i$ are missing for every student.
One possible hidden science table is:
| $i$ | $Y_i(1)$ | $Y_i(0)$ | $\delta_i$ |
|---|---|---|---|
| 1 | 0 | 1 | $-1$ |
| 2 | 0 | 0 | 0 |
| 3 | 1 | 0 | 1 |
| 4 | 1 | 1 | 0 |
If treatment assignment is $A_1=A_2=1$ and $A_3=A_4=0$, the observed-data table is:
| $i$ | $Y_i(1)$ | $Y_i(0)$ | $\delta_i$ | $A_i$ | $Y_i^{\text{obs}}$ |
|---|---|---|---|---|---|
| 1 | 0 | NA | NA | 1 | 0 |
| 2 | 0 | NA | NA | 1 | 0 |
| 3 | NA | 0 | NA | 0 | 0 |
| 4 | NA | 1 | NA | 0 | 1 |
The true ATE in the hidden science is $(-1 + 0 + 1 + 0)/4 = 0$. The naive observed difference in means is $\bar{Y}{A=1}^{\mathrm{obs}} - \bar{Y}{A=0}^{\mathrm{obs}} = 0 - 0.5 = -0.5$. The discrepancy arises because treatment assignment is not random with respect to the potential outcomes: students 1 and 2, who received $A=1$, differ in their hidden counterfactual outcomes from students 3 and 4, who received $A=0$. Exchangeability does not hold.
Tracing the identification logic
The claim "students who chose the mindfulness app had lower anxiety, therefore the app works" compares $\mathbb{E}[Y \mid A=1]$ with $\mathbb{E}[Y \mid A=0]$ and interprets the difference causally.
Consistency is questionable if "used the app" pools different versions of treatment under one label: different apps, different session lengths, different start dates, or irregular adherence. Multiple versions undermine the link between $A_i = 1$ and a well-defined $Y_i(1)$.
Exchangeability is the most plausible violated assumption: students who chose the app may have differed from non-users in baseline anxiety, motivation, help-seeking, or available time. The treated group may therefore have had different counterfactual outcomes even without the app, so $Y(0) \cancel\coprod A$.
Positivity may also fail: in some covariate strata, such as students with very high workload or students already receiving intensive therapy, almost no one may choose one side of the contrast, leaving no meaningful comparison group in those strata.
Designing a target trial
Causal estimand: the average difference in anxiety symptoms (e.g., GAD-7 score) at 6 months if all university students practised 20 minutes of daily meditation versus if all maintained their current routine (no meditation).
Time zero: the date of programme enrolment (or randomisation in the target trial).
Two baseline covariates with causal rationale: (1) baseline anxiety (GAD-7 at enrolment), because prior anxiety affects both the decision to meditate and the outcome; (2) academic workload (full-time vs part-time enrolment), because workload affects adherence to meditation and anxiety levels.
Positivity failure: students with severe clinical anxiety may be referred to treatment rather than a meditation programme, so the stratum "severe baseline anxiety" may contain no one in the meditation arm.
Week 6
Interaction versus effect modification
The causal estimand for interaction requires four potential outcomes: $Y(a=1,g=\text{young})$, $Y(a=1,g=\text{old})$, $Y(a=0,g=\text{young})$, $Y(a=0,g=\text{old})$. This is conceptually odd because we cannot intervene on age.
The causal estimand for effect modification involves one intervention (exercise) with subgroup contrasts: $\mathbb{E}[Y(1)-Y(0) \mid G=\text{young}]$ versus $\mathbb{E}[Y(1)-Y(0) \mid G=\text{old}]$. This is the design that matches the study.
The regression interaction term could be non-zero without causal modification if, for example, the linear specification is wrong (the true effect varies non-linearly with a confounder correlated with age), or if age is a collider or descendant of a collider in the DAG.
Why conditioning changes effect modification
Even without a direct $G \to Y$ path, the CATE varies by age because $G$ (age) affects $L$ (fitness), and if the treatment effect varies with $L$, then the distribution of $L$ within age strata determines the subgroup average. Different age groups have different fitness distributions, so $\tau(g)$ differs.
The colleague's null interaction conclusion is premature: the regression test depends on the conditioning set and the functional form. A non-significant $A \times G$ term in a linear model does not rule out effect modification visible with a richer specification or different conditioning set.
Two apparent modifiers could vanish together if both $G_1$ and $G_2$ are proxies for the same underlying variable $L$. Each captures part of the variation in $L$; conditioning on both accounts for $L$ fully, and the residual variation in treatment effect disappears.
From average to subgroup
If 60% of participants have CATE = 8 and 40% have CATE = $-2$: ATE = $0.6 \times 8 + 0.4 \times (-2) = 4.8 - 0.8 = 4.0$. Adjust proportions: e.g., 50% with CATE = 8 and 50% with CATE = $-2$: ATE = $4 - 1 = 3$ mmHg. The policy-maker misses that 50% of participants are harmed (blood pressure increases by 2 mmHg).
The claim "$\hat{\tau}(X_i) = 8$ means the programme will reduce my blood pressure by 8" confuses an estimated subgroup average with an individual effect. $\hat{\tau}(X_i) = 8$ estimates the average effect for everyone sharing person $i$'s measured profile. Person $i$'s true individual effect $Y_i(1) - Y_i(0)$ is unobservable and could be larger, smaller, or opposite in sign.
Week 8
From tree to forest to causal forest
A single regression tree is interpretable (you can read the decision path), but unstable: small changes in the data shift splits and predictions (high variance).
Averaging many trees (a forest) reduces variance. Each tree's idiosyncratic splits cancel out, producing smoother, more reliable predictions.
Two differences in a causal forest: (a) the target is $\tau(x) = \mathbb{E}[Y(1)-Y(0) \mid X=x]$, a treatment contrast, not a prediction of $Y$; (b) honest splitting uses one subsample to choose splits and a separate subsample to estimate contrasts within leaves. Honest splitting is necessary because treatment contrasts require estimating quantities under two exposures, only one of which is observed per person. Using the same data for splitting and estimation would overfit to noise in the individual-level contrasts.
Reading a TOC curve
A steep initial rise means treatment gains are heavily concentrated among the top-ranked individuals. The programme helps some people a lot but most people only a little.
Large AUTOC but small Qini at $q=0.3$ means that gains concentrate in a very narrow top slice (perhaps the top 5-10%), and by the time you expand to 30% coverage, the additional individuals contribute little. For a decision-maker with a 30% budget, the targeting advantage over random allocation is small.
Computing the TOC curve on training data overfits: the forest's rankings are optimised for the training sample, so in-sample evaluation inflates the apparent targeting value. Honest evaluation requires held-out or cross-fitted data.
Should we target?
The Qini addresses the causal estimand: "does treating the top $q$ fraction (ranked by estimated treatment effect) yield greater total benefit than treating a random $q$ fraction?" It goes beyond the ATE by asking whether benefits are concentrated enough to justify selective allocation.
Two non-statistical reasons not to target: (1) logistical feasibility (screening and scoring may cost more than universal provision); (2) stigma or fairness concerns (singling out individuals with high loneliness scores may be perceived as labelling).
Response to the stigma concern: "The concern about stigmatisation is legitimate and must shape how targeting is implemented. The evidence shows that some students benefit substantially more than others, but it does not mandate that selection criteria be disclosed or that participation be compulsory. A self-referral design using the targeting criteria as capacity planning could capture most of the benefit without labelling individuals."
Week 9
Designing a depth-2 policy rule
Example: split first on deprivation index (high vs low), then split the high-deprivation leaf on baseline loneliness (high vs low). Treat the "high deprivation, high loneliness" leaf. If roughly 40% of the population is high-deprivation and 50% of those are high-loneliness, the treated group is approximately 20%.
Two reasons to prefer depth-2 over depth-4: (1) a depth-2 tree has at most 4 leaves and 3 yes/no questions, which a policy-maker or clinician can explain in a sentence; (2) deeper trees split the sample into smaller subgroups, reducing the number of observations per leaf and increasing the variance of the estimated policy value.
Equity audit
A deprivation split indirectly stratifies by ethnicity because in Aotearoa NZ, Māori and Pasifika populations are disproportionately represented in high-deprivation areas due to historical and structural inequities. A rule that targets high deprivation will differentially affect these groups.
Applying governance checks: (1) "Who gains and who loses?" The high-deprivation-under-40 group gains access; everyone else is excluded. If the excluded group includes high-deprivation people over 40 who also benefit, the rule creates an age-based inequity within disadvantaged communities. (2) "Can affected communities understand and contest the rule?" A depth-2 tree is transparent enough to explain, but communities need a mechanism to challenge the split variables and thresholds.
Te Tiriti modification: guarantee a minimum allocation floor for Māori regardless of the tree's splits, ensuring that the algorithmic rule does not reduce access below current levels for tangata whenua.
"The algorithm is objective because it only uses data" is incorrect. The algorithm optimises a chosen objective function on data that reflect historical inequities. Structural disadvantage is encoded in the variables. Objectivity in computation does not imply fairness in outcomes.
Policy tree versus ranking
(a) Estimated policy value: Strategy A (pure ranking) typically achieves equal or slightly higher policy value because it uses the full granularity of $\hat{\tau}(X_i)$. Strategy B loses some value by collapsing to a few leaves.
(b) Explainability: Strategy B is far more explainable. A depth-2 tree is a short set of if-then rules. Strategy A requires explaining a continuous score derived from thousands of overlapping tree splits.
(c) Ability to answer "why was I selected?": Strategy B can give a clear answer ("because your deprivation index is above 8 and your loneliness score is above the median"). Strategy A can only say "because your estimated benefit score was in the top 20%," which is opaque.
Strategy A is preferable when the decision is internal (e.g., a research team allocating limited follow-up resources) and does not require public justification. Strategy B is preferable when the rule must be defended publicly, contested by affected communities, or implemented by non-technical staff.
Week 10
Interpreting invariance results
Configural invariance means the same items load on the same factors in both groups (same pattern of zero and non-zero loadings). Metric invariance means the factor loadings are equal across groups (a one-unit increase in the latent variable produces the same change in item responses). Scalar/threshold invariance means the intercepts (or thresholds for ordinal items) are equal, so the same latent level produces the same expected response.
Failing scalar/threshold invariance means that even at the same latent distress level, the two groups endorse "felt hopeless" and "felt worthless" differently. A one-unit difference in total score does not correspond to the same latent difference across groups. Group mean comparisons on the total score therefore confound true latent differences with measurement artefact.
Hypothesis for differential functioning: cultural norms about expressing hopelessness or worthlessness may differ. In some cultural contexts, endorsing "felt worthless" may carry greater stigma, leading to systematically lower endorsement at the same latent distress level. Alternatively, translation may anchor response categories differently.
Fit is not identification
Good fit means the model reproduces the observed covariance matrix. It does not establish causal direction. Multiple causal structures can generate the same covariance pattern.
Reflective DAG: $\eta \to X_1, \eta \to X_2, \ldots, \eta \to X_6$. The latent variable causes the indicators. Formative DAG: $X_1 \to \eta, X_2 \to \eta, \ldots, X_6 \to \eta$. The indicators cause the composite. Both can produce identical fit statistics for a single-factor solution.
The choice matters for downstream causal inference. If the construct is reflective and we use it as a confounder, we assume the latent variable is the true common cause. If the construct is actually formative (a composite of independent causes), conditioning on the composite may not block the backdoor paths we intend to close, because each component may have a different causal relationship with treatment and outcome.
Measurement as an identification problem
Scalar/threshold non-invariance means the same response pattern corresponds to different latent levels across groups. Put differently, the mapping from the latent outcome $Y^$ to the measured outcome $Y$ depends on group membership. This matters for CATE because CATE is defined by a group contrast. If measurement differs by group, the estimated heterogeneity can be measurement artefact. This can happen even when exchangeability and positivity hold for $Y^$, because the analysis uses $Y$, not $Y^*$.
Example intuition: population A is distressed and population B is not. In A, "worthlessness" may be caused by unemployment. In B, it may be rare and have different causes. The same K6 item can have different causal parents across groups. The factor structure and item means can therefore differ without any change in "true distress".
Counter to "validated in hundreds of studies": most validation evidence is about internal consistency, short-term stability, and associations with other variables. Those are associational properties. They do not establish that the items have the same meaning, the same causes, or the same measurement function across the particular groups you want to compare. A scale can be reliable within each group and still be non-comparable across groups.
Proposed workflow step (between DAG and estimation): write down a measurement submodel as causal assumptions. State whether your estimand is the effect on reported K6 ($Y$) or the effect on the underlying state ($Y^*$). Then draw a measurement DAG that makes explicit what causes item responses in each group, including stigma, translation, and response norms. Decide what design or data would support those assumptions. If you choose to run measurement invariance tests, treat them as descriptive stress tests of a specific reflective model, not as evidence that measurement is causally comparable. If the stress test fails, the honest conclusion is that your group comparison is not identified without stronger measurement assumptions or better data.