Test 2 Practice Question Answers

Use these answers to check your own work after you have tried the practice questions. Strong answers can be phrased in different ways. The key is to show the causal logic clearly.

Heterogeneous Treatment Effects

  1. The average treatment effect (ATE) is the mean contrast $\mathbb{E}[Y(1)-Y(0)]$ in the target population. The conditional average treatment effect (CATE) is $\mathbb{E}[Y(1)-Y(0)\mid X=x]$, the average contrast within people who share covariates $x$. The distinction matters when the average hides large differences in who benefits.

  2. No. Non-significant age and gender interactions only rule out those two pre-specified linear interactions with the available power and model form. Heterogeneity may involve other variables, non-linear patterns, or high-dimensional combinations.

  3. Honest splitting separates the data used to choose tree splits from the data used to estimate effects within leaves. It reduces overfitting and prevents the forest from exaggerating heterogeneity by using the same noise twice.

  4. A doubly robust estimator remains consistent if either the outcome model or the treatment/propensity model is correctly specified. The practical advantage is protection against one form of model misspecification.

  5. RATE-AUTOC = 0.04 with CI $[0.01,0.07]$ suggests the forest's ranking describes useful treatment-effect heterogeneity: people ranked higher appear to benefit more than people ranked lower. The same estimate with CI $[-0.02,0.10]$ is inconclusive: the point estimate is positive, but the data are compatible with no useful targeting signal. Neither result identifies the causal source of the heterogeneity or shows that any splitter is itself a cause.

  6. A Qini curve plots the treatment share targeted on the $x$-axis and cumulative gain over a baseline allocation on the $y$-axis. When targeting helps, the curve rises steeply above the diagonal among the first people targeted. When targeting does not help, it stays close to the diagonal.

  7. Sparse regions have few comparable treated and untreated observations. The forest must extrapolate more, leaves contain less information, and the CATE estimate has higher variance.

  8. Regression only tests the interactions the investigator specifies. A causal forest can search for non-linear and high-dimensional heterogeneity that a small regression interaction set would miss, though the forest still needs validation.

Policy Trees, Fairness, and Judgement

  1. A policy tree converts estimated treatment-effect heterogeneity into a simple allocation rule: for these covariates, treat; for those covariates, do not treat. A CATE estimate describes the expected treatment contrast at $X=x$. Policy learning compares whole allocation rules by their expected utility or policy value if the rule were applied to the target population.

  2. Fit the shallower and deeper trees as candidates, then prefer the simpler tree unless the deeper tree clears the prespecified held-out policy-value point-gain threshold. Uncertainty and stability guide how cautiously to interpret a threshold-clearing depth-2 rule; interval overlap is not the selection rule. Simpler trees are easier to explain, more stable, and less likely to be misapplied.

  3. Offer the programme to residents whose deprivation index is above 1.2 and who are aged 40 or younger.

  4. A strong fairness check could ask: what share of each relevant social group receives the programme; whether split variables are proxies for sensitive characteristics; whether people with similar need receive similar access; whether the rule is stable across resamples; and whether there is an override or appeal pathway.

  5. Deprivation is correlated with many social conditions, such as income, housing, neighbourhood resources, age, health, and sometimes ethnicity. A rule that uses deprivation may therefore allocate treatment unevenly across groups even if group membership is not in the tree. The split is descriptive: deprivation may be the strongest measured marker even if the root causes of the heterogeneity lie upstream, for example in housing policy, labour-market exclusion, discrimination, or other causes of deprivation.

  6. Statistical evidence can estimate expected benefits, uncertainty, and subgroup patterns. It cannot decide which public value should govern allocation: maximising total benefit, equal access, need, individual choice, cost control, or another principle.

  7. An organiser might override the rule for someone assigned "do not treat" who is in acute crisis or whose situation changed after baseline measurement. The organiser has current, relational, and contextual information that the model does not contain.

  8. A ranking rule can treat the top 20% by $\hat{\tau}(x)$ when a budget fixes the treatment share. It uses the full continuous CATE score, so the first task is estimating person-level treatment contrasts. The course policy-tree workflow solves a different problem: it estimates the utility or policy value of simple if-then allocation rules, fits depth-1 and depth-2 candidates, and chooses depth-2 only if the held-out policy-value point gain clears the prespecified threshold. The treated share is whatever the selected rule implies. Ranking may be preferable when trained staff can use a score safely, a fixed budget must be met exactly, and the extra policy value is large enough to justify lower transparency.

Outcome-Wide Reporting

  1. The four causal estimands are the effects of the same exposure on each outcome: purpose, belonging, self-esteem, and life satisfaction. Formally, $\mathrm{ATE}_k=\mathbb{E}[Y_k(1)-Y_k(0)]$ for $k=1,\ldots,4$.

  2. With four outcomes, the chance of at least one apparently positive result by chance is higher than 5%. Bonferroni controls the family-wise error rate by testing each outcome at $\alpha=0.05/4=0.0125$, equivalent to reporting 98.75% confidence intervals for each outcome.

  3. An E-value of 2.0 means that an unmeasured confounder would need to be associated with both the exposure and the outcome by a risk ratio of at least 2.0 each, above and beyond measured covariates, to explain away the estimated effect. It is a sensitivity summary, not proof that confounding is absent.

  4. The exposure shows its clearest evidence for the two outcomes whose Bonferroni-adjusted intervals exclude zero. The other two estimates point in the same direction but remain uncertain after correcting for the four-outcome family. A non-specialist summary should emphasise the overall pattern, the stronger outcomes, and the fact that some outcomes remain compatible with no effect.

Measurement

  1. In a reflective model, the latent construct causes the indicators. This is awkward for causal inference if the latent variable is treated as a real causal object without a clear intervention. In a formative model, the indicators compose the construct. This is awkward because intervening on the construct is not the same as intervening on each component item.

  2. Measurement invariance means that a scale relates to the underlying construct in the same way across groups. Scalar invariance fails when item intercepts or thresholds differ across cultures, so the same observed score can imply different latent levels. Mean comparisons may then mix real differences with measurement artefacts.

  3. "Mindfulness intervention" can bundle different versions: breathing exercises, meditation apps, classroom lessons, retreats, or teacher-led practice. Consistency is threatened because $Y(1)$ is not a single well-defined potential outcome if different people receive different versions.

  4. A suitable graph is $A \to Y^\ast \to Y$, with $U_Y \to Y$, and differential measurement error represented by $A \to U_Y \to Y$. The path $A \to U_Y \to Y$ means the exposure changes how the outcome is measured, so the observed effect on $Y$ may partly reflect measurement processes rather than the true outcome $Y^\ast$.

  5. Baseline outcome adjustment controls much of the stable prior difference between exposed and unexposed people. In a three-wave panel, a remaining unmeasured confounder would need to affect later exposure conditional on earlier exposure and earlier outcome, which is a stronger requirement.

  6. A design response is to adapt, translate, pilot, or replace items before data collection, ideally with cultural consultation. An analytic response is to test measurement invariance and use group-specific models, partial invariance, sensitivity analysis, or avoid mean comparisons where invariance fails.

Synthesis Questions

  1. A strong workflow states the four causal estimands for weekly volunteer work versus no weekly volunteer work; defines the target population and time horizon; states consistency, conditional exchangeability, and positivity; adjusts for baseline covariates, exposure, and outcomes; fits a causal forest; reports four ATEs with Bonferroni-adjusted intervals; reports E-values for point estimates and confidence limits; tests heterogeneity with RATE or calibration; and presents a forest plot plus table.

  2. Estimate CATEs with a causal forest, check whether heterogeneity is real, then move from person-level treatment contrasts to policy learning. Policy learning estimates the expected utility or policy value of candidate allocation rules if they were applied to the target population. Fit shallow policy trees, evaluate held-out policy value and stability, and translate the chosen rule into plain language. The fairness check should consider group treatment shares implied by the rule, proxy variables, need, and override procedures. The model cannot decide whether the allocation should prioritise total benefit, equal access, greatest need, cost control, or another public value. If a fixed budget is added, the analysis must also compare rules or rankings under that budget; the current default policy-tree workflow does not impose the percentage treated by itself.

  3. Measurement threat: the K6 may not be invariant across countries; test invariance, adapt items, or avoid direct mean comparisons if scalar invariance fails. Treatment-version threat: "school-based mindfulness" may differ across teachers, schools, and countries; define the intervention more precisely or analyse versions separately. Identification threat: treated and untreated students may differ in baseline distress, family background, school resources, or selection into the programme; use a clear DAG, baseline adjustment, and sensitivity analysis.

  4. A good reply is: "The corrected intervals mean we should not claim four separate effects. The evidence is strongest for the outcomes whose Bonferroni-adjusted intervals exclude zero. For the other outcomes, the estimates may still contribute to the overall pattern, but they remain compatible with no effect after family-wise correction. I would present the result as an outcome-wide pattern with graded uncertainty, not as four confirmed findings."