Week 9: Resource Allocation and Policy Trees

Date: 6 May 2026

Key idea

A policy tree turns an opaque heterogeneity ranking into a short allocation rule that a non-specialist can read, apply, and contest. The variables the tree splits on are useful for targeting, but they are not thereby causes.

Readings

Required

grf: Generalised Random Forests

Policy learning (tutorial)

RATE (tutorial)

QINI curves (tutorial)

Optional

VanderWeele et al. (2020)

Suzuki et al. (2020)

Bulbulia (2024)

Hoffman et al. (2023)

Key concepts

Outcome-wide evidence asks whether a prespecified exposure has a credible pattern across several outcomes.

Policy learning estimates utility over allocation rules.

Policy value must be evaluated out of sample.

Shallow policy trees trade a little value for a lot of interpretability.

Fairness constraints are design choices, not automatic outputs.

Week 8 introduced causal forests and ranking diagnostics one outcome at a time so the CATE machinery was visible. Week 9 connects that machinery to the final assignment's decision sequence. First, estimate outcome-wide ATEs for the prespecified outcome family. Second, for each outcome with enough evidence to discuss, use policy trees to summarise where treatment appears most valuable. Rankings alone are not policy. This week asks how much of the underlying heterogeneity a short, publicly defensible policy tree can recover.

Where we are in the heterogeneity sequence

Week 6: define effect modification and CATE.

Week 8: estimate CATE rankings and ask whether rankings contain useful targeting information.

Week 9: turn modelled heterogeneity into interpretable policy trees.

Week 10: ask whether the measured outcomes and covariates support the interpretation we place on the tree.

Final assessment: report outcome-wide ATEs, then interpret policy trees as readable targeting summaries.

Seminar

Motivating example

A district health board is considering a community-group intervention. The analysis begins outcome-wide: purpose, belonging, self-esteem, and life satisfaction are all possible targets, and investigators should not pick whichever outcome looks most convenient after the fact. Suppose the outcome-wide screen suggests that sense of purpose is the clearest outcome to inspect further. Week 8 then asks whether effects on purpose vary across residents and whether a CATE ranking has targeting value. Week 9 asks the next question: can that ranking be turned into a rule the board could describe in public?

We work through purpose for concreteness. The final assignment applies the same broad logic across the four wellbeing outcomes: report the outcome-wide ATEs, then use policy trees as the interpretable heterogeneity output. RATE and Qini remain useful for understanding rankings, but they are not the report scaffold.

Where Week 9 fits

A policy tree is not the first analysis to run. It is a late-stage summary, used after the outcome-wide ATEs have established which outcomes need interpretation. Week 8 estimated and evaluated single-outcome CATE rankings as teaching diagnostics; Week 9 reopens the outcome-wide screen as the gate before policy work, then fits, evaluates, and interprets the policy tree.

Outcome-wide evidence before targeting

Before fitting a policy tree, ask whether the underlying average effects are credible enough to justify a targeting exercise. In the final assignment, students estimate one exposure across four outcomes: purpose, belonging, self-esteem, and life satisfaction. This is the same outcome-wide screen introduced in Week 8, now used as the context for policy-tree interpretation. The point is to read the pattern across outcomes, not to hunt for whichever row happens to look best.

VanderWeele's outcome-wide design starts from a common problem in social and psychological science: exposures often plausibly affect several outcomes, and investigators can tell a convincing story after the fact if they focus only on whichever row looks strongest. An outcome-wide design disciplines that choice. It asks about a prespecified family of outcomes under the same exposure, target population, time zero, and adjustment set (VanderWeele et al., 2020).

The motivation is disciplined comparison. A broad pattern across outcomes can support a stronger substantive interpretation than one isolated estimate. A specific pattern can show that the exposure seems more relevant for some outcomes than others. A null pattern can prevent investigators from over-selling a single noisy association. Because the same exposure is being evaluated repeatedly, outcome-wide designs also require multiplicity control and sensitivity analysis.

Read the forest plot in four passes:

Is the pattern broad, outcome-specific, or mostly null?
Which outcomes remain statistically reliable after multiplicity correction?
Which estimates are robust enough to unmeasured confounding to discuss seriously?
Which outcome, if any, has enough evidence to justify asking a Week 8 targeting question and a Week 9 policy-tree question?

When a single exposure is tested against several outcomes, the chance that at least one confidence interval excludes zero by chance alone is higher than the nominal per-test error rate. A simple correction is Bonferroni:

$$ \alpha_{\text{per outcome}} = \frac{\alpha_{\text{family-wise}}}{K}. $$

With $K = 4$ outcomes and $\alpha_{\text{family-wise}} = 0.05$, each outcome is tested at $\alpha = 0.0125$. Equivalently, report a 98.75% confidence interval for each outcome. This is conservative, but it is transparent and easy to explain.

The second question is sensitivity to unmeasured confounding. The E-value is the minimum association strength, on the risk-ratio scale, that an unmeasured confounder would need to have with both the exposure and the outcome, above and beyond the measured covariates, to explain away the estimated effect (VanderWeele & Ding, 2017). Larger E-values mean that a stronger unmeasured confounder would be needed.

Do not treat E-values as a universal pass/fail rule. Whether an E-value is large enough depends on the study design, measured covariates, outcome, and the kinds of unmeasured confounding that remain plausible. In your report, state the E-value for the point estimate and for the confidence-limit closest to the null, then interpret it in context. The confidence-limit E-value is usually the more cautious summary because it asks how strong confounding would need to be to move the uncertainty interval to include no effect. VanderWeele and Mathur also recommend reporting E-values with enough context that readers can compare them with known covariate-outcome and covariate-exposure associations rather than reading them as standalone thresholds (VanderWeele & Mathur, 2020).

From ranking to policy

Week 8 produced a ranked list: who is predicted to benefit most. A ranking is informative, but it is not a rule. A rule maps any covariate profile $x$ to an action $d(x) \in {0, 1}$, "treat" or "do not treat". Two profiles that are nearly identical should receive nearly identical decisions, and a ranking gives no such guarantee, because rank position drifts with sample size.

The causal estimand has changed. Conditional average treatment effect (CATE) estimation asks for a person-level contrast, $\tau(x) = \mathbb{E}[Y(1)-Y(0)\mid X=x]$. Policy learning asks which allocation rule has the highest expected utility if applied to the target population. We need a way to score competing rules: how much utility each would deliver, on average, if applied to the whole population. Call that number the rule's policy value, written $V(d)$:

$$ V(d) = \mathbb{E}[Y(d(X))]. $$

In the course lab, utility is the outcome $Y$ itself. That assumes the treatment and control actions have equal cost, or that costs are being ignored for the teaching example. If treatment costs money, time, risk, or staff capacity, the utility should be a net utility such as $U(a) = Y(a) - c_a(X)$. If costs change, the best policy can change too: a rule that maximises wellbeing gain when treatment is cheap may no longer maximise net utility when treatment is expensive or scarce. With costs included, the rule should treat when the expected benefit is large enough to justify the cost, not simply when the expected benefit is positive.

The default policytree problem does not set a fixed percentage of people to treat. Instead, it estimates the utility of candidate allocation rules and asks which action has the highest estimated reward in each leaf. The share treated is an output of the fitted rule, not an input.

If a project has a fixed budget, the policy problem can be written with a treatment-share cap:

$$ \mathbb{E}[d(X)] \le q. $$

That is useful to discuss, but it is not what the course assignment enforces. In the assignment scaffold, policy trees identify regions where the estimated reward for treatment is higher than control, then report the expected mean differences and the size of those regions. Production scripts add further safeguards and interpretation layers, including more outcomes, sample weights, adverse-outcome flipping (reorienting harmful outcomes so a higher value always means a better result), cross-validated heterogeneity interpretation, depth comparison, and larger stability runs.

Two practical complications remain. First, $V(d)$ is a counterfactual quantity, since for any individual we observe at most one of $Y(0)$ or $Y(1)$. We need to evaluate candidate rules without assigning everyone to them. Second, the rule must be small enough to defend in front of a non-statistical audience. Both constraints push the analysis toward shallow, transparent decision trees evaluated with doubly-robust scores.

Why policy trees

A causal forest maps a high-dimensional covariate vector $X$ to a personalised CATE score $\hat{\tau}(X)$. The score says how much treatment is expected to change the outcome for people with covariates $X=x$. It does not decide how to allocate a programme. The policytree algorithm bridges that gap by comparing the utility of allocation rules. It collapses the forest's many $\hat{\tau}(X)$ values into a single shallow decision tree, where each split is chosen to maximise expected policy value subject to the depth budget (Sverdrup et al., 2024).

The algorithm proceeds greedily, top down. At the root, it searches every covariate and every threshold for the split that delivers the largest gain in policy value when each child receives the locally-optimal action. It then recurses into each child until the depth limit is reached. Because the rule must hold for the entire population, the search uses doubly-robust scores rather than raw outcomes.

Define $\Gamma_i(a)$ as person $i$'s estimated outcome contribution if an allocation rule assigned action $a$. In the lab, outcome contribution and utility contribution are the same because we are treating $Y$ as the utility. The augmented inverse-propensity-weighted (AIPW) version is:

$$ \Gamma_i(a) = \mu_a(X_i) + \frac{\mathbb{1}{A_i = a}}{\Pr(A = a \mid X_i)},\bigl(Y_i - \mu_a(X_i)\bigr), $$

where $\mu_a(x)$ is the outcome model and $\Pr(A = a \mid x)$ is the propensity score. The propensity score is not a potential outcome. It is the estimated probability of receiving action $a$ given covariates. The first term is the outcome model's best guess at $Y_i(a)$ for everyone. The second term uses the people who actually received action $a$ to correct that guess. The propensity score appears in the denominator because only a fraction of people with covariates like $X_i$ received action $a$; inverse-propensity weighting scales their residuals so they represent the whole covariate stratum, not only the observed treated or untreated cases.

Under consistency, exchangeability, positivity, and correct specification of either the outcome model or the propensity model, averaging $\Gamma_i(a)$ estimates the mean potential outcome $\mathbb{E}[Y(a)]$. That is why $\Gamma_i(a)$ is useful for policy learning: it is an action-specific pseudo-outcome. In the simple lab case, that pseudo-outcome is also a pseudo-utility. If costs are included, the same logic applies after replacing $Y(a)$ with net utility $U(a)$.

This is the bridge from individual scores to policy value. For any candidate rule $d$, evaluate the score for the action that rule assigns to each person, then average:

$$ \widehat{V}(d) = \frac{1}{n}\sum_{i=1}^{n}\Gamma_i(d(X_i)). $$

The policy tree search compares candidate rules by this average. A split is useful when assigning different actions in the two child nodes raises $\widehat{V}(d)$ relative to a simpler rule. The tree therefore searches for a simple allocation rule with high estimated utility, and each leaf names the action — treat or control — that maximises that utility.

In this course we cap tree depth at two. Three reasons motivate the cap. First, at most three yes/no questions per rule means the logic fits on a slide for policy-makers or clinicians. Second, each leaf retains enough observations to yield a stable effect estimate, and stability matters more than precision when the audience is non-technical. Third, deeper trees increase computational complexity faster than they improve payoffs; the gain from a depth-3 tree over a depth-2 tree is usually small relative to the loss in clarity.

Pair exercise: reading a simple policy rule

Suppose a depth-2 policy tree identifies a strong-response region:

First split on deprivation: high versus low.

Among the high-deprivation group, split on baseline loneliness: high versus low.

The largest estimated mean difference is in the high-deprivation, high-loneliness leaf.

In pairs:

Draw this tree and label the high-response leaf.

Write the high-response region as one sentence a community organiser could repeat.

Suppose 40% of residents are high-deprivation, and half of those are high-loneliness. What share of all eligible residents falls in the high-response region? Show the multiplication.

The lab reports strong-response regions and expected mean differences; it does not force a 20% treatment cap. Why is it still useful to know the approximate size of the region?

Reading a policy tree

Two visualisations work in tandem. The decision-tree diagram shows the rule abstractly: which covariate, which threshold, which leaf. The prediction-points scatter shows the rule as a partition of the underlying covariate space, with each individual coloured by the assigned action.

Read the two together. Where the scatter has many control-coloured points clustered against a treat boundary, the rule is treating individuals whose own predicted benefit is small but whose neighbourhood benefit is large. That is the cost of forcing a sharp threshold onto a smooth surface, and it is one reason a policy tree should never be the only output of an analysis. The underlying CATE estimates and the calibration test from week 8 carry information the tree discards.

Choosing tree depth

The depth budget is a design choice, not a property of the data. A depth-1 rule asks one question; a depth-2 rule asks two or three. Whether the extra depth pays off depends on whether the second-level splits carve out subregions in which the locally optimal action differs from the parent's recommendation.

Compare the two depths for community-group participation on purpose.

On the same held-out sample, the depth-1 policy delivers an estimated policy value of $0.215$ (the rule's expected outcome on the held-out fold, in the outcome's own units); the depth-2 policy delivers $0.243$, a point gain of $0.028$ (about thirteen percent above depth-1). The gain is real in the fitted comparison, but small relative to the additional complexity.

The course rule is prespecified. Under this parsimony rule, prefer the shallower tree unless the depth-2 point gain in held-out policy value clears the stated gain threshold. Uncertainty intervals, stability checks, equity considerations, and implementation burden then inform how confidently investigators should act on the selected rule. For a high-stakes clinical intervention, a threshold-clearing gain may justify the extra complexity only with strong validation. For a community programme that must be administered by volunteers, the simpler depth-1 rule may be preferable even when depth-2 clears the point-gain threshold, because the rule is more likely to be applied correctly in the field. A rule a worker mis-remembers is not the rule that was evaluated.

The full research workflow compares both depths automatically. Lab 9 shows depth-1 and depth-2 outputs so you can practise the same judgement: examine both trees, then justify the chosen depth in the methods section.

Pair exercise: depth-1 versus depth-2

Look at the depth-1 and depth-2 policy trees above. State each rule in one sentence of plain language.

The depth-2 rule lifts policy value by $0.028$ over the depth-1 rule. For $10,000$ eligible people, describe what a small average lift can mean at population scale.

Name one reason to use the depth-1 rule despite the lift.

Name one reason to use the depth-2 rule.

State what extra evidence would make you more comfortable choosing the depth-2 rule.

Off-policy evaluation

A policy fitted to a sample will always look good on that sample. The serious test is held-out evaluation. The research workflow uses AIPW scores computed on a held-out fold, then estimates policy value and its sampling distribution by averaging over individuals' scores under the proposed rule. The estimator is doubly robust in the sense above; the standard errors come from a plug-in or bootstrap calculation that respects the sampling design.

Two consequences follow. First, the same rule can have different value estimates depending on which fold is used; small samples make this drift visible. The lab's stability analysis (margot_policy_tree_stability()) repeats the fit across many train-test splits to surface this variability. Rules that change splits across resamples are flagged as unstable, even if any single fit looks reasonable. Second, the policy value's confidence interval reflects only sampling noise. Model misspecification, residual confounding, and measurement error sit outside the interval. Sensitivity analysis on the underlying causal estimates and checks on the rule itself (below) carry the rest of the inferential burden.

If the held-out value is statistically reliably above the no-treatment baseline, the rule is worth considering. If the held-out value is indistinguishable from random allocation, the rule is not. The policy_value_audit that margot_policy_workflow() returns flags both cases automatically.

Practical interpretation

A policy tree's splits are selected to maximise policy value, not to identify causally privileged variables. If a depth-2 tree splits on openness and neuroticism, that does not mean these variables are deep causes of purpose; it means they help separate high-value and low-value treatment regions under the reward objective supplied to policytree. The same data with a different outcome could pick out an entirely different splitter.

Conversely, a variable that does not appear in the tree is not necessarily causally irrelevant. The greedy search may have found a single composite that explains most of the heterogeneity, leaving secondary variables unused. Variable-importance summaries from the underlying causal forest (e.g. grf::variable_importance()) give a complementary picture of what the forest used to estimate $\hat{\tau}(x)$, even if those variables do not appear at the top of any policy tree.

When you write up a policy-tree analysis, name the splitters, state the policy value with its confidence interval, and explicitly disclaim the causal interpretation of the splits. Readers who skim past those clarifications will otherwise treat the splitters as causes.

Heterogeneity as scientific discovery

CATE machinery maps treatment effects across a high-dimensional covariate space. It helps test whether our conventional categories (gender, age group, clinical severity) capture the differences that matter. Sometimes they do; often they do not. Discovering where the forest finds meaningful splits can generate fresh hypotheses about who responds and why, even when no policy decision is on the table. A forest that splits on loneliness rather than age, for example, suggests a hypothesis about social connection; it does not prove that loneliness is the causal source of the heterogeneity. This use of heterogeneity is exploratory, not confirmatory, and a follow-up study designed around the discovered subgroup is the proper test.

Why effect modifiers are descriptive

Heterogeneous treatment effect (HET) estimates are descriptions of how treatment contrasts vary across measured covariates. They are not causal effects of the covariates that appear in the forest or policy tree. A variable can be a useful effect modifier because it is correlated with a deeper cause, because it is downstream of a deeper cause, or because it is the best measured proxy available.

Consider a simple randomised experiment. The exposure $A$ affects the outcome $Y$, and $Z$ is the variable that directly helps explain why the treatment effect differs. If $Z$ also affects $G$, then $G$ can look like an effect modifier by proxy. The modifier describes where effects vary; it does not identify the root cause of that variation.

A randomised experiment in which $Z$ is the direct effect modifier and $G$ is a proxy effect modifier because $Z$ affects $G$.

If investigators condition on $Z$, $G$ may no longer help describe effect heterogeneity. The apparent importance of a modifier is therefore relative to the other variables in the model.

Conditioning on $Z$ removes the association between $G$ and the outcome-relevant effect-modification path.

The same point matters for allocation rules. Suppose a policy tree splits on deprivation. That split may be useful for prediction and allocation, but it does not prove that deprivation is the causal source of the different treatment response. If ethnicity or other upstream causes affect deprivation, and deprivation is the strongest measured marker in the data, the tree may split on deprivation while the deeper causal explanation lies upstream.

A proxy structure in which an upstream variable affects a downstream marker that helps describe effect heterogeneity.

The practical rule is: use HETs for description, targeting, and hypothesis generation. Do not read a splitter as a cause without a separate design that identifies the causal effect of that splitter.

Fairness and public judgement

Efficiency is not the only consideration. A rule that maximises policy value can still be hard to explain, hard to apply, or unacceptable to the people affected by it. Three concerns recur.

Proxy variables can affect social groups differently. A split on deprivation, income, postcode, age, or baseline wellbeing may reproduce group differences even when group membership is not used explicitly. Removing an explicit variable does not remove its proxies.

Targeting concentrates resources on those who benefit most, by design. People who would benefit somewhat from the intervention receive nothing under a tight budget. That trade-off may be defensible, but it is a value choice. State it plainly rather than hiding it inside an objective function.

Statistical evidence can inform public judgement. It cannot make the judgement for us. Statisticians and psychologists can estimate benefits, harms, uncertainty, and subgroup patterns. Public allocation also depends on values that citizens and institutions debate and decide. The analyst's job is to make the trade-offs visible.

Before recommending a rule, investigators should check:

Who gains and who loses?
Are protected groups differentially affected through proxies?
Does the rule reduce or worsen disparities?
Can affected communities understand and contest the rule?

Pair exercise: fairness check

A policy tree treats residents who are high-deprivation and under 40.

Translate the rule into plain language.

Explain how the deprivation split can affect social groups differently even when group membership is not used.

Name the table you would compute to check this empirically.

Apply two checks from the list above.

Name one value judgement the model cannot settle.

Your partner says "the algorithm is objective because it only uses data." Counter this claim in two sentences.

Workflow for this week

Estimate heterogeneity and targeting value (week 8 outputs).
Fit shallow policy trees at multiple depths using the policytree default reward objective.
Evaluate each policy out of sample with AIPW scores; report policy value with a confidence interval.
Examine stability across train-test splits and reject unstable rules even when any single fit looks reasonable.
Conduct a fairness check before recommending a rule.
Report trade-offs: value, fairness, transparency, and feasibility.

Which question are we answering?

The methods in weeks 5-10 answer related but different questions. The sequence matters because each step changes the causal estimand, the statistical summary, and the decision that the evidence can support.

Tool	Question	Main quantity	What it can support	What it does not settle
Outcome-wide ATE	Does one exposure appear to improve outcomes on average across a prespecified outcome family?	$\mathbb{E}[Y_k(1)-Y_k(0)]$ for each outcome $k$	Whether the exposure has credible average evidence for one or more outcomes	Who should receive treatment
Prespecified group CATE	Do average effects differ across a group chosen before looking at results?	$\mathbb{E}[Y(1)-Y(0)\mid G=g]$	Whether an interpretable group comparison deserves attention	Individual assignment or a complete allocation rule
Forest CATE ranking	Who is predicted to benefit more, conditional on baseline covariates?	$\hat{\tau}(X)$	A ranking for possible targeting	A transparent public rule, or a fixed treatment share by itself
RATE / Qini	Does targeting by the CATE ranking improve outcomes over random or uniform allocation?	Targeting gain over the ranked population	Whether heterogeneity has practical value	Which simple rule should be used
Policy tree	What short allocation rule has high expected policy value?	$V(d)=\mathbb{E}[Y(d(X))]$	A defensible, interpretable rule to consider	Whether the split variables are causes, or whether the rule is fair
Measurement checks	Do the outcomes and covariates mean what the analysis assumes?	Fit, invariance, construct, and measurement-error diagnostics	Whether interpretation needs qualification before policy use	The causal effect, or the fairness of the rule

The table is also a writing guide. In a report, avoid presenting policy trees as though they answer the same question as an average treatment effect. The ATE asks whether the exposure helps on average. CATE summaries ask whether effects vary. RATE and Qini ask whether a ranking is useful for targeting. Policy trees ask whether a short rule can recover enough value to be worth considering. Measurement checks ask whether the variables inside the analysis support the interpretation being placed on them.

Return to the opening example

Back to the district health board. The question is not only "what rule maximises sample gain?" The question is also whether the rule works out of sample, can be explained, and is acceptable to the people who would live with it. If there is a fixed budget, that constraint must be added explicitly; it is not supplied by the default policytree call. That is what moves a tree from a slide in a methods talk to a rule a clinic might actually use.

The workflow from question to policy rule is now in place. One assumption has been present throughout but never examined: that our instruments measure the same construct across the groups we compare. Week 10 asks whether that assumption holds.

Pair exercise: policy tree versus ranking

Strategy A ranks individuals by $\hat{\tau}(X_i)$ and, if a budget is fixed, could treat the top 20%. Strategy B fits depth-1 and depth-2 policy-tree candidates using the default reward objective, then selects the simpler rule unless depth-2 clears the prespecified gain threshold. The treated share is whatever the selected rule implies.

Compare the two strategies on: (a) estimated policy value, (b) explainability to a non-technical audience, and (c) ability to answer "why was I selected?"

State one scenario where Strategy A (pure ranking) is preferable.

State one scenario where Strategy B (policy tree) is preferable.

State one fairness question both strategies must answer before anyone uses them.

Lab materials: Lab 9: Policy Trees

Assessment checkpoint

We will reserve about 25 minutes today for the assessments.

Test 2, 5 minutes. The test covers weeks 8-10: heterogeneous treatment effects, policy trees, and measurement. Bring one A4 sheet of notes. No devices.
Test preparation, 10 minutes. The public study sheet and practice questions are linked from the resources section of the course book. Use them to practise short answers, not just definitions.
Research report and Marsden option, 10 minutes. Option A should follow the current research-report template or the Google Drive mirror if already downloaded: choose religious_service or volunteer_work, then report effects on all four wellbeing outcomes. Historical examples can help you see tone and structure, but this year's report is narrower and uses simulated data. For Option B, the old Marsden example is a model of compact grant style only; follow the current criteria in Assessments.

Bulbulia, J. A. (2024). A practical guide to causal inference in three-wave panel studies. PsyArXiv Preprints. https://doi.org/10.31234/osf.io/uyg3d

Hoffman, K. L., Salazar-Barreto, D., Rudolph, K. E., & Díaz, I. (2023). Introducing longitudinal modified treatment policies: A unified framework for studying complex exposures. https://doi.org/10.48550/arXiv.2304.09460

Suzuki, E., Shinozaki, T., & Yamamoto, E. (2020). Causal Diagrams: Pitfalls and Tips. Journal of Epidemiology, 30(4), 153–162. https://doi.org/10.2188/jea.JE20190192

Sverdrup, E., Kanodia, A., Zhou, Z., Athey, S., & Wager, S. (2024). Policytree: Policy learning via doubly robust empirical welfare maximization over trees. https://CRAN.R-project.org/package=policytree

VanderWeele, T. J., & Ding, P. (2017). Sensitivity analysis in observational research: Introducing the E-value. Annals of Internal Medicine, 167(4), 268–274. https://doi.org/10.7326/M16-2607

VanderWeele, T. J., & Mathur, M. B. (2020). Commentary: Developing best-practice guidelines for the reporting of e-values. International Journal of Epidemiology, 49(5), 1495–1497. https://doi.org/10.1093/ije/dyaa094

VanderWeele, T. J., Mathur, M. B., & Chen, Y. (2020). Outcome-wide longitudinal designs for causal inference: A new template for empirical studies. Statistical Science, 35(3), 437–466.

PSYC 434: Conducting Research Across Cultures