Week 10: Classical Measurement Theory from a Causal Perspective
Date: 13 May 2026
Key idea
Standard psychometric theories rest on strong, usually unstated, causal assumptions about how a construct and its indicators relate. These assumptions are especially likely to fail when measures are compared across cultures. Measurement is therefore part of causal identification, not a separate preliminary step.
Readings
Required
Optional
Key concepts
- EFA and CFA are model-building tools, not causal proofs.
- Invariance tests are associational diagnostics under a chosen measurement model, not tests of causal comparability.
- Reflective and formative equations need explicit causal interpretation.
- Measurement assumptions can open or close bias paths in DAGs.
Weeks 1 through 9 built a workflow from causal question to policy recommendation. Every step assumed that the outcomes we measure mean the same thing for every group in the target population. If they do not, contrasts between groups can reflect measurement artefact, not causal differences. This week examines that assumption.
Where we are in the heterogeneity sequence
- Week 6: define effect modification and CATE.
- Week 8: estimate CATE rankings and ask whether rankings contain useful targeting information.
- Week 9: turn modelled heterogeneity into interpretable policy trees.
- Week 10: ask whether the measured outcomes and covariates support the interpretation we place on the tree.
- Final assessment: report outcome-wide ATEs, then interpret policy trees as readable targeting summaries.
Week 10 closes the heterogeneity sequence by asking what the policy tree is made of. Its splits use measured covariates. Its value is judged on measured outcomes. Its fairness depends on whether those measurements behave comparably across groups. A clear tree can still mislead if the variables inside it do not mean what the analysis says they mean.
| Object | Question | Output |
|---|---|---|
| Outcome-wide ATEs | Which outcomes show credible average evidence? | Four-outcome ATE table or plot |
| CATE | For whom does the effect vary? | $\hat{\tau}(X)$ estimates |
| RATE / Qini | Does the ranking carry targeting value? | Diagnostic curves and summaries |
| Policy tree | Can we state a readable rule? | Depth-1 or depth-2 allocation rule |
| Measurement checks | Do the variables mean what the rule assumes? | Cautions about construct meaning and comparability |
Seminar
Motivating example
The Kessler-6 (K6) is widely used to screen psychological distress.
Two questions must be addressed before we try to compare scores causally across groups.
- Do the six items map to the same latent structure?
- Is that structure invariant across groups?
These questions are necessary, but they are not sufficient. Even if a measurement model fits well and invariance tests pass, causal interpretation still depends on a defended causal story about what the construct is and how it is measured.
This is why measurement belongs inside causal inference, not beside it. Measurement is part of identification: if a measure is unstable across the groups we compare, effect estimates and group contrasts can be distorted even when the adjustment set is otherwise defensible.
Classical validity and its limits
Psychology textbooks organise measurement quality around four types of validity.
Four classical validity types
- Content validity: the degree to which an instrument covers the intended domain.
- Construct validity: whether the construct is accurately defined and operationalised.
- Criterion validity: whether an instrument accurately predicts performance on an external criterion.
- Ecological validity: whether an instrument reflects real-world situations and behaviour.
These categories organise useful intuitions. From a causal perspective, each conflates problems that need to be kept separate.
Content validity asks whether items span the construct's domain. It does not specify the causal direction between construct and indicators. Does the construct cause the items (a reflective model), or do the items constitute the construct (a formative model)? Without stating the causal structure, "measures what it's intended to measure" has no formal content.
Construct validity bundles two separate questions. "Accurately defined" concerns the target quantity: what state of the world are we trying to capture? This is analogous to defining a causal estimand (which intervention, in which population, compared with what alternative). "Operationalised" concerns whether the same score means the same thing across individuals. This is a consistency question. Lumping both under one label obscures where a measurement fails.
Criterion validity is purely associational. An instrument can predict an outcome well for non-causal reasons: shared confounders, reverse causation, collider bias. "Predicts performance" tells you nothing about whether the instrument captures the construct that causally affects the criterion. Weeks 2 through 4 showed why prediction and causation are different questions. The same distinction applies here.
Ecological validity gestures at transportability without specifying what changes across settings. A causal framework asks: does the construct-to-indicator relationship hold in the target population? That question is testable through measurement invariance. "Reflects real-world situations" is not testable.
These four categories are qualitative checklists, not properties of a formal model. A causal approach specifies the directed graph connecting constructs to indicators, states the assumptions under which observed scores recover latent quantities, and tests those assumptions. The rest of this lecture shows what that looks like in practice.
Learning outcomes
By the end of this week, you should be able to:
- State what each classical validity type (content, construct, criterion, ecological) claims, and identify the causal assumptions each leaves implicit.
- Run exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) with clear model-comparison logic.
- Run configural, metric, and scalar/threshold invariance tests, and state what they do not establish.
- Explain why good fit does not prove a causal latent model.
- Explain why invariance tests do not deliver causal structure.
- Link measurement assumptions to DAG-based bias reasoning.
Part 1: practical workflow with K6
This lecture has two parts. Part 1 is the practical workflow you may be asked to run: exploratory and confirmatory factor analysis, then invariance testing. Part 2 is the causal reasoning that says what those tools can and cannot establish. Work the tools first, then read Part 2 before trusting them.
Step 1: prepare data and inspect factorability
library(margot)
library(tidyverse)
library(performance)
k6 <- df_nz |>
filter(wave == 2018) |>
select(
kessler_depressed,
kessler_effort,
kessler_hopeless,
kessler_worthless,
kessler_nervous,
kessler_restless
)
check_factorstructure(k6)
Bartlett and KMO are entry checks. They do not validate causal interpretation.
Step 2: run EFA
library(psych)
library(parameters)
efa <- psych::fa(k6, nfactors = 3, rotate = "oblimin") |>
model_parameters(sort = TRUE, threshold = "max")
efa
Oblique rotation is usually appropriate because psychological dimensions often co-vary.
Step 3: compare CFA candidates
library(datawizard)
library(lavaan)
library(performance)
set.seed(123)
parts <- data_partition(k6, training_proportion = 0.7, seed = 123)
train <- parts$p_0.7
test <- parts$test
m1_syntax <- psych::fa(train, nfactors = 1) |> efa_to_cfa()
m2_syntax <- psych::fa(train, nfactors = 2) |> efa_to_cfa()
m3_syntax <- psych::fa(train, nfactors = 3) |> efa_to_cfa()
m1 <- cfa(m1_syntax, data = test)
m2 <- cfa(m2_syntax, data = test)
m3 <- cfa(m3_syntax, data = test)
compare_performance(m1, m2, m3, verbose = FALSE)
Read CFI, RMSEA, AIC, and BIC together. Prefer simpler models when fit
is comparable. Do not suppress convergence or Heywood warnings here. A
three-factor model on six items is near the limit of what the data can
identify, so any warnings m3 produces are part of the model-comparison
evidence, not noise to hide.
Step 4: test invariance across groups
We teach invariance testing because it is widely used and because you may be asked to use it in comparative work. Treat it as a descriptive diagnostic, not as a generator of causal insight. A multi-group CFA invariance test is conditional on a specified measurement model (usually reflective), a chosen parameterisation, and assumptions such as local independence (no causal relations among items once the latent variable is held fixed). Passing an invariance test does not show that the same causally relevant construct exists in both groups. It only shows that a particular statistical measurement model, with particular equality constraints, is compatible with the observed covariance structure. Failures and successes are both compatible with many causal stories. Causal structure is underdetermined by associations in the data.
In this lecture, causal assumptions come first, and this means that statistical tests cannot replace thinking about our assumptions. We must define the construct and state a causal measurement story. Only then can invariance tests play a role, by checking some statistical implications of that story.
For ordinal items, threshold invariance is the analogue of scalar invariance. In practice, fit an ordinal estimator (for example WLSMV) when items are Likert-type.
library(semTools)
k6_eth <- df_nz |>
filter(wave == 2018, eth_cat %in% c("euro", "maori")) |>
select(
kessler_depressed,
kessler_effort,
kessler_hopeless,
kessler_worthless,
kessler_nervous,
kessler_restless,
eth_cat
)
# invariance testing is run end-to-end in Lab 10.
# current semTools fits the configural / metric / scalar sequence with
# measEq.syntax() and compares the steps with compareFit(). the older
# measurementInvariance() wrapper has been removed and no longer runs.
- Configural invariance: same loading pattern.
- Metric invariance: same loadings.
- Scalar/threshold invariance: same intercepts (continuous) or thresholds (ordinal).
How are these tests useful?
- They describe whether a chosen measurement model assigns similar roles to items across groups (within that model class). This can be a compact summary of group differences in the covariance structure.
- They can suggest where to look. If constraints fail, you learn which items or thresholds are most responsible, which can motivate substantive work (translation review, response-process interviews, item-by-item analysis, or redesign).
- They keep you honest about what is not identified. Even if every invariance test "passes", the causal meaning of the construct is not certified. If a test "fails", it does not tell you whether the problem is measurement bias or a real causal difference in the underlying state (for example, different causes of distress producing different item dynamics). To decide that, you need a causal story and often new data, not a better fit index.
Also note that item means can differ for many reasons that have nothing to do with biased measurement. If group A is more distressed than group B, we should expect different item means even under perfect measurement. Invariance testing describes whether a particular psychometric model is stable across groups. A failed test, on its own, does not establish whether a group difference is measurement artefact or a real difference in the underlying state.
Under the standard invariance interpretation, if scalar/threshold invariance fails then latent mean comparisons are not identified within that measurement model. In this course, treat this as a warning about interpretability under the assumed model, not as evidence that any observed group difference is "measurement bias" rather than a real difference in underlying causal reality.
Pair exercise: interpreting invariance results
- The K6 is tested across two ethnic groups. Configural invariance holds. Metric invariance holds. Scalar/threshold invariance fails on two items: "felt hopeless" and "felt worthless."
- State what each level of invariance means in plain language (same structure, same loadings, same intercepts/thresholds).
- State the standard invariance interpretation of scalar/threshold failure for group mean comparisons, then state the causal critique: why a cross-sectional associational test cannot decide whether the difference is measurement bias or a real difference in the underlying causal reality.
- Propose a hypothesis for why "felt hopeless" and "felt worthless" might function differently across groups (consider cultural norms, translation, different anchoring of response categories, or different causes of distress).
- Suppose all invariance tests passed. State one reason this would still not settle the causal question of whether "distress" is the same outcome across groups.
Part 2: how traditional measurement theory fails (for causal inference)
Part 1 showed the tools; Part 2 asks what they establish. Causal inference operates under assumptions, and for measurement it is causality all the way down. This follows from the potential outcomes framework: causal contrasts require well-defined outcomes under interventions, and constructed measures are themselves outcomes of causal processes (question wording, translation, response styles, incentives, and the world that generates the experiences being reported).
Classical psychometric checks (internal consistency, model fit, invariance tests) can organise associations. They do not, by themselves, evaluate the causal assumptions about direction and causal efficacy that are often imported into practice when we move from a measurement model to a causal claim. The causal question is not "does the model fit?" The causal question is "under which causal assumptions does this measured quantity behave like the variable in our DAG?"
Two ways of thinking about measurement in psychometric research
In psychometric research, formative and reflective models describe the relationship between latent variables and indicators. VanderWeele discusses this distinction, and its implications for causal inference with constructed measures, in the required reading (VanderWeele, 2022).
Reflective model (factor analysis)
In a reflective measurement model (an effect-indicator model), the latent variable is taken to cause the observed indicators. Each indicator is a reflection of the latent variable.
The reflective model is often written:
$$ X_i = \lambda_i\eta + \varepsilon_i. $$
Here, $X_i$ is an observed indicator, $\lambda_i$ is its loading, $\eta$ is a latent variable, and $\varepsilon_i$ is an error term. The equation is a statistical description. The stronger claim enters when we interpret it structurally: we treat $\eta$ as causally efficacious, and we treat the indicators as interchangeable reflections of it.
Formative model (factor analysis)
In a formative measurement model (a cause-indicator model), the indicators are taken to give rise to, or determine, a (univariate) latent variable. Correlation or interchangeability between indicators is not required. Each indicator can contribute distinctively to the latent variable.
The formative model is often written:
$$ \eta = \sum_i \lambda_iX_i + \varepsilon. $$
Again, the equation is a statistical description. It is not an automatic statement about causal direction.
Statistical models versus structural interpretations
VanderWeele distinguishes a statistical model from a structural interpretation (VanderWeele, 2022). A statistical model describes patterns in the observed covariance structure. A structural interpretation adds causal claims about how the world generates those patterns.
The two diagrams below show structural assumptions that are often taken for granted when scale scores are then used as exposures, outcomes, or confounders in causal analyses.
Why fit is not enough
A well-fitting factor model can be compatible with multiple causal structures. Fit indices alone cannot establish that one latent variable causes all indicators.
This is the central discipline point for this lecture. Fit is about what the statistical model can represent. Identification is about whether, under stated assumptions, the data identify the causal contrast we want.
Problems with the structural interpretations of reflective and formative factor models
Even if we grant the reflective or formative equations as useful statistical summaries, cross-sectional data do not, by themselves, decide the direction of causation among latent variables and indicators (VanderWeele, 2022). This creates a problem because the standard structural interpretations of reflective and formative models are used implicitly across psychology.
The same statistical forms can be compatible with alternative causal stories in which indicators (or the realities they partially reflect) are causally efficacious for the outcome. The compatibility examples below illustrate the issue. The point is not that one of these diagrams is "true". The point is that fit alone does not decide among them.
There are other compatible structural interpretations as well. For example, the "latent" reality may be multivariate, with different constituents giving rise to different indicators, and only some constituents being causally efficacious for the outcome.
VanderWeele's key observation is that cross-sectional data can describe relationships, but they cannot conclusively determine causal direction. This is worrying because it means that many psychometric checks do not explicitly evaluate the causal assumptions that later causal claims rely upon (VanderWeele, 2022). VanderWeele also discusses longitudinal tests for structural interpretations of univariate latent variables that often do not support the simple causal stories that are presumed. We might describe the uncritical reliance on factor-model structural interpretations as one component of a wider "causal crisis" in the social sciences (Bulbulia, 2023).
Multiple versions perspective
A coarse score may combine multiple underlying states. This is a multiple-versions problem. We can still estimate useful associations, but interpretation must state what is being averaged and what is unidentified.
Review: multiple versions of treatment
The theory of multiple versions of treatment addresses the fact that real interventions are rarely uniform. Let $K$ denote the "true" versioned treatment and let $A$ be a coarsened indicator of $K$.
Recall that a causal effect is defined as the difference in expected potential outcomes if everyone were exposed to one level of a treatment versus another, conditional on covariates $L$:
$$ \delta = \sum_l \left( \mathbb{E}[Y\mid A=a,l] - \mathbb{E}[Y\mid A=a^*,l] \right) P(l). $$
Under the multiple-versions interpretation, we can express a consistent estimand in terms of $K$:
$$ \delta = \sum_{k,l} \mathbb{E}[Y_k\mid l] P(k\mid a,l) P(l) - \sum_{k,l} \mathbb{E}[Y_k\mid l] P(k\mid a^*,l) P(l). $$
This corresponds to a hypothetical randomised trial in which, within strata of $L$, the treated group receives versions $K$ drawn from the version distribution among those with $A=a$ and the control group receives versions drawn from the version distribution among those with $A=a^*$ (VanderWeele & Hernan, 2013).
Reflective and formative measurement models as multiple versions
VanderWeele suggests using this framework to interpret constructed measures of psychosocial constructs (VanderWeele, 2022). Roughly, if $A$ is a constructed measure from indicators $(X_1,\dots,X_n)$, then $A$ can be treated as a coarsened indicator of an underlying reality, and the multiple-versions logic can preserve causal interpretability under strong assumptions.
One way to express this is to replace $K$ with an underlying (possibly multivariate) reality $\eta$, and to treat changes in a constructed measure as shifting the distribution of $\eta$ versions:
$$ \delta = \sum_{\eta,l} \mathbb{E}[Y_\eta\mid l] P(\eta\mid A=a+1,l) P(l) - \sum_{\eta,l} \mathbb{E}[Y_\eta\mid l] P(\eta\mid A=a,l) P(l). $$
This offers a reason not to despair. But it is not a free pass. The interpretation remains obscure when we do not have a clear definition of what the causally relevant constituents of the construct are, and when we have not explicitly stated which causal assumptions connect indicators, measures, and outcomes.
VanderWeele's model of reality
VanderWeele concludes by arguing that traditional univariate reflective and formative models do not adequately capture the relations between underlying causally relevant phenomena and our indicators and measures. He argues that the causally relevant constituents of reality are almost always multidimensional, that measure construction should start from construct definition, and that structural interpretations should be tested rather than presumed (VanderWeele, 2022).
VanderWeele's argument can be summarised as the following propositions (VanderWeele, 2022).
- Traditional univariate reflective and formative models do not adequately capture the relations between causally relevant phenomena and indicators and measures.
- The causally relevant constituents of reality related to psychosocial constructs are almost always multidimensional, giving rise to indicators and to our language and concepts.
- Construct measurement should start from an explicit construct definition, from which items are derived and justified.
- The presumption of a structural univariate reflective model can impair measure construction, evaluation, and use.
- If a structural interpretation of a univariate reflective factor model is proposed, it should be tested rather than presumed; factor analysis alone is not sufficient evidence.
- Even when causally relevant constituents are multidimensional but a univariate measure is used, associations with outcomes can be interpreted using multiple versions of treatment theory, though interpretation is obscured without clarity about constituents.
- When data permit, examining associations item-by-item, or in conceptually related item sets, can provide insight into facets of the construct.
This is a compelling sketch. It is not yet a complete causal recipe. In particular, it is not a causal DAG in the sense we have used throughout the course, because the arrows are not yet a clear set of causal claims that we can test with d-separation. It motivates the question we care about in causal inference: what assumptions do we need to connect our constructed measures to the causal contrasts we want to estimate?
A pragmatic causal response: measurement error as a structural threat
We can bring this discussion back to the causal workflow by using causal diagrams to represent measurement dynamics. Let $\eta_A$ denote a "true" exposure state, $\eta_Y$ a "true" outcome state, and $\eta_L$ a "true" confounder state. Let $A_{f(X_1,\dots,X_n)}$, $Y_{f(X_1,\dots,X_n)}$, and $L_{f(X_1,\dots,X_n)}$ denote constructed measures (functions of indicators). Allow unmeasured sources of measurement error, $U_A$, $U_Y$, and $U_L$, to influence the constructed measures.
Read the diagram as a measurement-augmented causal model.
- The $\eta$ nodes are latent realities ($\eta_L$, $\eta_A$, $\eta_Y$). They are the states we would ideally intervene on and measure without error.
- The $var_{f(X_1,\dots,X_n)}$ nodes are constructed measures: functions of multiple indicators. Simpler measurement DAGs compress this notation, writing the true outcome state as $Y^\ast$ and the recorded measure as $Y$; in that shorthand $\eta_Y$ is $Y^\ast$ and $Y_{f(X_1,\dots,X_n)}$ is $Y$.
- The $U$ nodes are unmeasured sources of error in those constructed measures. They include stable reporting tendencies, transient mood at the time of survey completion, social desirability, and culturally patterned response styles.
The key edges have the following interpretations.
- $\eta_L \rightarrow L_{f(X_1,\dots,X_n)}$: the true confounder state affects its measured realisation.
- $\eta_A \rightarrow A_{f(X_1,\dots,X_n)}$: the true exposure state affects its measured realisation.
- $\eta_Y \rightarrow Y_{f(X_1,\dots,X_n)}$: the true outcome state affects its measured realisation.
- $U_L \rightarrow L_{f(X_1,\dots,X_n)}$, $U_A \rightarrow A_{f(X_1,\dots,X_n)}$, $U_Y \rightarrow Y_{f(X_1,\dots,X_n)}$: unmeasured error sources distort each constructed measure. In the strongest language, our measures "see as through a mirror, in darkness" relative to the underlying reality they hope to capture.
- Correlated errors: $U$ nodes may share common causes, so error in one domain can correlate with error in another (for example, a general tendency to present oneself favourably affects multiple self-reports).
- Directed errors: true states can affect how other variables are measured (for example, exercise might change how people interpret distress items), creating pathways from $\eta_A$ into $U_Y$.
The utility of describing measurement dynamics using causal graphs is that we can see when measurement itself creates new paths. The act of conditioning on measured variables can introduce collider bias when both true states and measurement errors feed into the measured nodes. When unmeasured (multivariate) psycho-physical states are related to unmeasured sources of error in the measurement of those states, measurement can open pathways to confounding.
One key warning is that measurement error opens additional pathways to confounding when either errors are correlated or when the exposure causally affects the error in the measured outcome.
Confounding control by baseline measures in three-wave panels
One pragmatic design response is to measure baseline values of exposure and outcome (and key confounders), then estimate effects forward in time using a three-wave panel. Adjusting for each variable's own baseline value is called lagged-self adjustment.
- This design adjusts for baseline measurements of both exposure and outcome.
- Understanding this approach in the context of potential directed and correlated measurement errors clarifies its strengths and limitations.
- Baseline measures can reduce the chance that unmeasured sources of measurement error are correlated with later changes in exposure and outcome.
- For example, if individuals have a stable social desirability bias at baseline, then to create new bias it would need to change in a way that is unrelated to its baseline effects.
- However, we cannot eliminate the possibility of new bias development, nor directed effects of exposure on outcome reporting.
- Attrition and non-response can create additional directed measurement structures.
- Despite these challenges, including baseline exposure and outcome measures should be standard practice in multi-wave studies because it reduces the likelihood of novel confounding.
- Because we can never be certain the assumptions hold, we should still perform sensitivity analyses.
Pair exercise: fit is not identification
- A one-factor confirmatory factor analysis (CFA) of six K6 items yields CFI = 0.98 and RMSEA = 0.03.
- A colleague claims "the good fit confirms that distress causes all six responses." Evaluate this claim with reference to VanderWeele (2022).
- Choose two of the diagrams above that are compatible with the same statistical factor model, and state what causal assumption differs between them.
- Explain why the choice matters for causal inference downstream (hint: consider what happens when you use the scale score as a confounder or outcome in a DAG).
Return to the opening example
Back to K6.
A total score can still be useful for screening. But without defended structure and invariance, we should avoid strong causal claims about cross-group latent differences.
Our job as investigators is to separate what the model fits from what the design identifies. That discipline applies directly to the final assignment. Outcome-wide ATEs require outcomes that mean what the report says they mean. Policy trees require covariates whose split points can be interpreted without turning proxies into causes. Fairness checks require attention to whether measurement differs across groups before a rule is treated as publicly defensible.
Pair exercise: measurement as an identification problem
- Explain to your partner how scalar invariance failure distorts conditional average treatment effect (CATE) estimates even when exchangeability and positivity hold.
- A colleague says "the K6 has been validated in hundreds of studies, so measurement is not a concern." Counter this claim in two sentences, distinguishing internal consistency from cross-group invariance.
- Propose a workflow step that belongs between drawing the DAG and running estimation, specifically to check measurement assumptions. State what it tests and what a failure would change about the analysis.
Lab materials: Lab 10: End-to-End Research Report
Appendix: VanderWeele measurement lectures
These two lectures are optional supplements for students who want the measurement argument in seminar form. They develop the same core point as the required reading: constructed measures need explicit definitions, item choices, and causal interpretation before they can be used safely in causal analyses.
- Tyler VanderWeele, Constructed Measures and Causal Inference: Towards a New Model of Measurement, Johns Hopkins Causal Inference Working Group. The linked timestamp begins in the measurement-model discussion.
- Tyler VanderWeele, Causal Inference and Measure Construction: Towards a New Model of Measurement, Online Causal Inference Seminar. The linked timestamp begins near the practical implications for measure construction.
Bulbulia, J. A. (2023). A workflow for causal inference in cross-cultural psychology. Religion, Brain & Behavior, 13(3), 291–306. https://doi.org/10.1080/2153599X.2022.2070245
Fischer, R., & Karl, J. A. (2019). A primer to (cross-cultural) multi-group invariance testing possibilities in r. Frontiers in Psychology, 1507.
Harkness, J. A., Van de Vijver, F. J., & Johnson, T. P. (2003). Questionnaire design in comparative research. Cross-Cultural Survey Methods, 19–34.
Harkness, J. [et. al]. (2003). Questionnaire translation. In Cross-cultural survey methods (pp. 35–56). Wiley.
He, J., & Vijver, F. van de. (2012). Bias and Equivalence in Cross-Cultural Research. Online Readings in Psychology and Culture, 2(2). https://doi.org/10.9707/2307-0919.1111
VanderWeele, T. J. (2022). Constructed measures and causal inference: Towards a new model of measurement for psychosocial constructs. Epidemiology, 33(1), 141–151. https://doi.org/10.1097/EDE.0000000000001434
VanderWeele, T. J., & Hernan, M. A. (2013). Causal inference under multiple versions of treatment. Journal of Causal Inference, 1(1), 1–20.