Week 10: Classical Measurement Theory from a Causal Perspective
Date: 13 May 2026
Readings
Required
Optional
Key concepts
- EFA and CFA are model-building tools, not causal proofs.
- Invariance tests are associational diagnostics under a chosen measurement model, not tests of causal comparability.
- Reflective and formative equations need explicit causal interpretation.
- Measurement assumptions can open or close bias paths in DAGs.
Weeks 1 through 9 built a workflow from causal question to policy recommendation. Every step assumed that the outcomes we measure mean the same thing for every group in the target population. If they do not, contrasts between groups can reflect measurement artefact, not causal differences. This week examines that assumption.
Seminar
Classical validity and its limits
Psychology textbooks organise measurement quality around four types of validity.
Four classical validity types
- Content validity: the degree to which an instrument covers the intended domain.
- Construct validity: whether the construct is accurately defined and operationalised.
- Criterion validity: whether an instrument accurately predicts performance on an external criterion.
- Ecological validity: whether an instrument reflects real-world situations and behaviour.
These categories organise useful intuitions. From a causal perspective, each conflates problems that need to be kept separate.
Content validity asks whether items span the construct's domain. It does not specify the causal direction between construct and indicators. Does the construct cause the items (a reflective model), or do the items constitute the construct (a formative model)? Without stating the causal structure, "measures what it's intended to measure" has no formal content.
Construct validity bundles two separate questions. "Accurately defined" concerns the target quantity: what state of the world are we trying to capture? This is analogous to defining a causal estimand (which intervention, in which population, compared with what alternative). "Operationalised" concerns whether the same score means the same thing across individuals. This is a consistency question. Lumping both under one label obscures where a measurement fails.
Criterion validity is purely associational. An instrument can predict an outcome well for non-causal reasons: shared confounders, reverse causation, collider bias. "Predicts performance" tells you nothing about whether the instrument captures the construct that causally affects the criterion. Weeks 2 through 4 showed why prediction and causation are different questions. The same distinction applies here.
Ecological validity gestures at transportability without specifying what changes across settings. A causal framework asks: does the construct-to-indicator relationship hold in the target population? That question is testable through measurement invariance. "Reflects real-world situations" is not testable.
These four categories are qualitative checklists, not properties of a formal model. A causal approach specifies the directed graph connecting constructs to indicators, states the assumptions under which observed scores recover latent quantities, and tests those assumptions. The rest of this lecture shows what that looks like in practice.
Motivating example
The Kessler-6 (K6) is widely used to screen psychological distress.
Two questions must be addressed before we try to compare scores causally across groups.
- Do the six items map to the same latent structure?
- Is that structure invariant across groups?
These questions are necessary, but they are not sufficient. Even if a measurement model fits well and invariance tests pass, causal interpretation still depends on a defended causal story about what the construct is and how it is measured.
Why this is a causal lecture
Measurement is part of identification. If measurement is unstable, effect estimates and group contrasts can be distorted even when adjustment sets are otherwise defensible.
Learning outcomes
By the end of this week, you should be able to:
- State what each classical validity type (content, construct, criterion, ecological) claims, and identify the causal assumptions each leaves implicit.
- Run EFA and CFA with clear model-comparison logic.
- Run configural, metric, and scalar/threshold invariance tests, and state what they do not establish.
- Explain why good fit does not prove a causal latent model.
- Explain why invariance tests do not deliver causal structure.
- Link measurement assumptions to DAG-based bias reasoning.
Part 1: practical workflow with K6
Step 1: prepare data and inspect factorability
library(margot)
library(tidyverse)
library(performance)
k6 <- df_nz |>
filter(wave == 2018) |>
select(
kessler_depressed,
kessler_effort,
kessler_hopeless,
kessler_worthless,
kessler_nervous,
kessler_restless
)
check_factorstructure(k6)
Bartlett and KMO are entry checks. They do not validate causal interpretation.
Step 2: run EFA
library(psych)
library(parameters)
efa <- psych::fa(k6, nfactors = 3, rotate = "oblimin") |>
model_parameters(sort = TRUE, threshold = "max")
efa
Oblique rotation is usually appropriate because psychological dimensions often co-vary.
Step 3: compare CFA candidates
library(datawizard)
library(lavaan)
library(performance)
set.seed(123)
parts <- data_partition(k6, training_proportion = 0.7, seed = 123)
train <- parts$p_0.7
test <- parts$test
m1_syntax <- psych::fa(train, nfactors = 1) |> efa_to_cfa()
m2_syntax <- psych::fa(train, nfactors = 2) |> efa_to_cfa()
m3_syntax <- psych::fa(train, nfactors = 3) |> efa_to_cfa()
m1 <- suppressWarnings(cfa(m1_syntax, data = test))
m2 <- suppressWarnings(cfa(m2_syntax, data = test))
m3 <- suppressWarnings(cfa(m3_syntax, data = test))
compare_performance(m1, m2, m3, verbose = FALSE)
Read CFI, RMSEA, AIC, and BIC together. Prefer simpler models when fit is comparable.
Step 4: test invariance across groups
We teach invariance testing because it is widely used and because you may be asked to use it in comparative work. Treat it as a descriptive diagnostic, not as a generator of causal insight. A multi-group CFA invariance test is conditional on a specified measurement model (usually reflective), a chosen parameterisation, and assumptions such as local independence (no causal relations among items once the latent variable is held fixed). Passing an invariance test does not show that the same causally relevant construct exists in both groups. It only shows that a particular statistical measurement model, with particular equality constraints, is compatible with the observed covariance structure. Failures and successes are both compatible with many causal stories. Causal structure is underdetermined by associations in the data.
In this lecture, causal assumptions come first, and this means that statistical tests cannot replace thinking about our assumptions. We must define the construct and state a causal measurement story. Only then can invariance tests play a role, by checking some statistical implications of that story.
For ordinal items, threshold invariance is the analogue of scalar invariance. In practice, fit an ordinal estimator (for example WLSMV) when items are Likert-type.
library(semTools)
k6_eth <- df_nz |>
filter(wave == 2018, eth_cat %in% c("euro", "maori")) |>
select(
kessler_depressed,
kessler_effort,
kessler_hopeless,
kessler_worthless,
kessler_nervous,
kessler_restless,
eth_cat
)
# example template
# measurementInvariance(
# model = model_syntax,
# data = k6_eth,
# group = "eth_cat"
# )
- Configural invariance: same loading pattern.
- Metric invariance: same loadings.
- Scalar/threshold invariance: same intercepts (continuous) or thresholds (ordinal).
How are these tests useful?
- They describe whether a chosen measurement model assigns similar roles to items across groups (within that model class). This can be a compact summary of group differences in the covariance structure.
- They can suggest where to look. If constraints fail, you learn which items or thresholds are most responsible, which can motivate substantive work (translation review, response-process interviews, item-by-item analysis, or redesign).
- They keep you honest about what is not identified. Even if every invariance test "passes", the causal meaning of the construct is not certified. If a test "fails", it does not tell you whether the problem is measurement bias or a real causal difference in the underlying state (for example, different causes of distress producing different item dynamics). To decide that, you need a causal story and often new data, not a better fit index.
Also note that item means can differ for many reasons that have nothing to do with biased measurement. If group A is more distressed than group B, we should expect different item means even under perfect measurement. Invariance testing is not a way to label differences as artefact. It is a way to describe whether a particular psychometric model is stable across groups.
Under the standard invariance interpretation, if scalar/threshold invariance fails then latent mean comparisons are not identified within that measurement model. In this course, treat this as a warning about interpretability under the assumed model, not as evidence that any observed group difference is "measurement bias" rather than a real difference in underlying causal reality.
Pair exercise: interpreting invariance results
- The K6 is tested across two ethnic groups. Configural invariance holds. Metric invariance holds. Scalar/threshold invariance fails on two items: "felt hopeless" and "felt worthless."
- State what each level of invariance means in plain language (same structure, same loadings, same intercepts/thresholds).
- State the standard invariance interpretation of scalar/threshold failure for group mean comparisons, then state the causal critique: why a cross-sectional associational test cannot decide whether the difference is measurement bias or a real difference in the underlying causal reality.
- Propose a hypothesis for why "felt hopeless" and "felt worthless" might function differently across groups (consider cultural norms, translation, different anchoring of response categories, or different causes of distress).
- Suppose all invariance tests passed. State one reason this would still not settle the causal question of whether "distress" is the same outcome across groups.
Part 2: how traditional measurement theory fails (for causal inference)
This part restores the core point of the lecture. Causal inference operates under assumptions. For measurement, it is causality all the way down. This is not a matter of attitude. It is a consequence of the potential outcomes framework: causal contrasts require well-defined outcomes under interventions, and constructed measures are outcomes of causal processes (question wording, translation, response styles, incentives, and the world that generates the experiences being reported).
Classical psychometric checks (internal consistency, model fit, invariance tests) can organise associations. They do not, by themselves, evaluate the causal assumptions about direction and causal efficacy that are often imported into practice when we move from a measurement model to a causal claim. The causal question is not "does the model fit?" The causal question is "under which causal assumptions does this measured quantity behave like the variable in our DAG?"
Two ways of thinking about measurement in psychometric research
In psychometric research, formative and reflective models describe the relationship between latent variables and indicators. VanderWeele discusses this distinction, and its implications for causal inference with constructed measures, in the required reading (VanderWeele, 2022).
Reflective model (factor analysis)
In a reflective measurement model (an effect-indicator model), the latent variable is taken to cause the observed indicators. Each indicator is a reflection of the latent variable.
The reflective model is often written:
$$ X_i = \lambda_i\eta + \varepsilon_i. $$
Here, $X_i$ is an observed indicator, $\lambda_i$ is its loading, $\eta$ is a latent variable, and $\varepsilon_i$ is an error term. The equation is a statistical description. The stronger claim enters when we interpret it structurally: we treat $\eta$ as causally efficacious, and we treat the indicators as interchangeable reflections of it.
Formative model (factor analysis)
In a formative measurement model (a cause-indicator model), the indicators are taken to give rise to, or determine, a (univariate) latent variable. Correlation or interchangeability between indicators is not required. Each indicator can contribute distinctively to the latent variable.
The formative model is often written:
$$ \eta = \sum_i \lambda_iX_i + \varepsilon. $$
Again, the equation is a statistical description. It is not an automatic statement about causal direction.
Statistical models versus structural interpretations
VanderWeele distinguishes a statistical model from a structural interpretation (VanderWeele, 2022). A statistical model describes patterns in the observed covariance structure. A structural interpretation adds causal claims about how the world generates those patterns.
The two diagrams below show structural assumptions that are often taken for granted when scale scores are then used as exposures, outcomes, or confounders in causal analyses.
Why fit is not enough
A well-fitting factor model can be compatible with multiple causal structures. Fit indices alone cannot establish that one latent variable causes all indicators.
This is the central discipline point for this lecture. Fit is about what the statistical model can represent. Identification is about whether, under stated assumptions, the data identify the causal contrast we want.
Problems with the structural interpretations of reflective and formative factor models
Even if we grant the reflective or formative equations as useful statistical summaries, cross-sectional data do not, by themselves, decide the direction of causation among latent variables and indicators (VanderWeele, 2022). This creates a problem because the standard structural interpretations of reflective and formative models are used implicitly across psychology.
The same statistical forms can be compatible with alternative causal stories in which indicators (or the realities they partially reflect) are causally efficacious for the outcome. The compatibility examples below illustrate the issue. The point is not that one of these diagrams is "true". The point is that fit alone does not decide among them.
There are other compatible structural interpretations as well. For example, the "latent" reality may be multivariate, with different constituents giving rise to different indicators, and only some constituents being causally efficacious for the outcome.
VanderWeele's key observation is that cross-sectional data can describe relationships, but they cannot conclusively determine causal direction. This is worrying because it means that many psychometric checks do not explicitly evaluate the causal assumptions that later causal claims rely upon (VanderWeele, 2022). VanderWeele also discusses longitudinal tests for structural interpretations of univariate latent variables that often do not support the simple causal stories that are presumed. We might describe the uncritical reliance on factor-model structural interpretations as one component of a wider "causal crisis" in the social sciences (Bulbulia, 2023).
Multiple versions perspective
A coarse score may combine multiple underlying states. This is a multiple-versions problem. We can still estimate useful associations, but interpretation must state what is being averaged and what is unidentified.
Review: multiple versions of treatment
The theory of multiple versions of treatment addresses the fact that real interventions are rarely uniform. Let $K$ denote the "true" versioned treatment and let $A$ be a coarsened indicator of $K$.
Recall that a causal effect is defined as the difference in expected potential outcomes if everyone were exposed to one level of a treatment versus another, conditional on covariates $L$:
$$ \delta = \sum_l \left( \mathbb{E}[Y\mid A=a,l] - \mathbb{E}[Y\mid A=a^*,l] \right) P(l). $$
Under the multiple-versions interpretation, we can express a consistent estimand in terms of $K$:
$$ \delta = \sum_{k,l} \mathbb{E}[Y_k\mid l] P(k\mid a,l) P(l) - \sum_{k,l} \mathbb{E}[Y_k\mid l] P(k\mid a^*,l) P(l). $$
This corresponds to a hypothetical randomised trial in which, within strata of $L$, the treated group receives versions $K$ drawn from the version distribution among those with $A=a$ and the control group receives versions drawn from the version distribution among those with $A=a^*$ (VanderWeele & Hernan, 2013).
Reflective and formative measurement models as multiple versions
VanderWeele suggests using this framework to interpret constructed measures of psychosocial constructs (VanderWeele, 2022). Roughly, if $A$ is a constructed measure from indicators $(X_1,\dots,X_n)$, then $A$ can be treated as a coarsened indicator of an underlying reality, and the multiple-versions logic can preserve causal interpretability under strong assumptions.
One way to express this is to replace $K$ with an underlying (possibly multivariate) reality $\eta$, and to treat changes in a constructed measure as shifting the distribution of $\eta$ versions:
$$ \delta = \sum_{\eta,l} \mathbb{E}[Y_\eta\mid l] P(\eta\mid A=a+1,l) P(l) - \sum_{\eta,l} \mathbb{E}[Y_\eta\mid l] P(\eta\mid A=a,l) P(l). $$
This offers a reason not to despair. But it is not a free pass. The interpretation remains obscure when we do not have a clear definition of what the causally relevant constituents of the construct are, and when we have not explicitly stated which causal assumptions connect indicators, measures, and outcomes.
VanderWeele's model of reality
VanderWeele concludes by arguing that traditional univariate reflective and formative models do not adequately capture the relations between underlying causally relevant phenomena and our indicators and measures. He argues that the causally relevant constituents of reality are almost always multidimensional, that measure construction should start from construct definition, and that structural interpretations should be tested rather than presumed (VanderWeele, 2022).
VanderWeele's argument can be summarised as the following propositions (VanderWeele, 2022).
- Traditional univariate reflective and formative models do not adequately capture the relations between causally relevant phenomena and indicators and measures.
- The causally relevant constituents of reality related to psychosocial constructs are almost always multidimensional, giving rise to indicators and to our language and concepts.
- Construct measurement should start from an explicit construct definition, from which items are derived and justified.
- The presumption of a structural univariate reflective model can impair measure construction, evaluation, and use.
- If a structural interpretation of a univariate reflective factor model is proposed, it should be tested rather than presumed; factor analysis alone is not sufficient evidence.
- Even when causally relevant constituents are multidimensional but a univariate measure is used, associations with outcomes can be interpreted using multiple versions of treatment theory, though interpretation is obscured without clarity about constituents.
- When data permit, examining associations item-by-item, or in conceptually related item sets, can provide insight into facets of the construct.
This is a compelling sketch. It is not yet a complete causal recipe. In particular, it is not a causal DAG in the sense we have used throughout the course, because the arrows are not yet a clear set of causal claims that we can test with d-separation. It motivates the question we care about in causal inference: what assumptions do we need to connect our constructed measures to the causal contrasts we want to estimate?
A pragmatic causal response: measurement error as a structural threat
We can bring this discussion back to the causal workflow by using causal diagrams to represent measurement dynamics. Let $\eta_A$ denote a "true" exposure state, $\eta_Y$ a "true" outcome state, and $\eta_L$ a "true" confounder state. Let $A_{f(X_1,\dots,X_n)}$, $Y_{f(X_1,\dots,X_n)}$, and $L_{f(X_1,\dots,X_n)}$ denote constructed measures (functions of indicators). Allow unmeasured sources of measurement error, $U_A$, $U_Y$, and $U_L$, to influence the constructed measures.
Read the diagram as a measurement-augmented causal model.
- The $\eta$ nodes are latent realities ($\eta_L$, $\eta_A$, $\eta_Y$). They are the states we would ideally intervene on and measure without error.
- The $var_{f(X_1,\dots,X_n)}$ nodes are constructed measures: functions of multiple indicators.
- The $U$ nodes are unmeasured sources of error in those constructed measures. They include stable reporting tendencies, transient mood at the time of survey completion, social desirability, and culturally patterned response styles.
The key edges have the following interpretations.
- $\eta_L \rightarrow L_{f(X_1,\dots,X_n)}$: the true confounder state affects its measured realisation.
- $\eta_A \rightarrow A_{f(X_1,\dots,X_n)}$: the true exposure state affects its measured realisation.
- $\eta_Y \rightarrow Y_{f(X_1,\dots,X_n)}$: the true outcome state affects its measured realisation.
- $U_L \rightarrow L_{f(X_1,\dots,X_n)}$, $U_A \rightarrow A_{f(X_1,\dots,X_n)}$, $U_Y \rightarrow Y_{f(X_1,\dots,X_n)}$: unmeasured error sources distort each constructed measure. In the strongest language, our measures "see as through a mirror, in darkness" relative to the underlying reality they hope to capture.
- Correlated errors: $U$ nodes may share common causes, so error in one domain can correlate with error in another (for example, a general tendency to present oneself favourably affects multiple self-reports).
- Directed errors: true states can affect how other variables are measured (for example, exercise might change how people interpret distress items), creating pathways from $\eta_A$ into $U_Y$.
The utility of describing measurement dynamics using causal graphs is that we can see when measurement itself creates new paths. The act of conditioning on measured variables can introduce collider bias when both true states and measurement errors feed into the measured nodes. When unmeasured (multivariate) psycho-physical states are related to unmeasured sources of error in the measurement of those states, measurement can open pathways to confounding.
One key warning is that measurement error opens additional pathways to confounding when either errors are correlated or when the exposure causally affects the error in the measured outcome.
Confounding control by baseline measures in three-wave panels
One pragmatic design response is to measure baseline values of exposure and outcome (and key confounders), then estimate effects forward in time using a three-wave panel.
- This design adjusts for baseline measurements of both exposure and outcome.
- Understanding this approach in the context of potential directed and correlated measurement errors clarifies its strengths and limitations.
- Baseline measures can reduce the chance that unmeasured sources of measurement error are correlated with later changes in exposure and outcome.
- For example, if individuals have a stable social desirability bias at baseline, then to create new bias it would need to change in a way that is unrelated to its baseline effects.
- However, we cannot eliminate the possibility of new bias development, nor directed effects of exposure on outcome reporting.
- Attrition and non-response can create additional directed measurement structures.
- Despite these challenges, including baseline exposure and outcome measures should be standard practice in multi-wave studies because it reduces the likelihood of novel confounding.
- Because we can never be certain the assumptions hold, we should still perform sensitivity analyses.
Pair exercise: fit is not identification
- A one-factor confirmatory factor analysis (CFA) of six K6 items yields CFI = 0.98 and RMSEA = 0.03.
- A colleague claims "the good fit confirms that distress causes all six responses." Evaluate this claim with reference to VanderWeele (2022).
- Choose two of the diagrams above that are compatible with the same statistical factor model, and state what causal assumption differs between them.
- Explain why the choice matters for causal inference downstream (hint: consider what happens when you use the scale score as a confounder or outcome in a DAG).
Return to the opening example
Back to K6.
A total score can still be useful for screening. But without defended structure and invariance, we should avoid strong causal claims about cross-group latent differences.
Our job as investigators is to separate what the model fits from what the design identifies.
Pair exercise: measurement as an identification problem
- Explain to your partner how scalar invariance failure distorts conditional average treatment effect (CATE) estimates even when exchangeability and positivity hold.
- A colleague says "the K6 has been validated in hundreds of studies, so measurement is not a concern." Counter this claim in two sentences, distinguishing internal consistency from cross-group invariance.
- Propose a workflow step that belongs between drawing the DAG and running estimation, specifically to check measurement assumptions. State what it tests and what a failure would change about the analysis.
Appendix: if you use an LLM
Copy/paste prompt (use the LLM as a tutor, not a replacement):
You are my tutor. Do not solve the problem for me. Ask me short questions and wait for my answers.
We are working in the potential outcomes framework. Do not treat any associational model output (regression, factor analysis, SEM, invariance tests, fit indices) as evidence of causal structure.
Your job is to help me do the causal thinking. Start by asking me to state:
- The target population.
- The causal contrast (intervention vs control), with timing.
- The outcome, and how it is measured (what the questions/items are).
- The causal estimand (ATE, CATE, etc.).
Then ask me to list the identification assumptions (consistency, exchangeability, positivity) and the measurement assumptions I am making. If I mention "validity", "reliability", "fit", or "invariance", ask me what causal assumption I think that statistic is meant to support, and what causal alternative it fails to rule out.
Only after I answer should you give feedback. Keep feedback focused on whether my assumptions are explicit, plausible, and testable, and on what additional design or data would reduce reliance on unsupported causal assumptions.
Lab materials: Lab 10: Measurement Invariance
Bulbulia, J. A. (2023). A workflow for causal inference in cross-cultural psychology. Religion, Brain & Behavior, 13(3), 291–306. https://doi.org/10.1080/2153599X.2022.2070245
Fischer, R., & Karl, J. A. (2019). A primer to (cross-cultural) multi-group invariance testing possibilities in r. Frontiers in Psychology, 1507.
Harkness, J. A., Van de Vijver, F. J., & Johnson, T. P. (2003). Questionnaire design in comparative research. Cross-Cultural Survey Methods, 19–34.
Harkness, J. [et. al]. (2003). Questionnaire translation. In Cross-cultural survey methods (pp. 35–56). Wiley.
He, J., & Vijver, F. van de. (2012). Bias and Equivalence in Cross-Cultural Research. Online Readings in Psychology and Culture, 2(2). https://doi.org/10.9707/2307-0919.1111
VanderWeele, T. J. (2022). Constructed measures and causal inference: Towards a new model of measurement for psychosocial constructs. Epidemiology, 33(1), 141–151. https://doi.org/10.1097/EDE.0000000000001434
VanderWeele, T. J., & Hernan, M. A. (2013). Causal inference under multiple versions of treatment. Journal of Causal Inference, 1(1), 1–20.