Selection Bias and Measurement Bias

quick reset: what would you condition on?

for each graph, ask:

what path connects A and Y?
would conditioning on the middle variable help, hurt, or change the question?

warm-up 1: common cause

question: if you want the causal effect of A on Y, would you condition on \boxed{L}?

warm-up 2: mediator

question: if you want the total effect of A on Y, would you condition on \boxed{M}?

warm-up 3: collider

question: would conditioning on \boxed{C} help?

warm-up 4: descendant of a collider

question: if you do not condition on C, but you do condition on its descendant \boxed{D}, what happens?

warm-up 5: descendant of a common cause

question: if L is unmeasured, would conditioning on \boxed{D} be the same as conditioning on \boxed{L}?

warm-up summary

three rules from earlier weeks:

common cause: often condition
mediator: do not condition if you want the total effect
collider: do not condition

two extensions for this week:

descendant of a collider: also dangerous to condition on
descendant of a common cause: not a guaranteed substitute for the common cause

Motivating example: one study, two failure modes

bilingualism study recruited through university mailing lists:

Selection: people with high academic motivation and strong language confidence are more likely to enrol
Measurement: the cognitive task is validated only in English, so non-English-dominant participants may be mismeasured

Why this week extends Weeks 2–3

weeks 2–3: confounding

week 4 adds:

Threat	Source of bias
Selection bias	Who enters the analytic sample
Measurement bias	How variables are recorded

Common causal questions as graphs

different questions require different graphs

Measurement error: two dimensions

two dimensions of measurement error:

	Uncorrelated errors	Correlated errors
Independent	Attenuates effects	Creates spurious associations
Dependent	Opens non-causal paths	Biases in either direction

Independent measurement error

left: uncorrelated errors attenuate effects toward zero. right: a shared cause of errors (U) creates spurious associations even when no causal effect exists.

Dependent measurement error

left: the true exposure affects measurement of the outcome (red diagonal), opening a non-causal path. right: dependent errors with a shared cause (U) can bias in either direction.

Selection bias

selection can act like collider conditioning

Selection bias without colliders

No confounding, no collider. A is randomised. Yet if Z modifies the effect of A on Y (open circle), and Z is distributed differently in the sample than in the target population, the sample ATE does not transport.

Target, source, and analytic populations

Population	Role
Target	Where we want the causal claim to apply
Source	Where recruitment occurs
Analytic sample	Who is actually analysed

transportability asks whether effect-relevant structure carries across populations

WEIRD samples and effect heterogeneity

weird is a problem when effect modifiers differ

Link to Week 10

measurement invariance is transport for constructs

Return to the opening example

back to the bilingualism study:

Selection: why did these participants enter the analytic sample, and does that selection depend on treatment or outcome?
Measurement: do the instruments measure the same constructs across all participants?

reading a regression in r

basic pattern:

fit <- lm(
  exam_score ~ study_hours + motivation,
  data = df_scores
)

read it left to right:

fit <- store the model
lm() fit a linear model
exam_score outcome
~ modelled as a function of
study_hours + motivation predictors
data = df_scores where the variables live

what changes when the formula changes?

# one predictor
lm(exam_score ~ study_hours, data = df_scores)

# no predictor, only an intercept
lm(exam_score ~ 1, data = df_scores)

# two predictors
lm(exam_score ~ study_hours + motivation, data = df_scores)

# interaction
lm(exam_score ~ study_hours * workshop, data = df_scores)

~ 1 fits a flat mean line
+ motivation adjusts for one more variable
* workshop allows the slope for study_hours to differ by workshop

useful follow-up lines

summary(fit)

inspect the fitted model.

plot(df_scores$study_hours, df_scores$exam_score)
abline(fit)

see what the fitted line is doing.

predict(fit)

get fitted values from the model.

Readings

Required and optional readings for each week are listed on the course readings page.