Causal inference
Suppose we want to infer whether a dichotomous exposure has a causal effect. We must answer three questions:
- What would happen with exposure?
- What would happen with no exposure?
- Do these potential outcomes differ?
Consider a binary exposure \(A\) with two levels, \(A = 0\) and \(A=1\) and a continuous outcome \(Y\). Estimating the causal effect of \(A\) on outcome \(Y\) requires contrasting two states of the world. The state of the the world \(Y^{a=1}\) when, perhaps contrary to fact, \(A\) is set to \(1\) and the state of the world \(Y^{a=0}\) when, perhaps contrary to fact, \(A\) is set to \(0\). We say that \(A\) affects \(Y\) when the quantities \(Y^{a=1} - Y^{a=0} \neq 0\)(Hernan and Robins 2023).
The fundamental problem of causal inference.
At any given time, for any individual, at most only one level of the exposure \(A\) may be realised. That is, \(Y^{a=1}\) and \(Y^{a=0}\) are never simultaneously observed in any individual. As such, individual-level causal effects are not typically identified in the data. This is referred to as “the fundamental problem of causal inference” (Rubin 1976; Gelman, Hill, and Vehtari 2020). \(Y^{a=1}\) and \(Y^{a=0}\) are therefore called “potential” or “counterfactual” outcomes.
Although we cannot generally observed causal effects in any individual, when certain assumptions are satisfied, we may identify the average or marginal causal effect at the level of a population. We say there is an average or marginal causal effect if the difference of the averages of those who are exposed and are not exposed does not equal zero: \(E(Y^{a=1}) - E(Y^{a=0})\neq 0\) . Because the difference of the means is equivalent to the mean of the differences, we may equivalently say there is an marginal causal effect if \(E(Y^{a=1} - Y^{a=0})\neq 0\) (Hernan and Robins 2023). 1
Critically, to ensure valid inference, the potential outcomes must be (conditionally) independent of the exposure:
\[Y^a \perp\!\!\!\perp A|L\]
Or equivalently:
\[A \perp\!\!\!\perp Y^a |L\]
Outside of controlled experiments we cannot generally ensure this conditional independence. For this reason, we use sensitivity analysis, such as the E-value (VanderWeele, Mathur, and Chen 2020)
G-computation (or regression standardisation)
To consistently estimate a causal association we must infer the average outcomes for the entire population were it subject to different levels of the exposure variable \(A = a\) and \(A=a*\).
\[ATE = E[Y^{a*}] - E[Y^a] = \]
\[\sum_l E[Y|A=a*, L=l]\Pr[L=l] - \sum_c E[Y|A=a, L=l]\Pr[L=l]\]
We need not estimate \(\Pr(L = l)\). Rather obtain the weighted mean for the distribution of the confounders in the data by taking the double expectation: (Hernan and Robins 2023, 166).
\[ATE = E[E(Y|A = a, \boldsymbol{L}) - E(Y|A = a*, \boldsymbol{L})]\]
To obtain marginal contrasts for the expected outcomes for the entire population we use the stdReg package in R (Sjölander 2016)
The steps for G-computation
- First we fit a regression for each outcome on the exposure \(A_{t0}\) and baseline covariates \(\boldsymbol{L} = (L_1, L_2,L_3 \dots L_n)\). To avoid implausible assumptions of linearity we model the relationship between the exposure each of the potential outcomes of interest using a cubic spline. We include in the set of baseline confounders \(\boldsymbol{L}\) the baseline measure of the exposure as well as the baseline response (or responses):
\[\{A\_{t-1}, \boldsymbol{Y_{t-1}} = (Y_{1, t-1}, Y_{2, t-1}, Y_{3, t-1}\dots Y_{n, t-1})\} \subset \boldsymbol{L}\]
This gives us
\[E(Y|A, \boldsymbol{L})\]
- Second, we use the model in (1) to predict the values of a potential outcome \(Y^{a}\) by setting the exposure to the value \(A = a\).
This gives us
\[\hat{E}(Y|A = a, \boldsymbol{L})\]
- Third, we use the model in (1) to predict the values of a different potential outcome \(Y^{a*}\)by setting the exposure to a different value of \(A =a^{\star}\).
This gives us
\[\hat{E}(Y|A = a*,\boldsymbol{L})\]
- Forth we obtain the focal contrast as the expected difference in the expected average outcomes when exposure at levels \(a*\) and \(a\). Where outcomes are continuous we calculate both mean of \(Y^a\) and the mean of \(Y^{a*}\) and then obtain their difference. This gives us:
\[\hat{ATE} = \hat{E}[\hat{E}(Y|A = a, \boldsymbol{L}) - \hat{E}(Y|A = a*, \boldsymbol{L})]\]
Where outcomes are binary, we calculate the causal risk ratio in moving between the different levels of A. Where binary outcomes are more common than 10% we use a log-normal model to calculate a causal rate ratio (VanderWeele, Mathur, and Chen 2020).
The stdReg package in R calculates standard errors using the Delta method from which we construct confidence intervals under asymptotic assumptions(Sjölander 2016). Additionally, we pool uncertainty arising from the multiple imputation procedure by employing Rubin’s Rules.
Applications
See upcoming “Outcome-wide science” reports for illustrations of how we apply G-computation to NZAVS studies.
Acknowledgements
We are grateful to Arvid Sjölander for his help in modifying his stdReg package in R to enable G-computation with multiply-imputed datasets.
For a simple visual guide to G-computation see Kat Hoffman’s blog
References
Footnotes
Our example uses a binary exposure but we may contrast the counterfactual outcomes for two levels of a continuous exposure. However, to do this wee must state the levels of exposure at which we seek comparisons.↩︎
Reuse
Citation
@online{bulbulia2022,
author = {Joseph Bulbulia},
title = {G-Computation in {NZAVS} {Studies}},
date = {2022-11-05},
url = {https://go-bayes.github.io/b-causal/},
langid = {en}
}