G-computation in NZAVS Studies

Causal Inference

Joseph Bulbulia

Victoria University of Wellington, New Zealand



Causal Inference

Suppose we want to infer whether a binary exposure has a causal effect. We must answer three questions:

  1. What would happen with exposure?
  2. What would happen without exposure?
  3. Do these potential outcomes differ?

Consider a binary exposure \(A\) with two levels, \(A = 0\) and \(A = 1\) and a continuous outcome \(Y\). Estimating the causal effect of \(A\) on outcome \(Y\) requires contrasting two counterfactual states of the world: the state \(Y^{a=1}\) when, hypothetically, \(A\) is set to \(1\) and the state \(Y^{a=0}\) when, hypothetically, \(A\) is set to \(0\). We say that \(A\) affects \(Y\) when the quantities \(Y^{a=1} - Y^{a=0} \neq 0\) (Hernan and Robins 2023).

The Fundamental Problem of Causal Inference

At any given time, for any individual, at most only one level of the exposure \(A\) may be realised. That is, \(Y^{a=1}\) and \(Y^{a=0}\) are never simultaneously observed for any individual. As such, individual-level causal effects are not typically identifiable in the data. This is referred to as “the fundamental problem of causal inference” (Rubin 1976; Gelman, Hill, and Vehtari 2020). \(Y^{a=1}\) and \(Y^{a=0}\) are therefore called “potential” or “counterfactual” outcomes.

Although we generally cannot observe individual causal effects, when certain assumptions are satisfied, we may identify the average or marginal causal effect at the population level. We say there is an average or marginal causal effect if the difference of the means of those who are exposed and not exposed does not equal zero: \(E(Y^{a=1}) - E(Y^{a=0})\neq 0\). Because the difference of the means is equivalent to the mean of the differences, we may equivalently say there is a marginal causal effect if \(E(Y^{a=1} - Y^{a=0})\neq 0\) (Hernan and Robins 2023).

To ensure valid inference, the potential outcomes must be conditionally independent of the exposure:

\[Y^a \perp\!\!\!\perp A|L\]

Outside of controlled experiments, we cannot typically ensure this conditional independence. For this reason, we use sensitivity analysis, such as the E-value (VanderWeele, Mathur, and Chen 2020).


What is G-Computation?

G-computation, sometimes called “regression standardisation” is a non-parametric method for estimating causal effects from observational data in the presence of confounding variables. It was introduced by James Robins in 1986 and is based on the potential outcomes framework. G-computation allows for estimating the causal effects of time-varying treatments and can handle complex treatment regimens and effect modification.

How Does G-Computation Work?

To consistently estimate a causal association from statistical associations in the data, we must infer the average outcomes for the entire population were it subject to different levels of the exposure variable \(A = a\) and \(A=a^*\). The average treatment effect (ATE) can be expressed as:

\[ATE = E[Y^{a^*}] - E[Y^a] = \sum_l E[Y|A=a^*, L=l]\Pr[L=l] - \sum_l E[Y|A=a, L=l]\Pr[L=l]\]

However, we need not estimate \(\Pr(L = l)\) directly. Instead, we can obtain the weighted mean for the distribution of the confounders in the data by taking the double expectation: (Hernan and Robins 2023, 166).

\[ATE = E[E(Y|A = a, \boldsymbol{L}) - E(Y|A = a^*, \boldsymbol{L})]\]

Step 1. Fit a regression for the outcome on the exposure \(A_{t0}\) and baseline covariates \(\boldsymbol{L} = (L_1, L_2,L_3 \dots L_n)\). To avoid making unjustified linearity assumptions, model the relationship between the exposure and each of the potential outcomes of interest using a cubic spline. Include in the set of baseline confounders \(\boldsymbol{L}\), the baseline measure of the exposure as well as the baseline responses:

\[\{A_{t-1}, \boldsymbol{Y_{t-1}} = (Y_{1, t-1}, Y_{2, t-1}, Y_{3, t-1}\dots Y_{n, t-1})\} \subset \boldsymbol{L}\]

This provides

\[E(Y|A, \boldsymbol{L})\]

Step 2. Use the model from step 1 to predict the values of a potential outcome \(Y^{a}\) by setting the exposure to the value \(A = a\).

This provides

\[\hat{E}(Y|A = a, \boldsymbol{L})\]

Step 3. Use the model from step 1 to predict the values of a different potential outcome \(Y^{a*}\) by setting the exposure to a different value of \(A = a^{\star}\).

This provides

\[\hat{E}(Y|A = a^*, \boldsymbol{L})\]

Step 4. Obtain the focal contrast as the expected difference in the average outcomes when exposure is at levels \(a*\) and \(a\). For continuous outcomes, calculate the mean of \(Y^a\) and the mean of \(Y^{a*}\) and then find their difference. This provides:

\[\hat{ATE} = \hat{E}[\hat{E}(Y|A = a, \boldsymbol{L}) - \hat{E}(Y|A = a^*, \boldsymbol{L})]\]

Step 5. For binary outcomes, calculate the causal risk ratio in moving between different levels of A. When binary outcomes are more common than 10%, use a log-normal model to calculate a causal rate ratio (VanderWeele, Mathur, and Chen 2020).

Step 6. The stdReg package in R calculates standard errors using the Delta method, from which we construct confidence intervals under asymptotic assumptions (Sjölander 2016). Additionally, we pool uncertainty arising from the multiple imputation procedure by employing Rubin’s Rules.

Step 7. Finally, we perform sensitivity analyses to assess the sensitivity of the causal effect estimates to modelling assumptions.

What are the assumptions of g-computation?

  1. Outcome Model Misspecification: G-computation’s performance depends on the correct specification of the outcome model. If the model is misspecified, g-computation can produce biased estimates of the ATE.

  2. Confounding: G-computation relies on the assumption that there are no unmeasured confounders. If this assumption is violated, the estimates can be biased.


See upcoming “Outcome-wide science” reports for illustrations of how we apply G-computation to NZAVS studies.


We are grateful to Arvid Sjölander for his help in modifying his stdReg package in R to enable G-computation with multiply-imputed datasets.

For a simple visual guide to G-computation see Kat Hoffman’s blog


Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge University Press.
Hernan, M. A., and J. M. Robins. 2023. Causal Inference. Chapman & Hall/CRC Monographs on Statistics & Applied Probab. Taylor & Francis. https://books.google.co.nz/books?id=\_KnHIAAACAAJ.
Rubin, D. B. 1976. “Inference and Missing Data.” Biometrika 63 (3): 581–92. https://doi.org/10.1093/biomet/63.3.581.
Sjölander, Arvid. 2016. “Regression Standardization with the R Package stdReg.” European Journal of Epidemiology 31 (6): 563–74. https://doi.org/10.1007/s10654-016-0157-3.
VanderWeele, Tyler J, Maya B Mathur, and Ying Chen. 2020. “Outcome-Wide Longitudinal Designs for Causal Inference: A New Template for Empirical Studies.” Statistical Science 35 (3): 437466.




BibTeX citation:
  author = {Bulbulia, Joseph},
  title = {G-Computation in {NZAVS} {Studies}},
  date = {2022-11-05},
  url = {https://go-bayes.github.io/b-causal/},
  langid = {en}
For attribution, please cite this work as:
Bulbulia, Joseph. 2022. “G-Computation in NZAVS Studies.” November 5, 2022. https://go-bayes.github.io/b-causal/.