Causal Inference: Estimation of ATE and CATE

Author
Affiliation

Joseph Bulbulia

Victoria University of Wellington, New Zealand

Published

April 29, 2025

Hoffman, Katherine L., Diego Salazar-Barreto, Kara E. Rudolph, and Iván Díaz. 2023. “Introducing Longitudinal Modified Treatment Policies: A Unified Framework for Studying Complex Exposures,” April. https://doi.org/10.48550/arXiv.2304.09460.
Bulbulia, J. A. 2024. “A Practical Guide to Causal Inference in Three-Wave Panel Studies.” OSF. https://doi.org/10.31234/osf.io/uyg3d.
Suzuki, Etsuji, Tomohiro Shinozaki, and Eiji Yamamoto. 2020. “Causal Diagrams: Pitfalls and Tips.” Journal of Epidemiology 30 (4): 153–62. https://doi.org/10.2188/jea.JE20190192.
VanderWeele, Tyler J, Maya B Mathur, and Ying Chen. 2020. “Outcome-Wide Longitudinal Designs for Causal Inference: A New Template for Empirical Studies.” Statistical Science 35 (3): 437–66.
Key concepts for the test(s):
  • Counfounding vs Confounder
  • Causal Estimand, Statistical Estimand, Statistical Estimator
  • Inverse probabilitiy of treatment weights using propensity scores: modelling the exposure or treatment, not the outcome.
  • Subgroup analysis using propensity scores
For the lab, copy and paste code chunks following the “LAB” section.

Seminar

Learning Outcomes

  • You will learn how to state a causal question in a three-wave panel
  • You will learn how to identify causal effects in a three-waves of panel data.
  • How to do Propensity-score weighting: modelling the exposure or treatment, not the outcome.
  • Subgroup analysis by doubly-robust estimation

Why is this lesson important?

The methods you will learn today will help you to define and answer comparative questions in psychology.

Review: The Fundamental Problem of Causal Inference as a Missing Data Problem

Recall the fundamental problem of causal inference, returning to the question of whether bilingualism improves cognitive abilities:

  • Y_i^{a = 1}: The cognitive ability of child i if they were bilingual. This is the counterfactual outcome when A = 1.
  • Y_i^{a = 0}:: The cognitive ability of child i if they were monolingual. This is the counterfactual outcome when A = 0.

The causal effect of bilingualism on cognitive ability for individual i is then defined as the difference between these potential outcomes:

\text{Causal Effect}_i = Y_i^{a=1} - Y_i^{a=0}

We say there is a causal effect if:

Y_i^{a=1} - Y_i^{a=0} \neq 0

However, we only observe one of the potential outcomes for each child. The other outcome is not observed because physics prevents a child from both receiving and not receiving bilingual exposure.

The fact that causal contrasts are not observed in individuals is called “The fundamental problem of causal inference.”

Although we typically cannot observe individual causal effects, we can obtain average causal effects when certain assumptions are satisfied.

\begin{align} E(\delta) = E(Y^{a=1} - Y^{a=0})\\ ~ = E(Y^{a=1}) - E(Y^{a=0}) \\ ~ = ATE \end{align}

We may identify average causal effects from the data when the following assumptions are met:

  • Causal Consistency: The exposure values under comparisons correspond to well-defined interventions that, in turn, correspond to the treatment versions in the data.[]
  • Positivity: The probability of receiving every value of the exposure within all strata of co-variates is greater than zero []
  • Exchangeability: The conditional probability of receiving every value of an exposure level, though not decided by the investigators, depends only on the measured covariates []

Further assumptions:

  • No Interference, also known as the Stable Unit Treatment Value Assumption (SUTVA), requires that the treatment given to one unit (e.g., person, group, organization) does not interfere with the potential outcomes of another unit. Put differently, there are no “spillover” effects. Note: this assumption may be thought to be part of causal consistency, namely individual has only one potential outcome under each treatment condition.
  • Correctly specified model: the requirement that the underlying statistical model used to estimate causal effects accurately represents the true relationships between the variables of interest. We say the model should be able to capture “the functional form” of the relationship between the treatment, the outcome, and any covariates. The model’s functional form should be flexible enough to capture the true underlying relationship. The estimated causal effects may be biased if the model’s functional form is incorrect. Additionally, the model must handle omitted variable bias by including all relevant confounders and should correctly handle missing data from non-response or loss-to follow up. We will return to the bias arising from missing data in the weeks ahead. For now, it is important to note that causal inference assumes that our model is correctly specified.

Subgroup analysis

Redcall, Effect Modification (also known as “heterogeneity of treatment effects”, and “Effect-measure modification”) occurs when the causal effect of intervention A varies across different levels of another variable R:

E(Y^{a=1}|G=g_1, L=l) - E(Y^{a=0}|G=g_1, L=l) \neq E(Y^{a=1}|G=g_2, L=l) - E(Y^{a=0}|G=g_2, L=l)

Effect modification indicates that the magnitude of the causal effect of intervention A is related to the modifier variable G level. As discussed last week, effect modification can be observed even when there is no direct causal interaction between the treatment and the modifier variable. We noted that interaction in causal inference refers to a situation where the combined effect of two interventions is not equal to the sum of their individual effects. Effect modification, on the other hand, occurs when the causal effect of one intervention varies across different levels of another variable.

We also noted that

For comparative research, we are typically interested in effect-modification, which requires subgroup analysis.

Causal Estimand, Statistical Estimand, Statistical Estimator

Let’s set subgroup analysis to the side for a moment and begin focussing on statistical estimation.

Suppose a researcher wants to understand the causal effect of marriage on individual happiness. Participants in the study are surveyed for their marital status (“married” or “not married”) and their self-reported happiness on a scale from 1 to 10.

Causal Estimand

  • Definition: The causal estimand is the specific quantity or parameter that we aim to estimate to understand the causal effect of an intervention or treatment on an outcome.

  • Example: Here, the Causal Estimand would be the Average Treatment Effect (ATE) of being married on happiness. Specifically, we define the ATE as the difference in the potential outcomes of happiness if all individuals were married versus if no individuals were married:

    \text{ATE} = E[Y^{a=1} - Y^{a=0}]

    Here, Y^{a=1} represents the potential happiness score if an individual is married, and Y^{a=0} if they are not married.

Next step: Are Causal Assumptions Met?

  • Identification (Exchangeability): balance in the confounders across the treatments to be compared

  • Consistency: well-defined interventions

  • Positivity: treatments occur within levels of covariates L

Statistical Estimand (next step)

  • The problem: how do we bridge the gap between potential outcomes and data?

  • Definition: the statistical estimand is the parameter or function that summarises the relationship between variables as described by a statistical model applied to data.

  • Example: for our study, the Statistical Estimand might be the mean difference in happiness scores between individuals who are married and those who are not, as derived from a linear regression model:

    \text{Happiness} = \beta_0 + \beta_1 \times \text{Married} + \epsilon

    In this equation, \beta_1 represents the estimated difference in happiness scores between the married and non-married groups.

Statistical Estimator

  • Definition: a statistical estimator is a rule or method by which a numerical estimate of a statistical estimand is calculated from the data.

  • Example: in our marriage study, the Statistical Estimator for \beta_1 is the ordinary least squares (OLS) estimator. This estimator is used to calculate \beta_1 from the sample data provided by the survey. It provides an estimate of the impact of being married on happiness, calculated using: \hat{\beta}_1 = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2} where X_i is a binary indicator for being married (1 for married, 0 for not married), Y_i is the observed happiness score, and \bar{X}, \bar{Y} are the sample means of X and Y, respectively.

(Note: you will not need to know this equation for the quiz)

The upshot, we anchor our causal inquiries within a mult-step framework of data analysis. This involves:

  1. clearly defining our causal estimand within a specified target population,
  2. clarifying assumptions, & especially identification assumptions,
  3. describing a statistical strategy for extracting this estimand from the data, and then
  4. applying an algorithm that embodies this statistical method.

Methods for Statistical Estimation in Causal Inference: Inverse Probability of Treatment Weights Using Propensity Scores

Last week, we discussed confounding control using regression adjustment. Recall the formula for the average treatment effect (ATE) when conditioning on a set of covariates L:

\begin{aligned} \text{ATE} = E[Y^{a=1} \mid L = l] - E[Y^{a=0} \mid L = l] \quad \text{for any value of } l \end{aligned}

“We say that a set L of measured non-descendants of L is a sufficient set for confounding adjustment when conditioning on L blocks all backdoor paths—that is, the treated and the untreated are exchangeable within levels of L” (Hernán & Robins, Causal Inference, p. 86).

This formula calculates the expected outcome difference between treated (a=1) and untreated (a=0) groups, given a specific value of the covariates l.

Inverse Probability of Treatment Weighting (IPTW) takes a different approach. We create a pseudo-population where the treatment assignment is independent of the observed covariates by assigning weights to each individual based on their propensity scores.

We do this by modelling the treatment

Denote the treatment indicator by A, where A = 1 if an individual receives treatment and A = 0 otherwise. L represents the vector of observed covariates, and Y^a the potential outcomes. The propensity score, e(L), is defined as the probability of receiving the treatment given the observed covariates:

\hat{e}(L) = P(A = 1 \mid L)

To obtain IPTW weights, compute the inverse probability of treatment:

v_i = \frac{A_i}{\hat{e}(L_i)} + \frac{1 - A_i}{1 - \hat{e}(L_i)}

Which simplifies to

v_i = \begin{cases} \frac{1}{\hat{e}} & \text{if } A_i = 1 \\ \frac{1}{1-\hat{e}} & \text{if } A_i = 0 \end{cases}

where v_i is the IPTW weight for individual i, A_i is the treatment indicator for individual i, and \hat{e}(L_i) is the estimated propensity score for individual i.

How might we use these weights to obtain causal effect estimates?

Marginal Structural Models (MSMs)

Marginal Structural Models (MSMs) estimate causal effects without requiring an “outcome model” that stratifies on covariates. Rather, MSMs employ weights derived from the inverse probability of treatment weighting (IPTW) to create a pseudo-population in which the distribution of covariates is independent of treatment assignment over time.

The general form of an MSM can be expressed as follows:

E[Y^a] = \beta_0 + \beta_1a

where E[Y^a] is the expected outcome under treatment a and \beta_0 and \beta_1 are parameters estimated by fitting the weighted model. Again, the weights used in the MSM, typically derived from the IPTW (or another treatment model), adjust for the confounding, allowing the model to estimate the unbiased effect of the treatment on the outcome without requiring covariates in the model.

Where do weights fit in? Note, we have E[Y^a] in please of E[Y|A=a]. When applying propensity score weights in the linear regression model E[Y^a] = \beta_0 + \beta_1a, each observation is weighted by v_i, such that v_i(\beta_0 + \beta_1a). This changes the estimation process to focus on a weighted sum of outcomes, where each individual’s contribution is adjusted to reflect their probability of receiving the treatment, given their covariates.

Interpretation of \beta_0 and \beta_1 in a Marginal Structural Model

Binary Treatment

In models where the treatment a is binary (e.g., a = 0 or a = 1), such as in many causal inference studies:

  • \beta_0: the expected value of the outcome Y when the treatment is not applied (a = 0). This is the baseline level of the outcome in the absence of treatment.
  • \beta_1: the change in the expected outcome when the treatment status changes from 0 to 1. In logistic regression, \beta_1 represents the log-odds ratio of the outcome for the treatment group relative to the control group. In linear regression, \beta_1 quantifies the difference in the average outcome between the treated and untreated groups.

Continuous Treatment

When the treatment a is continuous, the interpretation of \beta_0 and \beta_1 adjusts slightly:

  • \beta_0: represents the expected value of the outcome Y when the treatment a is at its reference value (often zero).
  • \beta_1: represents the expected change in the outcome for each unit increase in the treatment. In this case, \beta_1 measures the gradient or slope of the relationship between the treatment and the outcome. For every one-unit increase in treatment, the outcome changes by \beta_1 units, assuming all other factors remain constant.

How can we apply marginal structural models in subgroups?

Assumptions

  • Model assumptions: the treatment model is correctly specified.
  • Causal assumptions: all confounders are appropriately controlled, positivity and consistency assumptions hold.

Calculating Treatment Weights (Propensity Scores) and Confounding Control in Subgroups

We may often achieve greater balance when conducting weighted analyses in subgroups by estimating propensity scores within these subgroups. The propensity score $ e(L, G) $ is the conditional probability of receiving the exposure $ A = 1 $, given the covariates $ L $ and subgroup indicator $ G $. This is often modelled using logistic regression or other methods that ensure covariate balance We define the estimated propensity score as follows:

\hat{e} = P(A = 1 \mid L, G) = f_A(L, G; \theta_A)

Here, $ f_A(L, G; _A) $ is the statistical model estimating the probability of exposure A = 1 given covariates L and subgroup G. We then calculate the weights for each individual, denoted v, using the estimated propensity score:

\theta_A encapsulates all the coefficients (parameters) in this model, including intercepts, slopes, and potentially other parameters depending on the model complexity (e.g., interaction terms, non-linear effects…etc).

These weights v depend on A and are calculated as the inverse of the propensity score for exposed individuals and as the inverse of $ 1- $ for unexposed individuals.

Propensity scores are estimated separately within strata of the subgroup to control for potential confounding tailored to each subgroup. These weights v are specific to each individual in subgroup G. In the lab, we will clarify how to fit models to estimate contrasts for the causal effects within groups \hat{\delta}_{g}, \hat{\delta}_{g'}, etc., and how to obtain estimates for group-wise differences:

\hat{\gamma} = \overbrace{\big( \hat{E}[Y^a \mid G=g] - \hat{E}[Y^{a'} \mid G=g] \big)}^{\hat{\delta}_g} - \overbrace{\big( \hat{E}[Y^{a'} \mid G=g'] - \hat{E}[Y^a \mid G=g'] \big)}^{\hat{\delta}_{g'}}

  • \hat{E}[Y^a \mid G=g]: Estimated expected outcome when treatment a is applied to subgroup G=g.

  • \hat{E}[Y^{a'} \mid G=g]: Estimated expected outcome when a different treatment or control a' is applied to the same subgroup G=g.

  • \hat{\delta}_g: Represents the estimated treatment effect within subgroup G=g, computed as the difference in expected outcomes between treatment a and a' within this subgroup.

  • \hat{E}[Y^{a'} \mid G=g']: Estimated expected outcome when treatment a' is applied to a different subgroup G=g'.

  • \hat{E}[Y^a \mid G=g']: Estimated expected outcome when treatment a is applied to subgroup G=g'.

  • \hat{\delta}_{g'}: Represents the estimated treatment effect within subgroup G=g', computed as the difference in expected outcomes between treatment a' and a within this subgroup.

  • \hat{\gamma}: The overall measure calculated from your formula represents the difference in treatment effects between two subgroups, G=g and G=g'. It quantifies how the effect of switching between treatments a and a' differs across the two subgroups.

Considerations

  • Estimation: to estimate the expected outcomes \hat{E}[Y^a \mid G] and \hat{E}[Y^{a'} \mid G], we require statistical models. If we use regression, we include interaction terms between treatment and subgroup indicators to directly estimate subgroup-specific treatment effects. Our use depends on correct model specification.
  • Confidence intervals: we may compute confidence intervals for \hat{\gamma} using bootstrap, the delta method, or – in our excercises – simulation based methods.
  • Causal assumptions: again, a causal interpretation of \hat{\gamma} relies on satisfying both causal assumptions and modelling assumptions. Here, we have described estimation using propensity scores.

Doubly Robust Estimation

We can combine regression-based estimation with propensity score estimation to obtain doubly robust estimation. I will walk you through the steps in the lab. The TL;DR is this: doubly robust estimation reduces reliance on correct model specification. If either the PS model or the regression model is correctly specified, the model will be unbiased – if the other causal inference assumptions are met.

We cannot know whether these assumptions are met, we will need to do a sensitivity analysis, the topic of next week.

I’ll show you in lab how to employ simulation-based inference methods to compute standard errors and confidence intervals, following the approaches suggested by Greifer (2023)[].

Readings:

Noah Griefer’s Software and Blogs: https://ngreifer.github.io/blog/subgroup-analysis-psm/

Lab

Today we estimate causal effects

# uncomment, update the margot package
# devtools::install_github("go-bayes/margot")

# load packages
library(tidyverse)
library(margot)
library(skimr)
library(naniar)
library(WeightIt)
library(clarify)
library(MatchThem)
library(cobalt)
library(MatchIt)
library(kableExtra)
library(lmtp)
library(SuperLearner)
library(ranger)
library(xgboost)
library(nnls)
library(mice)

if (!requireNamespace("doParallel", quietly = TRUE)) {
  install.packages("doParallel")
}
if (!requireNamespace("foreach", quietly = TRUE)) {
  install.packages("foreach")
}
library(doParallel)
library(foreach)
# createmiceadds# create a folder names saved (you should have done this last week).
# set your path to this folder

# save in cloud if using github
push_mods <- here::here('/Users/joseph/Library/CloudStorage/Dropbox-v-project/data/saved')

# check it is correct
# uncomment, check
#push_mods

# set seed for reproducability
set.seed(123)


# eliminate haven labels
df_nz <- as.data.frame(df_nz)
df_nz <- haven::zap_formats(df_nz)
df_nz <- haven::zap_label(df_nz)
df_nz <- haven::zap_widths(df_nz)

# read functions
# source("/Users/joseph/GIT/templates/functions/funs.R")

# experimental functions (more functions)
# source(
#   "https://raw.githubusercontent.com/go-bayes/templates/main/functions/experimental_funs.R"
# )
# # set exposure name

name_exposure <-  "perfectionism"

# check missing values
skimr::skim(df_nz) |> arrange(n_missing)
Name df_nz
Number of rows 60000
Number of columns 130
_______________________
Column type frequency:
factor 6
numeric 124
________________________
Group variables None
Data summary

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
id 0 1.00 FALSE 20000 1: 3, 2: 3, 3: 3, 4: 3
wave 0 1.00 FALSE 3 201: 20000, 201: 20000, 202: 20000
gen_cohort 0 1.00 TRUE 5 gen: 24828, gen: 19965, gen: 11664, gen: 2280
eth_cat 786 0.99 FALSE 4 eur: 47556, mao: 6843, asi: 3321, pac: 1494
religion_bigger_denominations 12562 0.79 FALSE 11 not: 30926, chr: 5557, cat: 3632, ang: 2274
alert_level_combined_lead 20784 0.65 FALSE 6 no_: 35035, ear: 1618, ale: 1274, ale: 670

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
male 0 1.00 0.37 0.48 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅
sample_frame 0 1.00 8.26 2.86 1.10 6.90 10.10 10.10 10.90 ▁▁▁▂▇
sample_frame_opt_in 0 1.00 0.03 0.16 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
year_measured 0 1.00 0.79 0.42 -1.00 1.00 1.00 1.00 1.00 ▁▁▂▁▇
rural_gch_2018_l 578 0.99 1.66 0.98 1.00 1.00 1.00 2.00 5.00 ▇▂▂▁▁
regc_2022_l 593 0.99 7.16 5.07 1.00 2.00 7.00 13.00 99.00 ▇▁▁▁▁
nz_dep2018 596 0.99 4.79 2.74 1.00 2.04 4.95 7.00 10.00 ▇▇▆▅▃
born_nz 702 0.99 0.78 0.41 0.00 1.00 1.00 1.00 1.00 ▂▁▁▁▇
edu 1523 0.97 5.42 2.69 0.00 3.00 7.00 7.00 10.00 ▃▃▃▇▂
nzsei_13_l 4500 0.92 54.57 16.64 10.00 42.00 56.00 69.00 90.00 ▁▃▅▇▁
age 12229 0.80 50.96 13.76 18.12 41.19 53.35 61.56 97.90 ▂▅▇▂▁
sample_weights 12229 0.80 1.00 1.27 0.30 0.48 0.65 0.93 19.96 ▇▁▁▁▁
sfhealth 12242 0.80 5.04 1.17 1.00 4.34 5.04 5.96 7.00 ▁▂▃▇▅
perfectionism 12247 0.80 3.12 1.37 1.00 2.01 3.00 4.03 7.00 ▇▇▅▃▁
gratitude 12248 0.80 5.90 0.88 1.00 5.34 6.01 6.65 7.00 ▁▁▁▅▇
forgiveness 12255 0.80 5.08 1.27 1.00 4.31 5.32 6.02 7.00 ▁▂▃▇▇
vengeful_rumin 12255 0.80 2.92 1.27 1.00 1.98 2.68 3.68 7.00 ▇▇▃▂▁
lifemeaning 12261 0.80 5.45 1.18 1.00 4.96 5.53 6.46 7.00 ▁▁▃▅▇
support 12268 0.80 5.94 1.12 1.00 5.35 6.29 6.96 7.00 ▁▁▁▃▇
nwi 12306 0.79 5.30 1.67 0.00 4.30 5.33 6.38 10.00 ▁▃▇▆▁
sfhealth_get_sick_easier_reversed 12344 0.79 5.67 1.62 1.00 4.99 6.02 6.99 7.00 ▁▁▁▁▇
meaning_sense 12380 0.79 5.72 1.20 1.00 5.01 5.99 6.96 7.00 ▁▁▁▂▇
parent 12398 0.79 0.73 0.45 0.00 0.00 1.00 1.00 1.00 ▃▁▁▁▇
self_control_wish_more_reversed 12398 0.79 3.73 1.82 1.00 2.03 3.04 5.03 7.00 ▇▆▅▃▆
pwb_your_future_security 12403 0.79 6.32 2.35 0.00 4.98 6.98 8.01 10.00 ▂▃▆▇▆
emotion_regulation_hide_neg_emotions 12408 0.79 4.19 1.66 1.00 2.99 4.04 5.05 7.00 ▆▅▆▇▇
pwb_your_health 12442 0.79 6.68 2.33 0.00 5.02 7.02 8.04 10.00 ▁▂▅▇▇
pwb_standard_living 12457 0.79 7.68 2.00 0.00 6.97 8.00 9.02 10.00 ▁▁▂▅▇
pwb_your_relationships 12460 0.79 7.71 2.26 0.00 6.97 8.03 9.03 10.00 ▁▁▂▃▇
neighbourhood_community 12461 0.79 4.27 1.63 1.00 3.00 4.04 5.95 7.00 ▆▅▆▇▇
power_no_control_composite 12469 0.79 2.92 1.41 1.00 1.96 2.55 3.99 7.00 ▇▅▆▂▁
power_no_control_composite_reversed 12469 0.79 5.08 1.41 1.00 4.01 5.45 6.04 7.00 ▁▂▆▅▇
sfhealth_expect_worse_health_reversed 12478 0.79 4.12 1.73 1.00 2.98 4.00 5.96 7.00 ▆▆▆▃▇
employed 12558 0.79 0.79 0.41 0.00 1.00 1.00 1.00 1.00 ▂▁▁▁▇
religion_religious 12561 0.79 0.35 0.48 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅
conscientiousness 12616 0.79 5.12 1.05 1.00 4.47 5.24 5.96 7.00 ▁▁▅▇▅
extraversion 12618 0.79 3.86 1.19 1.00 3.01 3.80 4.72 7.00 ▂▆▇▅▁
openness 12622 0.79 4.96 1.11 1.00 4.22 5.00 5.77 7.00 ▁▂▆▇▅
agreeableness 12626 0.79 5.36 0.99 1.00 4.74 5.48 6.03 7.00 ▁▁▃▇▆
belong 12626 0.79 5.11 1.08 1.00 4.36 5.30 5.98 7.00 ▁▁▃▇▅
honesty_humility 12627 0.79 5.50 1.16 1.00 4.75 5.72 6.47 7.00 ▁▁▃▆▇
neuroticism 12628 0.79 3.48 1.15 1.00 2.71 3.47 4.25 7.00 ▃▇▇▃▁
self_esteem 12633 0.79 5.14 1.29 1.00 4.32 5.34 6.04 7.00 ▁▂▃▇▇
power_self_nocontrol 12642 0.79 2.96 1.64 1.00 1.96 2.95 4.02 7.00 ▇▂▂▂▂
kessler6_sum 12650 0.79 5.31 4.11 0.00 2.03 4.04 7.04 24.00 ▇▆▂▁▁
kessler_latent_depression 12654 0.79 0.58 0.74 0.00 0.01 0.32 0.98 4.00 ▇▂▁▁▁
kessler_latent_anxiety 12656 0.79 1.20 0.77 0.00 0.66 1.04 1.68 4.00 ▇▇▆▁▁
lifesat 12666 0.79 5.29 1.21 1.00 4.53 5.50 6.03 7.00 ▁▁▃▆▇
support_turnto 12688 0.79 5.88 1.28 1.00 5.03 6.02 6.99 7.00 ▁▁▁▂▇
emotion_regulation_out_control 12701 0.79 2.81 1.66 1.00 1.04 2.03 4.01 7.00 ▇▂▂▂▁
modesty 12709 0.79 6.04 0.92 1.00 5.50 6.24 6.79 7.00 ▁▁▁▃▇
pers_honestyhumility_seen_expensivecar 12737 0.79 5.66 1.56 1.00 4.97 6.02 6.99 7.00 ▁▁▁▂▇
belong_beliefs 12741 0.79 4.77 1.28 1.00 3.99 4.99 5.98 7.00 ▂▂▆▇▇
hlth_weight 12742 0.79 79.29 18.55 35.03 65.04 77.00 89.99 209.99 ▅▇▁▁▁
pers_honestyhumility_deserve_more 12743 0.79 5.30 1.53 1.00 4.02 5.97 6.96 7.00 ▁▁▂▂▇
hlth_fatigue 12756 0.79 1.64 1.07 0.00 0.98 1.96 2.04 4.00 ▃▇▇▃▁
kessler_depressed 12765 0.79 0.47 0.78 0.00 0.00 0.02 0.98 4.00 ▇▂▁▁▁
belong_routside_reversed 12767 0.79 5.04 1.74 1.00 3.97 5.95 6.96 7.00 ▂▂▂▂▇
pers_agreeable_sympathise_others 12767 0.79 5.64 1.16 1.00 5.00 5.98 6.05 7.00 ▁▁▁▃▇
pers_honestyhumility_pleasure_expensivegoods 12767 0.79 5.32 1.66 1.00 4.01 5.98 6.97 7.00 ▁▂▂▂▇
bodysat 12768 0.79 4.23 1.70 1.00 2.99 4.04 5.97 7.00 ▅▅▅▆▇
rumination 12768 0.79 0.83 0.99 0.00 0.00 0.95 1.04 4.00 ▇▅▂▁▁
kessler_restless 12779 0.79 1.22 0.94 0.00 0.95 1.02 1.99 4.00 ▅▇▆▂▁
kessler_nervous 12785 0.79 1.16 0.94 0.00 0.04 1.01 1.98 4.00 ▆▇▅▂▁
kessler_worthless 12788 0.79 0.50 0.84 0.00 0.00 0.02 0.99 4.00 ▇▂▁▁▁
selfesteem_failure_reversed 12791 0.79 5.23 1.63 1.00 4.01 5.97 6.96 7.00 ▁▂▂▂▇
kessler_effort 12794 0.79 1.23 0.98 0.00 0.05 1.01 1.99 4.00 ▅▇▅▂▁
hlth_height 12795 0.79 1.70 0.10 1.26 1.63 1.69 1.77 2.10 ▁▂▇▃▁
kessler_hopeless 12797 0.79 0.78 0.88 0.00 0.00 0.96 1.04 4.00 ▇▅▃▁▁
selfesteem_satself 12797 0.79 5.13 1.45 1.00 4.03 5.95 6.02 7.00 ▁▁▂▃▇
pers_conscientious_chores_done 12808 0.79 4.67 1.50 1.00 3.96 4.98 5.99 7.00 ▂▃▅▆▇
pers_conscientious_forget_putback 12812 0.79 5.15 1.65 1.00 4.00 5.96 6.04 7.00 ▂▂▂▂▇
belong_accept 12814 0.79 5.52 1.27 1.00 4.98 5.98 6.04 7.00 ▁▁▁▃▇
pers_agreeable_no_interest_others 12819 0.79 5.39 1.36 1.00 4.96 5.97 6.04 7.00 ▁▁▂▃▇
pers_extraversion_talk_peopleparties 12820 0.79 3.81 1.72 1.00 2.04 3.98 5.02 7.00 ▇▅▆▅▆
pers_extraversion_life_party 12828 0.79 3.42 1.44 1.00 2.03 3.05 4.04 7.00 ▇▆▇▅▂
selfesteem_postiveself 12830 0.79 5.06 1.41 1.00 4.02 5.03 6.01 7.00 ▁▂▃▅▇
pers_neuroticism_upset_easily 12843 0.79 3.47 1.52 1.00 2.02 3.03 4.96 7.00 ▇▅▅▃▂
pers_neuroticism_mostly_relaxed 12844 0.79 3.43 1.42 1.00 2.03 3.03 4.04 7.00 ▇▇▆▃▂
pers_neuroticism_mood_swings 12847 0.79 3.15 1.59 1.00 1.98 2.99 4.03 7.00 ▇▃▃▂▂
pers_neuroticism_seldom_blue 12851 0.79 3.86 1.66 1.00 2.04 3.99 5.02 7.00 ▇▅▆▆▆
pers_conscientious_like_order 12853 0.79 5.41 1.29 1.00 4.96 5.96 6.04 7.00 ▁▁▂▃▇
pers_extraversion_keepbackground 12854 0.79 3.92 1.49 1.00 2.98 3.99 5.00 7.00 ▆▆▇▆▅
pers_agreeable_feel_others_emotions 12859 0.79 5.32 1.31 1.00 4.96 5.96 6.03 7.00 ▁▁▂▃▇
pers_agreeable_no_interest_others_probs 12870 0.79 5.07 1.48 1.00 4.01 5.04 6.02 7.00 ▁▂▂▃▇
pers_honestyhumility_feel_entitled 12873 0.79 5.69 1.32 1.00 4.99 6.00 6.97 7.00 ▁▁▁▂▇
pers_openness_vivid_imagination 12882 0.79 4.67 1.52 1.00 3.96 4.98 5.99 7.00 ▂▃▅▆▇
pers_extraversion_dont_talkalot 12894 0.79 4.31 1.59 1.00 3.01 4.03 5.96 7.00 ▅▅▇▆▇
pers_conscientious_make_mess 12897 0.79 5.26 1.37 1.00 4.03 5.96 6.03 7.00 ▁▁▂▃▇
pers_openness_good_imagination 12904 0.78 5.22 1.55 1.00 4.02 5.96 6.04 7.00 ▁▂▂▃▇
lifesat_satlife 12917 0.78 5.67 1.23 1.00 5.00 5.99 6.95 7.00 ▁▁▁▂▇
pers_openness_difficult_abstraction 12918 0.78 4.95 1.54 1.00 3.98 5.02 6.02 7.00 ▂▂▃▃▇
pers_openness_interested_abstraction 13013 0.78 5.00 1.50 1.00 3.99 5.02 6.02 7.00 ▁▂▃▃▇
hlth_bmi 13016 0.78 27.41 5.93 12.35 23.39 26.26 30.18 75.56 ▆▇▁▁▁
smoker 13077 0.78 0.06 0.24 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
self_control_have_lots 13107 0.78 5.04 1.43 1.00 4.02 5.03 6.01 7.00 ▁▂▂▅▇
religion_identification_level 13120 0.78 2.33 2.17 1.00 1.00 1.00 4.00 7.00 ▇▁▁▁▂
meaning_purpose 13302 0.78 5.18 1.43 1.00 4.04 5.04 6.03 7.00 ▁▁▂▅▇
support_help 13303 0.78 6.07 1.16 1.00 5.96 6.04 7.00 7.00 ▁▁▁▁▇
partner 13333 0.78 0.76 0.43 0.00 1.00 1.00 1.00 1.00 ▂▁▁▁▇
permeability_individual 13371 0.78 5.20 1.24 1.00 4.05 5.03 6.01 7.00 ▁▁▃▆▇
support_noguidance_reversed 13373 0.78 5.88 1.47 1.00 5.04 6.04 7.00 7.00 ▁▁▁▁▇
impermeability_group 13408 0.78 3.67 1.58 1.00 2.04 3.97 4.99 7.00 ▇▆▇▆▃
alcohol_frequency 13462 0.78 2.17 1.35 0.00 1.00 2.02 3.03 5.00 ▇▇▇▇▃
hours_charity 13563 0.77 1.42 4.18 0.00 0.00 0.02 0.99 168.00 ▇▁▁▁▁
hours_exercise 13565 0.77 6.02 7.80 0.00 2.00 4.03 7.02 168.00 ▇▁▁▁▁
religion_believe_god 13809 0.77 0.44 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
religion_believe_spirit 13809 0.77 0.67 0.47 0.00 0.00 1.00 1.00 1.00 ▃▁▁▁▇
sfhealth_your_health 13942 0.77 5.33 1.26 1.00 4.96 5.96 6.02 7.00 ▁▁▂▃▇
power_others_control 14051 0.77 2.87 1.61 1.00 1.95 2.05 4.01 7.00 ▇▂▂▂▁
emotion_regulation_change_thinking_to_calm 14177 0.76 4.81 1.45 1.00 3.99 5.00 5.99 7.00 ▂▂▅▇▇
lifesat_ideal 14186 0.76 4.89 1.43 1.00 4.00 5.01 6.00 7.00 ▂▂▃▆▇
religion_perceive_religious_discrim 14198 0.76 1.96 1.40 1.00 1.00 1.04 2.04 7.00 ▇▁▁▁▁
hlth_sleep_hours 14277 0.76 6.94 1.12 2.00 6.02 7.00 7.97 20.00 ▁▇▁▁▁
charity_donate 14501 0.76 1013.80 6682.04 0.00 20.03 149.97 500.02 650000.00 ▇▁▁▁▁
alcohol_intensity 14605 0.76 2.08 2.08 0.00 0.98 1.97 2.96 36.00 ▇▁▁▁▁
political_conservative 14644 0.76 3.57 1.37 1.00 2.05 3.96 4.04 7.00 ▆▅▇▃▂
household_inc 14744 0.75 117927.50 97598.66 1.00 59999.96 99999.98 150000.02 3000000.00 ▇▁▁▁▁
pol_wing 15139 0.75 3.71 1.30 1.00 2.97 3.98 4.05 7.00 ▅▅▇▃▂
sexual_satisfaction 15680 0.74 4.51 1.77 1.00 3.04 4.97 6.00 7.00 ▃▂▅▅▇
emp_job_sat 23630 0.61 5.31 1.45 1.00 4.96 5.96 6.04 7.00 ▁▁▂▃▇
emp_job_secure 23677 0.61 5.49 1.55 1.00 4.97 5.99 6.97 7.00 ▁▁▁▂▇
emp_job_valued 23912 0.60 5.18 1.65 1.00 4.02 5.96 6.05 7.00 ▂▁▂▃▇
home_owner 27754 0.54 0.76 0.43 0.00 1.00 1.00 1.00 1.00 ▂▁▁▁▇
# obtain ids for individuals who participated in 2018 and have no missing baseline exposure
ids_2018 <- df_nz |>
  dplyr::filter(year_measured == 1, wave == 2018) |>
  dplyr::filter(!is.na(!!sym(name_exposure))) |> # criteria, no missing
  dplyr::filter(!is.na(eth_cat)) |> # criteria, no missing
  pull(id)


# obtain ids for individuals who participated in 2019
ids_2019 <- df_nz |>
  dplyr::filter(year_measured == 1, wave == 2019) |>
  dplyr::filter(!is.na(!!sym(name_exposure))) |> # criteria, no missing
  pull(id)

# intersect IDs from 2018 and 2019 to ensure participation in both years
ids_2018_2019 <- intersect(ids_2018, ids_2019)

# data wrangling
dat_long <- df_nz |>
  dplyr::filter(id %in% ids_2018_2019 &
                  wave %in% c(2018, 2019, 2020)) |>
  arrange(id, wave) |>
  select(
    "id",
    "wave",
    "year_measured",
    "age",
    "male",
    "born_nz",
    "eth_cat",
    #factor(EthCat, labels = c("Euro", "Maori", "Pacific", "Asian")),
    "employed",
    # are you currently employed? (this includes self-employment or casual work)
    "edu",
    # "gen_cohort",
    "household_inc",
    "partner",
    # 0 = no, 1 = yes
    "parent",
    #"alert_level_combined_lead", # see bibliography
    # 0 = no, 1 = yes
    "political_conservative", # see nzavs sheet
    "hours_exercise", # see nzavs sheet
    "agreeableness", 
    # Mini-IPIP6 Agreeableness (also modelled as empathy facet)
    # Sympathize with others' feelings.
    # Am not interested in other people's problems.
    # Feel others' emotions.
    # Am not really interested in others.
    "conscientiousness",
    # see mini ipip6
    # Get chores done right away.
    # Like order.
    # Make a mess of things.
    # Often forget to put things back in their proper place.
    "extraversion",
    # Mini-IPIP6 Extraversion
    # Am the life of the party.
    # Don't talk a lot.
    # Keep in the background.
    # Talk to a lot of different people at parties.
    "honesty_humility",
    # see mini ipip6
    # Would like to be seen driving around in a very expensive car.
    # Would get a lot of pleasure from owning expensive luxury goods.
    # Feel entitled to more of everything.
    # Deserve more things in life.
    "openness",
    # see mini ipip6
    # Have a vivid imagination.
    # Have difficulty understanding abstract ideas.
    # Do not have a good imagination.
    # Am not interested in abstract ideas.
    "neuroticism",
    # see mini ipip6
    # Have frequent mood swings.
    # Am relaxed most of the time.
    # Get upset easily.
    # Seldom feel blue.
    "modesty",
    # # see mini ipip6
    # # I want people to know that I am an important person of high status,
    # # I am an ordinary person who is no better than others.
    # # I wouldn’t want people to treat me as though I were superior to them.
    # # I think that I am entitled to more respect than the average person is
    #"w_gend_age_ethnic",
    "neighbourhood_community",
    # #I feel a sense of community with others in my local neighbourhood.
    "belong", # see nzavs sheet
    "rural_gch_2018_l",# see nzavs sheet
    "support",
    # "support_help",
    # # 'There are people I can depend on to help me if I really need it.
    # "support_turnto",
    # # There is no one I can turn to for guidance in times of stress.
    # "support_rnoguidance",
    #There is no one I can turn to for guidance in times of stress.
    "perfectionism",
    "religion_religious",
    "kessler_latent_depression",
    "kessler_latent_anxiety"
  ) |>
  mutate(
    #initialize 'censored'
    censored = ifelse(lead(year_measured) == 1, 1, 0),
    
    # modify 'censored' based on the condition; no need to check for NA here as 'censored' is already defined in the previous step
    censored =  ifelse(is.na(censored) &
                         year_measured == 1, 1, censored),
    # create urban binary variable
    
    urban = ifelse(rural_gch_2018_l == 1, 1, 0)
    
  ) |>
  select(-c(year_measured, rural_gch_2018_l) )|>
  dplyr::mutate(
    # rescale these variables, to get all variables on a similar scale
    # otherwise your models can blow up, or become uninterpretable. 
    household_inc_log = log(household_inc + 1),
    hours_exercise_log = log(hours_exercise + 1)  ) |>
  dplyr::select(
    -c(
      household_inc,
      hours_exercise)
  ) |>
  droplevels() |>
  # dplyr::rename(sample_weights = w_gend_age_ethnic,
  #               sample_origin =  sample_origin_names_combined) |>
  arrange(id, wave) |>
  mutate(
    urban = as.numeric(as.character(urban)),
    #   parent = as.numeric(as.character(parent)),
    partner = as.numeric(as.character(partner)),
    born_nz = as.numeric(as.character(born_nz)),
    censored = as.numeric(as.character(censored)),
    employed = as.numeric(as.character(employed))
  ) |>
  droplevels() |>
  arrange(id, wave) |>
  data.frame()


# inspect data
glimpse(dat_long)
Rows: 42,990
Columns: 29
$ id                        <fct> 1, 1, 1, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6,…
$ wave                      <fct> 2018, 2019, 2020, 2018, 2019, 2020, 2018, 20…
$ age                       <dbl> 40.31341, 41.33449, 42.24450, 47.14449, 47.9…
$ male                      <dbl> 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ born_nz                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ eth_cat                   <fct> euro, euro, euro, maori, maori, maori, euro,…
$ employed                  <dbl> 1, 1, 1, 0, 1, NA, 0, 0, 0, 1, 1, 0, 0, 1, 1…
$ edu                       <dbl> 8, 8, 8, 3, 3, 3, 6, 6, 6, 6, 6, 6, 7, 7, 7,…
$ partner                   <dbl> 1, 1, 1, 1, 1, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ parent                    <dbl> 1, 1, 1, 1, 1, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ political_conservative    <dbl> 1.000000, 1.022343, 1.042282, 3.021098, 3.02…
$ agreeableness             <dbl> 6.235760, 6.216993, 6.539885, 4.997000, 4.54…
$ conscientiousness         <dbl> 5.781907, 6.004211, 4.719577, 5.281236, 4.73…
$ extraversion              <dbl> 4.740896, 4.737648, 4.724690, 2.752063, 3.23…
$ honesty_humility          <dbl> 5.954770, 3.970874, 4.988354, 2.249136, 2.75…
$ openness                  <dbl> 6.502722, 6.469405, 6.237941, 7.000000, 6.52…
$ neuroticism               <dbl> 2.543777, 3.049655, 2.960180, 2.455489, 2.24…
$ modesty                   <dbl> 6.786189, 6.002267, 5.702049, 5.239746, 4.74…
$ neighbourhood_community   <dbl> 6.951882, 6.005449, 5.986452, 4.994146, 3.97…
$ belong                    <dbl> 4.958810, 6.039168, 5.382137, 6.332741, 5.68…
$ support                   <dbl> 7.000000, 6.959343, 6.967418, 6.983696, 7.00…
$ perfectionism             <dbl> 4.303349, 3.670650, 3.337095, 2.660515, 3.33…
$ religion_religious        <int> 0, 0, 0, 0, 0, NA, 1, 1, 1, 0, 0, 0, 0, 0, 0…
$ kessler_latent_depression <dbl> 0.037011885, 0.000000000, 0.006337577, 0.000…
$ kessler_latent_anxiety    <dbl> 0.96853714, 0.98302076, 1.71493734, 1.029461…
$ censored                  <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ urban                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ household_inc_log         <dbl> 12.206078, 12.206078, 11.849405, 11.532738, …
$ hours_exercise_log        <dbl> 0.41844130, 0.01955105, 0.00000000, 3.219609…
# more thorough view
#skimr::skim(dat_long)

Next let’s insure that we obtain 50/50 representation of gender within our estimation.

# balance on gender weights
# calculate gender weights assuming male is coded as 1 and female as 0
prop_male_population <-
  0.5  # target proportion of males in the population
prop_female_population <-
  0.5  # target proportion of females in the population

prop_male_sample <- mean(dat_long$male)
prop_female_sample <- 1 - prop_male_sample

gender_weight_male <- prop_male_population / prop_male_sample
gender_weight_female <- prop_female_population / prop_female_sample

dat_long$sample_weights <-
  ifelse(dat_long$male == 1, gender_weight_male, gender_weight_female)

# we will upweight males and down weight non-males to obtain a balance of gender in the *target* population
table(dat_long$sample_weights)

0.783745351126667  1.38107170393216 
            27426             15564 

Next let’s get our data into shape. Today we’re going to consider a ‘treatment’ of perfectionism in which we shift people by different quantiles of perfectionism:

dat_long <-
  margot::create_ordered_variable(dat_long, "perfectionism", n_divisions = 4) #

=== Creating Ordered Variable ===
Note: ties.method set to last based on cutpoint_inclusive.

Variable Summary:
  Min value: 1 
  Max value: 7 
  Number of NA values: 2795 

Using quantile breaks:
  Number of divisions: 4 
  Breaks:


|    Break|
|--------:|
| 1.000000|
| 1.997943|
| 2.986456|
| 4.016010|
| 7.000000|

New Variable Summary:
  New column name: perfectionism_cat 
  Category distribution:


|Category  | Count|
|:---------|-----:|
|[1.0,2.0] | 10049|
|(2.0,3.0] | 10049|
|(3.0,4.0] | 10049|
|(4.0,7.0] | 10048|
|NA        |  2795|

Ordered variable created successfully 👍 
================================================================================
# view scale
# uncommentto see break points
# print(quantile(dat_long$perfectionism, probs = seq(0, 1, 1/4), na.rm = TRUE))

# n by groups
table(dat_long$perfectionism_cat)

[1.0,2.0] (2.0,3.0] (3.0,4.0] (4.0,7.0] 
    10049     10049     10049     10048 
# obtain labels if useful later
labels <- levels(dat_long$perfectionism_cat)

Consider where the quantiles break

dt_19 <- dat_long |> filter(wave == 2019)

margot::margot_plot_categorical(col_name = "perfectionism",
                                    # custom_breaks = c(1,1.9,3.9,7),
                                    n_divisions = 4,
                                     cutpoint_inclusive = "upper",
                                     binwidth = .5,
                                     dt_19)

=== Creating Ordered Variable ===
Note: ties.method set to last based on cutpoint_inclusive.

Variable Summary:
  Min value: 1 
  Max value: 7 
  Number of NA values: 0 

Using quantile breaks:
  Number of divisions: 4 
  Breaks:


|    Break|
|--------:|
| 1.000000|
| 1.996297|
| 2.985367|
| 4.017218|
| 7.000000|

New Variable Summary:
  New column name: perfectionism_cat 
  Category distribution:


|Category  | Count|
|:---------|-----:|
|[1.0,2.0] |  3583|
|(2.0,3.0] |  3582|
|(3.0,4.0] |  3583|
|(4.0,7.0] |  3582|

Ordered variable created successfully 👍 
================================================================================

Consider mean

dt_19 <- dat_long |> filter(wave == 2019)

margot::coloured_histogram_sd(col_name = "perfectionism", binwidth = .5, dt_19)

margot::margot_plot_shift(col_name = "perfectionism", binwidth = .1, 
                                 shift = "down",
                                 range_highlight = c(1,2),
                                 dt_19)

Set variables

Next lets set our baseline variables

baseline_vars = c("age", "male", "edu", "eth_cat", "partner", "employed", "born_nz", "neighbourhood_community", "household_inc_log", "parent", "religion_religious", "urban", "employed", "sample_weights")

# treatment
exposure_vars = c("perfectionism_cat") 

# outcome, can be many
outcome_vars = c("kessler_latent_anxiety", "kessler_latent_depression")

Muliply impute missing values

In our first analysis we will not merely impute baseline values. Rather we will impute baseline and outcome values within quantiles of the exposure. Consider, why should we impute within the treatments to be compared? What are the lingering dangers of this approach?

# make long data wide

# remove sample weights, return them later
prep_dat_wide <- margot_wide(dat_long, 
                             baseline_vars = baseline_vars, 
                             exposure_var = exposure_vars,
                             outcome_vars = outcome_vars)


# filter data, so that we impute within quartiles
list_filtered_df <-
  margot::margot_filter(prep_dat_wide, exposure_vars = "t1_perfectionism_cat", sort_var = "id")

# check that these add up to the total data set
a <- nrow( list_filtered_df$tile_1)
b <- nrow( list_filtered_df$tile_2)
c <- nrow( list_filtered_df$tile_3)
d <-nrow( list_filtered_df$tile_4)

# must sum to equal
a + b + c + d == nrow(prep_dat_wide)
logical(0)

Visualise missing values

# visually inspect missingness
naniar::vis_miss(prep_dat_wide, warn_large_data = FALSE)

# check for collinear vars
mice:::find.collinear(prep_dat_wide) # note sample weights is collinear
[1] "t0_sample_weights"

Imputations

Next we impute by quartile. Save this output in your folder.

# impute by quartile
mice_health <- margot::impute_and_combine(list_filtered_df,  m = 5 )


margot::here_save(mice_health, "mice_health")

Wrangling

# read data if you are not running the imputation again
mice_health <- here_read("mice_health")

# post-imputation arrange
mice_health_mids <- mice_health |>
  arrange(.imp, id) |>
  rename(sample_weights = t0_sample_weights) |>
dplyr::mutate(
  across(
    where(is.numeric) &
      !sample_weights,
    ~ scale(.x),
    .names = "{col}_z"
  )
) |>
  select(-.imp_z, -.id_z) |>
  select(where(is.factor),
         sample_weights,
         ends_with("_z"),
         .imp,
         .id) |>
  relocate(sample_weights, .before = starts_with("t1_"))  |>
  relocate(id, .before = sample_weights)  |>
  relocate(starts_with("t0_"), .before = starts_with("t1_"))  |>
  relocate(starts_with("t2_"), .after = starts_with("t1_"))  |>
  arrange(.imp, id) |>
  droplevels() |>
  mutate_if(is.matrix, as.vector) |>
  as.mids()

# make long
mice_health_long <- mice::complete(mice_health_mids, "long", inc = TRUE)


#Save
here_save(mice_health_mids, "mice_health_mids")
here_save(mice_health_long, "mice_health_long")

Compute propensity scores for IPTW using machine learning

# this is how you read saved files without the margot package
mice_health_mids <- readRDS(here::here(push_mods, "mice_health_mids"))
mice_health_long <- readRDS(here::here(push_mods, "mice_health_long"))

# propensity scors --------------------------------------------------------


# set exposure
X = "t1_perfectionism_cat"

# set estimand
estimand = "ATE"

# set baseline variables
baseline_vars_models = mice_health_long |>
  dplyr::select(starts_with("t0"))|> colnames() # note, we earlier change name of `t0_sample_weights` to `sample weights`

baseline_vars_models
# obtain propensity scores
match_ebal_ate <- margot::match_mi_general(data = mice_health_mids, 
                                      X = X, 
                                      baseline_vars = baseline_vars_models, 
                                      estimand = "ATE",  
                                   #   focal = "tile_3", #for ATT
                                      method = "ebal", 
                                      sample_weights = "sample_weights")

# save output
margot::here_save(match_ebal_ate, "match_ebal_ate")
# read output
match_ebal_ate <- margot::here_read("match_ebal_ate")


# check balance
#bal.tab(match_ebal_ate)

# visualise imbalance
love.plot(match_ebal_ate, binary = "std", thresholds = c(m = .1),
          wrap = 50, position = "bottom", size =3)

Visualise imbalance

# consider results 

sum_ebal_match_ebal_ate <- summary(match_ebal_ate)

# summarise
sum_ebal_match_ebal_ate

# visualise
plot(sum_ebal_match_ebal_ate)

Some extreme weights. We can trim weights as follows. (Note there is no hard and fast rule about how much to trim weights by, here we trim by 1%.)

# trimmed weights
match_ebal_ate_trim <- WeightIt::trim(match_ebal_ate, at = .99)

# check balance
# bal.tab(match_ebal_ate_trim)

# summary
summary_match_ebal_ate_trim <- summary(match_ebal_ate_trim)

# check - extreme weights gone
plot(summary_match_ebal_ate_trim)

# plot for balance
love.plot(match_ebal_ate_trim, binary = "std", thresholds = c(m = .1),
          wrap = 50, position = "bottom", size =2, limits = list(m = c(-.5, .5)))

Estimation

# here_save(match_ebal_ate_trim, "match_ebal_ate_trim")
# match_ebal_ate_trim <- here_read( "match_ebal_ate_trim")

# set data frame; output of match_mi_general model
df_ate = match_ebal_ate_trim
str(df_ate)
str(match_ebal_ate_trim)
# remind self of levels if needed
# levels(mice_health_long$t1_perfectionism_4tile)

# set treatment level
treat_0 = "[1.0,2.0]" # lowest quartile
treat_1 = "(4.0,7.0]" # third quartile

# bootstrap simulations ( generally use 1000)
nsims <- 1000

# cores
cl =  parallel::detectCores () 

estimand = "ATE"

# as specified
vcov = "HC2" # robust standard errors. 

# cores
cores = parallel::detectCores () # use all course

# Example call to the function
# propensity score only model

# code review: these are the functions 'under the hood'

# this function is defined outside of both main functions
# build_formula_str <- function(Y, X, continuous_X, splines, baseline_vars) {
#   baseline_part <- if (!(length(baseline_vars) == 1 && baseline_vars == "1")) {
#     paste(baseline_vars, collapse = "+")
#   } else {
#     "1"
#   }
# 
#   if (continuous_X && splines) {
#     paste(Y, "~ bs(", X, ")", "*", "(", baseline_part, ")")
#   } else {
#     paste(Y, "~", X, "*", "(", baseline_part, ")")
#   }
# }
# 
# causal_contrast_marginal <- function(df, Y, X, baseline_vars = "1", treat_0, treat_1,
#                                      estimand = c("ATE", "ATT"), type = c("RR", "RD"),
#                                      nsims = 200, cores = parallel::detectCores(), family = "gaussian",
#                                      weights = NULL, continuous_X = FALSE, splines = FALSE, vcov = "HC2",
#                                      verbose = FALSE) {
#   # validate the type argument
#   type <- match.arg(type, choices = c("RR", "RD"))
# 
#   # validate the family argument
#   if (is.character(family)) {
#     if (!family %in% c("gaussian", "binomial", "Gamma", "inverse.gaussian", "poisson", "quasibinomial", "quasipoisson", "quasi")) {
#       stop("Invalid 'family' argument. Please specify a valid family function.")
#     }
#     family_fun <- get(family, mode = "function", envir = parent.frame())
#   } else if (inherits(family, "family")) {
#     family_fun <- family
#   } else {
#     stop("Invalid 'family' argument. Please specify a valid family function or character string.")
#   }
# 
#   # use the updated causal_contrast_engine function
#   results <- causal_contrast_engine(
#     df = df,
#     Y = Y,
#     X = X,
#     baseline_vars = baseline_vars,
#     treat_0 = treat_0,
#     treat_1 = treat_1,
#     estimand = estimand,
#     type = type,
#     nsims = nsims,
#     cores = cores,
#     family = family,
#     weights = weights,
#     continuous_X = continuous_X,
#     splines = splines,
#     vcov = vcov,
#     verbose = verbose
#   )
# 
#   return(results)
# }
# 
# causal_contrast_engine <- function(df, Y, X, baseline_vars, treat_0, treat_1,
#                                    estimand = c("ATE", "ATT"), type = c("RR", "RD"),
#                                    nsims = 200, cores = parallel::detectCores(),
#                                    family = "gaussian", weights = TRUE,
#                                    continuous_X = FALSE, splines = FALSE,
#                                    vcov = "HC2", verbose = FALSE) {
#   # check if required packages are installed
#   required_packages <- c("clarify", "rlang", "glue", "parallel", "mice")
#   for (pkg in required_packages) {
#     if (!requireNamespace(pkg, quietly = TRUE)) {
#       stop(paste0("Package '", pkg, "' is needed for this function but is not installed"))
#     }
#   }
# 
#   # helper function to process a single imputed dataset
#   process_imputation <- function(imputed_data, weights) {
#     # build formula
#     formula_str <- build_formula_str(Y, X, continuous_X, splines, baseline_vars)
#     
#     # apply model
#     fit <- glm(
#       as.formula(formula_str),
#       weights = weights,
#       family = get(family, mode = "function", envir = parent.frame()),
#       data = imputed_data
#     )
#     
#     return(fit)
#   }
# 
#   # check if df is a wimids object
#   if (inherits(df, "wimids")) {
#     # Extract the mids object and weights
#     mids_obj <- df$object
#     weights_list <- lapply(df$models, function(m) m$weights)
#     
#     # process each imputed dataset
#     fits <- lapply(1:mids_obj$m, function(i) {
#       imputed_data <- mice::complete(mids_obj, i)
#       process_imputation(imputed_data, weights_list[[i]])
#     })
#     
#     # create a misim object
#     sim.imp <- clarify::misim(fits, n = nsims, vcov = vcov)
#   } else {
#     # If df is not a wimids object, process it as before
#     weight_var <- if (!is.null(weights) && weights %in% names(df)) df[[weights]] else NULL
#     
#     fit <- process_imputation(df, weight_var)
#     sim.imp <- clarify::sim(fit, n = nsims, vcov = vcov)
#   }
# 
#   if (continuous_X) {
#     estimand <- "ATE"
#   }
# 
#   # compute the average marginal effects
#   if (!continuous_X && estimand == "ATT") {
#     subset_expr <- rlang::expr(!!rlang::sym(X) == !!treat_1)
#     sim_estimand <- clarify::sim_ame(
#       sim.imp,
#       var = X,
#       subset = eval(subset_expr),
#       cl = cores,
#       verbose = FALSE
#     )
#   } else {
#     sim_estimand <- clarify::sim_ame(sim.imp,
#       var = X,
#       cl = cores,
#       verbose = FALSE
#     )
# 
#     treat_0_name <- paste0("`E[Y(", treat_0, ")]`")
#     treat_1_name <- paste0("`E[Y(", treat_1, ")]`")
# 
#     if (type == "RR") {
#       rr_expression_str <- glue::glue("{treat_1_name}/{treat_0_name}")
#       rr_expression <- rlang::parse_expr(rr_expression_str)
#       sim_estimand <- transform(sim_estimand, RR = eval(rr_expression))
#       return(summary(sim_estimand))
#     } else if (type == "RD") {
#       rd_expression_str <- glue::glue("{treat_1_name} - {treat_0_name}")
#       rd_expression <- rlang::parse_expr(rd_expression_str)
#       sim_estimand <- transform(sim_estimand, RD = eval(rd_expression))
#       return(summary(sim_estimand))
#     } else {
#       stop("Invalid type. Please choose 'RR' or 'RD'")
#     }
#   }
# }

propensity_mod_fit_t2_kessler_latent_anxiety_z <-double_robust_marginal(
  df = df_ate,
  Y = "t2_kessler_latent_anxiety_z",
  X = X, 
  baseline_vars = 1, # we are not regressing with any covariates
  treat_1 = treat_1,
  treat_0 = treat_0,
  nsims = 1000,
  cores = cores,
  family = "gaussian",
  weights = TRUE,
  continuous_X = FALSE,
  splines = FALSE,
  estimand = "ATE",
  type_causal = "RD",  
  type_tab = "RD",    
  vcov = vcov,         
  new_name = "Kessler Anxiety (PR)",
  delta = 1,
  sd = 1
)



# save
here_save(propensity_mod_fit_t2_kessler_latent_anxiety_z, "propensity_mod_fit_t2_kessler_latent_anxiety_z")

propensity_mod_fit_t2_kessler_latent_depression_z <-margot::double_robust_marginal(
  df = df_ate,
  Y = "t2_kessler_latent_depression_z",
  X = X,  
  baseline_vars = 1, #no covariates, only the propensity scores
  treat_1 = treat_1,
  treat_0 = treat_0,
  nsims = 1000,
  cores = cores,
  family = "gaussian",
  weights = TRUE,
  continuous_X = FALSE,
  splines = FALSE,
  estimand = "ATE",
  type_causal = "RD",  
  type_tab = "RD",    
  vcov = vcov,         
  new_name = "Kessler Depression (PR)",
  delta = 1,
  sd = 1
)

# save
here_save(propensity_mod_fit_t2_kessler_latent_depression_z, "propensity_mod_fit_t2_kessler_latent_depression_z")



## Doubly robust
mod_fit_t2_kessler_latent_anxiety_z <-margot::double_robust_marginal(
  df = df_ate,
  Y = "t2_kessler_latent_anxiety_z",
  X = X,
  baseline_vars = baseline_vars_models, # no covariates, only the propensity scores
  treat_1 = treat_1,
  treat_0 = treat_0,
  nsims = 1000,
  cores = cores,
  family = "gaussian",
  weights = TRUE,
  continuous_X = FALSE,
  splines = FALSE,
  estimand = "ATE",
  type_causal = "RD",  
  type_tab = "RD",    
  vcov = vcov,         
  new_name = "Kessler Anxiety (DR)",
  delta = 1,
  sd = 1
)
# save
here_save(mod_fit_t2_kessler_latent_anxiety_z, "mod_fit_t2_kessler_latent_anxiety_z")


# model 2
mod_fit_t2_kessler_latent_depression_z <-margot::double_robust_marginal(
  df = df_ate,
  Y = "t2_kessler_latent_depression_z",
  X = X,
  baseline_vars = baseline_vars_models,
  treat_1 = treat_1,
  treat_0 = treat_0,
  nsims = 1000,
  cores = cores,
  family = "gaussian",
  weights = TRUE,
  continuous_X = FALSE,
  splines = FALSE,
  estimand = "ATE",
  type_causal = "RD",  
  type_tab = "RD",    
  vcov = vcov,         
  new_name = "Kessler Depression (DR)",
  delta = 1,
  sd = 1
)

# save
here_save(mod_fit_t2_kessler_latent_depression_z, "mod_fit_t2_kessler_latent_depression_z")

Results

# to save time, we do not run the models
# recover saved models 

# propensity score  models
propensity_mod_fit_t2_kessler_latent_depression_z<- here_read("propensity_mod_fit_t2_kessler_latent_depression_z")
propensity_mod_fit_t2_kessler_latent_anxiety_z<- here_read("propensity_mod_fit_t2_kessler_latent_anxiety_z")

# doubly robust models
mod_fit_t2_kessler_latent_depression_z<- here_read("mod_fit_t2_kessler_latent_depression_z")
mod_fit_t2_kessler_latent_anxiety_z<- here_read("mod_fit_t2_kessler_latent_anxiety_z")


# tables
library(kableExtra)

# proprensity only 
tab_double_pr <- rbind(propensity_mod_fit_t2_kessler_latent_depression_z$tab_results,
      propensity_mod_fit_t2_kessler_latent_anxiety_z$tab_results)
# bind results in a table
tab_double_robust <- rbind(mod_fit_t2_kessler_latent_depression_z$tab_results,
      mod_fit_t2_kessler_latent_anxiety_z$tab_results)


# combine the individual results
tab_combo <- rbind(tab_double_pr,tab_double_robust)


# table
tab_combo |> 
  kbl(format = "markdown")

We see that the doubly robust estimator leads to lower overall effect sizes.

Packages

report::cite_packages()
  - Analytics R, Weston S (2022). _iterators: Provides Iterator Construct_. R package version 1.0.14, <https://CRAN.R-project.org/package=iterators>.
  - Bulbulia J (2024). _margot: MARGinal Observational Treatment-effects_. doi:10.5281/zenodo.10907724 <https://doi.org/10.5281/zenodo.10907724>, R package version 0.3.1.0 Functions to obtain MARGinal Observational Treatment-effects from observational data., <https://go-bayes.github.io/margot/>.
  - Chang W (2023). _extrafont: Tools for Using Fonts_. R package version 0.19, <https://CRAN.R-project.org/package=extrafont>.
  - Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, Chen K, Mitchell R, Cano I, Zhou T, Li M, Xie J, Lin M, Geng Y, Li Y, Yuan J (2024). _xgboost: Extreme Gradient Boosting_. R package version 1.7.8.1, <https://CRAN.R-project.org/package=xgboost>.
  - Corporation M, Weston S (2022). _doParallel: Foreach Parallel Adaptor for the 'parallel' Package_. R package version 1.0.17, <https://CRAN.R-project.org/package=doParallel>.
  - Csárdi G (2024). _crayon: Colored Terminal Output_. R package version 1.5.3, <https://CRAN.R-project.org/package=crayon>.
  - Csárdi G (2025). _cli: Helpers for Developing Command Line Interfaces_. R package version 3.6.3.9002, commit 8320914f110ddaa5f3004f2a4421535661b18ea1, <https://github.com/r-lib/cli>.
  - Greifer N (2024). _cobalt: Covariate Balance Tables and Plots_. R package version 4.5.5, <https://CRAN.R-project.org/package=cobalt>.
  - Greifer N (2024). _WeightIt: Weighting for Covariate Balance in Observational Studies_. R package version 1.3.2, <https://CRAN.R-project.org/package=WeightIt>.
  - Greifer N, Worthington S, Iacus S, King G (2024). _clarify: Simulation-Based Inference for Regression Models_. R package version 0.2.1, <https://CRAN.R-project.org/package=clarify>.
  - Grolemund G, Wickham H (2011). "Dates and Times Made Easy with lubridate." _Journal of Statistical Software_, *40*(3), 1-25. <https://www.jstatsoft.org/v40/i03/>.
  - Hastie T (2024). _gam: Generalized Additive Models_. R package version 1.22-5, <https://CRAN.R-project.org/package=gam>.
  - Ho D, Imai K, King G, Stuart E (2011). "MatchIt: Nonparametric Preprocessing for Parametric Causal Inference." _Journal of Statistical Software_, *42*(8), 1-28. doi:10.18637/jss.v042.i08 <https://doi.org/10.18637/jss.v042.i08>.
  - Microsoft, Weston S (2022). _foreach: Provides Foreach Looping Construct_. R package version 1.5.2, <https://CRAN.R-project.org/package=foreach>.
  - Mullen KM, van Stokkum IHM (2024). _nnls: The Lawson-Hanson Algorithm for Non-Negative Least Squares (NNLS)_. R package version 1.6, <https://CRAN.R-project.org/package=nnls>.
  - Müller K, Wickham H (2023). _tibble: Simple Data Frames_. R package version 3.2.1, <https://CRAN.R-project.org/package=tibble>.
  - Pishgar F, Greifer N, Leyrat C, Stuart E (2021). "MatchThem:: Matching and Weighting after Multiple Imputation." _The R Journal_. doi:10.32614/RJ-2021-073 <https://doi.org/10.32614/RJ-2021-073>, <https://journal.r-project.org/archive/2021/RJ-2021-073/>.
  - Polley E, LeDell E, Kennedy C, van der Laan M (2024). _SuperLearner: Super Learner Prediction_. R package version 2.0-29, <https://CRAN.R-project.org/package=SuperLearner>.
  - R Core Team (2024). _R: A Language and Environment for Statistical Computing_. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>.
  - Tierney N, Cook D (2023). "Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations." _Journal of Statistical Software_, *105*(7), 1-31. doi:10.18637/jss.v105.i07 <https://doi.org/10.18637/jss.v105.i07>.
  - van Buuren S, Groothuis-Oudshoorn K (2011). "mice: Multivariate Imputation by Chained Equations in R." _Journal of Statistical Software_, *45*(3), 1-67. doi:10.18637/jss.v045.i03 <https://doi.org/10.18637/jss.v045.i03>.
  - Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). _skimr: Compact and Flexible Summaries of Data_. R package version 2.1.5, <https://CRAN.R-project.org/package=skimr>.
  - Wickham H (2016). _ggplot2: Elegant Graphics for Data Analysis_. Springer-Verlag New York. ISBN 978-3-319-24277-4, <https://ggplot2.tidyverse.org>.
  - Wickham H (2023). _forcats: Tools for Working with Categorical Variables (Factors)_. R package version 1.0.0, <https://CRAN.R-project.org/package=forcats>.
  - Wickham H (2023). _stringr: Simple, Consistent Wrappers for Common String Operations_. R package version 1.5.1, <https://CRAN.R-project.org/package=stringr>.
  - Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." _Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
  - Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A Grammar of Data Manipulation_. R package version 1.1.4, <https://CRAN.R-project.org/package=dplyr>.
  - Wickham H, Henry L (2023). _purrr: Functional Programming Tools_. R package version 1.0.2, <https://CRAN.R-project.org/package=purrr>.
  - Wickham H, Hester J, Bryan J (2024). _readr: Read Rectangular Text Data_. R package version 2.1.5, <https://CRAN.R-project.org/package=readr>.
  - Wickham H, Vaughan D, Girlich M (2024). _tidyr: Tidy Messy Data_. R package version 1.3.1, <https://CRAN.R-project.org/package=tidyr>.
  - Williams N, Díaz I (2023). "lmtp: An R package for estimating the causal effects of modified treatment policies." _Observational Studies_. <https://muse.jhu.edu/article/883479>. Díaz I, Williams N, Hoffman K, Schneck E (2021). "Non-parametric causal effects based on longitudinal modified treatment policies." _Journal of the American Statistical Association_. doi:10.1080/01621459.2021.1955691 <https://doi.org/10.1080/01621459.2021.1955691>.
  - Wright MN, Ziegler A (2017). "ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R." _Journal of Statistical Software_, *77*(1), 1-17. doi:10.18637/jss.v077.i01 <https://doi.org/10.18637/jss.v077.i01>.
  - Xie Y (2024). _knitr: A General-Purpose Package for Dynamic Report Generation in R_. R package version 1.49, <https://yihui.org/knitr/>. Xie Y (2015). _Dynamic Documents with R and knitr_, 2nd edition. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, <https://yihui.org/knitr/>. Xie Y (2014). "knitr: A Comprehensive Tool for Reproducible Research in R." In Stodden V, Leisch F, Peng RD (eds.), _Implementing Reproducible Computational Research_. Chapman and Hall/CRC. ISBN 978-1466561595.
  - Xie Y (2024). _tinytex: Helper Functions to Install and Maintain TeX Live, and Compile LaTeX Documents_. R package version 0.54, <https://github.com/rstudio/tinytex>. Xie Y (2019). "TinyTeX: A lightweight, cross-platform, and easy-to-maintain LaTeX distribution based on TeX Live." _TUGboat_, *40*(1), 30-32. <https://tug.org/TUGboat/Contents/contents40-1.html>.
  - Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and Pipe Syntax_. R package version 1.4.0, <https://CRAN.R-project.org/package=kableExtra>.