Causal Diagrams with ggdag
This tutorial introduces the ggdag package for drawing and analysing directed acyclic graphs (DAGs). DAGs encode causal assumptions and help identify which variables to condition on (and which to leave alone) when estimating causal effects.
This is a reference resource, not a graded lab. Work through the examples at your own pace and return when you need to draw or analyse a DAG for your research report.
Load libraries
# data wrangling
library(tidyverse)
# graphing
library(ggplot2)
# automated causal diagrams
library(ggdag)
The fork: omitted variable bias
Let's use ggdag to identify confounding arising from omitting L in our regression of A on Y.
First we write out the DAG:
# code for creating a DAG
graph_fork <- dagify(Y ~ L,
A ~ L,
exposure = "A",
outcome = "Y") |>
tidy_dagitty(layout = "tree")
# plot the DAG
graph_fork |>
ggdag() + theme_dag_blank() + labs(title = "L is a common cause of A and Y")
Next we ask ggdag which variables we need to include for an unbiased estimate:
ggdag::ggdag_adjustment_set(graph_fork) +
theme_dag_blank() +
labs(title = "{L} is the exclusive member of the confounder set for A and Y")
The causal graph tells us to obtain an unbiased estimate of A on Y we must condition on L.
When we include the omitted variable L in our simulated dataset, it breaks the spurious association between A and Y:
set.seed(123)
N <- 1000
L <- rnorm(N)
A <- rnorm(N, L)
Y <- rnorm(N, L)
# without control: A appears associated with Y
fit_fork <- lm(Y ~ A)
parameters::model_parameters(fit_fork)
# with control: association vanishes
fit_fork_controlled <- lm(Y ~ A + L)
parameters::model_parameters(fit_fork_controlled)
Mediation and the pipe
Suppose we are interested in the causal effect of A on Y, where the effect operates through a mediator M.
graph_mediation <- dagify(Y ~ M,
M ~ A,
exposure = "A",
outcome = "Y") |>
ggdag::tidy_dagitty(layout = "tree")
graph_mediation |>
ggdag() +
theme_dag_blank() +
labs(title = "Mediation Graph")
What should we condition on?
ggdag::ggdag_adjustment_set(graph_fork)
"Backdoor Paths Unconditionally Closed" means we may obtain an unbiased estimate of A on Y without including additional variables, assuming our DAG is correct. There is no "backdoor path" from A to Y that would bias our estimate.
Two variables are d-connected if information flows between them (conditional on the graph), and d-separated if they are conditionally independent.
ggdag::ggdag_dconnected(graph_mediation)
Here d-connection is desirable because it means we can estimate A's effect on Y.
Simulation: pipe confounding
Suppose we want to know whether a ritual action condition (A) influences charity (Y), and the effect operates entirely through perceived social cohesion (M):
set.seed(123)
N <- 100
c0 <- rnorm(N, 10, 2)
ritual <- rep(0:1, each = N / 2)
cohesion <- ritual * rnorm(N, .5, .2)
c1 <- c0 + ritual * cohesion
d <- data.frame(c0 = c0, c1 = c1, ritual = ritual, cohesion = cohesion)
# total effect of ritual on charity
parameters::model_parameters(lm(c1 ~ c0 + ritual, data = d))
# conditioning on the mediator blocks the causal path
parameters::model_parameters(lm(c1 ~ c0 + ritual + cohesion, data = d))
The direct effect of ritual drops out when we include cohesion. Once the model knows M, it gets no new information from A. Conditioning on a post-treatment variable creates pipe confounding.
Masked relationships
When two correlated variables have opposing effects on the outcome, their individual effects can be "masked" in simple regressions.
dag_m1 <- dagify(K ~ C + R,
R ~ C,
exposure = "C",
outcome = "K") |>
tidy_dagitty(layout = "tree")
dag_m1 |> ggdag()
set.seed(123)
n <- 100
C <- rnorm(n)
R <- rnorm(n, C)
K <- rnorm(n, R - C)
d_sim <- data.frame(K = K, R = R, C = C)
# C alone: weak or null
parameters::model_parameters(lm(K ~ C, data = d_sim))
# both: opposing effects "pop"
parameters::model_parameters(lm(K ~ C + R, data = d_sim))
Note that ggdag correctly identifies that you do not need to condition on R to estimate C's total effect on K:
dag_m1 |> ggdag_adjustment_set()
The total effect of C on K combines the direct path () and the indirect path (); these work in opposite directions.
Collider confounding
The selection-distortion effect (Berkson's paradox). Imagine there is no relationship between the newsworthiness and trustworthiness of science. Selection committees make decisions on the basis of both:
dag_sd <- dagify(S ~ N,
S ~ T,
labels = c("S" = "Selection",
"N" = "Newsworthy",
"T" = "Trustworthy")) |>
tidy_dagitty(layout = "nicely")
dag_sd |>
ggdag(text = FALSE, use_labels = "label") + theme_dag_blank()
When two arrows enter a variable, conditioning on it opens a path of information between its causes:
ggdag_dseparated(
dag_sd,
from = "T",
to = "N",
controlling_for = "S",
text = FALSE,
use_labels = "label"
) + theme_dag_blank()
# simulation of selection-distortion effect
set.seed(123)
n <- 1000
p <- 0.05
d <- tibble(
newsworthiness = rnorm(n, mean = 0, sd = 1),
trustworthiness = rnorm(n, mean = 0, sd = 1)
) |>
mutate(total_score = newsworthiness + trustworthiness) |>
mutate(selected = ifelse(total_score >= quantile(total_score, 1 - p), TRUE, FALSE))
# correlation among selected proposals
d |>
filter(selected == TRUE) |>
select(newsworthiness, trustworthiness) |>
cor()
Selection induces a spurious correlation. Among selected proposals, newsworthy ones appear less trustworthy, even though the two are independent in the population.
Collider bias in experiments
Conditioning on a post-treatment variable can open a spurious path even when no experimental effect exists:
dag_ex2 <- dagify(
C1 ~ C0 + U,
Ch ~ U + R,
labels = c(
"R" = "Ritual",
"C1" = "Charity-post",
"C0" = "Charity-pre",
"Ch" = "Cohesion",
"U" = "Religiousness (Unmeasured)"
),
exposure = "R",
outcome = "C1",
latent = "U"
) |>
control_for(c("Ch", "C0"))
dag_ex2 |>
ggdag_collider(text = FALSE, use_labels = "label") +
ggtitle("Cohesion is a collider that opens a path from ritual to charity")
Taxonomy of confounding
There are four basic structures:
The fork (omitted variable bias)
confounder_triangle(x = "Coffee", y = "Lung Cancer", z = "Smoking") |>
ggdag_dconnected(text = FALSE, use_labels = "label")
The pipe (fully mediated effects)
mediation_triangle(x = NULL, y = NULL, m = NULL, x_y_associated = FALSE) |>
tidy_dagitty(layout = "nicely") |>
ggdag()
The collider
collider_triangle() |>
ggdag_dseparated(controlling_for = "m")
Confounding by proxy
Controlling for a descendant of a collider introduces collider bias:
dag_sd <- dagify(
Z ~ X,
Z ~ Y,
D ~ Z,
labels = c("Z" = "Collider", "D" = "Descendant", "X" = "X", "Y" = "Y"),
exposure = "X",
outcome = "Y"
) |>
control_for("D")
dag_sd |>
ggdag_dseparated(
from = "X", to = "Y",
controlling_for = "D",
text = FALSE, use_labels = "label"
) +
ggtitle("D induces collider bias")
Rules for avoiding confounding
From Statistical Rethinking (p. 286):
- List all paths connecting the exposure and outcome
- Classify each path as open or closed (a path is open unless it contains a collider)
- Classify each path as a backdoor path (has an arrow entering the exposure)
- If there are open backdoor paths, decide which variable(s) to condition on to close them
Selection bias in sampling and longitudinal research
Selection bias arises when the sample is not representative of the target population due to conditioning on a collider or its descendant:
coords_mine <- tibble::tribble(
~name, ~x, ~y,
"glioma", 1, 2,
"hospitalized", 2, 3,
"broken_bone", 3, 2,
"reckless", 4, 1,
"smoking", 5, 2
)
dagify(hospitalized ~ broken_bone + glioma,
broken_bone ~ reckless,
smoking ~ reckless,
labels = c(hospitalized = "Hospitalization",
broken_bone = "Broken Bone",
glioma = "Glioma",
reckless = "Reckless \nBehaviour",
smoking = "Smoking"),
coords = coords_mine) |>
ggdag_dconnected("glioma", "smoking", controlling_for = "hospitalized",
text = FALSE, use_labels = "label", collider_lines = FALSE)
In longitudinal research, retention can act as a descendant of a collider, introducing bias when the sample is conditioned on being retained.
Summary
We control for variables to avoid omitted variable bias. But included variable bias is also commonplace. It arises from "pipes", "colliders", and conditioning on descendants of colliders. The ggdag package can help identify adjustment sets, but the results depend on assumptions encoded in your DAG that are not part of the data. Clarify your assumptions.