Lab 3: Regression, Graphing, and Simulation
Download the R script for this lab (right-click → Save As)
Simulating data: Outcome ~ Treatment
Step 1: Set up your R environment
Ensure R or RStudio is installed and open.
Step 2: Set a seed for reproducibility
set.seed(123)
Step 3: Simulate continuous data
n <- 100
mean <- 50
sd <- 10
data_continuous <- rnorm(n, mean, sd)
head(data_continuous)
hist(data_continuous)
Step 4: Simulate categorical data
levels <- c("Male", "Female")
data_categorical <- sample(levels, n, replace = TRUE)
table(data_categorical)
# with unequal probabilities
data_categorical_unequal <- sample(levels, n, replace = TRUE, prob = c(0.3, 0.7))
table(data_categorical_unequal)
Simulating outcomes from treatments
library(tidyverse)
library(ggplot2)
library(patchwork)
library(parameters)
library(report)
library(ggeffects)
set.seed(123)
groupA_scores <- rnorm(100, mean = 100, sd = 15)
groupB_scores <- rnorm(100, mean = 105, sd = 15)
df_scores <- data.frame(
group = rep(c("A", "B"), each = 100),
scores = c(groupA_scores, groupB_scores)
)
df_scores_1 <- df_scores |>
mutate(group = as.factor(group))
Visualising simulated data
ggplot(df_scores_1, aes(x = group, y = scores, fill = group)) +
geom_boxplot() +
theme_minimal() +
labs(title = "score distribution by group", x = "group", y = "scores")
Histogram
ggplot(df_scores_1, aes(x = scores, fill = group)) +
geom_histogram(binwidth = 5, color = "black") +
labs(title = "distribution of scores by group", x = "scores", y = "frequency") +
facet_wrap(~group, ncol = 1) +
theme_minimal()
Exercise 1
- Modify the simulation parameters to change each group's mean and standard deviation. Observe how these changes affect the distribution.
- Experiment with different bin widths in the histogram. How do large and small bin widths speak differently to the data?
Simulating data for statistical tests
# one-sample t-test
set.seed(123)
data <- rnorm(100, mean = 5, sd = 1)
mod_first_test <- t.test(data, mu = 4)
parameters::parameters(mod_first_test)
report::report(mod_first_test)
# two-sample t-test
group1 <- rnorm(50, mean = 5, sd = 1)
group2 <- rnorm(50, mean = 5.5, sd = 1)
mod_t_test_result <- t.test(group1, group2)
report::report(mod_t_test_result)
# paired t-test
pre_test <- rnorm(30, mean = 80, sd = 10)
post_test <- rnorm(30, mean = pre_test + 5, sd = 5)
mod_pre_post <- t.test(pre_test, post_test, paired = TRUE)
report::report(mod_pre_post)
Understanding regression using simulation
set.seed(123)
n <- 100
treatment <- rnorm(n, mean = 50, sd = 10)
beta_a <- 2
outcome <- 5 + beta_a * treatment + rnorm(n, mean = 0, sd = 20)
df <- data.frame(treatment = treatment, outcome = outcome)
fit <- lm(outcome ~ treatment, data = df)
summary(fit)
predicted_values <- ggeffects::ggemmeans(fit, terms = c("treatment"))
plot(predicted_values, dot_alpha = 0.35, show_data = TRUE, jitter = .1)
Equivalence of ANOVA and regression
set.seed(123)
n <- 90
k <- 3
group <- factor(rep(1:k, each = n/k))
means <- c(100, 100, 220)
sd <- 15
y <- rnorm(n, mean = rep(means, each = n/k), sd = sd)
df_1 <- cbind.data.frame(y, group)
# ANOVA
anova_model <- aov(y ~ group, data = df_1)
report::report(anova_model)
# regression (equivalent)
fit_regression <- lm(y ~ group, data = df_1)
parameters::model_parameters(fit_regression)
Combination plots
library(patchwork)
coefficient_plot <- plot(parameters::model_parameters(fit_regression))
predictive_plot <- plot(
ggeffects::ggpredict(fit_regression, terms = "group"),
dot_alpha = 0.35, show_data = TRUE, jitter = .1, colors = "reefs"
) +
scale_y_continuous(limits = c(0, 260)) +
labs(title = "predictive graph", x = "treatment group", y = "response")
my_first_combination_plot <- coefficient_plot / predictive_plot +
plot_annotation(title = "coefficient and predictive plots", tag_levels = "A")
my_first_combination_plot
Upshot
ANOVA partitions variance into between-group and within-group components. Regression estimates the mean of the dependent variable as a linear function of independent variables. For many questions ANOVA is appropriate, but when comparing groups we often want finer-grained comparisons. Note: comparisons do not establish causality.
Exercise 2
- Simulate two continuous variables
YandAwithn = 100. VariableAshould have mean 50, sd 10. - Define the effect of
AonYasbeta_A <- 2. - Include an error term with sd 20.
- Run
lm(Y ~ A)and report results.
Template:
library(parameters)
set.seed() # pick a number
n <- 100
A <- rnorm(n, mean = 50, sd = 10)
beta_A <- 2
df_3 <- data.frame(
A = A,
Y = 5 + beta_A * A + rnorm(n, mean = 0, sd = 20)
)
fit_3 <- lm(Y ~ A, data = df_3)
parameters::model_parameters(fit_3)
report::report(fit_3)
What you have learned
- Data simulation: simulating datasets in R for exploring statistical concepts and causal inference.
- Data visualisation: expanding your capacity for visualisation with ggplot2.
- Statistical tests: t-tests, ANOVA, and regression.
- ANOVA and regression: their equivalence and when to prefer regression.
Package references
- ggplot2: declarative graphics based on the Grammar of Graphics.
- Parameters: utilities for processing model parameters.
- Report: automated report generation from statistical models.
Appendix A: Solution to Exercise 2
library(parameters)
set.seed(12345)
n <- 100
A <- rnorm(n, mean = 50, sd = 10)
beta_A <- 2
df_3 <- data.frame(
A = rnorm(n, mean = 50, sd = 10),
Y = 5 + beta_A * A + rnorm(n, mean = 0, sd = 20)
)
fit_3 <- lm(Y ~ A, data = df_3)
parameters::model_parameters(fit_3)
report::report(fit_3)
Appendix B: ANOVA post hoc comparisons
tukey_post_hoc <- TukeyHSD(anova_model)
print(tukey_post_hoc)
plot(tukey_post_hoc)
Appendix C: Adding complexity in simulation
data_frame <- data.frame(
ID = 1:n,
Gender = sample(c("Male", "Female"), n, replace = TRUE),
Age = rnorm(n, mean = 30, sd = 5),
Income = rnorm(n, mean = 50000, sd = 10000)
)
# income as a function of age
intercept <- 20000
beta_age <- 1500
error_sd <- 10000
Age <- rnorm(n, mean = 30, sd = 5)
Income <- intercept + beta_age * Age + rnorm(n, mean = 0, sd = error_sd)
data_complex <- data.frame(Age, Income)
ggplot(data_complex, aes(x = Age, y = Income)) +
geom_point() +
theme_minimal() +
labs(title = "simulated age vs. income", x = "age", y = "income")
Appendix D: Causal Inference Glossary
Download the glossary of key terms in causal inference (PDF)