Lab 3: Regression, Graphing, and Simulation

R script

Download the R script for this lab (right-click → Save As)

Simulating data: Outcome ~ Treatment

Step 1: Set up your R environment

Ensure R or RStudio is installed and open.

Step 2: Set a seed for reproducibility

set.seed(123)

Step 3: Simulate continuous data

n <- 100
mean <- 50
sd <- 10
data_continuous <- rnorm(n, mean, sd)
head(data_continuous)
hist(data_continuous)

Step 4: Simulate categorical data

levels <- c("Male", "Female")
data_categorical <- sample(levels, n, replace = TRUE)
table(data_categorical)

# with unequal probabilities
data_categorical_unequal <- sample(levels, n, replace = TRUE, prob = c(0.3, 0.7))
table(data_categorical_unequal)

Simulating outcomes from treatments

library(tidyverse)
library(ggplot2)
library(patchwork)
library(parameters)
library(report)
library(ggeffects)

set.seed(123)
groupA_scores <- rnorm(100, mean = 100, sd = 15)
groupB_scores <- rnorm(100, mean = 105, sd = 15)

df_scores <- data.frame(
  group = rep(c("A", "B"), each = 100),
  scores = c(groupA_scores, groupB_scores)
)

df_scores_1 <- df_scores |>
  mutate(group = as.factor(group))

Visualising simulated data

ggplot(df_scores_1, aes(x = group, y = scores, fill = group)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "score distribution by group", x = "group", y = "scores")

Histogram

ggplot(df_scores_1, aes(x = scores, fill = group)) +
  geom_histogram(binwidth = 5, color = "black") +
  labs(title = "distribution of scores by group", x = "scores", y = "frequency") +
  facet_wrap(~group, ncol = 1) +
  theme_minimal()

Exercise 1

Modify the simulation parameters to change each group's mean and standard deviation. Observe how these changes affect the distribution.
Experiment with different bin widths in the histogram. How do large and small bin widths speak differently to the data?

Simulating data for statistical tests

# one-sample t-test
set.seed(123)
data <- rnorm(100, mean = 5, sd = 1)
mod_first_test <- t.test(data, mu = 4)
parameters::parameters(mod_first_test)
report::report(mod_first_test)

# two-sample t-test
group1 <- rnorm(50, mean = 5, sd = 1)
group2 <- rnorm(50, mean = 5.5, sd = 1)
mod_t_test_result <- t.test(group1, group2)
report::report(mod_t_test_result)

# paired t-test
pre_test <- rnorm(30, mean = 80, sd = 10)
post_test <- rnorm(30, mean = pre_test + 5, sd = 5)
mod_pre_post <- t.test(pre_test, post_test, paired = TRUE)
report::report(mod_pre_post)

Understanding regression using simulation

set.seed(123)
n <- 100

treatment <- rnorm(n, mean = 50, sd = 10)
beta_a <- 2
outcome <- 5 + beta_a * treatment + rnorm(n, mean = 0, sd = 20)

df <- data.frame(treatment = treatment, outcome = outcome)

fit <- lm(outcome ~ treatment, data = df)
summary(fit)

predicted_values <- ggeffects::ggemmeans(fit, terms = c("treatment"))
plot(predicted_values, dot_alpha = 0.35, show_data = TRUE, jitter = .1)

Equivalence of ANOVA and regression

set.seed(123)
n <- 90
k <- 3

group <- factor(rep(1:k, each = n/k))
means <- c(100, 100, 220)
sd <- 15
y <- rnorm(n, mean = rep(means, each = n/k), sd = sd)
df_1 <- cbind.data.frame(y, group)

# ANOVA
anova_model <- aov(y ~ group, data = df_1)
report::report(anova_model)

# regression (equivalent)
fit_regression <- lm(y ~ group, data = df_1)
parameters::model_parameters(fit_regression)

Combination plots

library(patchwork)

coefficient_plot <- plot(parameters::model_parameters(fit_regression))

predictive_plot <- plot(
  ggeffects::ggpredict(fit_regression, terms = "group"),
  dot_alpha = 0.35, show_data = TRUE, jitter = .1, colors = "reefs"
) +
  scale_y_continuous(limits = c(0, 260)) +
  labs(title = "predictive graph", x = "treatment group", y = "response")

my_first_combination_plot <- coefficient_plot / predictive_plot +
  plot_annotation(title = "coefficient and predictive plots", tag_levels = "A")

my_first_combination_plot

ANOVA partitions variance into between-group and within-group components. Regression estimates the mean of the dependent variable as a linear function of independent variables. For many questions ANOVA is appropriate, but when comparing groups we often want finer-grained comparisons. Note: comparisons do not establish causality.

Exercise 2

Simulate two continuous variables Y and A with n = 100. Variable A should have mean 50, sd 10.
Define the effect of A on Y as beta_A <- 2.
Include an error term with sd 20.
Run lm(Y ~ A) and report results.

Template:

library(parameters)
set.seed()  # pick a number

n <- 100
A <- rnorm(n, mean = 50, sd = 10)
beta_A <- 2

df_3 <- data.frame(
  A = A,
  Y = 5 + beta_A * A + rnorm(n, mean = 0, sd = 20)
)

fit_3 <- lm(Y ~ A, data = df_3)
parameters::model_parameters(fit_3)
report::report(fit_3)

What you have learned

Data simulation: simulating datasets in R for exploring statistical concepts and causal inference.
Data visualisation: expanding your capacity for visualisation with ggplot2.
Statistical tests: t-tests, ANOVA, and regression.
ANOVA and regression: their equivalence and when to prefer regression.

Package references

ggplot2: declarative graphics based on the Grammar of Graphics.
Parameters: utilities for processing model parameters.
Report: automated report generation from statistical models.

Appendix A: Solution to Exercise 2

library(parameters)
set.seed(12345)

n <- 100
A <- rnorm(n, mean = 50, sd = 10)
beta_A <- 2

df_3 <- data.frame(
  A = rnorm(n, mean = 50, sd = 10),
  Y = 5 + beta_A * A + rnorm(n, mean = 0, sd = 20)
)

fit_3 <- lm(Y ~ A, data = df_3)
parameters::model_parameters(fit_3)
report::report(fit_3)

Appendix B: ANOVA post hoc comparisons

tukey_post_hoc <- TukeyHSD(anova_model)
print(tukey_post_hoc)
plot(tukey_post_hoc)

Appendix C: Adding complexity in simulation

data_frame <- data.frame(
  ID = 1:n,
  Gender = sample(c("Male", "Female"), n, replace = TRUE),
  Age = rnorm(n, mean = 30, sd = 5),
  Income = rnorm(n, mean = 50000, sd = 10000)
)

# income as a function of age
intercept <- 20000
beta_age <- 1500
error_sd <- 10000
Age <- rnorm(n, mean = 30, sd = 5)
Income <- intercept + beta_age * Age + rnorm(n, mean = 0, sd = error_sd)
data_complex <- data.frame(Age, Income)

ggplot(data_complex, aes(x = Age, y = Income)) +
  geom_point() +
  theme_minimal() +
  labs(title = "simulated age vs. income", x = "age", y = "income")

Appendix D: Causal Inference Glossary

Download the glossary of key terms in causal inference (PDF)

PSYC 434: Conducting Research Across Cultures

Lab 3: Regression, Graphing, and Simulation

Simulating data: Outcome ~ Treatment

Step 1: Set up your R environment

Step 2: Set a seed for reproducibility

Step 3: Simulate continuous data

Step 4: Simulate categorical data

Simulating outcomes from treatments

Visualising simulated data

Histogram

Exercise 1

Simulating data for statistical tests

Understanding regression using simulation

Equivalence of ANOVA and regression

Combination plots

Upshot

Exercise 2

What you have learned

Package references

Appendix A: Solution to Exercise 2

Appendix B: ANOVA post hoc comparisons

Appendix C: Adding complexity in simulation

Appendix D: Causal Inference Glossary

Keyboard shortcuts

PSYC 434: Conducting Research Across Cultures