Lab 2: Install R and RStudio
Download the R script for this lab (right-click → Save As)
This session introduces R and RStudio.
Why learn R?
- You will need it for your final report (if you choose the report option).
- It supports your psychology coursework.
- It enhances your coding skills, which will help you in many domains of work, including utilising AI (!).
Installing R
- Visit CRAN at https://cran.r-project.org/.
- Select the version for your operating system (Windows, Mac, or Linux).
- Download and install by following the on-screen instructions.
Installing RStudio
See Johannes Karl's video for a walkthrough.
Step 1: Install RStudio
- Go to https://www.rstudio.com/products/rstudio/download/.
- Choose the free version of RStudio Desktop and download it for your operating system.
- Install RStudio Desktop.
- Open RStudio to begin setting up your project environment.
Step 2: Create a new project
- In RStudio, go to
File > New Project. - Choose
New Directoryfor a new project orExisting Directoryif you have a folder ready. - For a new project, select
New Project, then provide a directory name. - Specify the location where the project folder will be created.
- Click
Create Project.
- Use a clear folder structure.
- If you are using GitHub, create a location on your machine (not Dropbox).
- If you are not using GitHub, choose the cloud (Dropbox or similar).
- When creating new files and scripts, use clear labels that anyone could understand. That "anyone" will be your future self.
Step 3: Give the project structure
- Within your project, create folders to organise scripts and data. Common folder names include
R/for R scripts,data/for datasets, anddoc/for documentation. - To create a new R script, go to
File > New File > R Script. Save the script in yourR/folder with a meaningful file name. - If you set up Git in week 1, you can initialise a Git repository when creating a new project to track your R work.
Step 4: Working with R scripts
- Write your R code in the script editor. Execute code by selecting lines and pressing
Ctrl + Enter(Windows/Linux) orCmd + Enter(Mac). - Use comments (preceded by
#) to document your code. - Save your scripts regularly (
Ctrl + SorCmd + S).
Step 5: When you exit RStudio
Before concluding your work, save your workspace or clear it to start fresh (Session > Restart R).
- Use clearly defined script names.
- Annotate your code.
- Save your scripts often (
Ctrl + SorCmd + S).
Basic R Commands
- Open
File > New File > R Scriptin RStudio. - Name and save your new R script.
- Copy the code blocks below into your script.
- Save:
Ctrl + SorCmd + S.
Assignment (<-)
Assignment in R uses the <- operator:
x <- 10 # assigns the value 10 to x
y <- 5 # assigns the value 5 to y
Concatenation (c())
The c() function combines multiple elements into a vector:
numbers <- c(1, 2, 3, 4, 5)
print(numbers)
Operations (+, -)
x <- 10
y <- 5
sum <- x + y
print(sum)
difference <- x - y
difference
Multiplication (*) and Division (/)
product <- x * y
product
quotient <- x / y
quotient
# element-wise operations on vectors
vector1 <- c(1, 2, 3)
vector2 <- c(4, 5, 6)
vector_product <- vector1 * vector2
vector_product
vector_division <- vector1 / vector2
vector_division
Be mindful of division by zero: 10 / 0 returns Inf, and 0 / 0 returns NaN.
# integer division and modulo
integer_division <- 10 %/% 3 # 3
remainder <- 10 %% 3 # 1
rm() Remove Object
devil_number <- 666
devil_number
rm(devil_number)
Logic (!, !=, ==)
x_not_y <- x != y # TRUE
x_equal_10 <- x == 10 # TRUE
OR (| and ||)
# element-wise OR
vector_or <- c(TRUE, FALSE) | c(FALSE, TRUE) # c(TRUE, TRUE)
# single OR (first element only)
single_or <- TRUE || FALSE # TRUE
AND (& and &&)
# element-wise AND
vector_and <- c(TRUE, FALSE) & c(FALSE, TRUE) # c(FALSE, FALSE)
# single AND (first element only)
single_and <- TRUE && FALSE # FALSE
- Execute code line:
Cmd + Return(Mac) orCtrl + Enter(Win/Linux) - Insert section heading:
Cmd + Shift + R(Mac) orCtrl + Shift + R - Align code:
Cmd + Shift + A(Mac) orCtrl + Shift + A - Comment/uncomment:
Cmd/Ctrl + Shift + C - Save all:
Cmd/Ctrl + Shift + S - Find/replace:
Cmd/Ctrl + F - New file:
Cmd/Ctrl + Shift + N - Auto-complete:
Tab
For more, explore Tools > Command Palette or Shift + Cmd/Ctrl + P.
Data Types in R
Integers
x <- 42L
str(x) # int 42
y <- as.numeric(x)
str(y) # num 42
Integers are useful for counts or indices that do not require fractional values.
Characters
name <- "Alice"
Characters represent text: names, labels, descriptions.
Factors
colors <- factor(c("red", "blue", "green"))
Factors represent categorical data with a limited set of levels.
Ordered factors
education_levels <- c("high school", "bachelor", "master", "ph.d.")
# unordered
education_factor_no_order <- factor(education_levels, ordered = FALSE)
# ordered
education_factor <- factor(education_levels, ordered = TRUE)
education_factor
Ordered factors support logical comparisons based on level order:
edu1 <- ordered("bachelor", levels = education_levels)
edu2 <- ordered("master", levels = education_levels)
edu2 > edu1 # TRUE
Strings
you <- "world!"
greeting <- paste("hello,", you)
greeting # "hello, world!"
Vectors
Vectors are homogeneous: all elements must be of the same type.
numeric_vector <- c(1, 2, 3, 4, 5)
character_vector <- c("apple", "banana", "cherry")
logical_vector <- c(TRUE, FALSE, TRUE, FALSE)
Manipulating vectors
vector_sum <- numeric_vector + 10
vector_multiplication <- numeric_vector * 2
vector_greater_than_three <- numeric_vector > 3
Accessing elements:
first_element <- numeric_vector[1]
some_elements <- numeric_vector[c(2, 4)]
Functions with vectors
mean(numeric_vector)
sum(numeric_vector)
sort(numeric_vector)
unique(character_vector)
Data Frames
Creating data frames
df <- data.frame(
name = c("alice", "bob", "charlie"),
age = c(25, 30, 35),
gender = c("female", "male", "male")
)
head(df)
str(df)
Accessing elements
# by column name
names <- df$name
# by row and column
second_person <- df[2, ]
age_column <- df[, "age"]
# by condition
very_old_people <- subset(df, age > 25)
mean(very_old_people$age)
Exploring data frames
head(df) # first six rows
tail(df) # last six rows
str(df) # structure
summary(df) # summary statistics
Manipulating data frames
# adding columns
df$employed <- c(TRUE, TRUE, FALSE)
# adding rows
new_person <- data.frame(name = "diana", age = 28, gender = "female", employed = TRUE)
df <- rbind(df, new_person)
# modifying values
df[4, "age"] <- 26
# removing columns
df$employed <- NULL
# removing rows
df <- df[-4, ]
rbind() and cbind()
rbind() combines data frames by rows; cbind() combines by columns. When using these functions, column names (for rbind) or row counts (for cbind) must match. We will use dplyr for more flexible joining in later weeks.
Summary statistics
set.seed(12345)
vector <- rnorm(n = 40, mean = 0, sd = 1)
mean(vector)
sd(vector)
min(vector)
max(vector)
table() for categorical data
set.seed(12345)
gender <- sample(c("male", "female"), size = 100, replace = TRUE, prob = c(0.5, 0.5))
education_level <- sample(c("high school", "bachelor", "master"), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2))
df_table <- data.frame(gender, education_level)
table(df_table)
table(df_table$gender, df_table$education_level) # cross-tabulation
First Data Visualisation with ggplot2
ggplot2 is based on the Grammar of Graphics: you build plots layer by layer.
library(ggplot2)
set.seed(12345)
student_data <- data.frame(
name = c("alice", "bob", "charlie", "diana", "ethan", "fiona", "george", "hannah"),
score = sample(80:100, 8, replace = TRUE),
stringsAsFactors = FALSE
)
student_data$passed <- ifelse(student_data$score >= 90, "passed", "failed")
student_data$passed <- factor(student_data$passed, levels = c("failed", "passed"))
student_data$study_hours <- sample(5:15, 8, replace = TRUE)
Bar plot
ggplot(student_data, aes(x = name, y = score, fill = passed)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("failed" = "red", "passed" = "blue")) +
labs(title = "student scores", x = "student name", y = "score") +
theme_minimal()
Scatter plot
ggplot(student_data, aes(x = study_hours, y = score, color = passed)) +
geom_point(size = 4) +
scale_color_manual(values = c("failed" = "red", "passed" = "blue")) +
labs(title = "scores vs. study hours", x = "study hours", y = "score") +
theme_minimal()
Box plot
ggplot(student_data, aes(x = passed, y = score, fill = passed)) +
geom_boxplot() +
scale_fill_manual(values = c("failed" = "red", "passed" = "blue")) +
labs(title = "score distribution by pass/fail status", x = "status", y = "score") +
theme_minimal()
Histogram
ggplot(student_data, aes(x = score, fill = passed)) +
geom_histogram(binwidth = 5, color = "black", alpha = 0.7) +
scale_fill_manual(values = c("failed" = "red", "passed" = "blue")) +
labs(title = "histogram of scores", x = "score", y = "count") +
theme_minimal()
Line plot (time series)
months <- factor(month.abb[1:8], levels = month.abb[1:8])
study_hours <- c(0, 3, 15, 30, 35, 120, 18, 15)
study_data <- data.frame(month = months, study_hours = study_hours)
ggplot(study_data, aes(x = month, y = study_hours, group = 1)) +
geom_line(linewidth = 1, color = "blue") +
geom_point(color = "red", size = 1) +
labs(title = "monthly study hours", x = "month", y = "study hours") +
theme_minimal()
Base R Graphs
Base R provides plotting functions without additional packages.
# scatter plot
plot(student_data$study_hours, student_data$score,
main = "scores vs. study hours", xlab = "study hours", ylab = "score",
pch = 19, col = ifelse(student_data$passed == "passed", "blue", "red"))
# histogram
hist(student_data$score, breaks = 5, col = "skyblue",
main = "histogram of student scores", xlab = "scores", border = "white")
# box plot
boxplot(score ~ passed, data = student_data,
main = "score distribution by pass/fail status",
xlab = "status", ylab = "scores", col = c("red", "blue"))
Summary of Today's Lab
This session covered:
- Installing and setting up R and RStudio
- Basic arithmetic operations
- Data structures: vectors, factors, data frames
- Data visualisation with
ggplot2and base R
Where to get help
- Large language models: LLMs are effective coding tutors. Help from LLMs for coding does not constitute a breach of academic integrity in this course. For your final report, cite all sources including LLMs.
- Stack Overflow: outstanding community resource.
- Cross Validated: best place for statistics advice.
- Developer websites: Tidyverse.
- Your tutors and course coordinator.
Recommended reading
- Wickham, H., & Grolemund, G. (2016). R for Data Science. O'Reilly Media. Available online
- Megan Hall's lecture: https://meghan.rbind.io/talk/neair/
- RStudio learning materials: https://education.rstudio.com/learn/beginner/
- Johannes Karl on getting started in R: https://www.youtube.com/embed/haYxa3vWA28
Appendix A: At-Home Exercises
- Open RStudio.
- Go to
Tools > Install Packages.... - Type
tidyverseand checkInstall dependencies. - Click
Install. - Load with
library(tidyverse).
- Go to
Tools > Install Packages.... - Type
parameters, report. - Check
Install dependenciesand clickInstall.
The causalworkshop package provides simulated data for your research report. Run the following in your R console:
install.packages("remotes")
remotes::install_github("go-bayes/causalworkshop@v0.2.1")
Verify the installation:
library(causalworkshop)
d <- simulate_nzavs_data(n = 100, seed = 2026)
head(d)
- Create
vector_a <- c(2, 4, 6, 8)andvector_b <- c(1, 3, 5, 7). - Add them, subtract them, multiply
vector_aby 2, dividevector_bby 2. - Calculate the mean and standard deviation of both.
- Create a data frame with columns
id(1–4),name,score(88, 92, 85, 95). - Add a
passedcolumn (pass mark = 90). - Extract name and score of students who passed.
- Explore with
summary(),head(),str().
- Subset
student_datato find students who scored above the mean. - Create an
attendancevector and add it as a column. - Subset to select only rows where students were present.
- Create factor variables
fruitandcolour. - Make a data frame and use
table()for cross-tabulation. - Which fruit has the most colour variety?
- Using
student_data, create a bar plot of scores by name. - Add a title, axis labels, and colour by pass/fail status.
Appendix B: Solutions
Solution 3: Basic operations
vector_a <- c(2, 4, 6, 8)
vector_b <- c(1, 3, 5, 7)
sum_vector <- vector_a + vector_b
diff_vector <- vector_a - vector_b
double_vector_a <- vector_a * 2
half_vector_b <- vector_b / 2
mean(vector_a); sd(vector_a)
mean(vector_b); sd(vector_b)
Solution 4: Working with data frames
student_data <- data.frame(
id = 1:4,
name = c("alice", "bob", "charlie", "diana"),
score = c(88, 92, 85, 95),
stringsAsFactors = FALSE
)
student_data$passed <- student_data$score >= 90
passed_students <- student_data[student_data$passed == TRUE, ]
summary(student_data)
Solution 5: Logical operations and subsetting
mean_score <- mean(student_data$score)
students_above_mean <- student_data[student_data$score > mean_score, ]
attendance <- c("present", "absent", "present", "present")
student_data$attendance <- attendance
present_students <- student_data[student_data$attendance == "present", ]
Solution 6: Cross-tabulation
fruit <- factor(c("apple", "banana", "apple", "orange", "banana"))
colour <- factor(c("red", "yellow", "green", "orange", "green"))
fruit_data <- data.frame(fruit, colour)
table(fruit_data$fruit, fruit_data$colour)
# apple has the most colour variety (red, green)
Solution 7: Visualisation
library(ggplot2)
ggplot(student_data, aes(x = name, y = score, fill = passed)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("TRUE" = "blue", "FALSE" = "red")) +
labs(title = "student scores", x = "name", y = "score") +
theme_minimal()
Appendix C: Other Data Types
Arrays and matrices
matrix_1 <- matrix(1:9, nrow = 3)
array_1 <- array(1:12, dim = c(2, 3, 2))
# convert matrix to data frame
df_matrix_1 <- data.frame(matrix_1)
colnames(df_matrix_1) <- c("col_1", "col_2", "col_3")
Lists
my_list <- list(name = "John Doe", age = 30, scores = c(90, 80, 70))
# access elements
my_list$name
my_list[["scores"]]
# modify
my_list$gender <- "Male"
my_list$age <- 31
my_list$scores <- NULL
# lists as function return values
calculate_stats <- function(numbers) {
list(mean = mean(numbers), sum = sum(numbers))
}
results <- calculate_stats(c(1, 2, 3, 4, 5))