<- 10 # assigns the value 10 to x
x <- 5 # assigns the value 5 to y y
By the end of this lab you will:
- Have R and R-studio Downloaded on your machine
- Be able to use R for basic analysis and graphing
Things To Remember
- File names and code should be legible
- Learn macros to save time and order your code
- Learning is making mistakes; try first, and then seek help.
Introduction
- Why learn R?
- You’ll need it for your final report.
- Supports your psychology coursework.
- Enhances your coding skills.
Install R
- Visit the comprehensive r archive network (cran) at https://cran.r-project.org/
- Select the version of r suitable for your operating system (windows, mac, or linux)
- Download and install it by following the on-screen instructions
Install RStudio
- Visit rstudio download page at https://www.rstudio.com/products/rstudio/download/
- Choose the free version of rstudio desktop,
- Download it for your operating system
- Install and open
Create new project
file
>new project
- Choose new directory
- Specify the location where the project folder will be created
- Click
create project
Exercise 1: Install tidyverse
- Open rstudio: launch rstudio on your computer
tools
>install packages
- Type
tidyverse
- Click on the install button
- Type
library(tidyverse)
in the console and pressenter
Execut Code
- Use
Ctrl + Enter
(Windows/Linux) orCmd + Enter
(Mac).
Assignment Operator
The assignment operator in R is <-
. This operator assigns values to variables.
Alternative assignment:
= 10
x = 5 y
Comparing values:
10 == 5 # returns FALSE
[1] FALSE
RStudio Assignment Operator Shortcut
- For macOS:
Option
+-
inserts<-
. - For Windows and Linux:
Alt
+-
inserts<-
.
Keyboard Shortcuts
Explore keyboard shortcuts in RStudio through Tools
-> Keyboard Shortcuts Help
.
Concatenation
The c()
function combines multiple elements into a vector.
<- c(1, 2, 3, 4, 5) # a vector of numbers
numbers print(numbers)
[1] 1 2 3 4 5
Arithmetic Operations
Addition and subtraction in R:
<- x + y
sum print(sum)
[1] 15
<- x - y
difference print(difference)
[1] 5
Multiplication and Division
# Scalar operations
<- x * y
product <- x / y
quotient
# vector multiplication and division
<- c(1, 2, 3)
vector1 <- c(4, 5, 6)
vector2
# Vector operations
<- vector1 * vector2
vector_product <- vector1 / vector2 vector_division
Be cautious with division by zero:
<- 10 / 0 # Inf
result <- 0 / 0 # NaN zero_division
Integer division and modulo operation:
<- 10 %/% 3
integer_division <- 10 %% 3 remainder
Logical Operators
Examples of NOT, NOT EQUAL, and EQUAL operations:
<- x != y
x_not_y <- x == 10 x_equal_10
OR and AND operations:
<- c(TRUE, FALSE) | c(FALSE, TRUE)
vector_or <- TRUE || FALSE
single_or
<- c(TRUE, FALSE) & c(FALSE, TRUE)
vector_and <- TRUE && FALSE single_and
Integers
- Whole numbers without decimal points, defined with an
L
suffix
<- 42L
x str(x) # check type
int 42
- Conversion to numeric
<- as.numeric(x)
y str(y)
num 42
Characters
- Text strings enclosed in quotes
<- "alice" name
Factors
- Represent categorical data with limited values
<- factor(c("red", "blue", "green")) colors
Ordered Factors
- Factors with and without inherent order
<- c("high school", "bachelor", "master", "ph.d.")
education_levels <- factor(education_levels, ordered = FALSE)
education_factor_no_order <- factor(education_levels, ordered = TRUE)
education_factor <- factor(education_levels, levels = education_levels, ordered = TRUE) education_ordered_explicit
- Operations with ordered factors
<- ordered("bachelor", levels = education_levels)
edu1 <- ordered("master", levels = education_levels)
edu2 > edu1 # logical comparison edu2
[1] TRUE
- Modifying ordered factors
<- c("primary school", "high school", "bachelor", "master", "ph.d.")
new_levels <- factor(education_levels, levels = new_levels, ordered = TRUE)
education_updated str(education_updated)
Ord.factor w/ 5 levels "primary school"<..: 2 3 4 5
table(education_updated)
education_updated
primary school high school bachelor master ph.d.
0 1 1 1 1
Strings
- Sequences of characters
<- 'world!'
you <- paste("hello,", you)
greeting # hello world
greeting
[1] "hello, world!"
Vectors
- Fundamental data structure in R
<- c(1, 2, 3, 4, 5)
numeric_vector <- c("apple", "banana", "cherry")
character_vector <- c(TRUE, FALSE, TRUE, FALSE) logical_vector
- Manipulating vectors
<- numeric_vector + 10
vector_sum <- numeric_vector * 2
vector_multiplication <- numeric_vector > 3 vector_greater_than_three
table()
Function
- Generates frequency tables for categorical data
table(vector_greater_than_three)
vector_greater_than_three
FALSE TRUE
3 2
Dataframes
- Creating and manipulating data frames
# clear previous `df` object (if any)
rm(df)
<- data.frame(
df name = c("alice", "bob", "charlie"),
age = c(25, 30, 35),
gender = c("female", "male", "male")
)# look at structure
head(df)
name age gender
1 alice 25 female
2 bob 30 male
3 charlie 35 male
str(df)
'data.frame': 3 obs. of 3 variables:
$ name : chr "alice" "bob" "charlie"
$ age : num 25 30 35
$ gender: chr "female" "male" "male"
table(df$gender)
female male
1 2
table(df$age)
25 30 35
1 1 1
table(df$name)
alice bob charlie
1 1 1
Access Data Frame Elements
- By column name and row/column indexing
# by column name
<- df$name
names # by row and column
<- df[2, ]
second_person <- df[, "age"] age_column
Using subset()
Function
- Extracting rows based on conditions
<- subset(df, age > 25)
very_old_people summary(very_old_people$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
30.00 31.25 32.50 32.50 33.75 35.00
mean(very_old_people$age)
[1] 32.5
min(very_old_people$age)
[1] 30
Explore Data Frames
- Using
head()
,tail()
, andstr()
head(df)
name age gender
1 alice 25 female
2 bob 30 male
3 charlie 35 male
tail(df)
name age gender
1 alice 25 female
2 bob 30 male
3 charlie 35 male
str(df)
'data.frame': 3 obs. of 3 variables:
$ name : chr "alice" "bob" "charlie"
$ age : num 25 30 35
$ gender: chr "female" "male" "male"
Modify Data Frames
- Add and modify columns and rows
# add columns
$employed <- c(TRUE, TRUE, FALSE)
df# add rows
<- data.frame(name = "diana", age = 28, gender = "female", employed = TRUE)
new_person <- rbind(df, new_person)
df # modify values
4, "age"] <- 26
df[ df
name age gender employed
1 alice 25 female TRUE
2 bob 30 male TRUE
3 charlie 35 male FALSE
4 diana 26 female TRUE
rbind()
and cbind()
- Adding rows and columns to data frames
# add rows with `rbind()`
<- data.frame(name = "eve", age = 32, gender = "female", employed = TRUE)
new_person <- rbind(df, new_person)
df # add columns with `cbind()`
<- c("engineer", "doctor", "artist", "teacher", "doctor")
occupation_vector <- cbind(df, occupation_vector)
df df
name age gender employed occupation_vector
1 alice 25 female TRUE engineer
2 bob 30 male TRUE doctor
3 charlie 35 male FALSE artist
4 diana 26 female TRUE teacher
5 eve 32 female TRUE doctor
Data Structure View
- Using
summary()
,str()
,head()
, andtail()
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
tail(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
Statistical Functions
mean()
,sd()
,min()
,max()
, andtable()
# seed for reproducibility
set.seed(12345)
<- rnorm(n = 40, mean = 0, sd = 1)
vector mean(vector) # calculates mean
[1] 0.2401853
sd(vector) # computes standard deviation
[1] 1.038425
min(vector) # finds minimum value
[1] -1.817956
max(vector) # finds maximum value
[1] 2.196834
Introduction to ggplot2
- Visualizing data with
ggplot2
# seed for reproducibility
set.seed(12345)
# ensure ggplot2 is installed and loaded
if (!require(ggplot2)) install.packages("ggplot2")
library(ggplot2)
# simulate student data
<- data.frame(
student_data name = c("alice", "bob", "charlie", "diana", "ethan", "fiona", "george", "hannah"),
score = sample(80:100, 8, replace = TRUE),
stringsasfactors = FALSE
)$passed <- ifelse(student_data$score >= 90, "passed", "failed")
student_data$passed <- factor(student_data$passed, levels = c("failed", "passed"))
student_data$study_hours <- sample(5:15, 8, replace = TRUE) student_data
ggplot2 Barplot: score for each name
ggplot(student_data, aes(x = name, y = score)) +
geom_bar(stat = "identity")
- enhanced bar plot with titles, axis labels, and modified colours
ggplot(student_data, aes(x = name, y = score, fill = passed)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("TRUE" = "blue", "FALSE" = "red")) +
labs(title = "student scores", x = "student name", y = "score") +
theme_minimal()
ggplot2 Scatterplot: student scores against study hours
ggplot(student_data, aes(x = study_hours, y = score, color = passed)) +
geom_point(size = 4) +
labs(title = "student scores vs. study hours", x = "study hours", y = "score") +
theme_minimal() +
scale_color_manual(values = c("failed" = "red", "passed" = "blue"))
ggplot2 Boxplot: scores by pass/fail status
ggplot(student_data, aes(x = passed, y = score, fill = passed)) +
geom_boxplot() +
labs(title = "score distribution by pass/fail status", x = "status", y = "score") +
theme_minimal() +
scale_fill_manual(values = c("failed" = "red", "passed" = "blue"))
median (Q2/50th percentile): divides the dataset into two halves.
first quartile (Q1/25th percentile): lower edge indicating that 25% of the data falls below this value.
third quartile (Q3/75th percentile): upper edge of the box represents the third quartile, showing that 75% of the data is below this value.
interquartile range (IQR): height of the box represents the IQR: distance between the first and third quartiles (Q3 - Q1) / middle 50% of the data.
whiskers: The lines extending from the top and bottom of the box (the “whiskers”) indicate the range of the data, typically to the smallest and largest values within 1.5 * IQR from the first and third quartiles, respectively. Points outside this range are often considered outliers and can be plotted individually.
outliers: points that lie beyond the whiskers
ggplot2 Histogram: distribution of scores
ggplot(student_data, aes(x = score, fill = passed)) +
geom_histogram(binwidth = 5, color = "black", alpha = 0.7) +
labs(title = "histogram of scores", x = "score", y = "count") +
theme_minimal() +
scale_fill_manual(values = c("failed" = "red", "passed" = "blue"))
ggplot2 Lineplot
# prep data
<- factor(month.abb[1:8], levels = month.abb[1:8])
months <- c(0, 3, 15, 30, 35, 120, 18, 15)
study_hours <- data.frame(month = months, study_hours = study_hours)
study_data
# line plot
ggplot(study_data, aes(x = month, y = study_hours, group = 1)) +
geom_line(linewidth = 1, color = "blue") +
geom_point(color = "red", size = 1) +
labs(title = "monthly study hours", x = "month", y = "study hours") +
theme_minimal()
Base R Scatter Plot: scores vs. study hours
# scatter plot
plot(student_data$study_hours, student_data$score,
main = "scatter plot of scores vs. study hours",
xlab = "study hours", ylab = "score",
pch = 19, col = ifelse(student_data$passed == "passed", "blue", "red"))
Base R Histogram
- histogram to visualise distribution of student scores
# histogram
hist(student_data$score,
breaks = 5,
col = "skyblue",
main = "histogram of student scores",
xlab = "scores",
border = "white")
Base R Boxplots: Distribution by Pass/Fail
# boxplot
boxplot(score ~ passed, data = student_data,
main = "score distribution by pass/fail status",
xlab = "status", ylab = "scores",
col = c("red", "blue"))
median (Q2/50th percentile): divides the dataset into two halves.
first quartile (Q1/25th percentile): lower edge indicating that 25% of the data falls below this value.
third quartile (Q3/75th percentile): upper edge of the box represents the third quartile, showing that 75% of the data is below this value.
interquartile range (IQR): height of the box represents the IQR: distance between the first and third quartiles (Q3 - Q1) / middle 50% of the data.
whiskers: The lines extending from the top and bottom of the box (the “whiskers”) indicate the range of the data, typically to the smallest and largest values within 1.5 * IQR from the first and third quartiles, respectively. Points outside this range are often considered outliers and can be plotted individually.
outliers: points that lie beyond the whiskers
Base R Barplot: Score Distributions
# prep data for the barplot
<- table(student_data$score)
scores_table barplot(scores_table,
main = "Barplot of Scores",
xlab = "Scores",
ylab = "Frequency",
col = "skyblue",
border = "white")
Base R Line Plot
# convert 'month' to a numeric scale for plotting positions
<- 1:length(study_data$month) # Simple numeric sequence
months_num
# Plotting points with suppressed x-axis
plot(months_num, study_data$study_hours,
type = "p", # Points
pch = 19, # Type of point
col = "red",
xlab = "Month",
ylab = "Study Hours",
main = "Monthly Study Hours",
xaxt = "n") # Suppress the x-axis
# add lines between points
lines(months_num, study_data$study_hours,
col = "blue",
lwd = 1) # Line width
# add custom month labels to the x-axis at appropriate positions
axis(1, at = months_num, labels = study_data$month, las=2) # `las=2` makes labels perpendicular to axis
What have your learned?
- You have Base R and R-studio Downloaded on your machine
- You are able able to use R for basic analysis and graphing
- You will need to practice, and will have lots of opporunity.
Where to Get Help
- Large Language Models (LLMs): LLMs are trained on extensive datasets. They are extremely good coding tutors. Open AI’s GPT-4 considerably outperforms GPT-3.5. However GPT 3.5 should be good enough. Gemini has a two-month free trial. LLM’s are rapidly evolving. However, presently, to use these tools, and to spot their errors, you will need to know how to code. Which is fortunate because coding makes you smarter!
Note: you will not be assessed for R-code. Help from LLM’s for coding does not consitute a breach of academic integrity in this course. Your tests are in-class; no LLM’s allowed. For your final report, you will need to cite all sources, and how you used them, including LLMs.
Stack Overflow: an outstanding resource for most problems. Great community.
Cross-validated the best place to go for stats advice. (LLM’s are only safe for standard statistics. They do not perform well for causal inference.)
Developer Websites and GitHub Pages: Tidyverse
Your tutors and course coordinator. We care. We’re here to help you!
References
Wickham, H., & Grolemund, G. (2016). R for Data Science. O’Reilly Media. [Available online](https://r4ds.had.co.nz
A helpful resource for learning R is Megan Hall’s lecture available at: https://meghan.rbind.io/talk/neair/.
RStudio has compiled numerous accessible materials for learning R, which can be found here: https://education.rstudio.com/learn/beginner/.