In this lecture we’ll first introduce you to the ggplot2
package, and vocabulary, for creating graphs in R. We’ll mostly follow the approach described in the book “R for data science,” which can be found here.
We’ll then turn to data-wrangling using the dplyr
package.
Both ggplot2
and dplyr
can be found in library(tidyverse)
Step 1, load tidyverse:
Step 2, Make sure your dataset is loaded. We’ll start with the mpg
dataset
#inspect the mpg dataset
head(mpg)
# A tibble: 6 x 11
manufacturer model displ year cyl trans drv cty hwy fl
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
1 audi a4 1.8 1999 4 auto(l… f 18 29 p
2 audi a4 1.8 1999 4 manual… f 21 29 p
3 audi a4 2 2008 4 manual… f 20 31 p
4 audi a4 2 2008 4 auto(a… f 21 30 p
5 audi a4 2.8 1999 6 auto(l… f 16 26 p
6 audi a4 2.8 1999 6 manual… f 18 26 p
# … with 1 more variable: class <chr>
Step 3. Inspect the"Negative relationship between highway fuel efficiency and a cars engine size (which is given by the variable displ
).
# Create graph
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
labs(title = "Negative relationship between engine displacement and fuel efficiency.")
A basic problem with this graph is that we don’t know what it is representing. To avoid this problem, it is useful to get into the habit of adding titles to your graphs, and also of using informative axis labels. We do this by adding additional layers.
# Create graph and add title
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
labs(title = "Negative relationship between engine displacement and fuel efficiency.") +
xlab("Engine displacement in (units)") +
ylab("Highway miles per liter")
Let’s walk through the logic of the ggplot2 “grammar”:
First we call the data
# here we are calling up the data
ggplot(data = mpg)
Next, we add a layer of points, by calling the relevant columns and rows of this dataset
# Here, we add a layer of points, by calling the relevant columns and rows of this dataset
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
Then we add the title
# Create graph and add title
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
labs(title = "Negative relationship between engine displacement and fuel efficiency.")
Then we add the labels
# Create graph and add title
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
labs(title ="Negative relationship between engine displacement and fuel efficiency.") +
xlab("Engine displacement in (units)") +
ylab("Highway miles per liter")
We can change the axis starting positions:
# Create graph and add title
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
labs(title ="Negative relationship between engine displacement and fuel efficiency.") +
xlab("Engine displacement in (units)") +
ylab("Highway miles per liter") + expand_limits(x = 0, y = 0)
The generic method for adding layers is as follows:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
Here we can use the “color =” option.1
# Which cases interest you in this graph?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
Here’s a shape command
# Which cases interest you in this graph?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
Here’s a size command
# Which cases interest you in this graph?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = cty))
Here’s the fill command
# Which cases interest you in this graph?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, fill = cty))
Here’s the alpha command
# Which cases interest you in this graph?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = .1))
Here’s the alpha command combined with the fill command
# Which cases interest you in this graph?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = cty, size = cty))
We can create multiple graphs using facets
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
facet_wrap(~ class, nrow = 2)
We use facet_grid for graphing the Negative relationship between two variables.
Note the difference betwen these two graphs:
Here the focus is on the negative relationship between class and the x variable, displacement
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
facet_grid(class ~ .) + theme(legend.position = "none")
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
facet_grid(. ~ class) + theme(legend.position = "none")
We can focus on Negative relationship between class and the x and y variables simultaneously. Here we add the ’year` indicator and we do not see much of an improvement in highway milage for the different classes, adjusting for displacement:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
facet_grid(class ~ year) + theme(legend.position = "bottom") +
labs(title ="Negative relationship between engine displacement and fuel efficiency by class.") +
xlab("Engine displacement in (units)") +
ylab("Highway miles per liter")
# set better theme
theme_set(theme_classic())
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy)) +
labs(title ="Negative relationship between engine displacement and fuel efficiency.") +
xlab("Engine displacement in (units)") +
ylab("Highway miles per liter")
Add points as a layer
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy)) +
theme(legend.position = "bottom") +
labs(title ="Negative relationship between engine displacement and fuel efficiency.") +
xlab("Engine displacement in (units)") +
ylab("Highway miles per liter")
We can write this more compactly, by including the mapping with the data layer
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
Then we can include mappings for specific layers
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
We can add a grouping factor e.g. for “drv”, thus creating multiple lines
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, group = drv)) +
geom_point(aes(color = class)) +
geom_smooth()
We can replace the smooths with linear models
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, linetype = drv)) +
geom_point(aes(color = class)) +
geom_smooth(method = "lm")
First we’ll get the flights data
library(nycflights13)
head(flights)
# A tibble: 6 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
# … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
Next we’ll create some data frames to help us illustrate points
df <- data.frame(
colour = c("blue", "black", "blue", "blue", "black"), value = 1:5)
head(df)
colour value
1 blue 1
2 black 2
3 blue 3
4 blue 4
5 black 5
Recall our logical operators. These will be essential for data wrangling
knitr::include_graphics("logic.png")
filter
: keeps rows matching criteriaKeep only blue rows:
df%>%
filter(colour == "blue")
colour value
1 blue 1
2 blue 3
3 blue 4
Keep values 1 through 4
Another way to do the same
df %>%
filter (value != 5)
colour value
1 blue 1
2 black 2
3 blue 3
4 blue 4
How can we find all flights that left in January?
head(flights)
# A tibble: 6 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
# … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
flights%>%
dplyr::filter(month ==1)
# A tibble: 27,004 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 26,994 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Flights delayed by more than 15 mintutes that arrived on time
flights%>%
dplyr::filter (dep_delay >15 & arr_delay <=0)
# A tibble: 4,314 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 1025 951 34 1258
2 2013 1 1 1033 1017 16 1130
3 2013 1 1 2052 2029 23 2349
4 2013 1 1 2107 2040 27 2354
5 2013 1 2 727 645 42 1024
6 2013 1 2 1004 945 19 1251
7 2013 1 2 1031 1015 16 1135
8 2013 1 2 1500 1430 30 1741
9 2013 1 2 1737 1720 17 1908
10 2013 1 2 1831 1815 16 2130
# … with 4,304 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
select
: picks columns by column nameSelect the colour column
df%>%
dplyr::select ( colour )
colour
1 blue
2 black
3 blue
4 blue
5 black
Another way?
df%>%
dplyr::select ( !value )
colour
1 blue
2 black
3 blue
4 blue
5 black
or
arrange
reorders rowsdf %>%
arrange(value)
colour value
1 blue 1
2 black 2
3 blue 3
4 blue 4
5 black 5
df %>%
arrange(desc(value))
colour value
1 black 5
2 blue 4
3 blue 3
4 black 2
5 blue 1
Task: how would we order flights by departure data and time ?
flights %>%
arrange(month, day, dep_time)
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 336,766 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Task which flights have the greated difference between departure delay and arrival delay?
flights%>%
arrange(desc(dep_delay - arr_delay))
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 6 13 1907 1512 235 2134
2 2013 2 26 1000 900 60 1513
3 2013 2 23 1226 900 206 1746
4 2013 5 13 1917 1900 17 2149
5 2013 2 27 924 900 24 1448
6 2013 7 14 1917 1829 48 2109
7 2013 7 17 2004 1930 34 2224
8 2013 12 27 1719 1648 31 1956
9 2013 5 2 1947 1949 -2 2209
10 2013 11 13 2024 2015 9 2251
# … with 336,766 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Not this could be written briefly as this:
arrange(flights, desc(dep_delay - arr_delay))
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 6 13 1907 1512 235 2134
2 2013 2 26 1000 900 60 1513
3 2013 2 23 1226 900 206 1746
4 2013 5 13 1917 1900 17 2149
5 2013 2 27 924 900 24 1448
6 2013 7 14 1917 1829 48 2109
7 2013 7 17 2004 1930 34 2224
8 2013 12 27 1719 1648 31 1956
9 2013 5 2 1947 1949 -2 2209
10 2013 11 13 2024 2015 9 2251
# … with 336,766 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
mutate
add new variable namedf %>%
mutate(double_value = 2 * value)
colour value double_value
1 blue 1 2
2 black 2 4
3 blue 3 6
4 blue 4 8
5 black 5 10
Order flights by greatest difference between departure delay and arrival delay?
flights %>%
mutate(diff_dep_arr = dep_delay - arr_delay)%>%
select(flight,diff_dep_arr)%>%
arrange(desc(diff_dep_arr))
# A tibble: 336,776 x 2
flight diff_dep_arr
<int> <dbl>
1 4377 109
2 51 87
3 51 80
4 1465 79
5 51 76
6 673 74
7 1532 74
8 1284 73
9 612 73
10 427 72
# … with 336,766 more rows
summarise
reduce variables to valuesSum all values in the df dataset
df %>%
summarise (total = sum(value))
total
1 15
Summaries the values by colour groups, and give the number of items per colour group
df %>%
group_by(colour) %>%
summarise(total = sum(value),
n = n())
# A tibble: 2 x 3
colour total n
<chr> <int> <int>
1 black 7 2
2 blue 8 3
Useful summary functions are:
min(x)
max(x)
mean(x)
n
n_distinct
sum(x)
sum(x > 10)
mean(x > 10)
sd(x
)var(x)
Task, how many flights flew on Christmas?
head(flights)
# A tibble: 6 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
# … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
flights %>%
filter( month == 12, day == 25)%>%
summarise (n = n())
# A tibble: 1 x 1
n
<int>
1 719
Calculate average delay:
flights %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 x 1
delay
<dbl>
1 12.6
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 x 1
delay
<dbl>
1 12.6
Here we:
Group flights by destination.
Summarise to compute distance, average delay, and number of flights.
Remove Honolulu airport, because it is so far away
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(dest != "HNL")
head(delays)
# A tibble: 6 x 4
dest count dist delay
<chr> <int> <dbl> <dbl>
1 ABQ 254 1826 4.38
2 ACK 265 199 4.85
3 ALB 439 143 14.4
4 ANC 8 3370 -2.5
5 ATL 17215 757. 11.3
6 AUS 2439 1514. 6.02
flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay)) %>% # not cancelled
group_by(tailnum) %>% # group by unique aircraft
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10) +
labs(title = "Variation in average delay by tailnumber ")
Suppose you only wanted to keep your mutated variables, in this case you can use transmute
new_flights <-transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
head(new_flights)
# A tibble: 6 x 3
gain hours gain_per_hour
<dbl> <dbl> <dbl>
1 -9 3.78 -2.38
2 -16 3.78 -4.23
3 -31 2.67 -11.6
4 17 3.05 5.57
5 19 1.93 9.83
6 -16 2.5 -6.4
To learn more, go to https://dplyr.tidyverse.org/
Removing the axis and labels here just to keep the code compact↩︎
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. Source code is available at https://go-bayes.github.io/psych-447/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Bulbulia (2021, March 9). Psych 447: Visualisation. Retrieved from https://vuw-psych-447.netlify.app/posts/3_1/
BibTeX citation
@misc{bulbulia2021visualisation, author = {Bulbulia, Joseph}, title = {Psych 447: Visualisation}, url = {https://vuw-psych-447.netlify.app/posts/3_1/}, year = {2021} }