Visualisation

Joseph Bulbulia https://josephbulbulia.netlify.app (Victoria University of Wellington)https://www.wgtn.ac.nz
2021-MAR-9

Data visualisation with ggplot2

Introduction

In this lecture we’ll first introduce you to the ggplot2 package, and vocabulary, for creating graphs in R. We’ll mostly follow the approach described in the book “R for data science,” which can be found here.

We’ll then turn to data-wrangling using the dplyr package.

Both ggplot2 and dplyr can be found in library(tidyverse)

Creating a graph

Step 1, load tidyverse:

Step 2, Make sure your dataset is loaded. We’ll start with the mpg dataset

#inspect the mpg dataset
head(mpg)
# A tibble: 6 x 11
  manufacturer model displ  year   cyl trans   drv     cty   hwy fl   
  <chr>        <chr> <dbl> <int> <int> <chr>   <chr> <int> <int> <chr>
1 audi         a4      1.8  1999     4 auto(l… f        18    29 p    
2 audi         a4      1.8  1999     4 manual… f        21    29 p    
3 audi         a4      2    2008     4 manual… f        20    31 p    
4 audi         a4      2    2008     4 auto(a… f        21    30 p    
5 audi         a4      2.8  1999     6 auto(l… f        16    26 p    
6 audi         a4      2.8  1999     6 manual… f        18    26 p    
# … with 1 more variable: class <chr>

Step 3. Inspect the"Negative relationship between highway fuel efficiency and a cars engine size (which is given by the variable displ).

# Create graph
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  labs(title = "Negative relationship between engine displacement and fuel efficiency.")

A basic problem with this graph is that we don’t know what it is representing. To avoid this problem, it is useful to get into the habit of adding titles to your graphs, and also of using informative axis labels. We do this by adding additional layers.

# Create graph and add title
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  labs(title = "Negative relationship between engine displacement and fuel efficiency.") + 
  xlab("Engine displacement in (units)") + 
  ylab("Highway miles per liter")

Let’s walk through the logic of the ggplot2 “grammar”:

First we call the data

# here we are calling up the data
ggplot(data = mpg)

Next, we add a layer of points, by calling the relevant columns and rows of this dataset

# Here, we add a layer of points, by calling the relevant columns and rows of this dataset
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

Then we add the title

# Create graph and add title
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  labs(title =  "Negative relationship between engine displacement and fuel efficiency.")

Then we add the labels

# Create graph and add title
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  labs(title ="Negative relationship between engine displacement and fuel efficiency.")   + 
  xlab("Engine displacement in (units)") + 
  ylab("Highway miles per liter") 

We can change the axis starting positions:

# Create graph and add title
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  labs(title ="Negative relationship between engine displacement and fuel efficiency.")   + 
  xlab("Engine displacement in (units)") + 
  ylab("Highway miles per liter") + expand_limits(x = 0, y = 0)

The generic method for adding layers is as follows:

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Using ggplot 2 to highlight elements of interest.

Here we can use the “color =” option.1

# Which cases interest you in this graph?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) 

Here’s a shape command

# Which cases interest you in this graph?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class)) 

Here’s a size command

# Which cases interest you in this graph?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = cty)) 

Here’s the fill command

# Which cases interest you in this graph?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, fill = cty)) 

Here’s the alpha command

# Which cases interest you in this graph?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, alpha  = .1)) 

Here’s the alpha command combined with the fill command

# Which cases interest you in this graph?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, alpha = cty, size = cty)) 

Facets

We can create multiple graphs using facets

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) + 
   facet_wrap(~ class, nrow = 2)

We use facet_grid for graphing the Negative relationship between two variables.

Note the difference betwen these two graphs:

Here the focus is on the negative relationship between class and the x variable, displacement

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
  facet_grid(class ~ .)  + theme(legend.position = "none") 

Here the focus is on the relationship betwen class and the y variable, highway milage.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
  facet_grid(. ~ class) + theme(legend.position = "none") 

We can focus on Negative relationship between class and the x and y variables simultaneously. Here we add the ’year` indicator and we do not see much of an improvement in highway milage for the different classes, adjusting for displacement:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
  facet_grid(class ~ year) + theme(legend.position = "bottom") +
  labs(title ="Negative relationship between engine displacement and fuel efficiency by class.") + 
  xlab("Engine displacement in (units)") + 
  ylab("Highway miles per liter") 

Understanding your data through graphs

We can create a graph of relationships:
# set better theme
theme_set(theme_classic())
ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy)) + 
  labs(title ="Negative relationship between engine displacement and fuel efficiency.") +
  xlab("Engine displacement in (units)") +
  ylab("Highway miles per liter") 

Add points as a layer

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(mapping = aes(x = displ, y = hwy)) +
  theme(legend.position = "bottom") +
  labs(title ="Negative relationship between engine displacement and fuel efficiency.") +
  xlab("Engine displacement in (units)") +
  ylab("Highway miles per liter") 

We can write this more compactly, by including the mapping with the data layer

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

Then we can include mappings for specific layers

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth()

We can add a grouping factor e.g. for “drv”, thus creating multiple lines

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, group = drv)) + 
  geom_point(aes(color = class)) + 
  geom_smooth()

We can replace the smooths with linear models

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, linetype = drv)) + 
  geom_point(aes(color = class)) + 
  geom_smooth(method = "lm")

Transforming data

First we’ll get the flights data

# A tibble: 6 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>
1  2013     1     1      517            515         2      830
2  2013     1     1      533            529         4      850
3  2013     1     1      542            540         2      923
4  2013     1     1      544            545        -1     1004
5  2013     1     1      554            600        -6      812
6  2013     1     1      554            558        -4      740
# … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Next we’ll create some data frames to help us illustrate points

df <- data.frame(
colour = c("blue", "black", "blue", "blue", "black"), value = 1:5)
head(df)
  colour value
1   blue     1
2  black     2
3   blue     3
4   blue     4
5  black     5

Revisiting logical operators

Recall our logical operators. These will be essential for data wrangling

knitr::include_graphics("logic.png")

Command filter: keeps rows matching criteria

Keep only blue rows:

df%>%
filter(colour == "blue")
  colour value
1   blue     1
2   blue     3
3   blue     4
Keep only values 1 and 4
df%>%
  filter (value %in% c(1,4))
  colour value
1   blue     1
2   blue     4

Keep values 1 through 4

df %>%
  filter (value %in% c(1:4))
  colour value
1   blue     1
2  black     2
3   blue     3
4   blue     4

Another way to do the same

df %>%
  filter (value != 5)
  colour value
1   blue     1
2  black     2
3   blue     3
4   blue     4

Task

How can we find all flights that left in January?

head(flights)
# A tibble: 6 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>
1  2013     1     1      517            515         2      830
2  2013     1     1      533            529         4      850
3  2013     1     1      542            540         2      923
4  2013     1     1      544            545        -1     1004
5  2013     1     1      554            600        -6      812
6  2013     1     1      554            558        -4      740
# … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>
flights%>%
  dplyr::filter(month ==1)
# A tibble: 27,004 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 26,994 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Flights delayed by more than 15 mintutes that arrived on time

flights%>%
  dplyr::filter (dep_delay >15 & arr_delay <=0)
# A tibble: 4,314 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1     1025            951        34     1258
 2  2013     1     1     1033           1017        16     1130
 3  2013     1     1     2052           2029        23     2349
 4  2013     1     1     2107           2040        27     2354
 5  2013     1     2      727            645        42     1024
 6  2013     1     2     1004            945        19     1251
 7  2013     1     2     1031           1015        16     1135
 8  2013     1     2     1500           1430        30     1741
 9  2013     1     2     1737           1720        17     1908
10  2013     1     2     1831           1815        16     2130
# … with 4,304 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Command select: picks columns by column name

Select the colour column

df%>%
  dplyr::select ( colour )
  colour
1   blue
2  black
3   blue
4   blue
5  black

Another way?

df%>%
  dplyr::select ( !value )
  colour
1   blue
2  black
3   blue
4   blue
5  black

or

df%>%
  dplyr::select ( -c(value ))
  colour
1   blue
2  black
3   blue
4   blue
5  black

Command arrange reorders rows

df %>%
  arrange(value)
  colour value
1   blue     1
2  black     2
3   blue     3
4   blue     4
5  black     5
df %>%
  arrange(desc(value))
  colour value
1  black     5
2   blue     4
3   blue     3
4  black     2
5   blue     1

Task: how would we order flights by departure data and time ?

flights %>%
  arrange(month, day, dep_time)
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 336,766 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Task which flights have the greated difference between departure delay and arrival delay?

flights%>%
  arrange(desc(dep_delay - arr_delay))
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     6    13     1907           1512       235     2134
 2  2013     2    26     1000            900        60     1513
 3  2013     2    23     1226            900       206     1746
 4  2013     5    13     1917           1900        17     2149
 5  2013     2    27      924            900        24     1448
 6  2013     7    14     1917           1829        48     2109
 7  2013     7    17     2004           1930        34     2224
 8  2013    12    27     1719           1648        31     1956
 9  2013     5     2     1947           1949        -2     2209
10  2013    11    13     2024           2015         9     2251
# … with 336,766 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Not this could be written briefly as this:

arrange(flights, desc(dep_delay - arr_delay))
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     6    13     1907           1512       235     2134
 2  2013     2    26     1000            900        60     1513
 3  2013     2    23     1226            900       206     1746
 4  2013     5    13     1917           1900        17     2149
 5  2013     2    27      924            900        24     1448
 6  2013     7    14     1917           1829        48     2109
 7  2013     7    17     2004           1930        34     2224
 8  2013    12    27     1719           1648        31     1956
 9  2013     5     2     1947           1949        -2     2209
10  2013    11    13     2024           2015         9     2251
# … with 336,766 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Command mutate add new variable name

df %>%
  mutate(double_value = 2 * value)
  colour value double_value
1   blue     1            2
2  black     2            4
3   blue     3            6
4   blue     4            8
5  black     5           10

Order flights by greatest difference between departure delay and arrival delay?

flights %>%
  mutate(diff_dep_arr = dep_delay - arr_delay)%>%
  select(flight,diff_dep_arr)%>%
  arrange(desc(diff_dep_arr))
# A tibble: 336,776 x 2
   flight diff_dep_arr
    <int>        <dbl>
 1   4377          109
 2     51           87
 3     51           80
 4   1465           79
 5     51           76
 6    673           74
 7   1532           74
 8   1284           73
 9    612           73
10    427           72
# … with 336,766 more rows

Command summarise reduce variables to values

Sum all values in the df dataset

df %>%
  summarise (total = sum(value))
  total
1    15

Summaries the values by colour groups, and give the number of items per colour group

df %>%
  group_by(colour) %>%
  summarise(total = sum(value),
            n = n())
# A tibble: 2 x 3
  colour total     n
  <chr>  <int> <int>
1 black      7     2
2 blue       8     3

Useful summary functions are:

Task, how many flights flew on Christmas?

head(flights)
# A tibble: 6 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>
1  2013     1     1      517            515         2      830
2  2013     1     1      533            529         4      850
3  2013     1     1      542            540         2      923
4  2013     1     1      544            545        -1     1004
5  2013     1     1      554            600        -6      812
6  2013     1     1      554            558        -4      740
# … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>
flights %>%
  filter( month == 12, day == 25)%>%
  summarise (n = n())
# A tibble: 1 x 1
      n
  <int>
1   719

Calculate average delay:

flights %>%
  summarise(delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 x 1
  delay
  <dbl>
1  12.6
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 x 1
  delay
  <dbl>
1  12.6

Multiple pipe operators

Here we:

  1. Group flights by destination.

  2. Summarise to compute distance, average delay, and number of flights.

  3. Remove Honolulu airport, because it is so far away

delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(dest != "HNL")
head(delays)
# A tibble: 6 x 4
  dest  count  dist delay
  <chr> <int> <dbl> <dbl>
1 ABQ     254 1826   4.38
2 ACK     265  199   4.85
3 ALB     439  143  14.4 
4 ANC       8 3370  -2.5 
5 ATL   17215  757. 11.3 
6 AUS    2439 1514.  6.02
flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay)) %>% # not cancelled
   group_by(tailnum) %>% # group by unique aircraft
  summarise(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  ) %>%
  ggplot(mapping = aes(x = n, y = delay)) + 
  geom_point(alpha = 1/10)  + 
  labs(title = "Variation in average delay by tailnumber ") 

Other functions

Suppose you only wanted to keep your mutated variables, in this case you can use transmute

new_flights <-transmute(flights,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
head(new_flights)
# A tibble: 6 x 3
   gain hours gain_per_hour
  <dbl> <dbl>         <dbl>
1    -9  3.78         -2.38
2   -16  3.78         -4.23
3   -31  2.67        -11.6 
4    17  3.05          5.57
5    19  1.93          9.83
6   -16  2.5          -6.4 

To learn more, go to https://dplyr.tidyverse.org/


  1. Removing the axis and labels here just to keep the code compact↩︎

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. Source code is available at https://go-bayes.github.io/psych-447/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Bulbulia (2021, March 9). Psych 447: Visualisation. Retrieved from https://vuw-psych-447.netlify.app/posts/3_1/

BibTeX citation

@misc{bulbulia2021visualisation,
  author = {Bulbulia, Joseph},
  title = {Psych 447: Visualisation},
  url = {https://vuw-psych-447.netlify.app/posts/3_1/},
  year = {2021}
}