dplyr: a grammar of data manipulation

Overview - Unary verbs

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

select() picks variables based on their names (it’s a filter on columns of the data frame)
filter() picks cases based on their values (it’s a filter on rows of the data frame)
mutate() adds new variables that are typically functions of existing variables
arrange() changes the ordering of the rows
group_by() partitions rows of data into groups defined by the values of some variables
summarise() reduces multiple values down to a single summary, typically used in combination with group_by()

Overview - Binary verbs

In practice, you’ll normally have many tables that contribute to an analysis, and you need flexible tools to combine them.

In dplyr, there are three families of verbs that work with two tables at a time:

mutating joins, which add new variables to one table from matching rows in another
filtering joins, which filter observations from one table based on whether or not they match an observation in the other table
set operations, which combine the observations in the data sets as if they were set elements

New York flights dataset

The nycflights13 dataset contains information about all flights that departed from New York City in 2013 and related metadata (planes, airports, airlines and weather conditions).

Type the name of the table in the help to read about the meaning of the columns.

Here is an Entity-Relationship schema for the dataset:

library("nycflights13")

The Star Wars dataset

All the Star Wars data you’ve ever wanted: Planets, Spaceships, Vehicles, People, Films and Species. From all SEVEN Star Wars films

The Star Wars API, or SWAPI, is the world’s first quantified and programmatically-accessible data source for all the data from the Star Wars canon universe!

A fraction of the data source is contained in the dataset starwars in the dplyr package:

library("dplyr")

Pipes

Pipes are a powerful tool for clearly expressing a sequence of multiple operations.

Little bunny Foo Foo
Went hopping through the forest
Scooping up the field mice
And bopping them on the head

Intermediate steps

foo_foo_1 <- hop(foo_foo, through = forest)
foo_foo_2 <- scoop(foo_foo_1, up = field_mice)
foo_foo_3 <- bop(foo_foo_2, on = head)

Too many unimportant names!

Overwrite the original

foo_foo <- hop(foo_foo, through = forest)
foo_foo <- scoop(foo_foo, up = field_mice)
foo_foo <- bop(foo_foo, on = head)

Debugging is painful!

Function composition

bop(
  scoop(
    hop(foo_foo, through = forest),
    up = field_mice
  ), 
  on = head
)

Hard for a human to consume!

Use the pipe

foo_foo %>%
  hop(through = forest) %>%
  scoop(up = field_mouse) %>%
  bop(on = head)

This is our favorite form, because it focuses on verbs, not nouns!

Unary verbs

filter

Verb filter() picks cases based on their values (it’s a filter on rows of the data frame).

starwars %>% 
  filter(species == "Droid")

## # A tibble: 6 × 14
##   name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
##   <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
## 1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
## 2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
## 3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
## 4 IG-88     200   140 none       metal       red               15 none  masculi…
## 5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
## 6 BB8        NA    NA none       none        black             NA none  masculi…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

starwars %>% 
  filter(species == "Droid" & eye_color == "red")

## # A tibble: 3 × 14
##   name  height  mass hair_color skin_color  eye_color birth_year sex   gender   
##   <chr>  <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>    
## 1 R2-D2     96    32 <NA>       white, blue red               33 none  masculine
## 2 R5-D4     97    32 <NA>       white, red  red               NA none  masculine
## 3 IG-88    200   140 none       metal       red               15 none  masculine
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

# or
starwars %>% 
  filter(species == "Droid", eye_color == "red")

## # A tibble: 3 × 14
##   name  height  mass hair_color skin_color  eye_color birth_year sex   gender   
##   <chr>  <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>    
## 1 R2-D2     96    32 <NA>       white, blue red               33 none  masculine
## 2 R5-D4     97    32 <NA>       white, red  red               NA none  masculine
## 3 IG-88    200   140 none       metal       red               15 none  masculine
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

starwars %>% 
  filter(species == "Droid" | species == "Wookiee")

## # A tibble: 8 × 14
##   name      height  mass hair_color skin_color  eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr> 
## 1 C-3PO        167    75 <NA>       gold        yellow           112 none  mascu…
## 2 R2-D2         96    32 <NA>       white, blue red               33 none  mascu…
## 3 R5-D4         97    32 <NA>       white, red  red               NA none  mascu…
## 4 Chewbacca    228   112 brown      unknown     blue             200 male  mascu…
## 5 IG-88        200   140 none       metal       red               15 none  mascu…
## 6 R4-P17        96    NA none       silver, red red, blue         NA none  femin…
## 7 Tarfful      234   136 brown      brown       blue              NA male  mascu…
## 8 BB8           NA    NA none       none        black             NA none  mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

starwars %>% 
  filter(species == "Droid", !is.na(birth_year))

## # A tibble: 3 × 14
##   name  height  mass hair_color skin_color  eye_color birth_year sex   gender   
##   <chr>  <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>    
## 1 C-3PO    167    75 <NA>       gold        yellow           112 none  masculine
## 2 R2-D2     96    32 <NA>       white, blue red               33 none  masculine
## 3 IG-88    200   140 none       metal       red               15 none  masculine
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

select

Verb select() picks variables based on their names (it’s a filter on columns of the data frame).

starwars %>% 
  select(name, gender, birth_year)

## # A tibble: 87 × 3
##    name               gender    birth_year
##    <chr>              <chr>          <dbl>
##  1 Luke Skywalker     masculine       19  
##  2 C-3PO              masculine      112  
##  3 R2-D2              masculine       33  
##  4 Darth Vader        masculine       41.9
##  5 Leia Organa        feminine        19  
##  6 Owen Lars          masculine       52  
##  7 Beru Whitesun lars feminine        47  
##  8 R5-D4              masculine       NA  
##  9 Biggs Darklighter  masculine       24  
## 10 Obi-Wan Kenobi     masculine       57  
## # … with 77 more rows

starwars %>% 
  select(name, ends_with("color"))

## # A tibble: 87 × 4
##    name               hair_color    skin_color  eye_color
##    <chr>              <chr>         <chr>       <chr>    
##  1 Luke Skywalker     blond         fair        blue     
##  2 C-3PO              <NA>          gold        yellow   
##  3 R2-D2              <NA>          white, blue red      
##  4 Darth Vader        none          white       yellow   
##  5 Leia Organa        brown         light       brown    
##  6 Owen Lars          brown, grey   light       blue     
##  7 Beru Whitesun lars brown         light       blue     
##  8 R5-D4              <NA>          white, red  red      
##  9 Biggs Darklighter  black         light       brown    
## 10 Obi-Wan Kenobi     auburn, white fair        blue-gray
## # … with 77 more rows

starwars %>% 
  select(-contains("color"))

## # A tibble: 87 × 11
##    name    height  mass birth_year sex   gender homeworld species films vehicles
##    <chr>    <int> <dbl>      <dbl> <chr> <chr>  <chr>     <chr>   <lis> <list>  
##  1 Luke S…    172    77       19   male  mascu… Tatooine  Human   <chr… <chr [2…
##  2 C-3PO      167    75      112   none  mascu… Tatooine  Droid   <chr… <chr [0…
##  3 R2-D2       96    32       33   none  mascu… Naboo     Droid   <chr… <chr [0…
##  4 Darth …    202   136       41.9 male  mascu… Tatooine  Human   <chr… <chr [0…
##  5 Leia O…    150    49       19   fema… femin… Alderaan  Human   <chr… <chr [1…
##  6 Owen L…    178   120       52   male  mascu… Tatooine  Human   <chr… <chr [0…
##  7 Beru W…    165    75       47   fema… femin… Tatooine  Human   <chr… <chr [0…
##  8 R5-D4       97    32       NA   none  mascu… Tatooine  Droid   <chr… <chr [0…
##  9 Biggs …    183    84       24   male  mascu… Tatooine  Human   <chr… <chr [0…
## 10 Obi-Wa…    182    77       57   male  mascu… Stewjon   Human   <chr… <chr [1…
## # … with 77 more rows, and 1 more variable: starships <list>

Play

The name of all characters with blue eyes and blond hair

Solution

starwars %>% 
  filter(hair_color == "blond", eye_color == "blue") %>% 
  select(name)

## # A tibble: 3 × 1
##   name            
##   <chr>           
## 1 Luke Skywalker  
## 2 Anakin Skywalker
## 3 Finis Valorum

Play

The film starred by Luke Skywalker

Solution

starwars %>% 
  filter(name == "Luke Skywalker") %>% 
  select(films)

## # A tibble: 1 × 1
##   films    
##   <list>   
## 1 <chr [5]>

starwars %>% 
  filter(name == "Luke Skywalker") %>% 
  pull(films) %>% 
  unlist()

## [1] "The Empire Strikes Back" "Revenge of the Sith"    
## [3] "Return of the Jedi"      "A New Hope"             
## [5] "The Force Awakens"

starwars %>% 
  select(name)

## # A tibble: 87 × 1
##    name              
##    <chr>             
##  1 Luke Skywalker    
##  2 C-3PO             
##  3 R2-D2             
##  4 Darth Vader       
##  5 Leia Organa       
##  6 Owen Lars         
##  7 Beru Whitesun lars
##  8 R5-D4             
##  9 Biggs Darklighter 
## 10 Obi-Wan Kenobi    
## # … with 77 more rows

starwars %>% 
  pull(name)

##  [1] "Luke Skywalker"        "C-3PO"                 "R2-D2"                
##  [4] "Darth Vader"           "Leia Organa"           "Owen Lars"            
##  [7] "Beru Whitesun lars"    "R5-D4"                 "Biggs Darklighter"    
## [10] "Obi-Wan Kenobi"        "Anakin Skywalker"      "Wilhuff Tarkin"       
## [13] "Chewbacca"             "Han Solo"              "Greedo"               
## [16] "Jabba Desilijic Tiure" "Wedge Antilles"        "Jek Tono Porkins"     
## [19] "Yoda"                  "Palpatine"             "Boba Fett"            
## [22] "IG-88"                 "Bossk"                 "Lando Calrissian"     
## [25] "Lobot"                 "Ackbar"                "Mon Mothma"           
## [28] "Arvel Crynyd"          "Wicket Systri Warrick" "Nien Nunb"            
## [31] "Qui-Gon Jinn"          "Nute Gunray"           "Finis Valorum"        
## [34] "Jar Jar Binks"         "Roos Tarpals"          "Rugor Nass"           
## [37] "Ric Olié"              "Watto"                 "Sebulba"              
## [40] "Quarsh Panaka"         "Shmi Skywalker"        "Darth Maul"           
## [43] "Bib Fortuna"           "Ayla Secura"           "Dud Bolt"             
## [46] "Gasgano"               "Ben Quadinaros"        "Mace Windu"           
## [49] "Ki-Adi-Mundi"          "Kit Fisto"             "Eeth Koth"            
## [52] "Adi Gallia"            "Saesee Tiin"           "Yarael Poof"          
## [55] "Plo Koon"              "Mas Amedda"            "Gregar Typho"         
## [58] "Cordé"                 "Cliegg Lars"           "Poggle the Lesser"    
## [61] "Luminara Unduli"       "Barriss Offee"         "Dormé"                
## [64] "Dooku"                 "Bail Prestor Organa"   "Jango Fett"           
## [67] "Zam Wesell"            "Dexter Jettster"       "Lama Su"              
## [70] "Taun We"               "Jocasta Nu"            "Ratts Tyerell"        
## [73] "R4-P17"                "Wat Tambor"            "San Hill"             
## [76] "Shaak Ti"              "Grievous"              "Tarfful"              
## [79] "Raymus Antilles"       "Sly Moore"             "Tion Medon"           
## [82] "Finn"                  "Rey"                   "Poe Dameron"          
## [85] "BB8"                   "Captain Phasma"        "Padmé Amidala"

Play

flights on Christmas
flights that have a valid delay either on departure or on arrival
flights that have a valid delay on departure but not on arrival

Solution

# flights on Christmas
filter(flights, month == 12 & day == 25)

## # A tibble: 719 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    25      456            500        -4      649            651
##  2  2013    12    25      524            515         9      805            814
##  3  2013    12    25      542            540         2      832            850
##  4  2013    12    25      546            550        -4     1022           1027
##  5  2013    12    25      556            600        -4      730            745
##  6  2013    12    25      557            600        -3      743            752
##  7  2013    12    25      557            600        -3      818            831
##  8  2013    12    25      559            600        -1      855            856
##  9  2013    12    25      559            600        -1      849            855
## 10  2013    12    25      600            600         0      850            846
## # ℹ 709 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

# flights that have a valid delay 
# either on departure or on arrival
filter(flights, !is.na(dep_delay) | !is.na(arr_delay))

## # A tibble: 328,521 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 328,511 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

# flights that have a valid delay on departure and not on arrival
filter(flights, !is.na(dep_delay) & is.na(arr_delay))

## # A tibble: 1,175 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1     1525           1530        -5     1934           1805
##  2  2013     1     1     1528           1459        29     2002           1647
##  3  2013     1     1     1740           1745        -5     2158           2020
##  4  2013     1     1     1807           1738        29     2251           2103
##  5  2013     1     1     1939           1840        59       29           2151
##  6  2013     1     1     1952           1930        22     2358           2207
##  7  2013     1     1     2016           1930        46       NA           2220
##  8  2013     1     2      905            822        43     1313           1045
##  9  2013     1     2     1125            925       120     1445           1146
## 10  2013     1     2     1848           1840         8     2333           2151
## # ℹ 1,165 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

mutate

Verb mutate() adds new variables that are typically functions of existing variables.

starwars %>% 
  mutate(bmi = mass / ((height / 100)  ^ 2)) %>%
  select(name:mass, bmi)

## # A tibble: 87 × 4
##    name               height  mass   bmi
##    <chr>               <int> <dbl> <dbl>
##  1 Luke Skywalker        172    77  26.0
##  2 C-3PO                 167    75  26.9
##  3 R2-D2                  96    32  34.7
##  4 Darth Vader           202   136  33.3
##  5 Leia Organa           150    49  21.8
##  6 Owen Lars             178   120  37.9
##  7 Beru Whitesun lars    165    75  27.5
##  8 R5-D4                  97    32  34.0
##  9 Biggs Darklighter     183    84  25.1
## 10 Obi-Wan Kenobi        182    77  23.2
## # … with 77 more rows

Play

If a table lacks a natural primary key (a set of attributes that identify observations), it’s sometimes useful to add one that simply contains the row number. This is called a surrogate key.

This is the case of flights table. Add an unique attribute id to the flights table.

Solution

flights = 
  flights %>% 
  mutate(id = 1:nrow(flights)) %>%
  select(id, everything())

flights

## # A tibble: 336,776 × 20
##       id  year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1     1  2013     1     1      517            515         2      830
##  2     2  2013     1     1      533            529         4      850
##  3     3  2013     1     1      542            540         2      923
##  4     4  2013     1     1      544            545        -1     1004
##  5     5  2013     1     1      554            600        -6      812
##  6     6  2013     1     1      554            558        -4      740
##  7     7  2013     1     1      555            600        -5      913
##  8     8  2013     1     1      557            600        -3      709
##  9     9  2013     1     1      557            600        -3      838
## 10    10  2013     1     1      558            600        -2      753
## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

Play

Add a catchup variable that contains the catch up time of the flight defined as the departure delay minus the arrival delay; then select id and delays of flights that catched up during the flight

Solution

# flights that catched up during the flight sorted by catch up time
flights %>% 
  mutate(catchup = dep_delay - arr_delay) %>%
  select(id, dep_delay, arr_delay, catchup) %>%
  filter(catchup > 0)

## # A tibble: 221,565 × 4
##       id dep_delay arr_delay catchup
##    <int>     <dbl>     <dbl>   <dbl>
##  1     4        -1       -18      17
##  2     5        -6       -25      19
##  3     8        -3       -14      11
##  4     9        -3        -8       5
##  5    12        -2        -3       1
##  6    14        -2       -14      12
##  7    16         0        -4       4
##  8    17        -1        -8       7
##  9    18         0        -7       7
## 10    20         1        -6       7
## # … with 221,555 more rows

arrange

Verb arrange() changes the ordering of the rows.

starwars %>% 
  arrange(mass) %>% 
  select(name, mass, height)

## # A tibble: 87 × 3
##    name                   mass height
##    <chr>                 <dbl>  <int>
##  1 Ratts Tyerell            15     79
##  2 Yoda                     17     66
##  3 Wicket Systri Warrick    20     88
##  4 R2-D2                    32     96
##  5 R5-D4                    32     97
##  6 Sebulba                  40    112
##  7 Dud Bolt                 45     94
##  8 Padmé Amidala            45    165
##  9 Wat Tambor               48    193
## 10 Sly Moore                48    178
## # … with 77 more rows

starwars %>% 
  arrange(desc(mass)) %>% 
  select(name, mass, height)

## # A tibble: 87 × 3
##    name                   mass height
##    <chr>                 <dbl>  <int>
##  1 Jabba Desilijic Tiure  1358    175
##  2 Grievous                159    216
##  3 IG-88                   140    200
##  4 Darth Vader             136    202
##  5 Tarfful                 136    234
##  6 Owen Lars               120    178
##  7 Bossk                   113    190
##  8 Chewbacca               112    228
##  9 Jek Tono Porkins        110    180
## 10 Dexter Jettster         102    198
## # … with 77 more rows

starwars %>% 
  arrange(-mass, -height) %>% 
  select(name, mass, height)

## # A tibble: 87 × 3
##    name                   mass height
##    <chr>                 <dbl>  <int>
##  1 Jabba Desilijic Tiure  1358    175
##  2 Grievous                159    216
##  3 IG-88                   140    200
##  4 Tarfful                 136    234
##  5 Darth Vader             136    202
##  6 Owen Lars               120    178
##  7 Bossk                   113    190
##  8 Chewbacca               112    228
##  9 Jek Tono Porkins        110    180
## 10 Dexter Jettster         102    198
## # … with 77 more rows

Play

The name and birth year of all human characters living in Tatooine sorted by birth year

Solution

starwars %>% 
  filter(species == "Human" & homeworld == "Tatooine") %>% 
  select(name, birth_year) %>% 
  arrange(birth_year)

## # A tibble: 8 × 2
##   name               birth_year
##   <chr>                   <dbl>
## 1 Luke Skywalker           19  
## 2 Biggs Darklighter        24  
## 3 Darth Vader              41.9
## 4 Anakin Skywalker         41.9
## 5 Beru Whitesun lars       47  
## 6 Owen Lars                52  
## 7 Shmi Skywalker           72  
## 8 Cliegg Lars              82

Play

The characters sorted in decreasing order of popularity (the popularity is the number of films starred by a characted)

Hint: use sapply function.

Solution

starwars %>% 
  mutate(popularity = sapply(films, length)) %>% 
  arrange(desc(popularity)) %>% 
  select(name, popularity)

## # A tibble: 87 × 2
##    name           popularity
##    <chr>               <int>
##  1 R2-D2                   7
##  2 C-3PO                   6
##  3 Obi-Wan Kenobi          6
##  4 Luke Skywalker          5
##  5 Leia Organa             5
##  6 Chewbacca               5
##  7 Yoda                    5
##  8 Palpatine               5
##  9 Darth Vader             4
## 10 Han Solo                4
## # … with 77 more rows

# why not like that?
starwars %>% 
  mutate(popularity = length(films)) %>% 
  arrange(desc(popularity)) %>% 
  select(name, popularity)

## # A tibble: 87 × 2
##    name               popularity
##    <chr>                   <int>
##  1 Luke Skywalker             87
##  2 C-3PO                      87
##  3 R2-D2                      87
##  4 Darth Vader                87
##  5 Leia Organa                87
##  6 Owen Lars                  87
##  7 Beru Whitesun lars         87
##  8 R5-D4                      87
##  9 Biggs Darklighter          87
## 10 Obi-Wan Kenobi             87
## # … with 77 more rows

group by and summarise

verb group_by() partitions rows of data into groups defined by the values of some variables
verb summarise() reduces multiple values down to a single summary
they are typically used in combination.

group by and summarise

starwars %>%
  group_by(species) %>%
  summarise(count = n()) %>% 
  arrange(-count)

## # A tibble: 38 × 2
##    species  count
##    <chr>    <int>
##  1 Human       35
##  2 Droid        6
##  3 <NA>         4
##  4 Gungan       3
##  5 Kaminoan     2
##  6 Mirialan     2
##  7 Twi'lek      2
##  8 Wookiee      2
##  9 Zabrak       2
## 10 Aleena       1
## # … with 28 more rows

starwars %>%
  count(species, sort = TRUE)

## # A tibble: 38 × 2
##    species      n
##    <chr>    <int>
##  1 Human       35
##  2 Droid        6
##  3 <NA>         4
##  4 Gungan       3
##  5 Kaminoan     2
##  6 Mirialan     2
##  7 Twi'lek      2
##  8 Wookiee      2
##  9 Zabrak       2
## 10 Aleena       1
## # … with 28 more rows

# group species by cardinality and average mass. Notice some averages are NA, why?
starwars %>%
  group_by(species) %>%
  summarise(
    n = n(),
    mass = mean(mass))

## # A tibble: 38 × 3
##    species       n  mass
##    <chr>     <int> <dbl>
##  1 Aleena        1    15
##  2 Besalisk      1   102
##  3 Cerean        1    82
##  4 Chagrian      1    NA
##  5 Clawdite      1    55
##  6 Droid         6    NA
##  7 Dug           1    40
##  8 Ewok          1    20
##  9 Geonosian     1    80
## 10 Gungan        3    NA
## # ℹ 28 more rows

# average on valid values only. Notice the NaN value for Chagrian species. Why?
starwars %>%
  group_by(species) %>%
  summarise(
    n = n(),
    mass = mean(mass, na.rm = TRUE))

## # A tibble: 38 × 3
##    species       n  mass
##    <chr>     <int> <dbl>
##  1 Aleena        1  15  
##  2 Besalisk      1 102  
##  3 Cerean        1  82  
##  4 Chagrian      1 NaN  
##  5 Clawdite      1  55  
##  6 Droid         6  69.8
##  7 Dug           1  40  
##  8 Ewok          1  20  
##  9 Geonosian     1  80  
## 10 Gungan        3  74  
## # ℹ 28 more rows

# filter valid observations
starwars %>%
  group_by(species) %>%
  summarise(
    n = n(),
    mass = mean(mass, na.rm = TRUE)
  ) %>%
  filter(!is.nan(mass)) %>% 
  arrange(desc(mass))

## # A tibble: 32 × 3
##    species          n   mass
##    <chr>        <int>  <dbl>
##  1 Hutt             1 1358  
##  2 Kaleesh          1  159  
##  3 Wookiee          2  124  
##  4 Trandoshan       1  113  
##  5 Besalisk         1  102  
##  6 Neimodian        1   90  
##  7 Kaminoan         2   88  
##  8 Nautolan         1   87  
##  9 Mon Calamari     1   83  
## 10 Human           35   82.8
## # ℹ 22 more rows

Play

The number of characters with a given eye and hair color sorted in decreasing order

Solution

count(starwars, hair_color, eye_color, sort=TRUE)

## # A tibble: 35 × 3
##    hair_color eye_color     n
##    <chr>      <chr>     <int>
##  1 black      brown         9
##  2 brown      brown         9
##  3 none       black         9
##  4 brown      blue          7
##  5 none       orange        7
##  6 none       yellow        6
##  7 blond      blue          3
##  8 none       blue          3
##  9 none       red           3
## 10 black      blue          2
## # … with 25 more rows

starwars %>% 
  group_by(hair_color, eye_color) %>% 
  summarise(n = n()) %>% 
  arrange(-n)

## # A tibble: 35 × 3
## # Groups:   hair_color [13]
##    hair_color eye_color     n
##    <chr>      <chr>     <int>
##  1 black      brown         9
##  2 brown      brown         9
##  3 none       black         9
##  4 brown      blue          7
##  5 none       orange        7
##  6 none       yellow        6
##  7 blond      blue          3
##  8 none       blue          3
##  9 none       red           3
## 10 black      blue          2
## # … with 25 more rows

Play

the number of flights per day
the busy days (with more than 1000 flights)

Solution

# the number of flights per day
group_by(flights, month, day) %>%
  summarise(count = n())

## # A tibble: 365 × 3
## # Groups:   month [12]
##    month   day count
##    <int> <int> <int>
##  1     1     1   842
##  2     1     2   943
##  3     1     3   914
##  4     1     4   915
##  5     1     5   720
##  6     1     6   832
##  7     1     7   933
##  8     1     8   899
##  9     1     9   902
## 10     1    10   932
## # … with 355 more rows

# or
count(flights, month, day)

## # A tibble: 365 × 3
##    month   day     n
##    <int> <int> <int>
##  1     1     1   842
##  2     1     2   943
##  3     1     3   914
##  4     1     4   915
##  5     1     5   720
##  6     1     6   832
##  7     1     7   933
##  8     1     8   899
##  9     1     9   902
## 10     1    10   932
## # … with 355 more rows

# the busy days (with more than 1000 flights)
count(flights, month, day) %>%
  filter(n > 1000)

## # A tibble: 14 × 3
##    month   day     n
##    <int> <int> <int>
##  1     7     8  1004
##  2     7     9  1001
##  3     7    10  1004
##  4     7    11  1006
##  5     7    12  1002
##  6     7    17  1001
##  7     7    18  1003
##  8     7    25  1003
##  9     7    31  1001
## 10     8     7  1001
## 11     8     8  1001
## 12     8    12  1001
## 13    11    27  1014
## 14    12     2  1004

Play

The mean departure delay per day sorted in decreasing order of all flights on busy days of July

Solution

# the mean departure delay per day sorted in 
# decreasing order of all flights on busy days of July 
flights %>% 
  filter(month == 7) %>%
  group_by(day) %>%
  summarise(n = n(), avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
  filter(n > 1000) %>%
  arrange(desc(avg_delay))

## # A tibble: 9 × 3
##     day     n avg_delay
##   <int> <int>     <dbl>
## 1    10  1004     52.9 
## 2     8  1004     37.3 
## 3     9  1001     30.7 
## 4    12  1002     25.1 
## 5    11  1006     23.6 
## 6    18  1003     20.6 
## 7    25  1003     19.7 
## 8    17  1001     13.7 
## 9    31  1001      6.28

Binary verbs

Overview

It’s rare that a data analysis involves only a single table of data
In practice, you’ll normally have many tables that contribute to an analysis, and you need flexible tools to combine them
All two-table verbs work similarly: the first two arguments are tables to combine and the output is always a new table

Set operations

These expect the input tables x and y to have the same variables, and treat the observations like sets, hence the resulting table have unique observations:

intersect(x, y): return observations in both x and y
union(x, y): return observations in either x or y
setdiff(x, y): return observations in x, but not in y

Set operations

df1 = tibble(x = c(1, 2), y = c("a", "a"))
df2 = tibble(x = c(1, 1), y = c("a", "b"))

df1

## # A tibble: 2 × 2
##       x y    
##   <dbl> <chr>
## 1     1 a    
## 2     2 a

df2

## # A tibble: 2 × 2
##       x y    
##   <dbl> <chr>
## 1     1 a    
## 2     1 b

intersect(df1, df2)

## # A tibble: 1 × 2
##       x y    
##   <dbl> <chr>
## 1     1 a

union(df1, df2)

## # A tibble: 3 × 2
##       x y    
##   <dbl> <chr>
## 1     1 a    
## 2     2 a    
## 3     1 b

setdiff(df1, df2)

## # A tibble: 1 × 2
##       x y    
##   <dbl> <chr>
## 1     2 a

setdiff(df2, df1)

## # A tibble: 1 × 2
##       x y    
##   <dbl> <chr>
## 1     1 b

Joins

mutating joins add new variables to one table from matching rows in another
- inner join includes observations that match in both tables
- outer join (left, right, full) includes also observations that do not match in one of the tables
filtering joins filter observations from one table from matching rows in another
- semi-join filter observations from one table based on whether they match an observation in the other table
- anti-join filter observations from one table based on whether they do not match an observation in the other table

Inner join

Inner join only includes observations that match in both tables.

Inner join

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)

inner_join(x, y)

## # A tibble: 2 × 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2

Left join

Left join includes observations that match in both tables (like inner join) plus the observations of the left table that find no match in the right table.

Left join

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)

left_join(x, y)

## # A tibble: 3 × 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     3 x3    <NA>

Right join

Right join includes observations that match in both tables (like inner join) plus the observations of the right table that find no match in the left table.

Right join

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)

right_join(x, y)

## # A tibble: 3 × 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     4 <NA>  y3

Full join

Full join includes observations that match in both tables (like inner join) plus the observations of the left table that find no match in the right table and those in the right table that find no match in the left table.

Full join

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)

full_join(x, y)

## # A tibble: 4 × 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     3 x3    <NA> 
## 4     4 <NA>  y3

Semi-join

Semi-join keeps all observations in the first table that have a match in the second table.

Semi-join

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)

semi_join(x, y)

## # A tibble: 2 × 2
##     key val_x
##   <dbl> <chr>
## 1     1 x1   
## 2     2 x2

Anti-join

Anti-join keeps all observations in the first table that have no match in the second table.

Anti-join

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)

anti_join(x, y)

## # A tibble: 1 × 2
##     key val_x
##   <dbl> <chr>
## 1     3 x3

Joining NYC flights

A natural join is a join on common attributes of tables.

# Drop unimportant variables so it's easier 
# to understand the join results
flights2 = flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)

flights2

## # A tibble: 336,776 × 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 EWR    IAH   N14228  UA     
##  2  2013     1     1     5 LGA    IAH   N24211  UA     
##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
##  6  2013     1     1     5 EWR    ORD   N39463  UA     
##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # … with 336,766 more rows

airlines

## # A tibble: 16 × 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

flights2 %>% 
  left_join(airlines)

## # A tibble: 336,776 × 9
##     year month   day  hour origin dest  tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 EWR    IAH   N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 LGA    IAH   N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 EWR    ORD   N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      American Airlines Inc.  
## # … with 336,766 more rows

left_join(flights2, airlines)

## # A tibble: 336,776 × 9
##     year month   day  hour origin dest  tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 EWR    IAH   N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 LGA    IAH   N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 EWR    ORD   N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      American Airlines Inc.  
## # … with 336,766 more rows

Joining NYC flights

Both tables flights and planes have year columns, but they mean different things, so we don’t want to join them.

flights2

## # A tibble: 336,776 × 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 EWR    IAH   N14228  UA     
##  2  2013     1     1     5 LGA    IAH   N24211  UA     
##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
##  6  2013     1     1     5 EWR    ORD   N39463  UA     
##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # ℹ 336,766 more rows

planes

## # A tibble: 3,322 × 9
##    tailnum  year type              manufacturer model engines seats speed engine
##    <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## # ℹ 3,312 more rows

flights2 %>% 
  left_join(planes, join_by(tailnum))

## # A tibble: 336,776 × 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type            
##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>           
##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixed wing mult…
##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixed wing mult…
##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixed wing mult…
##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixed wing mult…
##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixed wing mult…
##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixed wing mult…
##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixed wing mult…
##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixed wing mult…
##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixed wing mult…
## 10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA>            
## # ℹ 336,766 more rows
## # ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
## #   seats <int>, speed <int>, engine <chr>

Note that the year columns in the output are disambiguated with a suffix.

Joining NYC flights

Each flight has an origin and destination airport, so we need to specify which one we want to join to:

flights2

## # A tibble: 336,776 × 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 EWR    IAH   N14228  UA     
##  2  2013     1     1     5 LGA    IAH   N24211  UA     
##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
##  6  2013     1     1     5 EWR    ORD   N39463  UA     
##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # … with 336,766 more rows

airports

## # A tibble: 1,458 × 8
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
## # … with 1,448 more rows

flights2 %>% 
  left_join(airports, join_by(dest == faa))

## # A tibble: 336,776 × 15
##     year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Georg…  30.0 -95.3    97
##  2  2013     1     1     5 LGA    IAH   N24211  UA      Georg…  30.0 -95.3    97
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      Miami…  25.8 -80.3     8
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>    NA    NA      NA
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Harts…  33.6 -84.4  1026
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Chica…  42.0 -87.9   668
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort …  26.1 -80.2     9
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      Washi…  38.9 -77.5   313
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      Orlan…  28.4 -81.3    96
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chica…  42.0 -87.9   668
## # ℹ 336,766 more rows
## # ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>

flights2 %>% 
  left_join(airports, join_by(origin == faa))

## # A tibble: 336,776 × 15
##     year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Newar…  40.7 -74.2    18
##  2  2013     1     1     5 LGA    IAH   N24211  UA      La Gu…  40.8 -73.9    22
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      John …  40.6 -73.8    13
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      John …  40.6 -73.8    13
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      La Gu…  40.8 -73.9    22
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Newar…  40.7 -74.2    18
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Newar…  40.7 -74.2    18
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      La Gu…  40.8 -73.9    22
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      John …  40.6 -73.8    13
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      La Gu…  40.8 -73.9    22
## # ℹ 336,766 more rows
## # ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>

Joining NYC flights

Use semi-join to filter observations of the first table that match in the second table. For instance:

Which are the flights to the top-10 popular destinations?

top_dest = flights %>%
  count(dest, sort = TRUE) %>%
  head(10) # select first 10 rows

top_dest

## # A tibble: 10 × 2
##    dest      n
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705

semi_join(flights2, top_dest)

## # A tibble: 141,145 × 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 JFK    MIA   N619AA  AA     
##  2  2013     1     1     6 LGA    ATL   N668DN  DL     
##  3  2013     1     1     5 EWR    ORD   N39463  UA     
##  4  2013     1     1     6 EWR    FLL   N516JB  B6     
##  5  2013     1     1     6 JFK    MCO   N593JB  B6     
##  6  2013     1     1     6 LGA    ORD   N3ALAA  AA     
##  7  2013     1     1     6 JFK    LAX   N29129  UA     
##  8  2013     1     1     6 EWR    SFO   N53441  UA     
##  9  2013     1     1     5 JFK    BOS   N708JB  B6     
## 10  2013     1     1     6 LGA    FLL   N595JB  B6     
## # … with 141,135 more rows

Joining NYC flights

Use anti-join to check foreign key integrity.

For example, the attribute tailnum in flights table refers to (is a foreign key of) the same attribute in table planes.

The foreign key constraint claims that:

Each valid foreign key value (tailnum in flights) must correspond to a value in the referenced table (tailnum in planes).

flights %>% 
  anti_join(planes, join_by(tailnum)) %>% 
  count(tailnum, sort = TRUE)

## # A tibble: 722 × 2
##    tailnum     n
##    <chr>   <int>
##  1 <NA>     2512
##  2 N725MQ    575
##  3 N722MQ    513
##  4 N723MQ    507
##  5 N713MQ    483
##  6 N735MQ    396
##  7 N0EGMQ    371
##  8 N534MQ    364
##  9 N542MQ    363
## 10 N531MQ    349
## # ℹ 712 more rows

Mind that in case foreign key integrity is violated, left join and inner join produce different outputs:

flights2 %>% 
  left_join(planes, join_by(tailnum))

## # A tibble: 336,776 × 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type            
##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>           
##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixed wing mult…
##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixed wing mult…
##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixed wing mult…
##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixed wing mult…
##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixed wing mult…
##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixed wing mult…
##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixed wing mult…
##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixed wing mult…
##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixed wing mult…
## 10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA>            
## # ℹ 336,766 more rows
## # ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
## #   seats <int>, speed <int>, engine <chr>

flights2 %>% 
  inner_join(planes, join_by(tailnum))

## # A tibble: 284,170 × 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type            
##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>           
##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixed wing mult…
##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixed wing mult…
##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixed wing mult…
##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixed wing mult…
##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixed wing mult…
##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixed wing mult…
##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixed wing mult…
##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixed wing mult…
##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixed wing mult…
## 10   2013     1     1     6 JFK    PBI   N793JB  B6        2011 Fixed wing mult…
## # ℹ 284,160 more rows
## # ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
## #   seats <int>, speed <int>, engine <chr>

Play

Flights that flew with a plane manufactured by BOEING

Solution

# flights that flew with a plane manufactured by BOEING
flights %>% 
  inner_join(planes, join_by(tailnum)) %>%
  select(id, tailnum, manufacturer) %>%
  filter(manufacturer == "BOEING")

## # A tibble: 82,912 × 3
##       id tailnum manufacturer
##    <int> <chr>   <chr>       
##  1     1 N14228  BOEING      
##  2     2 N24211  BOEING      
##  3     3 N619AA  BOEING      
##  4     5 N668DN  BOEING      
##  5     6 N39463  BOEING      
##  6    13 N29129  BOEING      
##  7    14 N53441  BOEING      
##  8    17 N76515  BOEING      
##  9    23 N633AA  BOEING      
## 10    24 N3739P  BOEING      
## # ℹ 82,902 more rows

# more efficient
flights %>% 
  inner_join(filter(planes, manufacturer == "BOEING"), join_by(tailnum)) %>%
  select(id, tailnum, manufacturer)

## # A tibble: 82,912 × 3
##       id tailnum manufacturer
##    <int> <chr>   <chr>       
##  1     1 N14228  BOEING      
##  2     2 N24211  BOEING      
##  3     3 N619AA  BOEING      
##  4     5 N668DN  BOEING      
##  5     6 N39463  BOEING      
##  6    13 N29129  BOEING      
##  7    14 N53441  BOEING      
##  8    17 N76515  BOEING      
##  9    23 N633AA  BOEING      
## 10    24 N3739P  BOEING      
## # ℹ 82,902 more rows

library(microbenchmark)

first = function(flights, planes) {
  flights %>% 
  inner_join(planes, join_by(tailnum)) %>%
  select(id, tailnum, manufacturer) %>%
  filter(manufacturer == "BOEING") 
}

second = function(flights, planes) {
  flights %>% 
  inner_join(filter(planes, manufacturer == "BOEING"), join_by(tailnum)) %>%
  select(id, tailnum, manufacturer)
}

# times in ms
microbenchmark(first(flights, planes), 
               second(flights, planes), 
               times = 10)

## Unit: milliseconds
##                     expr      min       lq     mean   median       uq       max
##   first(flights, planes) 80.98152 85.73728 97.77476 90.12195 96.20821 136.27755
##  second(flights, planes) 34.28921 35.68643 41.60185 41.09479 46.06985  54.54973
##  neval
##     10
##     10

Play

Flights that flew to a destination with an altitude greater than 6000 feet sorted by altitude

Solution

# flights that flew to a destination with an altitude 
# greater than 6000 feet sorted by altitude
flights %>% 
  inner_join(filter(airports, alt > 6000), join_by(dest == faa)) %>%
  select(id, dest, name, alt) %>%
  arrange(alt)

## # A tibble: 253 × 4
##        id dest  name                   alt
##     <int> <chr> <chr>                <dbl>
##  1    153 JAC   Jackson Hole Airport  6451
##  2   1068 JAC   Jackson Hole Airport  6451
##  3 100067 JAC   Jackson Hole Airport  6451
##  4 101062 JAC   Jackson Hole Airport  6451
##  5 101913 JAC   Jackson Hole Airport  6451
##  6 102007 JAC   Jackson Hole Airport  6451
##  7 102805 JAC   Jackson Hole Airport  6451
##  8 103800 JAC   Jackson Hole Airport  6451
##  9 104658 JAC   Jackson Hole Airport  6451
## 10 105490 JAC   Jackson Hole Airport  6451
## # ℹ 243 more rows

Play

Flights that took off with a plane with 4 engines and a visibility lower than 3 miles

Solution

# flights that took off with a plane with 
# 4 engines and a visibility lower than 3 miles
flights %>%
  inner_join(filter(weather, visib < 3), join_by(origin, time_hour)) %>%
  inner_join(filter(planes, engines == 4), join_by(tailnum)) %>%
  select(id, engines, visib)

## # A tibble: 11 × 3
##        id engines visib
##     <int>   <int> <dbl>
##  1  10066       4  2.5 
##  2  11278       4  0.25
##  3  25295       4  0   
##  4 120234       4  0.06
##  5 120640       4  0.12
##  6 120734       4  0.06
##  7 182566       4  0.12
##  8 210774       4  1.25
##  9 213173       4  0.25
## 10 227971       4  1   
## 11 228191       4  2

Play

Flights with destination and origin airports with an altitude difference of more than 6000 feet

Solution

# flights with destination and origin airports with 
# an altitude difference of more than 6000 feet
flights %>% 
  inner_join(airports, join_by(origin == faa)) %>%
  inner_join(airports, join_by(dest == faa)) %>%
  select(id, alt.x, alt.y) %>%
  mutate(altdelta = abs(alt.y - alt.x)) %>%
  filter(altdelta > 6000) %>% 
  arrange(id)

## # A tibble: 253 × 4
##       id alt.x alt.y altdelta
##    <int> <dbl> <dbl>    <dbl>
##  1   153    18  6451     6433
##  2   194    18  6540     6522
##  3   571    13  6540     6527
##  4  1068    18  6451     6433
##  5  1095    18  6540     6522
##  6  1492    13  6540     6527
##  7  2052    18  6540     6522
##  8  2509    13  6540     6527
##  9  2975    18  6540     6522
## 10  3477    13  6540     6527
## # ℹ 243 more rows

# Alternative: use cross-join
cross_join(select(airports, faa, alt), select(airports, faa, alt)) %>% 
  mutate(altdelta = abs(alt.y - alt.x)) %>%
  filter(altdelta > 6000) %>%
  inner_join(select(flights, id, origin, dest), join_by(faa.x == origin, faa.y == dest))  %>%
  select(id, alt.x, alt.y, altdelta) %>% 
  arrange(id)

## # A tibble: 253 × 4
##       id alt.x alt.y altdelta
##    <int> <dbl> <dbl>    <dbl>
##  1   153    18  6451     6433
##  2   194    18  6540     6522
##  3   571    13  6540     6527
##  4  1068    18  6451     6433
##  5  1095    18  6540     6522
##  6  1492    13  6540     6527
##  7  2052    18  6540     6522
##  8  2509    13  6540     6527
##  9  2975    18  6540     6522
## 10  3477    13  6540     6527
## # ℹ 243 more rows

# which is faster?
library(microbenchmark)

first = function(flights, airports) {
  flights %>% 
  inner_join(airports, join_by(origin == faa)) %>%
  inner_join(airports, join_by(dest == faa)) %>%
  select(id, alt.x, alt.y) %>%
  mutate(altdelta = abs(alt.y - alt.x)) %>%
  filter(altdelta > 6000) %>% 
  arrange(id)
}

second = function(flights, airports) {
  cross_join(select(airports, faa, alt), select(airports, faa, alt)) %>% 
  mutate(altdelta = abs(alt.y - alt.x)) %>%
  filter(altdelta > 6000) %>%
  inner_join(select(flights, id, origin, dest), join_by(faa.x == origin, faa.y == dest))  %>%
  select(id, alt.x, alt.y, altdelta) %>% 
  arrange(id)
}

# times in ms
microbenchmark(first(flights, airports), 
               second(flights, airports), 
               times = 10)

## Unit: milliseconds
##                       expr      min      lq     mean   median       uq      max
##   first(flights, airports) 154.6796 183.633 216.7170 219.7652 238.2998 282.0651
##  second(flights, airports) 122.5329 160.119 174.7753 164.6483 190.3577 229.3254
##  neval
##     10
##     10

Play

Check that attribute tailnum is a key for table planes, that is it identifies the table observations.

Solution

planes %>% 
  count(tailnum) %>% 
  filter(n > 1)

## # A tibble: 0 × 2
## # ℹ 2 variables: tailnum <chr>, n <int>

Play

check that the foreign key constraint from attribute dest of table flights to attribute faa of table airports does not hold
set to NA the attribute dest of the rows of flights that do not match an airport
do not remove the rows of flights that do not match an airport since they contain other useful information
then check the foreign key constraint again

Solution

flights %>%
  anti_join(airports, join_by(dest == faa)) %>%
  count(dest, sort = TRUE)

## # A tibble: 4 × 2
##   dest      n
##   <chr> <int>
## 1 SJU    5819
## 2 BQN     896
## 3 STT     522
## 4 PSE     365

id_set = 
  flights %>% 
  anti_join(airports, join_by(dest == faa)) %>% 
  pull(id)

flightsValid = 
  flights %>% 
  mutate(dest = ifelse(id %in% id_set, NA, dest))

flightsValid %>%
  anti_join(airports, join_by(dest == faa)) %>%
  count(dest, sort = TRUE)

## # A tibble: 1 × 2
##   dest      n
##   <chr> <int>
## 1 <NA>   7602

Dig deeper

dplyr cheat sheet
data transformation with dplyr
joins with dplyr
tidy your datasets with tidyr
query a database with dplyr using dbplyr
query Apache Spark cluster computing platform with dplyr using sparklyr