In this trailer we will use the nycflights13 dataset. It contains information about all flights that departed from New York City (e.g. EWR, JFK and LGA) in 2013 and related metadata (planes, airports, airlines and weather conditions). It includes the following entities:

The entities are associated with the following relationships:

Entity-Relationship model

The Entity-Relationship model (ER model) is a conceptual model of data. As such it describes at high level the reality modeled regardless of how the data will be logically and physically represented.

The ER model defines a conceptual data schema. A schema describes the structure or form of the data, but does not describe the actual data, that is the instances or occurrences of the schema. For example, the ER schema may say that the plane entity has the manufacturer and model attributes but does not say what the actual values of these attributes are. Compared to object-oriented programming, the schema corresponds to the notion of class and an instance of the schema is an object of the class.

An ER schema contains the following components:

Both entities and relationships can have atomic properties called attributes. A key of an entity is one or more attributes that identify unambiguously the instances of the entity.

Here is an ER schema for the nycflights13 dataset:

Relational model

The relational model consists of two main components:

The relational model is based on the concept of relation (not to be confused with the conceptual relationship of the ER model), whose representation is a table (or dataframe in R’s jargon). The concept of relation is formal and comes from set theory, a part of mathematics. The concept of table is intuitive and is used in various contexts outside databases. The success of the relational model lies precisely in the conjunction of a formal concept, the relation, which allowed the development of a theory of relational databases with results of practical impact, with an intuitive concept, the table, which made it possible the relational model also to end users without any mathematical notion.

The following properties characterize the relational mode:

  1. a relation consists of rows (also called tuples) and columns (also called attributes);
  2. each attribute of the relation is associated with a domain, that is a set of values it can take. Attributes can have missing values, that are unknown or non-existent values;
  3. every tuple of the relation is distinct from the others. Every relation must have a primary key, that is a set of attributes that uniquely identify every tuple of the relation;
  4. information is typically distributed on different interrelated tables. Attributes that reference information contained in another table are called foreign keys. The non-null values of a foreign key must match values of the referenced attribute.

Here is the relational model of the nycflights13 dataset. Primary key attributes are underlined, foreign key references are represented as R(A) –> S(B), meaning that A in table R references B in table S:

flight(id, month, day, hour, origin, destination, tailnum, carrier, time_hour)
plane(tailnum)
airline(carrier)
airport(code)
weather(time_hour, airport)
flight(tailnum) –> plane(tailnum)
flight(carrier) –> airline(carrier)
flight(origin) –> airport(code)
flight(destination) –> airport(code)
flight(time_hour, origin) –> weather(time_hour, airport)
weather(airport) –> airport(code)

Relational algebra

The relational algebra is a collection of operators; each operator takes as input either a single relation or a pair of relations and outputs a single relation as its result. A relational query is a composition of a finite number of relational operators. In that sense a query is procedural, since it specifies the order in which the operators comprising the query are to be evaluated.

The relational algebra contains the following operators. Set operators union \(\cup\), intersection \(\cap\) and set difference \(\setminus\) have the usual semantics.

The projection operator \[\pi(T, A_1, \ldots A_n)\] selects columns \(A_1, \ldots A_n\) from table \(T\). Consider the table:

planes
## # A tibble: 3,322 x 9
##    tailnum  year type       manufacturer  model  engines seats speed engine
##    <chr>   <int> <chr>      <chr>         <chr>    <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed win… EMBRAER       EMB-1…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  3 N103US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  5 N10575   2002 Fixed win… EMBRAER       EMB-1…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  7 N107US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
## # … with 3,312 more rows

The projection:

\[ \pi(planes, \{tailnum, manufacturer\}) \]

is the following table:

## # A tibble: 3,322 x 2
##    tailnum manufacturer    
##    <chr>   <chr>           
##  1 N10156  EMBRAER         
##  2 N102UW  AIRBUS INDUSTRIE
##  3 N103US  AIRBUS INDUSTRIE
##  4 N104UW  AIRBUS INDUSTRIE
##  5 N10575  EMBRAER         
##  6 N105UW  AIRBUS INDUSTRIE
##  7 N107US  AIRBUS INDUSTRIE
##  8 N108UW  AIRBUS INDUSTRIE
##  9 N109UW  AIRBUS INDUSTRIE
## 10 N110UW  AIRBUS INDUSTRIE
## # … with 3,312 more rows

The selection operator \[\sigma(T, \varphi)\] selects rows of \(T\) that satisfy formula \(\varphi\). Consider the table:

planes
## # A tibble: 3,322 x 9
##    tailnum  year type       manufacturer  model  engines seats speed engine
##    <chr>   <int> <chr>      <chr>         <chr>    <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed win… EMBRAER       EMB-1…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  3 N103US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  5 N10575   2002 Fixed win… EMBRAER       EMB-1…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  7 N107US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
## # … with 3,312 more rows

The selection: \[ \sigma(planes, manufacturer = EMBRAER \wedge engines = 2 ) \]

is the following table:

## # A tibble: 299 x 9
##    tailnum  year type        manufacturer model  engines seats speed engine
##    <chr>   <int> <chr>       <chr>        <chr>    <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  2 N10575   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  3 N11106   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  4 N11107   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  5 N11109   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  6 N11113   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  7 N11119   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  8 N11121   2003 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  9 N11127   2003 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
## 10 N11137   2003 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
## # … with 289 more rows

The join operator \[T_1 \Join T_2\] of tables \(T_1\) and \(T_2\) having the attributes \(X\) in common concatenates every tuple of \(T_1\) with every tuple of \(T_2\) and selects those having the same \(X\)-values.

Consider the relations:

airflights
## # A tibble: 336,776 x 7
##    month   day flight origin dest  tailnum carrier
##    <int> <int>  <int> <chr>  <chr> <chr>   <chr>  
##  1     1     1   1545 EWR    IAH   N14228  UA     
##  2     1     1   1714 LGA    IAH   N24211  UA     
##  3     1     1   1141 JFK    MIA   N619AA  AA     
##  4     1     1    725 JFK    BQN   N804JB  B6     
##  5     1     1    461 LGA    ATL   N668DN  DL     
##  6     1     1   1696 EWR    ORD   N39463  UA     
##  7     1     1    507 EWR    FLL   N516JB  B6     
##  8     1     1   5708 LGA    IAD   N829AS  EV     
##  9     1     1     79 JFK    MCO   N593JB  B6     
## 10     1     1    301 LGA    ORD   N3ALAA  AA     
## # … with 336,766 more rows
airlines
## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

The join \[airflights \Join airlines\] is the relation:

## Joining, by = "carrier"
## # A tibble: 336,776 x 8
##    month   day flight origin dest  tailnum carrier name                    
##    <int> <int>  <int> <chr>  <chr> <chr>   <chr>   <chr>                   
##  1     1     1   1545 EWR    IAH   N14228  UA      United Air Lines Inc.   
##  2     1     1   1714 LGA    IAH   N24211  UA      United Air Lines Inc.   
##  3     1     1   1141 JFK    MIA   N619AA  AA      American Airlines Inc.  
##  4     1     1    725 JFK    BQN   N804JB  B6      JetBlue Airways         
##  5     1     1    461 LGA    ATL   N668DN  DL      Delta Air Lines Inc.    
##  6     1     1   1696 EWR    ORD   N39463  UA      United Air Lines Inc.   
##  7     1     1    507 EWR    FLL   N516JB  B6      JetBlue Airways         
##  8     1     1   5708 LGA    IAD   N829AS  EV      ExpressJet Airlines Inc.
##  9     1     1     79 JFK    MCO   N593JB  B6      JetBlue Airways         
## 10     1     1    301 LGA    ORD   N3ALAA  AA      American Airlines Inc.  
## # … with 336,766 more rows

The renaming operator \[\rho(T, B = A)\] renames attribute A of table T with B. Consider the table:

aircrafts
## # A tibble: 3,322 x 3
##    tailnum engine    engines
##    <chr>   <chr>       <int>
##  1 N10156  Turbo-fan       2
##  2 N102UW  Turbo-fan       2
##  3 N103US  Turbo-fan       2
##  4 N104UW  Turbo-fan       2
##  5 N10575  Turbo-fan       2
##  6 N105UW  Turbo-fan       2
##  7 N107US  Turbo-fan       2
##  8 N108UW  Turbo-fan       2
##  9 N109UW  Turbo-fan       2
## 10 N110UW  Turbo-fan       2
## # … with 3,312 more rows

The rename \[\rho(aircrafts, engineType = engine) \]

## # A tibble: 3,322 x 3
##    tailnum engineType engines
##    <chr>   <chr>        <int>
##  1 N10156  Turbo-fan        2
##  2 N102UW  Turbo-fan        2
##  3 N103US  Turbo-fan        2
##  4 N104UW  Turbo-fan        2
##  5 N10575  Turbo-fan        2
##  6 N105UW  Turbo-fan        2
##  7 N107US  Turbo-fan        2
##  8 N108UW  Turbo-fan        2
##  9 N109UW  Turbo-fan        2
## 10 N110UW  Turbo-fan        2
## # … with 3,312 more rows

Query: SQL

Structured Query Language (SQL) is a language that allows the definition (creation of schema and integrity constraints), the manipulation (insert, update and delete of data) and the query (data retrieval) of a relational database. Here we are interested in the query part of the language. It contains the following operators

The projection:

\[ \pi(planes, \{tailnum, manufacturer\}) \]

is the following SQL query:

select tailnum, manufacturer 
from planes
## # A tibble: 3,322 x 2
##    tailnum manufacturer    
##    <chr>   <chr>           
##  1 N10156  EMBRAER         
##  2 N102UW  AIRBUS INDUSTRIE
##  3 N103US  AIRBUS INDUSTRIE
##  4 N104UW  AIRBUS INDUSTRIE
##  5 N10575  EMBRAER         
##  6 N105UW  AIRBUS INDUSTRIE
##  7 N107US  AIRBUS INDUSTRIE
##  8 N108UW  AIRBUS INDUSTRIE
##  9 N109UW  AIRBUS INDUSTRIE
## 10 N110UW  AIRBUS INDUSTRIE
## # … with 3,312 more rows

The selection: \[ \sigma(planes, manufacturer = EMBRAER \wedge engines = 2 ) \]

is the following SQL query:

select *
from planes
where manufacturer = "EMBRAER" and engines = 2
## # A tibble: 299 x 9
##    tailnum  year type        manufacturer model  engines seats speed engine
##    <chr>   <int> <chr>       <chr>        <chr>    <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  2 N10575   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  3 N11106   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  4 N11107   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  5 N11109   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  6 N11113   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  7 N11119   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  8 N11121   2003 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  9 N11127   2003 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
## 10 N11137   2003 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
## # … with 289 more rows

The join \[airflights \Join airlines\] is the following SQL query:

select *
from airflights join airlines on airflights.carrier = airlines.carrier
## Joining, by = "carrier"
## # A tibble: 336,776 x 8
##    month   day flight origin dest  tailnum carrier name                    
##    <int> <int>  <int> <chr>  <chr> <chr>   <chr>   <chr>                   
##  1     1     1   1545 EWR    IAH   N14228  UA      United Air Lines Inc.   
##  2     1     1   1714 LGA    IAH   N24211  UA      United Air Lines Inc.   
##  3     1     1   1141 JFK    MIA   N619AA  AA      American Airlines Inc.  
##  4     1     1    725 JFK    BQN   N804JB  B6      JetBlue Airways         
##  5     1     1    461 LGA    ATL   N668DN  DL      Delta Air Lines Inc.    
##  6     1     1   1696 EWR    ORD   N39463  UA      United Air Lines Inc.   
##  7     1     1    507 EWR    FLL   N516JB  B6      JetBlue Airways         
##  8     1     1   5708 LGA    IAD   N829AS  EV      ExpressJet Airlines Inc.
##  9     1     1     79 JFK    MCO   N593JB  B6      JetBlue Airways         
## 10     1     1    301 LGA    ORD   N3ALAA  AA      American Airlines Inc.  
## # … with 336,766 more rows

SQL adds an operator order by to sort rows of tables. For instance the following query order flights in decreasing order with respect to departure delay:

select flight, dep_delay
from flights 
order by dep_delay desc
## # A tibble: 336,776 x 2
##    flight dep_delay
##     <int>     <dbl>
##  1     51      1301
##  2   3535      1137
##  3   3695      1126
##  4    177      1014
##  5   3075      1005
##  6   2391       960
##  7   2119       911
##  8   2007       899
##  9   2047       898
## 10    172       896
## # … with 336,766 more rows

You can introduce new attributes using the as operator. In the following we define catchup as the difference between departure and arrival delays:

select flight, (dep_delay - arr_delay) as catchup
from flights
## # A tibble: 336,776 x 2
##    flight catchup
##     <int>   <dbl>
##  1   1545      -9
##  2   1714     -16
##  3   1141     -31
##  4    725      17
##  5    461      19
##  6   1696     -16
##  7    507     -24
##  8   5708      11
##  9     79       5
## 10    301     -10
## # … with 336,766 more rows

Finally SQL has an operator group by to partition rows of a table into groups by common values of some attributes. We can use summary functions like count, sum and mean on the table groups. For instance, the following query groups flights by day and month and then counts the number of flights in each group:

select month, day, count(*) as count
from flights
group by month, day
## # A tibble: 365 x 3
## # Groups:   month [12]
##    month   day count
##    <int> <int> <int>
##  1     1     1   842
##  2     1     2   943
##  3     1     3   914
##  4     1     4   915
##  5     1     5   720
##  6     1     6   832
##  7     1     7   933
##  8     1     8   899
##  9     1     9   902
## 10     1    10   932
## # … with 355 more rows

Notice the operator as which creates a new attribute called count.

Query: dplyr

Generally, dplyr is a little easier to use than SQL because dplyr is specialized to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren’t commonly needed for data analysis. There is an immediate correspondence between SQL clauses and dplyr verbs:

The SQL projection:

select tailnum, manufacturer 
from planes

writes as:

select(planes, tailnum, manufacturer)
## # A tibble: 3,322 x 2
##    tailnum manufacturer    
##    <chr>   <chr>           
##  1 N10156  EMBRAER         
##  2 N102UW  AIRBUS INDUSTRIE
##  3 N103US  AIRBUS INDUSTRIE
##  4 N104UW  AIRBUS INDUSTRIE
##  5 N10575  EMBRAER         
##  6 N105UW  AIRBUS INDUSTRIE
##  7 N107US  AIRBUS INDUSTRIE
##  8 N108UW  AIRBUS INDUSTRIE
##  9 N109UW  AIRBUS INDUSTRIE
## 10 N110UW  AIRBUS INDUSTRIE
## # … with 3,312 more rows

The SQL selection:

select *
from planes
where manufacturer = "EMBRAER" and engines = 2

writes as:

filter(planes, manufacturer == "EMBRAER" & engines == 2)
## # A tibble: 299 x 9
##    tailnum  year type        manufacturer model  engines seats speed engine
##    <chr>   <int> <chr>       <chr>        <chr>    <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  2 N10575   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  3 N11106   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  4 N11107   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  5 N11109   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  6 N11113   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  7 N11119   2002 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  8 N11121   2003 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
##  9 N11127   2003 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
## 10 N11137   2003 Fixed wing… EMBRAER      EMB-1…       2    55    NA Turbo…
## # … with 289 more rows

The SQL join:

select *
from airflights join airlines on airflights.carrier = airlines.carrier

writes as:

## # A tibble: 336,776 x 8
##    month   day flight origin dest  tailnum carrier name                    
##    <int> <int>  <int> <chr>  <chr> <chr>   <chr>   <chr>                   
##  1     1     1   1545 EWR    IAH   N14228  UA      United Air Lines Inc.   
##  2     1     1   1714 LGA    IAH   N24211  UA      United Air Lines Inc.   
##  3     1     1   1141 JFK    MIA   N619AA  AA      American Airlines Inc.  
##  4     1     1    725 JFK    BQN   N804JB  B6      JetBlue Airways         
##  5     1     1    461 LGA    ATL   N668DN  DL      Delta Air Lines Inc.    
##  6     1     1   1696 EWR    ORD   N39463  UA      United Air Lines Inc.   
##  7     1     1    507 EWR    FLL   N516JB  B6      JetBlue Airways         
##  8     1     1   5708 LGA    IAD   N829AS  EV      ExpressJet Airlines Inc.
##  9     1     1     79 JFK    MCO   N593JB  B6      JetBlue Airways         
## 10     1     1    301 LGA    ORD   N3ALAA  AA      American Airlines Inc.  
## # … with 336,766 more rows

The SQL order by:

select flight, dep_delay
from flights 
order by dep_delay desc

writes as:

## # A tibble: 336,776 x 2
##    flight dep_delay
##     <int>     <dbl>
##  1     51      1301
##  2   3535      1137
##  3   3695      1126
##  4    177      1014
##  5   3075      1005
##  6   2391       960
##  7   2119       911
##  8   2007       899
##  9   2047       898
## 10    172       896
## # … with 336,766 more rows

The SQL query using as operator:

select flight, (dep_delay - arr_delay) as catchup
from flights

writes as:

# dplyr
select(mutate(flights, catchup = dep_delay - arr_delay), flight, catchup)
## # A tibble: 336,776 x 2
##    flight catchup
##     <int>   <dbl>
##  1   1545      -9
##  2   1714     -16
##  3   1141     -31
##  4    725      17
##  5    461      19
##  6   1696     -16
##  7    507     -24
##  8   5708      11
##  9     79       5
## 10    301     -10
## # … with 336,766 more rows

Finally the SQL group by:

select month, day, count(*) as count
from flights
group by month, day

writes as:

## # A tibble: 365 x 3
## # Groups:   month [12]
##    month   day count
##    <int> <int> <int>
##  1     1     1   842
##  2     1     2   943
##  3     1     3   914
##  4     1     4   915
##  5     1     5   720
##  6     1     6   832
##  7     1     7   933
##  8     1     8   899
##  9     1     9   902
## 10     1    10   932
## # … with 355 more rows

Visualization

A barplot showing the number of flights per hour:

ggplot(data = flights) +
  geom_bar(mapping = aes(x = as.factor(hour)))

A boxplot of the number of flight per day of the week:

daily <- flights %>% 
  mutate(date = make_date(year, month, day)) %>% 
  group_by(date) %>% 
  summarise(n = n())

daily <- daily %>% 
  mutate(wday = wday(date, label = TRUE))

ggplot(data = daily, mapping = aes(wday, n)) + 
  geom_boxplot()

An histogram of the departure delay of flights with at most one hour of delay:

flights60 = filter(flights, dep_delay <=60)

ggplot(data = flights60) +
  geom_histogram(mapping = aes(x = dep_delay))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A scatter plot of arrival versus departure delays:

flightsDecember = filter(flights, month == 12, !is.na(dep_delay), !is.na(arr_delay))

ggplot(data = flightsDecember) +
  geom_point(mapping = aes(x = dep_delay, y = arr_delay))