In this trailer we will use the nycflights13 dataset. It contains information about all flights that departed from New York City (e.g. EWR, JFK and LGA) in 2013 and related metadata (planes, airports, airlines and weather conditions). It includes the following entities:
The entities are associated with the following relationships:
The Entity-Relationship model (ER model) is a conceptual model of data. As such it describes at high level the reality modeled regardless of how the data will be logically and physically represented.
The ER model defines a conceptual data schema. A schema describes the structure or form of the data, but does not describe the actual data, that is the instances or occurrences of the schema. For example, the ER schema may say that the plane entity has the manufacturer and model attributes but does not say what the actual values of these attributes are. Compared to object-oriented programming, the schema corresponds to the notion of class and an instance of the schema is an object of the class.
An ER schema contains the following components:
flies
among flights and planes associating each flight with the corresponding plane (or viceversa).Both entities and relationships can have atomic properties called attributes. A key of an entity is one or more attributes that identify unambiguously the instances of the entity.
Here is an ER schema for the nycflights13 dataset:
The relational model consists of two main components:
The relational model is based on the concept of relation (not to be confused with the conceptual relationship of the ER model), whose representation is a table (or dataframe in R’s jargon). The concept of relation is formal and comes from set theory, a part of mathematics. The concept of table is intuitive and is used in various contexts outside databases. The success of the relational model lies precisely in the conjunction of a formal concept, the relation, which allowed the development of a theory of relational databases with results of practical impact, with an intuitive concept, the table, which made it possible the relational model also to end users without any mathematical notion.
The following properties characterize the relational mode:
Here is the relational model of the nycflights13 dataset. Primary key attributes are underlined, foreign key references are represented as R(A) –> S(B), meaning that A in table R references B in table S:
flight(id, month, day, hour, origin, destination, tailnum, carrier, time_hour)
plane(tailnum)
airline(carrier)
airport(code)
weather(time_hour, airport)
flight(tailnum) –> plane(tailnum)
flight(carrier) –> airline(carrier)
flight(origin) –> airport(code)
flight(destination) –> airport(code)
flight(time_hour, origin) –> weather(time_hour, airport)
weather(airport) –> airport(code)
The relational algebra is a collection of operators; each operator takes as input either a single relation or a pair of relations and outputs a single relation as its result. A relational query is a composition of a finite number of relational operators. In that sense a query is procedural, since it specifies the order in which the operators comprising the query are to be evaluated.
The relational algebra contains the following operators. Set operators union \(\cup\), intersection \(\cap\) and set difference \(\setminus\) have the usual semantics.
The projection operator \[\pi(T, A_1, \ldots A_n)\] selects columns \(A_1, \ldots A_n\) from table \(T\). Consider the table:
planes
## # A tibble: 3,322 x 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed win… EMBRAER EMB-1… 2 55 NA Turbo…
## 2 N102UW 1998 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 3 N103US 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 4 N104UW 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 5 N10575 2002 Fixed win… EMBRAER EMB-1… 2 55 NA Turbo…
## 6 N105UW 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 7 N107US 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 8 N108UW 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 9 N109UW 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 10 N110UW 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## # … with 3,312 more rows
The projection:
\[ \pi(planes, \{tailnum, manufacturer\}) \]
is the following table:
## # A tibble: 3,322 x 2
## tailnum manufacturer
## <chr> <chr>
## 1 N10156 EMBRAER
## 2 N102UW AIRBUS INDUSTRIE
## 3 N103US AIRBUS INDUSTRIE
## 4 N104UW AIRBUS INDUSTRIE
## 5 N10575 EMBRAER
## 6 N105UW AIRBUS INDUSTRIE
## 7 N107US AIRBUS INDUSTRIE
## 8 N108UW AIRBUS INDUSTRIE
## 9 N109UW AIRBUS INDUSTRIE
## 10 N110UW AIRBUS INDUSTRIE
## # … with 3,312 more rows
The selection operator \[\sigma(T, \varphi)\] selects rows of \(T\) that satisfy formula \(\varphi\). Consider the table:
planes
## # A tibble: 3,322 x 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed win… EMBRAER EMB-1… 2 55 NA Turbo…
## 2 N102UW 1998 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 3 N103US 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 4 N104UW 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 5 N10575 2002 Fixed win… EMBRAER EMB-1… 2 55 NA Turbo…
## 6 N105UW 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 7 N107US 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 8 N108UW 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 9 N109UW 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 10 N110UW 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## # … with 3,312 more rows
The selection: \[ \sigma(planes, manufacturer = EMBRAER \wedge engines = 2 ) \]
is the following table:
## # A tibble: 299 x 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 2 N10575 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 3 N11106 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 4 N11107 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 5 N11109 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 6 N11113 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 7 N11119 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 8 N11121 2003 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 9 N11127 2003 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 10 N11137 2003 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## # … with 289 more rows
The join operator \[T_1 \Join T_2\] of tables \(T_1\) and \(T_2\) having the attributes \(X\) in common concatenates every tuple of \(T_1\) with every tuple of \(T_2\) and selects those having the same \(X\)-values.
Consider the relations:
airflights
## # A tibble: 336,776 x 7
## month day flight origin dest tailnum carrier
## <int> <int> <int> <chr> <chr> <chr> <chr>
## 1 1 1 1545 EWR IAH N14228 UA
## 2 1 1 1714 LGA IAH N24211 UA
## 3 1 1 1141 JFK MIA N619AA AA
## 4 1 1 725 JFK BQN N804JB B6
## 5 1 1 461 LGA ATL N668DN DL
## 6 1 1 1696 EWR ORD N39463 UA
## 7 1 1 507 EWR FLL N516JB B6
## 8 1 1 5708 LGA IAD N829AS EV
## 9 1 1 79 JFK MCO N593JB B6
## 10 1 1 301 LGA ORD N3ALAA AA
## # … with 336,766 more rows
airlines
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
The join \[airflights \Join airlines\] is the relation:
## Joining, by = "carrier"
## # A tibble: 336,776 x 8
## month day flight origin dest tailnum carrier name
## <int> <int> <int> <chr> <chr> <chr> <chr> <chr>
## 1 1 1 1545 EWR IAH N14228 UA United Air Lines Inc.
## 2 1 1 1714 LGA IAH N24211 UA United Air Lines Inc.
## 3 1 1 1141 JFK MIA N619AA AA American Airlines Inc.
## 4 1 1 725 JFK BQN N804JB B6 JetBlue Airways
## 5 1 1 461 LGA ATL N668DN DL Delta Air Lines Inc.
## 6 1 1 1696 EWR ORD N39463 UA United Air Lines Inc.
## 7 1 1 507 EWR FLL N516JB B6 JetBlue Airways
## 8 1 1 5708 LGA IAD N829AS EV ExpressJet Airlines Inc.
## 9 1 1 79 JFK MCO N593JB B6 JetBlue Airways
## 10 1 1 301 LGA ORD N3ALAA AA American Airlines Inc.
## # … with 336,766 more rows
The renaming operator \[\rho(T, B = A)\] renames attribute A of table T with B. Consider the table:
aircrafts
## # A tibble: 3,322 x 3
## tailnum engine engines
## <chr> <chr> <int>
## 1 N10156 Turbo-fan 2
## 2 N102UW Turbo-fan 2
## 3 N103US Turbo-fan 2
## 4 N104UW Turbo-fan 2
## 5 N10575 Turbo-fan 2
## 6 N105UW Turbo-fan 2
## 7 N107US Turbo-fan 2
## 8 N108UW Turbo-fan 2
## 9 N109UW Turbo-fan 2
## 10 N110UW Turbo-fan 2
## # … with 3,312 more rows
The rename \[\rho(aircrafts, engineType = engine) \]
## # A tibble: 3,322 x 3
## tailnum engineType engines
## <chr> <chr> <int>
## 1 N10156 Turbo-fan 2
## 2 N102UW Turbo-fan 2
## 3 N103US Turbo-fan 2
## 4 N104UW Turbo-fan 2
## 5 N10575 Turbo-fan 2
## 6 N105UW Turbo-fan 2
## 7 N107US Turbo-fan 2
## 8 N108UW Turbo-fan 2
## 9 N109UW Turbo-fan 2
## 10 N110UW Turbo-fan 2
## # … with 3,312 more rows
Structured Query Language (SQL) is a language that allows the definition (creation of schema and integrity constraints), the manipulation (insert, update and delete of data) and the query (data retrieval) of a relational database. Here we are interested in the query part of the language. It contains the following operators
select
to select columnsfrom
to select tableswhere
to filter rowsas
to introduce new attributesorder by
to sort rowsgroup by
to partition rows into groupsjoin
to join tablesThe projection:
\[ \pi(planes, \{tailnum, manufacturer\}) \]
is the following SQL query:
select tailnum, manufacturer from planes
## # A tibble: 3,322 x 2
## tailnum manufacturer
## <chr> <chr>
## 1 N10156 EMBRAER
## 2 N102UW AIRBUS INDUSTRIE
## 3 N103US AIRBUS INDUSTRIE
## 4 N104UW AIRBUS INDUSTRIE
## 5 N10575 EMBRAER
## 6 N105UW AIRBUS INDUSTRIE
## 7 N107US AIRBUS INDUSTRIE
## 8 N108UW AIRBUS INDUSTRIE
## 9 N109UW AIRBUS INDUSTRIE
## 10 N110UW AIRBUS INDUSTRIE
## # … with 3,312 more rows
The selection: \[ \sigma(planes, manufacturer = EMBRAER \wedge engines = 2 ) \]
is the following SQL query:
select * from planes where manufacturer = "EMBRAER" and engines = 2
## # A tibble: 299 x 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 2 N10575 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 3 N11106 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 4 N11107 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 5 N11109 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 6 N11113 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 7 N11119 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 8 N11121 2003 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 9 N11127 2003 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 10 N11137 2003 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## # … with 289 more rows
The join \[airflights \Join airlines\] is the following SQL query:
select * from airflights join airlines on airflights.carrier = airlines.carrier
## Joining, by = "carrier"
## # A tibble: 336,776 x 8
## month day flight origin dest tailnum carrier name
## <int> <int> <int> <chr> <chr> <chr> <chr> <chr>
## 1 1 1 1545 EWR IAH N14228 UA United Air Lines Inc.
## 2 1 1 1714 LGA IAH N24211 UA United Air Lines Inc.
## 3 1 1 1141 JFK MIA N619AA AA American Airlines Inc.
## 4 1 1 725 JFK BQN N804JB B6 JetBlue Airways
## 5 1 1 461 LGA ATL N668DN DL Delta Air Lines Inc.
## 6 1 1 1696 EWR ORD N39463 UA United Air Lines Inc.
## 7 1 1 507 EWR FLL N516JB B6 JetBlue Airways
## 8 1 1 5708 LGA IAD N829AS EV ExpressJet Airlines Inc.
## 9 1 1 79 JFK MCO N593JB B6 JetBlue Airways
## 10 1 1 301 LGA ORD N3ALAA AA American Airlines Inc.
## # … with 336,766 more rows
SQL adds an operator order by
to sort rows of tables. For instance the following query order flights in decreasing order with respect to departure delay:
select flight, dep_delay from flights order by dep_delay desc
## # A tibble: 336,776 x 2
## flight dep_delay
## <int> <dbl>
## 1 51 1301
## 2 3535 1137
## 3 3695 1126
## 4 177 1014
## 5 3075 1005
## 6 2391 960
## 7 2119 911
## 8 2007 899
## 9 2047 898
## 10 172 896
## # … with 336,766 more rows
You can introduce new attributes using the as operator. In the following we define catchup as the difference between departure and arrival delays:
select flight, (dep_delay - arr_delay) as catchup from flights
## # A tibble: 336,776 x 2
## flight catchup
## <int> <dbl>
## 1 1545 -9
## 2 1714 -16
## 3 1141 -31
## 4 725 17
## 5 461 19
## 6 1696 -16
## 7 507 -24
## 8 5708 11
## 9 79 5
## 10 301 -10
## # … with 336,766 more rows
Finally SQL has an operator group by
to partition rows of a table into groups by common values of some attributes. We can use summary functions like count, sum and mean on the table groups. For instance, the following query groups flights by day and month and then counts the number of flights in each group:
select month, day, count(*) as count from flights group by month, day
## # A tibble: 365 x 3
## # Groups: month [12]
## month day count
## <int> <int> <int>
## 1 1 1 842
## 2 1 2 943
## 3 1 3 914
## 4 1 4 915
## 5 1 5 720
## 6 1 6 832
## 7 1 7 933
## 8 1 8 899
## 9 1 9 902
## 10 1 10 932
## # … with 355 more rows
Notice the operator as
which creates a new attribute called count
.
Generally, dplyr is a little easier to use than SQL because dplyr is specialized to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren’t commonly needed for data analysis. There is an immediate correspondence between SQL clauses and dplyr verbs:
select
in SQL is select()
in dplyrwhere
in SQL is filter()
in dplyrorder by
in SQL is arrange()
in dplyrgroup by
in SQL is group_by()
in dplyras
in select
in SQL is mutate()
in dplyrjoin
in SQL is join()
in dplyrThe SQL projection:
select tailnum, manufacturer from planes
writes as:
select(planes, tailnum, manufacturer)
## # A tibble: 3,322 x 2
## tailnum manufacturer
## <chr> <chr>
## 1 N10156 EMBRAER
## 2 N102UW AIRBUS INDUSTRIE
## 3 N103US AIRBUS INDUSTRIE
## 4 N104UW AIRBUS INDUSTRIE
## 5 N10575 EMBRAER
## 6 N105UW AIRBUS INDUSTRIE
## 7 N107US AIRBUS INDUSTRIE
## 8 N108UW AIRBUS INDUSTRIE
## 9 N109UW AIRBUS INDUSTRIE
## 10 N110UW AIRBUS INDUSTRIE
## # … with 3,312 more rows
The SQL selection:
select * from planes where manufacturer = "EMBRAER" and engines = 2
writes as:
filter(planes, manufacturer == "EMBRAER" & engines == 2)
## # A tibble: 299 x 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 2 N10575 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 3 N11106 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 4 N11107 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 5 N11109 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 6 N11113 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 7 N11119 2002 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 8 N11121 2003 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 9 N11127 2003 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## 10 N11137 2003 Fixed wing… EMBRAER EMB-1… 2 55 NA Turbo…
## # … with 289 more rows
The SQL join:
select * from airflights join airlines on airflights.carrier = airlines.carrier
writes as:
## # A tibble: 336,776 x 8
## month day flight origin dest tailnum carrier name
## <int> <int> <int> <chr> <chr> <chr> <chr> <chr>
## 1 1 1 1545 EWR IAH N14228 UA United Air Lines Inc.
## 2 1 1 1714 LGA IAH N24211 UA United Air Lines Inc.
## 3 1 1 1141 JFK MIA N619AA AA American Airlines Inc.
## 4 1 1 725 JFK BQN N804JB B6 JetBlue Airways
## 5 1 1 461 LGA ATL N668DN DL Delta Air Lines Inc.
## 6 1 1 1696 EWR ORD N39463 UA United Air Lines Inc.
## 7 1 1 507 EWR FLL N516JB B6 JetBlue Airways
## 8 1 1 5708 LGA IAD N829AS EV ExpressJet Airlines Inc.
## 9 1 1 79 JFK MCO N593JB B6 JetBlue Airways
## 10 1 1 301 LGA ORD N3ALAA AA American Airlines Inc.
## # … with 336,766 more rows
The SQL order by:
select flight, dep_delay from flights order by dep_delay desc
writes as:
## # A tibble: 336,776 x 2
## flight dep_delay
## <int> <dbl>
## 1 51 1301
## 2 3535 1137
## 3 3695 1126
## 4 177 1014
## 5 3075 1005
## 6 2391 960
## 7 2119 911
## 8 2007 899
## 9 2047 898
## 10 172 896
## # … with 336,766 more rows
The SQL query using as operator:
select flight, (dep_delay - arr_delay) as catchup from flights
writes as:
# dplyr
select(mutate(flights, catchup = dep_delay - arr_delay), flight, catchup)
## # A tibble: 336,776 x 2
## flight catchup
## <int> <dbl>
## 1 1545 -9
## 2 1714 -16
## 3 1141 -31
## 4 725 17
## 5 461 19
## 6 1696 -16
## 7 507 -24
## 8 5708 11
## 9 79 5
## 10 301 -10
## # … with 336,766 more rows
Finally the SQL group by:
select month, day, count(*) as count from flights group by month, day
writes as:
## # A tibble: 365 x 3
## # Groups: month [12]
## month day count
## <int> <int> <int>
## 1 1 1 842
## 2 1 2 943
## 3 1 3 914
## 4 1 4 915
## 5 1 5 720
## 6 1 6 832
## 7 1 7 933
## 8 1 8 899
## 9 1 9 902
## 10 1 10 932
## # … with 355 more rows
A barplot showing the number of flights per hour:
ggplot(data = flights) +
geom_bar(mapping = aes(x = as.factor(hour)))
A boxplot of the number of flight per day of the week:
daily <- flights %>%
mutate(date = make_date(year, month, day)) %>%
group_by(date) %>%
summarise(n = n())
daily <- daily %>%
mutate(wday = wday(date, label = TRUE))
ggplot(data = daily, mapping = aes(wday, n)) +
geom_boxplot()
An histogram of the departure delay of flights with at most one hour of delay:
flights60 = filter(flights, dep_delay <=60)
ggplot(data = flights60) +
geom_histogram(mapping = aes(x = dep_delay))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A scatter plot of arrival versus departure delays:
flightsDecember = filter(flights, month == 12, !is.na(dep_delay), !is.na(arr_delay))
ggplot(data = flightsDecember) +
geom_point(mapping = aes(x = dep_delay, y = arr_delay))