We formally define the relational model, the relational algebra and the relational calculus.
Informally, the relational model has only one data structure, the relation. The schema of a relation \(R\) is a set of attributes \[schema(R) = \{A_1, A_2,...,A_m\}\] describing the properties of the relation, and the schema of a database is a set of relation schemas.
Each attribute \(A\) is associated with a domain of values, denoted \(\cal{D}(A)\). A domain is atomic if it is a nondecomposable set of values. A tuple over a relation schema \(R\), with \(schema(R) = \{A_1, A_2,...,A_m\}\) is a member of the Cartesian product: \[\cal{D}(A_1) \times \cal{D}(A_2) \times \ldots \times \cal{D}(A_m)\] A relation is a finite set of tuples and a database is a finite set of relations. It is important to remember that relations and databases are finite sets – only a finite amount of information can be stored in a computer.
The relational algebra is a collection of operators; each operator takes as input either a single relation or a pair of relations and outputs a single relation as its result. A relational query is a composition of a finite number of relational operators. In that sense a query is procedural, since it specifies the order in which the operators comprising the query are to be evaluated.
The relational algebra was first presented in Codd’s seminal 1970 paper (A relational model of data for large shared data banks) and a variant of the domain relational calculus was first presented in 1972 in another Codd’s fundamental papers (Relational completeness of data base sublanguages). Since then the relational algebra has become a yardstick for measuring the expressiveness of any relational query language.
Our style of presentation of the relational algebra operators is: for each operator we give an informal definition of the operator, then we give its formal definition, and finally we give an example of its use over the database nycflights13
.
The set-theoretic operators, union, difference and intersection, defined below, are all binary operators which take two relations over a common relation schema and return a relation over the same schema. The union of two relations \(r_1\) and \(r_2\) over relation schema R is the set of tuples that are either in \(r_1\) or in \(r_2\):
\[r_1 \cup r_2 = \{t \ | \ t \in r_1 \ \vee \ t \in r_2\}\]
Consider relations:
r1
## # A tibble: 5 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
r2
## # A tibble: 7 x 2
## carrier name
## <chr> <chr>
## 1 B6 JetBlue Airways
## 2 DL Delta Air Lines Inc.
## 3 EV ExpressJet Airlines Inc.
## 4 F9 Frontier Airlines Inc.
## 5 FL AirTran Airways Corporation
## 6 HA Hawaiian Airlines Inc.
## 7 MQ Envoy Air
The union \[r_1 \cup r_2\] is the relation:
## # A tibble: 10 x 2
## carrier name
## <chr> <chr>
## 1 MQ Envoy Air
## 2 HA Hawaiian Airlines Inc.
## 3 FL AirTran Airways Corporation
## 4 F9 Frontier Airlines Inc.
## 5 EV ExpressJet Airlines Inc.
## 6 DL Delta Air Lines Inc.
## 7 B6 JetBlue Airways
## 8 AS Alaska Airlines Inc.
## 9 AA American Airlines Inc.
## 10 9E Endeavor Air Inc.
The difference between two relations \(r_1\) and \(r_2\) over relation schema R is the set of tuples that are in \(r_1\) but not in \(r_2\):
\[r_1 \setminus r_2 = \{t \ | \ t \in r_1 \ \wedge \ t \not\in r_2\}\]
Consider relations:
r1
## # A tibble: 5 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
r2
## # A tibble: 7 x 2
## carrier name
## <chr> <chr>
## 1 B6 JetBlue Airways
## 2 DL Delta Air Lines Inc.
## 3 EV ExpressJet Airlines Inc.
## 4 F9 Frontier Airlines Inc.
## 5 FL AirTran Airways Corporation
## 6 HA Hawaiian Airlines Inc.
## 7 MQ Envoy Air
The difference \[r_1 \setminus r_2\] is the relation:
## # A tibble: 3 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
The intersection between two relations \(r_1\) and \(r_2\) over relation schema R is the set of tuples that are both in \(r_1\) and in \(r_2\):
\[r_1 \cap r_2 = \{t \ | \ t \in r_1 \ \wedge \ t \in r_2\}\]
Consider relations:
r1
## # A tibble: 5 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
r2
## # A tibble: 7 x 2
## carrier name
## <chr> <chr>
## 1 B6 JetBlue Airways
## 2 DL Delta Air Lines Inc.
## 3 EV ExpressJet Airlines Inc.
## 4 F9 Frontier Airlines Inc.
## 5 FL AirTran Airways Corporation
## 6 HA Hawaiian Airlines Inc.
## 7 MQ Envoy Air
The intersection \[r_1 \cap r_2\] is the relation:
## # A tibble: 2 x 2
## carrier name
## <chr> <chr>
## 1 B6 JetBlue Airways
## 2 DL Delta Air Lines Inc.
Can we express intersection in terms of difference?
The projection of a relation \(r\) over relation schema \(R\) onto a set of attributes \(X\) included in \(schema(R)\) is the set of tuples resulting from projecting each of the tuples in \(r\) onto attributes in \(X\):
\[\pi(r, X) = \{t[X] \ | \ t \in r\}\] where \(t[X]\) is the projection of tuple \(t\) onto attributes in \(X\), that is, the tuple that contains only values of \(t\) for attributes in \(X\).
Consider the following relation:
planes
## # A tibble: 3,322 x 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wi… EMBRAER EMB-1… 2 55 NA Turbo…
## 2 N102UW 1998 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 3 N103US 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 4 N104UW 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 5 N10575 2002 Fixed wi… EMBRAER EMB-1… 2 55 NA Turbo…
## 6 N105UW 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 7 N107US 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 8 N108UW 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 9 N109UW 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 10 N110UW 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## # ... with 3,312 more rows
The projection \[ \pi(planes, \{tailnum, manufacturer\}) \] is the relation:
## # A tibble: 3,322 x 2
## tailnum manufacturer
## <chr> <chr>
## 1 N10156 EMBRAER
## 2 N102UW AIRBUS INDUSTRIE
## 3 N103US AIRBUS INDUSTRIE
## 4 N104UW AIRBUS INDUSTRIE
## 5 N10575 EMBRAER
## 6 N105UW AIRBUS INDUSTRIE
## 7 N107US AIRBUS INDUSTRIE
## 8 N108UW AIRBUS INDUSTRIE
## 9 N109UW AIRBUS INDUSTRIE
## 10 N110UW AIRBUS INDUSTRIE
## # ... with 3,312 more rows
Selection of tuples from a relation \(r\) with respect to a selection formula \(F\) is the subset of tuples from \(r\) that satisfy formula \(F\). A selection formula is defined recursively as follows:
For instance:
\[A = B \wedge (C = c_1 \vee C = c_2)\]
Informally a tuple, \(t\), logically implies a formula, \(F\), if the tuple satisfies \(F\). Formally, tuple \(t\) logically implies \(F\), written \(t \models F\), is defined recursively, as follows:
The selection, \(\sigma\), applied to a relation \(r\) over relation schema \(R\) with respect to a selection formula \(F\) over \(R\) is defined by:
\[\sigma(r, F) = \{t \ | \ t \in r \ \wedge \ t \models F\}\]
Consider the relation:
planes
## # A tibble: 3,322 x 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wi… EMBRAER EMB-1… 2 55 NA Turbo…
## 2 N102UW 1998 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 3 N103US 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 4 N104UW 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 5 N10575 2002 Fixed wi… EMBRAER EMB-1… 2 55 NA Turbo…
## 6 N105UW 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 7 N107US 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 8 N108UW 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 9 N109UW 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## 10 N110UW 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo…
## # ... with 3,312 more rows
The selection \[ \sigma(planes, manufacturer = EMBRAER \wedge engines = 2 ) \] is the relation:
## # A tibble: 299 x 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed win… EMBRAER EMB-1… 2 55 NA Turbo…
## 2 N10575 2002 Fixed win… EMBRAER EMB-1… 2 55 NA Turbo…
## 3 N11106 2002 Fixed win… EMBRAER EMB-1… 2 55 NA Turbo…
## 4 N11107 2002 Fixed win… EMBRAER EMB-1… 2 55 NA Turbo…
## 5 N11109 2002 Fixed win… EMBRAER EMB-1… 2 55 NA Turbo…
## 6 N11113 2002 Fixed win… EMBRAER EMB-1… 2 55 NA Turbo…
## 7 N11119 2002 Fixed win… EMBRAER EMB-1… 2 55 NA Turbo…
## 8 N11121 2003 Fixed win… EMBRAER EMB-1… 2 55 NA Turbo…
## 9 N11127 2003 Fixed win… EMBRAER EMB-1… 2 55 NA Turbo…
## 10 N11137 2003 Fixed win… EMBRAER EMB-1… 2 55 NA Turbo…
## # ... with 289 more rows
There is a natural correspondence between the Boolean logical connectives \(\vee\), \(\wedge\), \(\neg\) present in selection formulae and the set operators \(\cup\), \(\cap\), \(\setminus\) respectively:
Informally, the natural join of two relations \(r_1\) over relation schema \(R_1\) and \(r_2\) over relation schema \(R_2\), with \(schema(R_1) \cap schema(R_2)\) being the set of attributes \(X\), is the relation containing tuples that result from concatenating every tuple of \(r_1\) with every tuple of \(r_2\) both of which have the same \(X\)-values. The attributes in \(X\) are called the join attributes of \(R_1\) and \(R_2\).
Formally, the natural join, \(r_1 \Join r_2\) , of two relations \(r_1\) over relation schema \(R_1\) and \(r_2\) over relation schema \(R_2\) is a relation over relation schema \(R\) with \(schema(R) = schema(R_1) \cup schema(R_2)\) defined by
\[r_1 \Join r_2 = \{t \ | \ t[schema(R_1)] \in r_1 \ \wedge \ t[schema(R_2)] \in r_2\}\]
Consider the relations:
airflights
## # A tibble: 336,776 x 7
## month day flight origin dest tailnum carrier
## <int> <int> <int> <chr> <chr> <chr> <chr>
## 1 1 1 1545 EWR IAH N14228 UA
## 2 1 1 1714 LGA IAH N24211 UA
## 3 1 1 1141 JFK MIA N619AA AA
## 4 1 1 725 JFK BQN N804JB B6
## 5 1 1 461 LGA ATL N668DN DL
## 6 1 1 1696 EWR ORD N39463 UA
## 7 1 1 507 EWR FLL N516JB B6
## 8 1 1 5708 LGA IAD N829AS EV
## 9 1 1 79 JFK MCO N593JB B6
## 10 1 1 301 LGA ORD N3ALAA AA
## # … with 336,766 more rows
airlines
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
The join \[airflights \Join airlines\] is the relation:
## Joining, by = "carrier"
## # A tibble: 336,776 x 8
## month day flight origin dest tailnum carrier name
## <int> <int> <int> <chr> <chr> <chr> <chr> <chr>
## 1 1 1 1545 EWR IAH N14228 UA United Air Lines Inc.
## 2 1 1 1714 LGA IAH N24211 UA United Air Lines Inc.
## 3 1 1 1141 JFK MIA N619AA AA American Airlines Inc.
## 4 1 1 725 JFK BQN N804JB B6 JetBlue Airways
## 5 1 1 461 LGA ATL N668DN DL Delta Air Lines Inc.
## 6 1 1 1696 EWR ORD N39463 UA United Air Lines Inc.
## 7 1 1 507 EWR FLL N516JB B6 JetBlue Airways
## 8 1 1 5708 LGA IAD N829AS EV ExpressJet Airlines Inc.
## 9 1 1 79 JFK MCO N593JB B6 JetBlue Airways
## 10 1 1 301 LGA ORD N3ALAA AA American Airlines Inc.
## # … with 336,766 more rows
What is \(r_1 \Join r_2\) if \(schema(R_1) = schema(R_2)\)? In this case \[schema(R) = schema(R_1) \cup schema(R_2) = schema(R_1) = schema(R_2)\]
and hence:
\[r_1 \Join r_2 = \{t \ | \ t \in r_1 \ \wedge \ t \in r_2\} = r_1 \cap r_2\]
What is \(r_1 \Join r_2\) if \(schema(R_1) \cap schema(R_2) = \emptyset\)? In this case \(schema(R) = schema(R_1) \cup schema(R_2)\) contains all attributes in \(R_1\) and all attributes in \(R_2\). Since the set of join attributes is empty, each combination of a tuple of \(r_1\) and a tuple of \(r_2\) satisfies the join condition and hence is part of the result relation. Hence:
\[r_1 \Join r_2 = r_1 \times r_2 \]
The renaming operator allows us to change the name of an attribute in a schema of a relation. Renaming is useful when we want to take the set operations over different schemas, and when we want to take the natural join of two relations over a set of attributes other than the set of common ones.
Let \(r\) be a relation over relation schema \(R\), \(A\) be an attribute of \(schema(R)\) and \(B\) be an attribute not in \(schema(R)\). The renaming, \(\rho\), of \(A\) to \(B\) in \(r\), is a relation over relation schema \(S\), where \(schema(S) = (schema(R) \setminus \{A\}) \cup \{B\}\), defined by
\[\rho(r, B = A) = \{t \ | \ \exists u \in r. \ t[schema(R) \setminus \{B\}] = u[schema(R) \setminus \{A\}] \ \wedge \ t[B] = u[A]\}\]
It’s handy to extend the rename operator to cope with many renaming pairs: \[\rho(r, B_1 = A_1, B_2 = A_2, \ldots, B_n = A_n) \]
Consider the relation:
aircrafts
## # A tibble: 3,322 x 4
## manufacturer tailnum engine engines
## <chr> <chr> <chr> <int>
## 1 EMBRAER N10156 Turbo-fan 2
## 2 AIRBUS INDUSTRIE N102UW Turbo-fan 2
## 3 AIRBUS INDUSTRIE N103US Turbo-fan 2
## 4 AIRBUS INDUSTRIE N104UW Turbo-fan 2
## 5 EMBRAER N10575 Turbo-fan 2
## 6 AIRBUS INDUSTRIE N105UW Turbo-fan 2
## 7 AIRBUS INDUSTRIE N107US Turbo-fan 2
## 8 AIRBUS INDUSTRIE N108UW Turbo-fan 2
## 9 AIRBUS INDUSTRIE N109UW Turbo-fan 2
## 10 AIRBUS INDUSTRIE N110UW Turbo-fan 2
## # … with 3,312 more rows
The rename \[\rho(aircrafts, engineType = engine, engineNumber = engines) \]
## # A tibble: 3,322 x 4
## manufacturer tailnum engineType engineNumber
## <chr> <chr> <chr> <int>
## 1 EMBRAER N10156 Turbo-fan 2
## 2 AIRBUS INDUSTRIE N102UW Turbo-fan 2
## 3 AIRBUS INDUSTRIE N103US Turbo-fan 2
## 4 AIRBUS INDUSTRIE N104UW Turbo-fan 2
## 5 EMBRAER N10575 Turbo-fan 2
## 6 AIRBUS INDUSTRIE N105UW Turbo-fan 2
## 7 AIRBUS INDUSTRIE N107US Turbo-fan 2
## 8 AIRBUS INDUSTRIE N108UW Turbo-fan 2
## 9 AIRBUS INDUSTRIE N109UW Turbo-fan 2
## 10 AIRBUS INDUSTRIE N110UW Turbo-fan 2
## # … with 3,312 more rows
A relational algebra expression (or query) is a well-formed expression composed of a finite number of relational algebra operators whose operands are relation schemas which can be treated as input variables to the query. An answer to a query is obtained by replacing every occurrence of a relation schema in the query by a relation over the schema and computing the result by invoking the algebra operators present in the query.
For example, the following query retrieves the month, day, origin, destination, carrier code and carrier name of flights that flew with carrier JetBlue Airways.
\[ \pi(\sigma(flights \Join airlines, name = {\rm JetBlue\ Airways}), \{month, day, origin, dest, carrier, name\}) \] It would be handy to introduce the pipe operator (>) in relational algebra queries. For instance, the above query would become more readable:
\[ \begin{array}{l} (flights \Join airlines) > & \\ \sigma(name = {\rm JetBlue\ Airways}) > & \\ \pi(\{month, day, origin, dest, carrier, name\}) \end{array} \]
The query answer set is a relation with 54,635 tuples and 6 attributes (we only show the first 10 rows):
## # A tibble: 54,635 x 6
## month day origin dest carrier name
## <int> <int> <chr> <chr> <chr> <chr>
## 1 1 1 JFK BQN B6 JetBlue Airways
## 2 1 1 EWR FLL B6 JetBlue Airways
## 3 1 1 JFK MCO B6 JetBlue Airways
## 4 1 1 JFK PBI B6 JetBlue Airways
## 5 1 1 JFK TPA B6 JetBlue Airways
## 6 1 1 JFK BOS B6 JetBlue Airways
## 7 1 1 LGA FLL B6 JetBlue Airways
## 8 1 1 EWR PBI B6 JetBlue Airways
## 9 1 1 JFK RSW B6 JetBlue Airways
## 10 1 1 JFK SJU B6 JetBlue Airways
## # … with 54,625 more rows
The set of queries, which are expressible in the relational algebra, is considered to be the minimal set of queries that any query language for the relational model should possess. Thus the relational algebra provides a yardstick for measuring the expressive power of a query language for the relational model independently of any implementation.