We formally define the relational model, the relational algebra and the relational calculus.

Relational model

Informally, the relational model has only one data structure, the relation. The schema of a relation \(R\) is a set of attributes \[schema(R) = \{A_1, A_2,...,A_m\}\] describing the properties of the relation, and the schema of a database is a set of relation schemas.

Each attribute \(A\) is associated with a domain of values, denoted \(\cal{D}(A)\). A domain is atomic if it is a nondecomposable set of values. A tuple over a relation schema \(R\), with \(schema(R) = \{A_1, A_2,...,A_m\}\) is a member of the Cartesian product: \[\cal{D}(A_1) \times \cal{D}(A_2) \times \ldots \times \cal{D}(A_m)\] A relation is a finite set of tuples and a database is a finite set of relations. It is important to remember that relations and databases are finite sets – only a finite amount of information can be stored in a computer.

Relational algebra

The relational algebra is a collection of operators; each operator takes as input either a single relation or a pair of relations and outputs a single relation as its result. A relational query is a composition of a finite number of relational operators. In that sense a query is procedural, since it specifies the order in which the operators comprising the query are to be evaluated.

The relational algebra was first presented in Codd’s seminal 1970 paper (A relational model of data for large shared data banks) and a variant of the domain relational calculus was first presented in 1972 in another Codd’s fundamental papers (Relational completeness of data base sublanguages). Since then the relational algebra has become a yardstick for measuring the expressiveness of any relational query language.

Our style of presentation of the relational algebra operators is: for each operator we give an informal definition of the operator, then we give its formal definition, and finally we give an example of its use over the database nycflights13.

Union

The set-theoretic operators, union, difference and intersection, defined below, are all binary operators which take two relations over a common relation schema and return a relation over the same schema. The union of two relations \(r_1\) and \(r_2\) over relation schema R is the set of tuples that are either in \(r_1\) or in \(r_2\):

\[r_1 \cup r_2 = \{t \ | \ t \in r_1 \ \vee \ t \in r_2\}\]

Consider relations:

r1
## # A tibble: 5 x 2
##   carrier name                  
##   <chr>   <chr>                 
## 1 9E      Endeavor Air Inc.     
## 2 AA      American Airlines Inc.
## 3 AS      Alaska Airlines Inc.  
## 4 B6      JetBlue Airways       
## 5 DL      Delta Air Lines Inc.
r2
## # A tibble: 7 x 2
##   carrier name                       
##   <chr>   <chr>                      
## 1 B6      JetBlue Airways            
## 2 DL      Delta Air Lines Inc.       
## 3 EV      ExpressJet Airlines Inc.   
## 4 F9      Frontier Airlines Inc.     
## 5 FL      AirTran Airways Corporation
## 6 HA      Hawaiian Airlines Inc.     
## 7 MQ      Envoy Air

The union \[r_1 \cup r_2\] is the relation:

## # A tibble: 10 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 MQ      Envoy Air                  
##  2 HA      Hawaiian Airlines Inc.     
##  3 FL      AirTran Airways Corporation
##  4 F9      Frontier Airlines Inc.     
##  5 EV      ExpressJet Airlines Inc.   
##  6 DL      Delta Air Lines Inc.       
##  7 B6      JetBlue Airways            
##  8 AS      Alaska Airlines Inc.       
##  9 AA      American Airlines Inc.     
## 10 9E      Endeavor Air Inc.

Difference

The difference between two relations \(r_1\) and \(r_2\) over relation schema R is the set of tuples that are in \(r_1\) but not in \(r_2\):

\[r_1 \setminus r_2 = \{t \ | \ t \in r_1 \ \wedge \ t \not\in r_2\}\]

Consider relations:

r1
## # A tibble: 5 x 2
##   carrier name                  
##   <chr>   <chr>                 
## 1 9E      Endeavor Air Inc.     
## 2 AA      American Airlines Inc.
## 3 AS      Alaska Airlines Inc.  
## 4 B6      JetBlue Airways       
## 5 DL      Delta Air Lines Inc.
r2
## # A tibble: 7 x 2
##   carrier name                       
##   <chr>   <chr>                      
## 1 B6      JetBlue Airways            
## 2 DL      Delta Air Lines Inc.       
## 3 EV      ExpressJet Airlines Inc.   
## 4 F9      Frontier Airlines Inc.     
## 5 FL      AirTran Airways Corporation
## 6 HA      Hawaiian Airlines Inc.     
## 7 MQ      Envoy Air

The difference \[r_1 \setminus r_2\] is the relation:

## # A tibble: 3 x 2
##   carrier name                  
##   <chr>   <chr>                 
## 1 9E      Endeavor Air Inc.     
## 2 AA      American Airlines Inc.
## 3 AS      Alaska Airlines Inc.

Intersection

The intersection between two relations \(r_1\) and \(r_2\) over relation schema R is the set of tuples that are both in \(r_1\) and in \(r_2\):

\[r_1 \cap r_2 = \{t \ | \ t \in r_1 \ \wedge \ t \in r_2\}\]

Consider relations:

r1
## # A tibble: 5 x 2
##   carrier name                  
##   <chr>   <chr>                 
## 1 9E      Endeavor Air Inc.     
## 2 AA      American Airlines Inc.
## 3 AS      Alaska Airlines Inc.  
## 4 B6      JetBlue Airways       
## 5 DL      Delta Air Lines Inc.
r2
## # A tibble: 7 x 2
##   carrier name                       
##   <chr>   <chr>                      
## 1 B6      JetBlue Airways            
## 2 DL      Delta Air Lines Inc.       
## 3 EV      ExpressJet Airlines Inc.   
## 4 F9      Frontier Airlines Inc.     
## 5 FL      AirTran Airways Corporation
## 6 HA      Hawaiian Airlines Inc.     
## 7 MQ      Envoy Air

The intersection \[r_1 \cap r_2\] is the relation:

## # A tibble: 2 x 2
##   carrier name                
##   <chr>   <chr>               
## 1 B6      JetBlue Airways     
## 2 DL      Delta Air Lines Inc.

Can we express intersection in terms of difference?

Projection

The projection of a relation \(r\) over relation schema \(R\) onto a set of attributes \(X\) included in \(schema(R)\) is the set of tuples resulting from projecting each of the tuples in \(r\) onto attributes in \(X\):

\[\pi(r, X) = \{t[X] \ | \ t \in r\}\] where \(t[X]\) is the projection of tuple \(t\) onto attributes in \(X\), that is, the tuple that contains only values of \(t\) for attributes in \(X\).

Consider the following relation:

planes
## # A tibble: 3,322 x 9
##    tailnum  year type      manufacturer  model  engines seats speed engine
##    <chr>   <int> <chr>     <chr>         <chr>    <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wi… EMBRAER       EMB-1…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  3 N103US   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  5 N10575   2002 Fixed wi… EMBRAER       EMB-1…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  7 N107US   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
## # ... with 3,312 more rows

The projection \[ \pi(planes, \{tailnum, manufacturer\}) \] is the relation:

## # A tibble: 3,322 x 2
##    tailnum manufacturer    
##    <chr>   <chr>           
##  1 N10156  EMBRAER         
##  2 N102UW  AIRBUS INDUSTRIE
##  3 N103US  AIRBUS INDUSTRIE
##  4 N104UW  AIRBUS INDUSTRIE
##  5 N10575  EMBRAER         
##  6 N105UW  AIRBUS INDUSTRIE
##  7 N107US  AIRBUS INDUSTRIE
##  8 N108UW  AIRBUS INDUSTRIE
##  9 N109UW  AIRBUS INDUSTRIE
## 10 N110UW  AIRBUS INDUSTRIE
## # ... with 3,312 more rows

Selection

Selection of tuples from a relation \(r\) with respect to a selection formula \(F\) is the subset of tuples from \(r\) that satisfy formula \(F\). A selection formula is defined recursively as follows:

  1. A simple selection formula over a relation schema \(R\) is either an expression of the form \(A=a\) or an expression of the form \(A=B\) where \(A, B \in schema(R)\) and \(a \in \cal{D}(A)\).
  2. A selection formula over R is a well-formed expression composed of one or more simple selection formulae over R together with the Boolean logical connectives: \(\wedge\) (and), \(\vee\) (or), \(\neg\) (not) and parentheses.

For instance:

\[A = B \wedge (C = c_1 \vee C = c_2)\]

Informally a tuple, \(t\), logically implies a formula, \(F\), if the tuple satisfies \(F\). Formally, tuple \(t\) logically implies \(F\), written \(t \models F\), is defined recursively, as follows:

  1. \(t \models A=a\) if \(t[A]=a\) is true
  2. \(t \models A=B\) if \(t[A]=t[B]\) is true
  3. \(t \models F_1 \wedge F_2\) if \(t \models F_1\) is true and \(t \models F_2\) is true
  4. \(t \models F_1 \vee F_2\) if \(t \models F_1\) is true or \(t \models F_2\) is true
  5. \(t \models \neg F_1\) if \(t \models F_1\) is false

The selection, \(\sigma\), applied to a relation \(r\) over relation schema \(R\) with respect to a selection formula \(F\) over \(R\) is defined by:

\[\sigma(r, F) = \{t \ | \ t \in r \ \wedge \ t \models F\}\]

Consider the relation:

planes
## # A tibble: 3,322 x 9
##    tailnum  year type      manufacturer  model  engines seats speed engine
##    <chr>   <int> <chr>     <chr>         <chr>    <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wi… EMBRAER       EMB-1…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  3 N103US   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  5 N10575   2002 Fixed wi… EMBRAER       EMB-1…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  7 N107US   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
## # ... with 3,312 more rows

The selection \[ \sigma(planes, manufacturer = EMBRAER \wedge engines = 2 ) \] is the relation:

## # A tibble: 299 x 9
##    tailnum  year type       manufacturer model  engines seats speed engine
##    <chr>   <int> <chr>      <chr>        <chr>    <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed win… EMBRAER      EMB-1…       2    55    NA Turbo…
##  2 N10575   2002 Fixed win… EMBRAER      EMB-1…       2    55    NA Turbo…
##  3 N11106   2002 Fixed win… EMBRAER      EMB-1…       2    55    NA Turbo…
##  4 N11107   2002 Fixed win… EMBRAER      EMB-1…       2    55    NA Turbo…
##  5 N11109   2002 Fixed win… EMBRAER      EMB-1…       2    55    NA Turbo…
##  6 N11113   2002 Fixed win… EMBRAER      EMB-1…       2    55    NA Turbo…
##  7 N11119   2002 Fixed win… EMBRAER      EMB-1…       2    55    NA Turbo…
##  8 N11121   2003 Fixed win… EMBRAER      EMB-1…       2    55    NA Turbo…
##  9 N11127   2003 Fixed win… EMBRAER      EMB-1…       2    55    NA Turbo…
## 10 N11137   2003 Fixed win… EMBRAER      EMB-1…       2    55    NA Turbo…
## # ... with 289 more rows

There is a natural correspondence between the Boolean logical connectives \(\vee\), \(\wedge\), \(\neg\) present in selection formulae and the set operators \(\cup\), \(\cap\), \(\setminus\) respectively:

  1. \(\sigma(r, F_1 \vee F_2) = \sigma(r, F_1) \cup \sigma(r, F_2)\)
  2. \(\sigma(r, F_1 \wedge F_2) = \sigma(r, F_1) \cap \sigma(r, F_2)\)
  3. \(\sigma(r, \neg F) = r \setminus \sigma(r, F)\)

Join

Informally, the natural join of two relations \(r_1\) over relation schema \(R_1\) and \(r_2\) over relation schema \(R_2\), with \(schema(R_1) \cap schema(R_2)\) being the set of attributes \(X\), is the relation containing tuples that result from concatenating every tuple of \(r_1\) with every tuple of \(r_2\) both of which have the same \(X\)-values. The attributes in \(X\) are called the join attributes of \(R_1\) and \(R_2\).

Formally, the natural join, \(r_1 \Join r_2\) , of two relations \(r_1\) over relation schema \(R_1\) and \(r_2\) over relation schema \(R_2\) is a relation over relation schema \(R\) with \(schema(R) = schema(R_1) \cup schema(R_2)\) defined by

\[r_1 \Join r_2 = \{t \ | \ t[schema(R_1)] \in r_1 \ \wedge \ t[schema(R_2)] \in r_2\}\]

Consider the relations:

airflights
## # A tibble: 336,776 x 7
##    month   day flight origin dest  tailnum carrier
##    <int> <int>  <int> <chr>  <chr> <chr>   <chr>  
##  1     1     1   1545 EWR    IAH   N14228  UA     
##  2     1     1   1714 LGA    IAH   N24211  UA     
##  3     1     1   1141 JFK    MIA   N619AA  AA     
##  4     1     1    725 JFK    BQN   N804JB  B6     
##  5     1     1    461 LGA    ATL   N668DN  DL     
##  6     1     1   1696 EWR    ORD   N39463  UA     
##  7     1     1    507 EWR    FLL   N516JB  B6     
##  8     1     1   5708 LGA    IAD   N829AS  EV     
##  9     1     1     79 JFK    MCO   N593JB  B6     
## 10     1     1    301 LGA    ORD   N3ALAA  AA     
## # … with 336,766 more rows
airlines
## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

The join \[airflights \Join airlines\] is the relation:

## Joining, by = "carrier"
## # A tibble: 336,776 x 8
##    month   day flight origin dest  tailnum carrier name                    
##    <int> <int>  <int> <chr>  <chr> <chr>   <chr>   <chr>                   
##  1     1     1   1545 EWR    IAH   N14228  UA      United Air Lines Inc.   
##  2     1     1   1714 LGA    IAH   N24211  UA      United Air Lines Inc.   
##  3     1     1   1141 JFK    MIA   N619AA  AA      American Airlines Inc.  
##  4     1     1    725 JFK    BQN   N804JB  B6      JetBlue Airways         
##  5     1     1    461 LGA    ATL   N668DN  DL      Delta Air Lines Inc.    
##  6     1     1   1696 EWR    ORD   N39463  UA      United Air Lines Inc.   
##  7     1     1    507 EWR    FLL   N516JB  B6      JetBlue Airways         
##  8     1     1   5708 LGA    IAD   N829AS  EV      ExpressJet Airlines Inc.
##  9     1     1     79 JFK    MCO   N593JB  B6      JetBlue Airways         
## 10     1     1    301 LGA    ORD   N3ALAA  AA      American Airlines Inc.  
## # … with 336,766 more rows

What is \(r_1 \Join r_2\) if \(schema(R_1) = schema(R_2)\)? In this case \[schema(R) = schema(R_1) \cup schema(R_2) = schema(R_1) = schema(R_2)\]

and hence:

\[r_1 \Join r_2 = \{t \ | \ t \in r_1 \ \wedge \ t \in r_2\} = r_1 \cap r_2\]

What is \(r_1 \Join r_2\) if \(schema(R_1) \cap schema(R_2) = \emptyset\)? In this case \(schema(R) = schema(R_1) \cup schema(R_2)\) contains all attributes in \(R_1\) and all attributes in \(R_2\). Since the set of join attributes is empty, each combination of a tuple of \(r_1\) and a tuple of \(r_2\) satisfies the join condition and hence is part of the result relation. Hence:

\[r_1 \Join r_2 = r_1 \times r_2 \]

Renaming

The renaming operator allows us to change the name of an attribute in a schema of a relation. Renaming is useful when we want to take the set operations over different schemas, and when we want to take the natural join of two relations over a set of attributes other than the set of common ones.

Let \(r\) be a relation over relation schema \(R\), \(A\) be an attribute of \(schema(R)\) and \(B\) be an attribute not in \(schema(R)\). The renaming, \(\rho\), of \(A\) to \(B\) in \(r\), is a relation over relation schema \(S\), where \(schema(S) = (schema(R) \setminus \{A\}) \cup \{B\}\), defined by

\[\rho(r, B = A) = \{t \ | \ \exists u \in r. \ t[schema(R) \setminus \{B\}] = u[schema(R) \setminus \{A\}] \ \wedge \ t[B] = u[A]\}\]

It’s handy to extend the rename operator to cope with many renaming pairs: \[\rho(r, B_1 = A_1, B_2 = A_2, \ldots, B_n = A_n) \]

Consider the relation:

aircrafts
## # A tibble: 3,322 x 4
##    manufacturer     tailnum engine    engines
##    <chr>            <chr>   <chr>       <int>
##  1 EMBRAER          N10156  Turbo-fan       2
##  2 AIRBUS INDUSTRIE N102UW  Turbo-fan       2
##  3 AIRBUS INDUSTRIE N103US  Turbo-fan       2
##  4 AIRBUS INDUSTRIE N104UW  Turbo-fan       2
##  5 EMBRAER          N10575  Turbo-fan       2
##  6 AIRBUS INDUSTRIE N105UW  Turbo-fan       2
##  7 AIRBUS INDUSTRIE N107US  Turbo-fan       2
##  8 AIRBUS INDUSTRIE N108UW  Turbo-fan       2
##  9 AIRBUS INDUSTRIE N109UW  Turbo-fan       2
## 10 AIRBUS INDUSTRIE N110UW  Turbo-fan       2
## # … with 3,312 more rows

The rename \[\rho(aircrafts, engineType = engine, engineNumber = engines) \]

## # A tibble: 3,322 x 4
##    manufacturer     tailnum engineType engineNumber
##    <chr>            <chr>   <chr>             <int>
##  1 EMBRAER          N10156  Turbo-fan             2
##  2 AIRBUS INDUSTRIE N102UW  Turbo-fan             2
##  3 AIRBUS INDUSTRIE N103US  Turbo-fan             2
##  4 AIRBUS INDUSTRIE N104UW  Turbo-fan             2
##  5 EMBRAER          N10575  Turbo-fan             2
##  6 AIRBUS INDUSTRIE N105UW  Turbo-fan             2
##  7 AIRBUS INDUSTRIE N107US  Turbo-fan             2
##  8 AIRBUS INDUSTRIE N108UW  Turbo-fan             2
##  9 AIRBUS INDUSTRIE N109UW  Turbo-fan             2
## 10 AIRBUS INDUSTRIE N110UW  Turbo-fan             2
## # … with 3,312 more rows

A relational algebra expression (or query) is a well-formed expression composed of a finite number of relational algebra operators whose operands are relation schemas which can be treated as input variables to the query. An answer to a query is obtained by replacing every occurrence of a relation schema in the query by a relation over the schema and computing the result by invoking the algebra operators present in the query.

For example, the following query retrieves the month, day, origin, destination, carrier code and carrier name of flights that flew with carrier JetBlue Airways.

\[ \pi(\sigma(flights \Join airlines, name = {\rm JetBlue\ Airways}), \{month, day, origin, dest, carrier, name\}) \] It would be handy to introduce the pipe operator (>) in relational algebra queries. For instance, the above query would become more readable:

\[ \begin{array}{l} (flights \Join airlines) > & \\ \sigma(name = {\rm JetBlue\ Airways}) > & \\ \pi(\{month, day, origin, dest, carrier, name\}) \end{array} \]

The query answer set is a relation with 54,635 tuples and 6 attributes (we only show the first 10 rows):

## # A tibble: 54,635 x 6
##    month   day origin dest  carrier name           
##    <int> <int> <chr>  <chr> <chr>   <chr>          
##  1     1     1 JFK    BQN   B6      JetBlue Airways
##  2     1     1 EWR    FLL   B6      JetBlue Airways
##  3     1     1 JFK    MCO   B6      JetBlue Airways
##  4     1     1 JFK    PBI   B6      JetBlue Airways
##  5     1     1 JFK    TPA   B6      JetBlue Airways
##  6     1     1 JFK    BOS   B6      JetBlue Airways
##  7     1     1 LGA    FLL   B6      JetBlue Airways
##  8     1     1 EWR    PBI   B6      JetBlue Airways
##  9     1     1 JFK    RSW   B6      JetBlue Airways
## 10     1     1 JFK    SJU   B6      JetBlue Airways
## # … with 54,625 more rows

The set of queries, which are expressible in the relational algebra, is considered to be the minimal set of queries that any query language for the relational model should possess. Thus the relational algebra provides a yardstick for measuring the expressive power of a query language for the relational model independently of any implementation.