Structured Query Language (SQL) is a language that allows:

  1. the definition (creation of schema and integrity constraints),
  2. the manipulation (insert, update and delete of data) and
  3. the query (data retrieval) of a relational database.

SQL was one of the first commercial languages for Edgar F. Codd’s relational model, as described in his influential 1970 paper, “A Relational Model of Data for Large Shared Data Banks”.

SQL is a declarative language: it allows you to specify what to retrieve without saying how. When a query is executed by the query processor, it is translated into a procedural language that specifies how to access the data. There are generally more than one possible translations of an SQL query into the procedural language. The task of the query optimizer is to choose the most efficient execution plan.

SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite it a good choice for data analysis:

People who understand SQL can employ the sqlite3 command-line shell (or various third-party SQLite access programs) to analyze large datasets. Raw data can be imported from CSV files, then that data can be sliced and diced to generate a myriad of summary reports. More complex analysis can be done using simple scripts written in Tcl or Python (both of which come with SQLite built-in) or in R or other languages using readily available adaptors. (…) The same thing can be done with an enterprise client/server database, of course. The advantage of SQLite is that it is easier to install and use and the resulting database is a single file that can be written to a USB memory stick or emailed to a colleague.

Furthermore, it is helpful in education and training:

Because it is simple to setup and use (installation is trivial: just copy the sqlite3 or sqlite3exe executable to the target machine and run it) SQLite makes a good database engine for use in teaching SQL. Students can easily create as many databases as they like and can email databases to the instructor for comments or grading. For more advanced students who are interested in studying how an RDBMS is implemented, the modular and well-commented and documented SQLite code can serve as a good basis.

Data definition and management

We are going to create a database in SQLite corresponding to the dataset nycflights13 of NYC flights in 2013. In particular, we will follow these steps:

  1. download and compile SQLite
  2. check integrity constraints
  3. export data frames into CSV files
  4. create a database in SQLite
  5. create tables with primary and foreign key constraints
  6. insert data into tables using the exported CSV files
  7. create indices
  8. make a backup of the database
  9. try to violate primary and foreign key constraints with insertion, deletion, and update
  10. restore the database from the backup

Download and compile SQLite

Download the latest source version of SQLite from https://www.sqlite.org. On UNIX systems compile the source with:

gcc shell.c sqlite3.c -lpthread -ldl -o sqlite

You may also download the precompiled binaries for your platform (in the following I have renamed sqlite3 with sqlite). It is handy to add the SQLite folder to the $PATH system variable. For instance, on Mac OS, edit file /etc/paths with command:

sudo nano /etc/paths

and add the SQLite folder at the bottom. Save and exit. Close and re-open the terminal. Check with

echo $PATH

Check integrity contraints

Table flights doesn’t have an explicit primary key. Indeed, the attribute sequence year, month, day, sched_dep_time, flight, carrier is a key for flights:

library(nycflights13)
library(dplyr)

flights %>% 
  count(year, month, day, sched_dep_time, flight, carrier) %>% 
  filter(n > 1)
## # A tibble: 0 x 7
## # … with 7 variables: year <int>, month <int>, day <int>, sched_dep_time <int>,
## #   flight <int>, carrier <chr>, n <int>

This means that there are no two rows with the same values for attributes year, month, day, sched_dep_time, flight, carrier. However, this key sequence is rather long (6 attributes) and hence neither handy nor efficient. If a table lacks a natural primary key, it’s sometimes useful to add one with mutate() and row_number(). That makes it easier to match observations if you’ve done some filtering and want to check back in with the original data. This is called a surrogate key:

flights <- 
  flights %>% 
  arrange(year, month, day, sched_dep_time) %>%
  mutate(id = row_number()) %>%
  select(id, everything())

The statement above first sorts flights in chronological increasing order and then adds to the flights table an id attribute corresponding to the table row number. Hence, each flight has a unique serial number.

We also add an id attribute for the weather table:

weather <- 
  weather %>% 
  arrange(year, month, day, origin) %>%
  mutate(id = row_number()) %>%
  select(id, everything())

On the other hand, the other tables have key attributes:

planes %>% 
  count(tailnum) %>% 
  filter(n > 1)
## # A tibble: 0 x 2
## # … with 2 variables: tailnum <chr>, n <int>
airports %>% 
  count(faa) %>% 
  filter(n > 1)
## # A tibble: 0 x 2
## # … with 2 variables: faa <chr>, n <int>
airlines %>% 
  count(carrier) %>% 
  filter(n > 1)
## # A tibble: 0 x 2
## # … with 2 variables: carrier <chr>, n <int>

Unfortunately, some foreign key constraints are not satisfied by our dataset:

# find rows violating FKC
flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(tailnum, sort = TRUE)
## # A tibble: 722 x 2
##    tailnum     n
##    <chr>   <int>
##  1 <NA>     2512
##  2 N725MQ    575
##  3 N722MQ    513
##  4 N723MQ    507
##  5 N713MQ    483
##  6 N735MQ    396
##  7 N0EGMQ    371
##  8 N534MQ    364
##  9 N542MQ    363
## 10 N531MQ    349
## # … with 712 more rows
# cure the violation of FKC
id_fk = anti_join(flights, planes, by = "tailnum")$id
flights = mutate(flights, tailnum = ifelse(id %in% id_fk, NA, tailnum))


# find rows violating FKC
flights %>%
  anti_join(airports, by = c("dest" = "faa")) %>%
  count(dest, sort = TRUE)
## # A tibble: 4 x 2
##   dest      n
##   <chr> <int>
## 1 SJU    5819
## 2 BQN     896
## 3 STT     522
## 4 PSE     365
# cure the violation of FKC
id_fk = anti_join(flights, airports, by = c("dest" = "faa"))$id
flights = mutate(flights, dest = ifelse(id %in% id_fk, NA, dest))

# find rows violating FKC
flights %>%
  anti_join(airports, by = c("origin" = "faa")) %>%
  count(dest, sort = TRUE)
## # A tibble: 0 x 2
## # … with 2 variables: dest <chr>, n <int>
# find rows violating FKC
flights %>%
  anti_join(airlines, by = "carrier") %>%
  count(dest, sort = TRUE)
## # A tibble: 0 x 2
## # … with 2 variables: dest <chr>, n <int>

Export data frames into CSV files.

Here we export the dataframes into CSV files.

write.table(airlines, file = "airlines.csv", sep = ",", row.names=FALSE, col.names = FALSE, quote=FALSE, na="", qmethod = "double")
write.table(airports, file = "airports.csv", sep = ",", row.names=FALSE, col.names = FALSE, quote=FALSE, na="", qmethod = "double")
write.table(planes, file = "planes.csv", sep = ",", row.names=FALSE, col.names = FALSE, quote=FALSE, na="", qmethod = "double")
write.table(weather, file = "weather.csv", sep = ",", row.names=FALSE, col.names = FALSE, quote=FALSE, na="", qmethod = "double")
write.table(flights, file = "flights.csv", sep = ",", row.names=FALSE, col.names = FALSE, quote=FALSE, na="", qmethod = "double")

Notice: no row and column names, no quotation of strings, NA as empty string in the CSV files.

Create a database in SQLite

Here we move to SQLite shell and create a database called nycflights13:

sqlite nycflights13

Or to open an existing database:

.open nycflights13

Check there is a file called nycflights13 in the working folder (this is indeed the database).

.databases

Create tables with primary and foreign key constraints

Here we use SQL as a data definition language. The commands to create the schema and the integrity constraints are contained in file create.sql

.read create.sql

You can list names of tables:

.tables

Or show the schema:

.schema

Here is the full syntax of SQL create table statement. Drop a table with drop table. The alter table command allows the user to rename a table or to add a new column to an existing table.

Insert data into tables using the exported CSV files

Here we read data from the CSV files and import them into the database, generating a database instance.

Show settings:

 .show

Set column separator to comma and import mode to CSV:

.separator ,
.mode csv

Import CSV files into tables:

.import airlines.csv airlines
.import airports.csv airports
.import planes.csv planes 
.import weather.csv weather
.import flights.csv flights

Retrieve first 10 rows of tables just populated:

select * from airlines limit 10; 
select * from airports limit 10; 
select * from planes limit 10; 
select * from weather limit 10; 
select * from flights limit 10;

Replace empty values (NA values) with NULL values:

update flights set dep_delay = NULL where dep_delay = "";
update flights set arr_delay = NULL where arr_delay = "";
update flights set dep_time = NULL where dep_time = "";
update flights set arr_time = NULL where arr_time = "";
update flights set tailnum = NULL where tailnum = "";
update flights set dest = NULL where dest = "";

Create indexes

Memory in computers divides into:

  • primary memory, which is fast, directly accessible, expensive and volatile;
  • secondary memory is slow, inexpensive and permanent. It’s content, in order to be processed, must first be loaded into primary memory.

Databases are typically stored in secondary memory and loaded into primary memory for computation. Secondary memory readings, therefore, is the bottleneck for the database performance.

An auxiliary access structure, more frequently called an index, is a data structure that serves to speed up access to records in response to certain queries. Indexes are auxiliary and optional structures with respect to file storage structures and do not affect the primary organization of records in files. An index is defined on one or more attributes of a table and, in general, the queries involving the indices are faster. In fact, in order to find a record of which the value of the field is known, instead of scanning the data file, it is possible to search the index file for the address of the block containing the record. The corresponding block is then loaded into primary memory and the record is searched in the block in main memory.

Indexes work similarly to the index of a textbook, that is an ordered list of terms associated with the pages in which these terms appear in the text. To find a certain topic in the text, you can scan the book, or, more efficiently, find the term in the sorted and smaller index, retrieve the page number and then directly access the page in the text.

Good practices include:

  1. define an index on the primary key of each table and on the foreign keys. Indeed, selection and join operations often involve these attributes;
  2. define an index on other attributes frequently used in queries;
  3. if the table is subject to frequent updates, then minimize the use of the indexes on the table, as the indexes must be updated after each update of the table.

In our database, we define unique indices on the primary keys of flights:

create unique index airlines_index on airlines(carrier); 
create unique index airports_index on airports(faa); 
create unique index planes_index on planes(tailnum);
create unique index weather_index on weather(id);
create unique index flights_index on flights(id);

Moreover, we define indices on the foreign keys as well:

create index flights_tailnum on flights(tailnum); 
create index flights_origin on flights(origin); 
create index flights_dest on flights(dest); 
create index flights_carrier on flights(carrier); 

Here is the full syntax of SQL create index statement. Drop an index with drop index.

Backup the database

Let’s backup our database.

.backup nycflights13_backup_20200424075949

Check that there is a new file called nycflights13_backup_20200424075949 in your folder. This is the backed up database. It’s good practice to timestamp the backup file, so you can keep track of different versions of the database.

Violate integrity constraints

We try to force integrity of the database.

We first try to violate primary key constraint. Let’s first select all from airlines.

select * from airlines;

You can see there is an airline with carrier 9E. Since carrier is the key of the table, we cannot insert another airline with this carrier. But we try!

insert into airlines values ("9E", "My Air");

The insertion failed with the message:

Error: UNIQUE constraint failed: airlines.carrier

Let’s now try to violate foreign key constraints. First enable foreign key constraint check (this is not the default):

PRAGMA foreign_keys = ON;

The carrier XY does not exist in the airlines table, hence we are not allowed to insert a flight flying with this carrier. Let’s try anyway:

insert into flights(id, carrier) values ("0", "XY");

The insertion failed with the message:

Error: FOREIGN KEY constraint failed

Here is the full syntax of SQL insert statement.

Let’s now delete carrier UA. This is not possible, since this carrier is referenced in the flights table, hence a deletion of this carrier would make the database inconsistent:

delete from airlines where carrier = "UA";

Again, the deletion failed because of foreign key constraint violation. Here is the full syntax of SQL delete statement.

The same happens if we try to update carrier UA with XY:

update airlines set carrier = "XY" where carrier = "UA";

Here is the full syntax of SQL update statement.

It is possible to check if the database violates foreign key constraints with PRAGMA foreign_key_check command:

PRAGMA foreign_key_check;

Notice that the PRAGMA statements is an SQL extension specific to SQLite and is not compatible with any other SQL database engine.

Restore from backup

If you did mistakes, do not panic. You can restore the database from the previous backup:

.restore nycflights13_backup_20200424075949

Query

Besides definition and manipulation, SQL can be used to query, that is retrieve, data from the database. An SQL query is a statement that inputs one or more tables of the database and outputs a table. A query in SQL is a select statement and can have different components, called clauses, including from, where, group by, and order.

Select and from

The first line of the query is called the select clause and is used to select the columns of the table that interest us. The * operator allows you to select all columns. The second line of the query is called from clause and serves to indicate which tables to use.

Here we select all columns from flights table:

select *
from flights

Or only some of them:

select id, flight, carrier
from flights

Tip: run command .mode column on SQLite shell for a table output. Write queries on a file, for instance query.sql, and run them with command .read query.sql. Use limit 10 at the end of the query to limit output to 10 rows.

Where

The where clause defines a predicate on the attribute values of the selected tables. The rows that satisfy the predicate, i.e. for which predicate is true, are inserted in the result table.

select *
from planes
where manufacturer = "AIRBUS INDUSTRIE"
select *
from planes
where engines > 2

A predicate is a combination of atomic predicates with the Boolean operators and, or and not. An atomic predicate is obtained by comparing two expressions on attribute values using the following comparison operators: =, <>, <,>, <=,> =. Examples of expressions are an attribute and a constant. String, time and date constants are written in quotes, but you do not need quotes for numbers. In addition, the like operator allows a comparison with strings containing special values _ (any character) and % (an arbitrary, possibly empty, character sequence). You can compare the value of an expression with the null value with the is null and is not null predicates.

Time to practice! Write and run the following queries:

-- flights on Christmas
select *
from flights 
where ___

-- id, departure delay and arrival delay of flights that have either a null departure delay or a null arrival delay
select ___
from flights 
where ___ is null or ___ is null

-- id, departure time and arrival time of flights that have a not null departure time and a not null arrival time. 
-- First version using is not null
select ___
from flights 
where dep_time ___ and arr_time ___

-- Second version using not()
select id, dep_time, arr_time
from flights 
where not(dep_time ___ or ___ is null)

-- planes with tailnum that starts with N
select *
from planes
where tailnum like ___


-- planes with tailnum that starts with N10
select *
from planes
where tailnum like ___

The logic of the predicates in SQL is three values: true, false and unknown. Unknown means that either the attribute does not exists or the value of it is not known. Unknown values are represented with constant NULL in SQL and NA in R, but the logic is the same (use is.na function in R to test for NA).

Any comparison in which at least one component has unknown value results in the unknown value. The exceptions are is null and is not null, the value of which is always true or false, even if the comparison value is unknown:

0 == NA
## [1] NA
0 != NA
## [1] NA
NA == NA
## [1] NA
is.na(NA)
## [1] TRUE
!is.na(NA)
## [1] FALSE

Boolean operators are extended to the unknown value in the following way:

TRUE & NA
## [1] NA
FALSE & NA
## [1] FALSE
NA & NA
## [1] NA
TRUE | NA
## [1] TRUE
FALSE | NA
## [1] NA
NA | NA
## [1] NA
!NA
## [1] NA

Hence, what is the result of the following query? All planes?

select *
from planes
where (engines > 2) or (engine <= 2)

Order by

The order by clause defines an ordering of the rows of the result table:

select *
from planes
order by engines
select *
from planes
order by engines desc

Write and run the following ones:

-- id and departure delay of flights that have a positive departure delay sorted by delay departure in decreasing order

select id, dep_delay
from flights 
where ___
order by ___


-- id, departure and arrival delays, and catch up of flights that catched up during the flight sorted by catch up time in decreasing order
select id, dep_delay, arr_delay, (arr_delay - dep_delay) as catchup
from ___
where ___
order by ___

Notice the use of the as clause to create a new computed attribute in the output.

Group by and having

An aggregation operator is a function that inputs a set of tuples in a table and outputs an atomic value. The SQL standard provides the following aggregation operators: count, min, max, sum, avg.

--- count tuples of planes
select count(*)
from planes

--- count number of not null values for attribute manufacturer in table planes
select count(manufacturer)
from planes

--- count number of distinct, not null values for attribute manufacturer in table planes
select count(distinct manufacturer)
from planes

--- summary of seats of planes
select max(seats), min(seats), sum(seats), avg(seats), sum(seats) / count(seats)
from planes

The examples of aggregation operators seen so far operate on the set of all the rows in a table. The group by clause allows you to partition the rows of a table into subsets and apply the aggregation operators to the individual subsets. This clause refers to a set of attributes and groups the rows that have the same value for the attributes of the argument set. In the select clause you can use aggregate operators and attributes but the latter must be included in the group by clause.

Try these ones:

-- the number of flights per destination
select dest, count(*) as count
from flights
group by dest

-- the top ten destination by number of flights

-- the number of flights per month and day sorted in decreasing order
select month, day, count(*) as count
from ___
group by ___
order by ___

-- the mean departure delay per month and day sorted in decreasing order
select ___
from ___
group by ___
order by ___
  • What are the most and less popular destinations?
  • What are the most and less busy days?
  • What are the most and least delayed days?

If we want to select only the groups that satisfy a certain predicate we can use the having clause. A group predicate can use aggregation operators and the group by attributes.

Try these ones:

-- the busy days (those with more than 1000 flights)
select month, day, count(*) as count
from flights
group by month, day
having count > 1000

-- the popular destinations (those with more than 1000 flights) sorted by number of flights in descending order
select dest, count(*) as count
from ___
group by ___
having ___
order by ___

-- the mean departure delay per day sorted in decreasing order of all flights on busy days (those with more than 1000 flights) of July 
select day, count(*) as count, round(avg(dep_delay), 2) as delay
from ___
where ___
group by ___
having ___
order by ___

The last query uses all the SQL clauses. Pay attention to the difference between the where and having clauses: the where clause contains tuple predicates and serves to filter the tuples of the tables, the having clause contains group predicates and serves to filter the groups obtained by the group by clause. Conceptually, first we filter the tuples and then we make and filter the groups.

Set operators

SQL allows to make the union (union), the intersection (intersect) and the difference (except) between sets of rows of tables. These operators require that the schemas they operate on have the same number of attributes and that their domains are compatible. By default, these operators eliminate duplicates from the result, thus returning a set in the mathematical sense of the term. It is possible to recover the duplicates by following the keyword all to the name of the operator.

-- airports that are either origins or destinations (without duplicates)
select origin
from flights
union
select dest
from flights

-- airports that are either origins or destinations (with duplicates)
select origin
from flights
union all
select dest
from flights
  

-- airports that are both origins and destinations
select origin
from flights
intersect
select dest
from flights

-- airports that are either origins but not destinations
select origin
from flights
except
select dest
from flights

Join

So far we have seen queries on a single table. In the relational model, each independent concept is represented with a table and the correspondences between the data in different tables are made by comparing the values of the attributes (generally between a foreign key and a primary key). In particular, join queries exploit the value-based correspondence model, typical of the relational model, to retrieve related data that are distributed over more than one table.

-- flights that flew with a plane manufactured by BOEING
select flights.id, flights.tailnum, planes.manufacturer
from flights, planes
where flights.tailnum = planes.tailnum and planes.manufacturer = "BOEING"

The join condition is flights.tailnum = planes.tailnum: it matches the foreign key tailnum of flights with the primary key tailnum of planes. Notice that different tables can share the same attribute names, hence it is good practice in join queries to prefix the attribute name with the table name. More generally, a join condition between two tables R and S is a conjunction (according to the and Boolean operator) of atomic conditions of type \(A \, \theta \, B\), where \(A\) is an attribute of R, \(B\) is an attribute of S, and \(\theta\) is any comparison operator. Usually, but not necessarily, \(A\) is a primary key, \(B\) is an external key referring to \(A\) and \(\theta\) is the equality operator (=).

An operational view of the join is helpful to understand the join operation. Suppose we join tables R and S with a join condition \(\theta\). If we were to plan this operation, we would have to write two nested loops. The outer loop scans the rows of table R. For each row \(r\) of R, the inner loop scans the rows \(s\) of table S and checks whether the joint row \(r \cdot s\) satisfies \(\theta\) condition. In this case the row is filtered in the output.

Let’s try another join:

-- flights that flew to a destination with an altitude greater than 6000 feet sorted by altitude
select flights.id, flights.dest, airports.name, airports.alt
from ___, ___
where ___ and ___
order by ___

You can join more than two tables. However, mind that the computational complexity grown exponentially in the number of joint tables. The following query is computationally heavy hence limit the result to 10 rows:

-- flights that took off with a plane with 4 engines and a visibility lower than 3 miles
select flights.id, planes.engines, weather.visib
from flights, weather, planes
where flights.time_hour = weather.time_hour and flights.origin = weather.origin 
      and flights.tailnum = planes.tailnum and weather.visib < 3  and planes.engines = 4
limit 10

You can join the same two tables. In this case, you need to rename tables with as operator:

-- flight paths of length 2 (X --> Y --> Z)
select distinct f1.origin, f1.dest, f2.dest
from flights as f1, flights as f2
where f1.dest = f2.origin


-- flight paths of length 3 (X --> Y --> W --> Z)
select distinct f1.origin, f1.dest, f2.dest, f3.dest
from ___, ___, ___
where ___ and ___



-- flight paths of arbitrary length (this is tricky!)

Can you express the last one with SQL? Why? We will come back to this issue when discussing expressiveness of SQL.

Finally, consider the following:

-- flights with destination and origin airports with an altitude difference of more than 6000 feet
select flights.id, airports2.alt, airports1.alt, (airports2.alt - airports1.alt) as altdiff
from flights, airports as airports1, airports as airports2
where flights.origin = airports1.faa and flights.dest = airports2.faa and altdiff > 6000

The join operation is so common in relational databases that there exists (an optional) special clause called join. For instance, you can write the first join query as follows:

-- flights that flew with a plane manufactured by BOEING
select flights.id, flights.tailnum, planes.manufacturer
from flights join planes on flights.tailnum = planes.tailnum
where planes.manufacturer = "BOEING"

Using join clause explicitly has two advantages:

  • it is cleaner, since it separates join conditions (in the join clause) from other tuple conditions (in the where clause)
  • it allows to distinguish between different types of joins: inner join, left outer join, right outer join, and full outer join.
    1. inner join filters the joint rows that satisfy the join condition (this is the one we used so far);
    2. left outer join filters the joint rows that satisfy the join condition plus the the rows of the left hand side table that find no match in the right table (these rows will have NULL values for the right table attributes in the output);
    3. right outer join filters the joint rows that satisfy the join condition plus the the rows of the right hand side table that find no match in the left table (these rows will have NULL values for the left table attributes in the output);
    4. full outer join is the distinct union of left and right outer joins: it filters the joint rows that satisfy the join condition, plus left hand side table rows that find no match in the right table, plus right hand side table rows that find no match in the left table.

Let’s try the left join:

-- flights and the corresponding plane manufactured, including flights that have no match in planes (there are some)
select flights.id, flights.tailnum, planes.manufacturer
from flights left join planes on flights.tailnum = planes.tailnum

-- inner count (only those with a match)
select count(*)
from flights join planes on flights.tailnum = planes.tailnum

-- outer count (all flights)
select count(*)
from flights left join planes on flights.tailnum = planes.tailnum

Right and full joins are not implemented in SQLite. However, they can be simulated using left join and union operators. How?

-- simulate a right join on planes and flights using left join
select  planes.manufacturer, flights.id, flights.tailnum
from planes right join flights on planes.tailnum = flights.tailnum 



-- simulate a full join on planes and flights using left join and union
select flights.id, flights.tailnum, planes.manufacturer
from flights full join planes on flights.tailnum = planes.tailnum


Why SQL?

It is reasonable to ask why SQL offers exactly the described features and not others. An answer to this question is provided by the theory of relational databases.

As we have already said, the strength of the relational model lies in its practical-theoretical duality: a table is at the same time a concrete object that is simple to describe and understand but also the representation of a mathematical relation. The theory of relational databases is a mathematical theory that studies the properties (such as expressiveness and complexity) of the relational model and related query languages.

So let’s try to reinterpret SQL within this theory. The discussion will necessarily be informal and intuitive. For a formal discussion see Sections 2.1 and 2.2 of the Elements of Relational Database Theory of Paris C. Kanellakis.

The fundamental operations underlying SQL language are:

These operations form the core of the language and are the basis for defining other SQL constructs (for example, the intersection operation can be defined in terms of union and difference). These operations form the relational algebra: the algebra operates on relations with the above fundamental operations. A relational algebra expression evaluates to or computes a relation.

Relational algebra is a procedural language: a program (expression) of the algebra specifies how to calculate the result, and not what result we want to achieve, as happens for declarative languages (for example SQL). Therefore relational algebra is a procedural (or operational) counterpart of the SQL language. We can therefore answer the initial question, that is, why SQL has just those functionalities and not others, focusing on its procedural counterpart. In particular, the choice of fundamental operations of relational algebra is not dictated by chance but is justified by the following properties. The first one concerns the expressive power of the algebra:

Relative algebra has the expressiveness of the first order predicate calculus

This result establishes the expressiveness of the relational algebra: the fundamental operations on which algebra is defined and which are the basis of SQL allow to specify all the formulas of the first order predicate calculus. Moreover this result also sets the limits of the relational algebra (and of SQL). For example, it is well known that the transitive closure operation of a relation (such as finding paths of arbitrary length on a graph) is not definable at in first order logic. This operation will therefore not be definable neither in relational algebra nor in SQL (this is why we cannot write in SQL a query that finds all flight paths in our NYC flight database).

As for computational complexity, the problem of the evaluation of an expression of the relational algebra with respect to a database is complete in the complexity class PSPACE, i.e. in the class of problems solvable using polynomial space. This seems to be bad news, as it means that polynomial algorithms for this problem do not exist. In fact, the complexity of evaluating an expression is exponential in the dimension of expression and polynomial in the size of the database. The size of the database is far larger to that of the query, so much so that the complexity of the query is often considered as a constant and no longer as a parameter of complexity. Therefore fixing the dimension of the query as constant, and therefore referring only to the complexity of the data, we can have that:

The problem of relational algebra evaluation is in PTIME, the class of problems solvable in polynomial time. In particular, the problem lies in the LOGSPACE class, a subclass of PTIME, which corresponds to problems that can be solved in logarithmic space.

The LOGSPACE class is low class in the hierarchy of computational complexity classes. In particular, the problems contained in this class can be solved in parallel ways.

The last nice property of relational algebra is optimization:

Each expression in the select-from-where fragment of SQL can be transformed, with polynomial complexity with respect to its size, in an equivalent expression of the relational algebra and vice versa.

This property characterizes relational algebra as the procedural counterpart of SQL. Note that the transformation from SQL to algebra is efficient. This result opens the door to a series of optimizations that can be taken during the evaluation of an SQL query. An SQL query, to be evaluated, is first transformed into an expression in relational algebra. This expression is then rewritten in some equivalent but optimized version with respect to some measure of complexity. Since the most expensive operation of relational algebra is the join and the complexity of the join depends on the cardinality of the argument tables, the goal of rewriting relational expressions is to minimize the number of join operations and to perform these operations on small tables. This last result is obtained by pushing the selection inside the joins in order to filter the tables before submitting them to the joining operation.

All in all, the theoretical motivations behind the success of SQL are its expressiveness and the ability to efficiently solve its queries.