Structured Query Language (SQL) is a language that allows:
SQL was one of the first commercial languages for Edgar F. Codd’s relational model, as described in his influential 1970 paper, “A Relational Model of Data for Large Shared Data Banks”.
SQL is a declarative language: it allows you to specify what to retrieve without saying how. When a query is executed by the query processor, it is translated into a procedural language that specifies how to access the data. There are generally more than one possible translations of an SQL query into the procedural language. The task of the query optimizer is to choose the most efficient execution plan.
SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite it a good choice for data analysis:
People who understand SQL can employ the sqlite3 command-line shell (or various third-party SQLite access programs) to analyze large datasets. Raw data can be imported from CSV files, then that data can be sliced and diced to generate a myriad of summary reports. More complex analysis can be done using simple scripts written in Tcl or Python (both of which come with SQLite built-in) or in R or other languages using readily available adaptors. (…) The same thing can be done with an enterprise client/server database, of course. The advantage of SQLite is that it is easier to install and use and the resulting database is a single file that can be written to a USB memory stick or emailed to a colleague.
Furthermore, it is helpful in education and training:
Because it is simple to setup and use (installation is trivial: just copy the sqlite3 or sqlite3exe executable to the target machine and run it) SQLite makes a good database engine for use in teaching SQL. Students can easily create as many databases as they like and can email databases to the instructor for comments or grading. For more advanced students who are interested in studying how an RDBMS is implemented, the modular and well-commented and documented SQLite code can serve as a good basis.
We are going to create a database in SQLite corresponding to the dataset nycflights13 of NYC flights in 2013. In particular, we will follow these steps:
Download the latest source version of SQLite from https://www.sqlite.org. On UNIX systems compile the source with:
gcc shell.c sqlite3.c -lpthread -ldl -o sqlite
You may also download the precompiled binaries for your platform (in the following I have renamed sqlite3 with sqlite). It is handy to add the SQLite folder to the $PATH system variable. For instance, on Mac OS, edit file /etc/paths with command:
sudo nano /etc/paths
and add the SQLite folder at the bottom. Save and exit. Close and re-open the terminal. Check with
echo $PATH
Table flights doesn’t have an explicit primary key. Indeed, the attribute sequence year, month, day, sched_dep_time, flight, carrier is a key for flights:
library(nycflights13)
library(dplyr)
flights %>%
count(year, month, day, sched_dep_time, flight, carrier) %>%
filter(n > 1)
## # A tibble: 0 x 7
## # … with 7 variables: year <int>, month <int>, day <int>, sched_dep_time <int>,
## # flight <int>, carrier <chr>, n <int>
This means that there are no two rows with the same values for attributes year, month, day, sched_dep_time, flight, carrier. However, this key sequence is rather long (6 attributes) and hence neither handy nor efficient. If a table lacks a natural primary key, it’s sometimes useful to add one with mutate()
and row_number()
. That makes it easier to match observations if you’ve done some filtering and want to check back in with the original data. This is called a surrogate key:
flights <-
flights %>%
arrange(year, month, day, sched_dep_time) %>%
mutate(id = row_number()) %>%
select(id, everything())
The statement above first sorts flights in chronological increasing order and then adds to the flights table an id attribute corresponding to the table row number. Hence, each flight has a unique serial number.
We also add an id attribute for the weather table:
weather <-
weather %>%
arrange(year, month, day, origin) %>%
mutate(id = row_number()) %>%
select(id, everything())
On the other hand, the other tables have key attributes:
planes %>%
count(tailnum) %>%
filter(n > 1)
## # A tibble: 0 x 2
## # … with 2 variables: tailnum <chr>, n <int>
airports %>%
count(faa) %>%
filter(n > 1)
## # A tibble: 0 x 2
## # … with 2 variables: faa <chr>, n <int>
airlines %>%
count(carrier) %>%
filter(n > 1)
## # A tibble: 0 x 2
## # … with 2 variables: carrier <chr>, n <int>
Unfortunately, some foreign key constraints are not satisfied by our dataset:
# find rows violating FKC
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
## # A tibble: 722 x 2
## tailnum n
## <chr> <int>
## 1 <NA> 2512
## 2 N725MQ 575
## 3 N722MQ 513
## 4 N723MQ 507
## 5 N713MQ 483
## 6 N735MQ 396
## 7 N0EGMQ 371
## 8 N534MQ 364
## 9 N542MQ 363
## 10 N531MQ 349
## # … with 712 more rows
# cure the violation of FKC
id_fk = anti_join(flights, planes, by = "tailnum")$id
flights = mutate(flights, tailnum = ifelse(id %in% id_fk, NA, tailnum))
# find rows violating FKC
flights %>%
anti_join(airports, by = c("dest" = "faa")) %>%
count(dest, sort = TRUE)
## # A tibble: 4 x 2
## dest n
## <chr> <int>
## 1 SJU 5819
## 2 BQN 896
## 3 STT 522
## 4 PSE 365
# cure the violation of FKC
id_fk = anti_join(flights, airports, by = c("dest" = "faa"))$id
flights = mutate(flights, dest = ifelse(id %in% id_fk, NA, dest))
# find rows violating FKC
flights %>%
anti_join(airports, by = c("origin" = "faa")) %>%
count(dest, sort = TRUE)
## # A tibble: 0 x 2
## # … with 2 variables: dest <chr>, n <int>
# find rows violating FKC
flights %>%
anti_join(airlines, by = "carrier") %>%
count(dest, sort = TRUE)
## # A tibble: 0 x 2
## # … with 2 variables: dest <chr>, n <int>
Here we export the dataframes into CSV files.
write.table(airlines, file = "airlines.csv", sep = ",", row.names=FALSE, col.names = FALSE, quote=FALSE, na="", qmethod = "double")
write.table(airports, file = "airports.csv", sep = ",", row.names=FALSE, col.names = FALSE, quote=FALSE, na="", qmethod = "double")
write.table(planes, file = "planes.csv", sep = ",", row.names=FALSE, col.names = FALSE, quote=FALSE, na="", qmethod = "double")
write.table(weather, file = "weather.csv", sep = ",", row.names=FALSE, col.names = FALSE, quote=FALSE, na="", qmethod = "double")
write.table(flights, file = "flights.csv", sep = ",", row.names=FALSE, col.names = FALSE, quote=FALSE, na="", qmethod = "double")
Notice: no row and column names, no quotation of strings, NA as empty string in the CSV files.
Here we move to SQLite shell and create a database called nycflights13:
sqlite nycflights13
Or to open an existing database:
.open nycflights13
Check there is a file called nycflights13 in the working folder (this is indeed the database).
.databases
Here we use SQL as a data definition language. The commands to create the schema and the integrity constraints are contained in file create.sql
.read create.sql
You can list names of tables:
.tables
Or show the schema:
.schema
Here is the full syntax of SQL create table statement. Drop a table with drop table. The alter table command allows the user to rename a table or to add a new column to an existing table.
Here we read data from the CSV files and import them into the database, generating a database instance.
Show settings:
.show
Set column separator to comma and import mode to CSV:
.separator , .mode csv
Import CSV files into tables:
.import airlines.csv airlines .import airports.csv airports .import planes.csv planes .import weather.csv weather .import flights.csv flights
Retrieve first 10 rows of tables just populated:
select * from airlines limit 10; select * from airports limit 10; select * from planes limit 10; select * from weather limit 10; select * from flights limit 10;
Replace empty values (NA values) with NULL values:
update flights set dep_delay = NULL where dep_delay = ""; update flights set arr_delay = NULL where arr_delay = ""; update flights set dep_time = NULL where dep_time = ""; update flights set arr_time = NULL where arr_time = ""; update flights set tailnum = NULL where tailnum = ""; update flights set dest = NULL where dest = "";
Memory in computers divides into:
Databases are typically stored in secondary memory and loaded into primary memory for computation. Secondary memory readings, therefore, is the bottleneck for the database performance.
An auxiliary access structure, more frequently called an index, is a data structure that serves to speed up access to records in response to certain queries. Indexes are auxiliary and optional structures with respect to file storage structures and do not affect the primary organization of records in files. An index is defined on one or more attributes of a table and, in general, the queries involving the indices are faster. In fact, in order to find a record of which the value of the field is known, instead of scanning the data file, it is possible to search the index file for the address of the block containing the record. The corresponding block is then loaded into primary memory and the record is searched in the block in main memory.
Indexes work similarly to the index of a textbook, that is an ordered list of terms associated with the pages in which these terms appear in the text. To find a certain topic in the text, you can scan the book, or, more efficiently, find the term in the sorted and smaller index, retrieve the page number and then directly access the page in the text.
Good practices include:
In our database, we define unique indices on the primary keys of flights:
create unique index airlines_index on airlines(carrier); create unique index airports_index on airports(faa); create unique index planes_index on planes(tailnum); create unique index weather_index on weather(id); create unique index flights_index on flights(id);
Moreover, we define indices on the foreign keys as well:
create index flights_tailnum on flights(tailnum); create index flights_origin on flights(origin); create index flights_dest on flights(dest); create index flights_carrier on flights(carrier);
Here is the full syntax of SQL create index statement. Drop an index with drop index.
Let’s backup our database.
.backup nycflights13_backup_20200424075949
Check that there is a new file called nycflights13_backup_20200424075949 in your folder. This is the backed up database. It’s good practice to timestamp the backup file, so you can keep track of different versions of the database.
We try to force integrity of the database.
We first try to violate primary key constraint. Let’s first select all from airlines.
select * from airlines;
You can see there is an airline with carrier 9E. Since carrier is the key of the table, we cannot insert another airline with this carrier. But we try!
insert into airlines values ("9E", "My Air");
The insertion failed with the message:
Error: UNIQUE constraint failed: airlines.carrier
Let’s now try to violate foreign key constraints. First enable foreign key constraint check (this is not the default):
PRAGMA foreign_keys = ON;
The carrier XY does not exist in the airlines table, hence we are not allowed to insert a flight flying with this carrier. Let’s try anyway:
insert into flights(id, carrier) values ("0", "XY");
The insertion failed with the message:
Error: FOREIGN KEY constraint failed
Here is the full syntax of SQL insert statement.
Let’s now delete carrier UA. This is not possible, since this carrier is referenced in the flights table, hence a deletion of this carrier would make the database inconsistent:
delete from airlines where carrier = "UA";
Again, the deletion failed because of foreign key constraint violation. Here is the full syntax of SQL delete statement.
The same happens if we try to update carrier UA with XY:
update airlines set carrier = "XY" where carrier = "UA";
Here is the full syntax of SQL update statement.
It is possible to check if the database violates foreign key constraints with PRAGMA foreign_key_check command:
PRAGMA foreign_key_check;
Notice that the PRAGMA statements is an SQL extension specific to SQLite and is not compatible with any other SQL database engine.
If you did mistakes, do not panic. You can restore the database from the previous backup:
.restore nycflights13_backup_20200424075949
Besides definition and manipulation, SQL can be used to query, that is retrieve, data from the database. An SQL query is a statement that inputs one or more tables of the database and outputs a table. A query in SQL is a select statement and can have different components, called clauses, including from, where, group by, and order.
The first line of the query is called the select clause and is used to select the columns of the table that interest us. The * operator allows you to select all columns. The second line of the query is called from clause and serves to indicate which tables to use.
Here we select all columns from flights table:
select * from flights
Or only some of them:
select id, flight, carrier from flights
Tip: run command .mode column
on SQLite shell for a table output. Write queries on a file, for instance query.sql, and run them with command .read query.sql
. Use limit 10
at the end of the query to limit output to 10 rows.
The where clause defines a predicate on the attribute values of the selected tables. The rows that satisfy the predicate, i.e. for which predicate is true, are inserted in the result table.
select * from planes where manufacturer = "AIRBUS INDUSTRIE"
select * from planes where engines > 2
A predicate is a combination of atomic predicates with the Boolean operators and, or and not. An atomic predicate is obtained by comparing two expressions on attribute values using the following comparison operators: =, <>, <,>, <=,> =. Examples of expressions are an attribute and a constant. String, time and date constants are written in quotes, but you do not need quotes for numbers. In addition, the like
operator allows a comparison with strings containing special values _
(any character) and %
(an arbitrary, possibly empty, character sequence). You can compare the value of an expression with the null value with the is null
and is not null
predicates.
Time to practice! Write and run the following queries:
-- flights on Christmas select * from flights where ___ -- id, departure delay and arrival delay of flights that have either a null departure delay or a null arrival delay select ___ from flights where ___ is null or ___ is null -- id, departure time and arrival time of flights that have a not null departure time and a not null arrival time. -- First version using is not null select ___ from flights where dep_time ___ and arr_time ___ -- Second version using not() select id, dep_time, arr_time from flights where not(dep_time ___ or ___ is null) -- planes with tailnum that starts with N select * from planes where tailnum like ___ -- planes with tailnum that starts with N10 select * from planes where tailnum like ___
The logic of the predicates in SQL is three values: true, false and unknown. Unknown means that either the attribute does not exists or the value of it is not known. Unknown values are represented with constant NULL
in SQL and NA
in R, but the logic is the same (use is.na
function in R to test for NA).
Any comparison in which at least one component has unknown value results in the unknown value. The exceptions are is null
and is not null
, the value of which is always true or false, even if the comparison value is unknown:
0 == NA
## [1] NA
0 != NA
## [1] NA
NA == NA
## [1] NA
is.na(NA)
## [1] TRUE
!is.na(NA)
## [1] FALSE
Boolean operators are extended to the unknown value in the following way:
TRUE & NA
## [1] NA
FALSE & NA
## [1] FALSE
NA & NA
## [1] NA
TRUE | NA
## [1] TRUE
FALSE | NA
## [1] NA
NA | NA
## [1] NA
!NA
## [1] NA
Hence, what is the result of the following query? All planes?
select * from planes where (engines > 2) or (engine <= 2)
The order by clause defines an ordering of the rows of the result table:
select * from planes order by engines
select * from planes order by engines desc
Write and run the following ones:
-- id and departure delay of flights that have a positive departure delay sorted by delay departure in decreasing order select id, dep_delay from flights where ___ order by ___ -- id, departure and arrival delays, and catch up of flights that catched up during the flight sorted by catch up time in decreasing order select id, dep_delay, arr_delay, (arr_delay - dep_delay) as catchup from ___ where ___ order by ___
Notice the use of the as
clause to create a new computed attribute in the output.
An aggregation operator is a function that inputs a set of tuples in a table and outputs an atomic value. The SQL standard provides the following aggregation operators: count, min, max, sum, avg.
--- count tuples of planes select count(*) from planes --- count number of not null values for attribute manufacturer in table planes select count(manufacturer) from planes --- count number of distinct, not null values for attribute manufacturer in table planes select count(distinct manufacturer) from planes --- summary of seats of planes select max(seats), min(seats), sum(seats), avg(seats), sum(seats) / count(seats) from planes
The examples of aggregation operators seen so far operate on the set of all the rows in a table. The group by clause allows you to partition the rows of a table into subsets and apply the aggregation operators to the individual subsets. This clause refers to a set of attributes and groups the rows that have the same value for the attributes of the argument set. In the select clause you can use aggregate operators and attributes but the latter must be included in the group by clause.
Try these ones:
-- the number of flights per destination select dest, count(*) as count from flights group by dest -- the top ten destination by number of flights -- the number of flights per month and day sorted in decreasing order select month, day, count(*) as count from ___ group by ___ order by ___ -- the mean departure delay per month and day sorted in decreasing order select ___ from ___ group by ___ order by ___
If we want to select only the groups that satisfy a certain predicate we can use the having clause. A group predicate can use aggregation operators and the group by attributes.
Try these ones:
-- the busy days (those with more than 1000 flights) select month, day, count(*) as count from flights group by month, day having count > 1000 -- the popular destinations (those with more than 1000 flights) sorted by number of flights in descending order select dest, count(*) as count from ___ group by ___ having ___ order by ___ -- the mean departure delay per day sorted in decreasing order of all flights on busy days (those with more than 1000 flights) of July select day, count(*) as count, round(avg(dep_delay), 2) as delay from ___ where ___ group by ___ having ___ order by ___
The last query uses all the SQL clauses. Pay attention to the difference between the where and having clauses: the where clause contains tuple predicates and serves to filter the tuples of the tables, the having clause contains group predicates and serves to filter the groups obtained by the group by clause. Conceptually, first we filter the tuples and then we make and filter the groups.
SQL allows to make the union (union), the intersection (intersect) and the difference (except) between sets of rows of tables. These operators require that the schemas they operate on have the same number of attributes and that their domains are compatible. By default, these operators eliminate duplicates from the result, thus returning a set in the mathematical sense of the term. It is possible to recover the duplicates by following the keyword all
to the name of the operator.
-- airports that are either origins or destinations (without duplicates) select origin from flights union select dest from flights -- airports that are either origins or destinations (with duplicates) select origin from flights union all select dest from flights -- airports that are both origins and destinations select origin from flights intersect select dest from flights -- airports that are either origins but not destinations select origin from flights except select dest from flights
So far we have seen queries on a single table. In the relational model, each independent concept is represented with a table and the correspondences between the data in different tables are made by comparing the values of the attributes (generally between a foreign key and a primary key). In particular, join queries exploit the value-based correspondence model, typical of the relational model, to retrieve related data that are distributed over more than one table.
-- flights that flew with a plane manufactured by BOEING select flights.id, flights.tailnum, planes.manufacturer from flights, planes where flights.tailnum = planes.tailnum and planes.manufacturer = "BOEING"
The join condition is flights.tailnum = planes.tailnum
: it matches the foreign key tailnum of flights with the primary key tailnum of planes. Notice that different tables can share the same attribute names, hence it is good practice in join queries to prefix the attribute name with the table name. More generally, a join condition between two tables R and S is a conjunction (according to the and Boolean operator) of atomic conditions of type \(A \, \theta \, B\), where \(A\) is an attribute of R, \(B\) is an attribute of S, and \(\theta\) is any comparison operator. Usually, but not necessarily, \(A\) is a primary key, \(B\) is an external key referring to \(A\) and \(\theta\) is the equality operator (=).
An operational view of the join is helpful to understand the join operation. Suppose we join tables R and S with a join condition \(\theta\). If we were to plan this operation, we would have to write two nested loops. The outer loop scans the rows of table R. For each row \(r\) of R, the inner loop scans the rows \(s\) of table S and checks whether the joint row \(r \cdot s\) satisfies \(\theta\) condition. In this case the row is filtered in the output.
Let’s try another join:
-- flights that flew to a destination with an altitude greater than 6000 feet sorted by altitude select flights.id, flights.dest, airports.name, airports.alt from ___, ___ where ___ and ___ order by ___
You can join more than two tables. However, mind that the computational complexity grown exponentially in the number of joint tables. The following query is computationally heavy hence limit the result to 10 rows:
-- flights that took off with a plane with 4 engines and a visibility lower than 3 miles select flights.id, planes.engines, weather.visib from flights, weather, planes where flights.time_hour = weather.time_hour and flights.origin = weather.origin and flights.tailnum = planes.tailnum and weather.visib < 3 and planes.engines = 4 limit 10
You can join the same two tables. In this case, you need to rename tables with as operator:
-- flight paths of length 2 (X --> Y --> Z) select distinct f1.origin, f1.dest, f2.dest from flights as f1, flights as f2 where f1.dest = f2.origin -- flight paths of length 3 (X --> Y --> W --> Z) select distinct f1.origin, f1.dest, f2.dest, f3.dest from ___, ___, ___ where ___ and ___ -- flight paths of arbitrary length (this is tricky!)
Can you express the last one with SQL? Why? We will come back to this issue when discussing expressiveness of SQL.
Finally, consider the following:
-- flights with destination and origin airports with an altitude difference of more than 6000 feet select flights.id, airports2.alt, airports1.alt, (airports2.alt - airports1.alt) as altdiff from flights, airports as airports1, airports as airports2 where flights.origin = airports1.faa and flights.dest = airports2.faa and altdiff > 6000
The join operation is so common in relational databases that there exists (an optional) special clause called join. For instance, you can write the first join query as follows:
-- flights that flew with a plane manufactured by BOEING select flights.id, flights.tailnum, planes.manufacturer from flights join planes on flights.tailnum = planes.tailnum where planes.manufacturer = "BOEING"
Using join clause explicitly has two advantages:
Let’s try the left join:
-- flights and the corresponding plane manufactured, including flights that have no match in planes (there are some) select flights.id, flights.tailnum, planes.manufacturer from flights left join planes on flights.tailnum = planes.tailnum -- inner count (only those with a match) select count(*) from flights join planes on flights.tailnum = planes.tailnum -- outer count (all flights) select count(*) from flights left join planes on flights.tailnum = planes.tailnum
Right and full joins are not implemented in SQLite. However, they can be simulated using left join and union operators. How?
-- simulate a right join on planes and flights using left join select planes.manufacturer, flights.id, flights.tailnum from planes right join flights on planes.tailnum = flights.tailnum -- simulate a full join on planes and flights using left join and union select flights.id, flights.tailnum, planes.manufacturer from flights full join planes on flights.tailnum = planes.tailnum
It is reasonable to ask why SQL offers exactly the described features and not others. An answer to this question is provided by the theory of relational databases.
As we have already said, the strength of the relational model lies in its practical-theoretical duality: a table is at the same time a concrete object that is simple to describe and understand but also the representation of a mathematical relation. The theory of relational databases is a mathematical theory that studies the properties (such as expressiveness and complexity) of the relational model and related query languages.
So let’s try to reinterpret SQL within this theory. The discussion will necessarily be informal and intuitive. For a formal discussion see Sections 2.1 and 2.2 of the Elements of Relational Database Theory of Paris C. Kanellakis.
The fundamental operations underlying SQL language are:
These operations form the core of the language and are the basis for defining other SQL constructs (for example, the intersection operation can be defined in terms of union and difference). These operations form the relational algebra: the algebra operates on relations with the above fundamental operations. A relational algebra expression evaluates to or computes a relation.
Relational algebra is a procedural language: a program (expression) of the algebra specifies how to calculate the result, and not what result we want to achieve, as happens for declarative languages (for example SQL). Therefore relational algebra is a procedural (or operational) counterpart of the SQL language. We can therefore answer the initial question, that is, why SQL has just those functionalities and not others, focusing on its procedural counterpart. In particular, the choice of fundamental operations of relational algebra is not dictated by chance but is justified by the following properties. The first one concerns the expressive power of the algebra:
Relative algebra has the expressiveness of the first order predicate calculus
This result establishes the expressiveness of the relational algebra: the fundamental operations on which algebra is defined and which are the basis of SQL allow to specify all the formulas of the first order predicate calculus. Moreover this result also sets the limits of the relational algebra (and of SQL). For example, it is well known that the transitive closure operation of a relation (such as finding paths of arbitrary length on a graph) is not definable at in first order logic. This operation will therefore not be definable neither in relational algebra nor in SQL (this is why we cannot write in SQL a query that finds all flight paths in our NYC flight database).
As for computational complexity, the problem of the evaluation of an expression of the relational algebra with respect to a database is complete in the complexity class PSPACE, i.e. in the class of problems solvable using polynomial space. This seems to be bad news, as it means that polynomial algorithms for this problem do not exist. In fact, the complexity of evaluating an expression is exponential in the dimension of expression and polynomial in the size of the database. The size of the database is far larger to that of the query, so much so that the complexity of the query is often considered as a constant and no longer as a parameter of complexity. Therefore fixing the dimension of the query as constant, and therefore referring only to the complexity of the data, we can have that:
The problem of relational algebra evaluation is in PTIME, the class of problems solvable in polynomial time. In particular, the problem lies in the LOGSPACE class, a subclass of PTIME, which corresponds to problems that can be solved in logarithmic space.
The LOGSPACE class is low class in the hierarchy of computational complexity classes. In particular, the problems contained in this class can be solved in parallel ways.
The last nice property of relational algebra is optimization:
Each expression in the select-from-where fragment of SQL can be transformed, with polynomial complexity with respect to its size, in an equivalent expression of the relational algebra and vice versa.
This property characterizes relational algebra as the procedural counterpart of SQL. Note that the transformation from SQL to algebra is efficient. This result opens the door to a series of optimizations that can be taken during the evaluation of an SQL query. An SQL query, to be evaluated, is first transformed into an expression in relational algebra. This expression is then rewritten in some equivalent but optimized version with respect to some measure of complexity. Since the most expensive operation of relational algebra is the join and the complexity of the join depends on the cardinality of the argument tables, the goal of rewriting relational expressions is to minimize the number of join operations and to perform these operations on small tables. This last result is obtained by pushing the selection inside the joins in order to filter the tables before submitting them to the joining operation.
All in all, the theoretical motivations behind the success of SQL are its expressiveness and the ability to efficiently solve its queries.