The logical design of a database consists in translating the conceptual schema of the data into a logical schema. The logical model is independent of the physical model, i.e. the physical representation of the data. The logical model that we will use is called relational model (RM).

The relational model was proposed in 1970 by Edgar F. Codd in his article A Relational Model for Large Shared Data Banks, which appeared in the journal Communications of the ACM. In the abstract of the paper Codd writes:

A model based on n-ary relations, a normal form for data base relations, and the concept of a universal data sublanguage are introduced.

The relational model consists of two main components:

The relational model is based on the concept of relation (not to be confused with the conceptual relationship of the ER model), whose representation is a table (or dataframe in R’s jargon). The concept of relation is formal and comes from set theory, a part of mathematics. The concept of table is intuitive and is used in various contexts outside databases. The success of the relational model lies precisely in the conjunction of a formal concept, the relation, which allowed the development of a theory of relational databases with results of practical impact, with an intuitive concept, the table, which made it possible the relational model also to end users without any mathematical notion. The approach followed in this class will be intuitive. A good text for the theory of relational databases is the article Elements of Relational Database Theory by Paris C. Kanellakis.

Consider the planes table that contains information on planes in the NYC flight dataset (only the first 6 rows are shown):

tailnum year type manufacturer model engines seats speed engine
N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2 55 NA Turbo-fan
N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan

The following properties characterize the relational model (we will use the terms relation and table interchangeably):

  1. a relation consists of rows and columns. The columns, also called attributes, have a name that identifies them. These names must be distinguished from one another within the table. The rows, also called tuples, contain a value for each attribute of the table. A cell is the intersection of a row and a column (the value of an attribute for a given row). The order of the rows and that of the columns in the table is not relevant;
  2. each attribute of the relation is associated with a domain, that is a set of values it can take. A domain in the relational model must contain only atomic (indivisible) values. For instance, the engines attribute of the planes table has the domain of natural numbers, the number of engines of the plane;
  3. every tuple of the relation is distinct from the others. Every relation must have a primary key, that is a set of attributes that uniquely identify every tuple of the relation. For instance, tailnum attribute is the key of the planes table;
  4. attributes can have null values, that are unknown or non-existent values. The attributes of the primary key can not have null values. For instance, attribute speed contains null values in the first 6 rows of table planes (notice that null values are represented with constant NA in R and with constant NULL in databases).

In a relation we distinguish the schema from the content:

For instance, the schema of the planes relation is the following (notice that the key attributes are underlined):

planes(tailnum, year, type, manufacturer, model, engines, seats, speed, engine)

It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of related data are called a relational database. Hence we define:

We make a fundamental observation. In the relational model, all information is represented by relations that correspond to tables with attributes. Thus, in this model, both the entities and the relationships of the conceptual schema are implemented in relations. In particular, the conceptual relationships between entities are implemented at a logical level through tables that contain values common to other tables to which they refer.

For instance, consider a database that contains a weighted directed network. The conceptual schema contains a node entity and a recursive edge relationship:

The logical schema contains two relations:

The database schema is as follows:

node(id, name)
edge(from, to, weight)

The attributes from and to of the edge table are called foreign keys, since they refer to key values in another table (the node table in this case).

An instance of the database containing a cycle of length 6 is the following:

##   id name
## 1  1    a
## 2  2    b
## 3  3    c
## 4  4    d
## 5  5    e
## 6  6    f
##   from to weight
## 1    1  2      1
## 2    2  3      1
## 3    3  4      3
## 4    4  5      4
## 5    5  6      2
## 6    6  1      6

Notice that values of the from and to columns in the edge tables refer to valid id values in the node table.

Besides relations, the relational model defines integrity constraints. Integrity constraints serve to maintain the consistency of the database. Only database instances that satisfy all the integrity constraints are valid. Integrity constraints include:

Foreign key constraints are written as an arrow from the foreign key to the referenced primary key. For instance, for the database representing a network, we have the following two constraints:

edge(from) –> node(id)
edge(to) –> node(id)

ER to RM translation

There exists a translation from the ER model to the RM. It works by mapping entities and relationships of the conceptual ER schema into relations and integrity constraints of the logical RM schema.

Entities

An entity is translated into a relation with the same name, the same attributes and the same primary key constraints.

flight(id, dep_time, arr_time)

One-to-one relationships

There are two translations. The first one embeds the relationship rules inside the table king:

king(name, country)
kingdom(country)
king(country) –> kingdom(country)

The second one embeds the relationship rules inside the table kingdom (notice the renaming of the attribute name of king):

king(name)
kingdom(country, king)
kingdom(king) –> king(name)

The first one is to prefer, there might be kingdoms without kings, but every king has a kingdom. Hence, the second schema might introduce NULL values.

One-to-many relationships

The translations embeds the relationship flies inside the table flight:

flight(id, tailnum)
plane(tailnum)
flight(tailnum) –> plane(tailnum)

Notice that a flight is connected with at most one plane, while a plane might have more flights associated with it. Making the inverse translation, that is embedding flies into plane, would violate the primary key integrity constraint of the plane table.

Many-to-many relationships

The translations creates a table for the relationship enroll:

student(code)
course(name)
enroll(student, course)
enroll(student) –> student(code)
enroll(course) –> course(name)

Notice that the key of the enroll table is the composition of the keys of the two related tables student and course. Moreover, two foreign key constraints are necessary, one for each referenced key.