The logical design of a database consists in translating the conceptual schema of the data into a logical schema. The logical model is independent of the physical model, i.e. the physical representation of the data. The logical model that we will use is called relational model (RM).
The relational model was proposed in 1970 by Edgar F. Codd in his article A Relational Model for Large Shared Data Banks, which appeared in the journal Communications of the ACM. In the abstract of the paper Codd writes:
A model based on n-ary relations, a normal form for data base relations, and the concept of a universal data sublanguage are introduced.
The relational model consists of two main components:
The relational model is based on the concept of relation (not to be confused with the conceptual relationship of the ER model), whose representation is a table (or dataframe in R’s jargon). The concept of relation is formal and comes from set theory, a part of mathematics. The concept of table is intuitive and is used in various contexts outside databases. The success of the relational model lies precisely in the conjunction of a formal concept, the relation, which allowed the development of a theory of relational databases with results of practical impact, with an intuitive concept, the table, which made it possible the relational model also to end users without any mathematical notion. The approach followed in this class will be intuitive. A good text for the theory of relational databases is the article Elements of Relational Database Theory by Paris C. Kanellakis.
Consider the planes table that contains information on planes in the NYC flight dataset (only the first 6 rows are shown):
tailnum | year | type | manufacturer | model | engines | seats | speed | engine |
---|---|---|---|---|---|---|---|---|
N10156 | 2004 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NA | Turbo-fan |
N102UW | 1998 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
N103US | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
N104UW | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
N10575 | 2002 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NA | Turbo-fan |
N105UW | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
The following properties characterize the relational model (we will use the terms relation and table interchangeably):
engines
attribute of the planes table has the domain of natural numbers, the number of engines of the plane;tailnum
attribute is the key of the planes table;speed
contains null values in the first 6 rows of table planes (notice that null values are represented with constant NA
in R and with constant NULL
in databases).In a relation we distinguish the schema from the content:
For instance, the schema of the planes relation is the following (notice that the key attributes are underlined):
planes(tailnum, year, type, manufacturer, model, engines, seats, speed, engine)
It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of related data are called a relational database. Hence we define:
We make a fundamental observation. In the relational model, all information is represented by relations that correspond to tables with attributes. Thus, in this model, both the entities and the relationships of the conceptual schema are implemented in relations. In particular, the conceptual relationships between entities are implemented at a logical level through tables that contain values common to other tables to which they refer.
For instance, consider a database that contains a weighted directed network. The conceptual schema contains a node entity and a recursive edge relationship:
The logical schema contains two relations:
The database schema is as follows:
node(id, name)
edge(from, to, weight)
The attributes from and to of the edge table are called foreign keys, since they refer to key values in another table (the node table in this case).
An instance of the database containing a cycle of length 6 is the following:
## id name
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
## 6 6 f
## from to weight
## 1 1 2 1
## 2 2 3 1
## 3 3 4 3
## 4 4 5 4
## 5 5 6 2
## 6 6 1 6
Notice that values of the from and to columns in the edge tables refer to valid id values in the node table.
Besides relations, the relational model defines integrity constraints. Integrity constraints serve to maintain the consistency of the database. Only database instances that satisfy all the integrity constraints are valid. Integrity constraints include:
Foreign key constraints are written as an arrow from the foreign key to the referenced primary key. For instance, for the database representing a network, we have the following two constraints:
edge(from) –> node(id)
edge(to) –> node(id)
There exists a translation from the ER model to the RM. It works by mapping entities and relationships of the conceptual ER schema into relations and integrity constraints of the logical RM schema.
An entity is translated into a relation with the same name, the same attributes and the same primary key constraints.
flight(id, dep_time, arr_time)
There are two translations. The first one embeds the relationship rules inside the table king:
king(name, country)
kingdom(country)
king(country) –> kingdom(country)
The second one embeds the relationship rules inside the table kingdom (notice the renaming of the attribute name of king):
king(name)
kingdom(country, king)
kingdom(king) –> king(name)
The first one is to prefer, there might be kingdoms without kings, but every king has a kingdom. Hence, the second schema might introduce NULL values.
The translations embeds the relationship flies inside the table flight:
flight(id, tailnum)
plane(tailnum)
flight(tailnum) –> plane(tailnum)
Notice that a flight is connected with at most one plane, while a plane might have more flights associated with it. Making the inverse translation, that is embedding flies into plane, would violate the primary key integrity constraint of the plane table.
The translations creates a table for the relationship enroll:
student(code)
course(name)
enroll(student, course)
enroll(student) –> student(code)
enroll(course) –> course(name)
Notice that the key of the enroll table is the composition of the keys of the two related tables student and course. Moreover, two foreign key constraints are necessary, one for each referenced key.