I will teach how to organize, transform, analyse and visualize small and big data, as well as how to effectively communicate the outcomes of the workflow. This is a path starting at Edgar Frank Codd, passing through Hadley Wickham, and ending at Giorgia Lupi.

The course will be multi-task (learn, make, use, watch, glance, read, dig, listen; see more below) and multi-teacher (I will be assisted by other real and virtual teachers). Some basics in programming and statistics are desirable.

## Play

1. Getting started
2. Relational databases.
• learn Introduction to databases. Chapters from 1.1 to 1.6 of book LL99
• learn An informal overview of the relational model. Chapter 1.7.1 of book LL99
• learn The data structure. Chapter 3.1 of book LL99
• learn Integrity constraints. Chapter 3.4 of book LL99
• learn Update anomalies and normal forms. Chapters 4.1 and 4.4 of book LL99
• learn SQL. Chapter 3.2.2 of book LL99 and Teatro SQL
• use SQLite
• make Create a database in SQLite corresponding to the dataset nycflights13. At work
• make Write queries in SQL on the nycflights13 database. At work
• read DBI and RSQLite R packages’ vignettes
• make Create, populate and query database nycflights13 using DBI and RSQLite packages of R. At work
• read 10 Easy Steps to a Complete Understanding of SQL
• dig Relational algebra and relational calculus. Chapters 3.2.1 and 3.2.2 of book LL99
• dig Relational model and relational algebra
• listen Invited speaker: Angelo Montanari on data normalization
3. Explore
4. Wrangle
5. Program
6. Model
7. Non-tidy data
8. Communicate

You will go through different tasks: learn, make, use, watch, glance, read, dig, listen. A legend is below:

• learn: I teach, you listen (and hopefully learn).
• make: I give you an assignment, you make it during the class. We discuss the solutions during the next class.
• use: you use a software: download, install and run it for the first time. I give you a brief practical introduction to it.
• watch: We watch a video together. By and large, the video acts as a teaser, introducing the next topic in an informal and attractive way.
• glance: You give a brief and fast look at something, generally an informative website. I steer you towards the most important sections.
• read: You read a story, typically at home. We discuss it together during the following class.
• dig: You read a theoretical deepening of the current topic, normally at home. We talk about it during one of the next classes.
• listen: The class is given by an invited speaker, an expert in the field.

## Books

• WG17 R for Data Science. Hadley Wickham and Garrett Grolemund. O’Reilly. 2017.
• T11 R Cookbook. Paul Teetor. O’Reilly Media. 2011.
• W10 ggplot2: Elegant Graphics for Data Analysis. Hadley Wickham. Springer. 2010.
• LL99 A Guided Tour of Relational Databases and Beyond. Mark Levene and George Loizou. Springer. 1999.
• MS06 An Introduction to XML and Web Technologies. Anders Møller and Michael I. Schwartzbach. Addison-Wesley. 2006
• HMU06 Introduction to Automata Theory, Languages, And Computation. John E. Hopcroft, Rajeev Motwani, Jeffrey D. Ullman. Addison-Wesley, 2006.
• T01 The Visual Display of Quantitative Information. Edward. R. Tufte. Graphics Press. 2001
• C12 The functional art. Alberto Cairo. New Riders. 2012
• BBLL12 Generative design. Hartmut Bohnacker, Benedikt Gross, Julia Laub, Claudius Lazzeroni. Princeton Architectural Press. 2012

## Data challenges

Data challenges have 3 components:

• Input, which consists of:
1. a dataset of raw data. No data model is assumed. The data should be open so it can be freely distributed.
2. a set of data questions and challenges, formulated in natural language, whose answers might be (but not necessarily are) hidden behind the raw data. Questions should be sufficiently general and compelling to tease the attention and curiosity of scholars.
• Analysis notebook: a stream of analyses and visualizations aimed at approaching the given data questions and challenges. Ideally, the notebook is written in some popular, free language (like R or Python) and it is self-containing so that it can be easily distributed, executed and modified by other scholars. Issues like readability, conciseness, elegance, efficiency of the notebook are relevant, although not crucial.
• Output: these are the suggested answers to the given data questions and challenges. Answers might be partial (not definitive). The same question can be answered with different notebooks. A (modest) degree of subjectivity in the interpretation of the data answers is expected.

The following are examples of data challenges you are invited to try:

1. Which are the winners and losers in the last Italian soccer Seria A league? challenge
2. Which is the best team ever in Italian soccer? challenge
3. In there a first-mover advantage in chess? challenge
4. Are female dolphins more social than male dolphins? challenge
5. Which are the most dangerous terrorists involved in Madrid train bombing attack of 2011? challenge
6. Is child mortality decreasing over time? challenge
7. Are low quality diamonds more expensive? challenge