Data Science, reloaded

In the Dear Data Science course we covered analytics, visualization and modelling for relational (tabular) data. In this course we will approach network data as well as text mining.

Play

Teasers
- watch The joy of stats by Hans Rosling
- watch The power of networks by Manuel Lima
Relational data
- Catch up, if neccessary, by learning the Data Science part of the Dear Data Science course
Network science
- listenNotes on linear algebra and matrix theory
- learn Notes on graph theory [1 / 2]
- datacamp Network Science in R - A Tidy Approach
- Real-world networks
  - watch A visual history of human knowledge
  - watch Is time a network?
  - glance Gallery: Gorgeous networks that help us understand the world
  - glance Visual complexity
  - glance Networkism
  - learn Classes of networks
    - learn Technological networks
    - learn Social networks
    - learn Information networks
    - learn Biological networks
- Packages
  - The igraph package
    - glance igraph
    - learn Getting started with igraph [html / Rmd]
  - The ggraph package
    - glance ggraph
    - learn Getting started with ggraph [html / Rmd]
  - The tidygraph package
    - glance tidygraph
    - learn Getting started with tidygraph [html / Rmd]
- Centrality
  - learn The 3 usual suspects: Degree, Closeness and Betweenness
  - learn Recursive centrality: Eigenvector, Katz, PageRank, and HITS
  - dig PageRank: Standing on the shoulders of giants
  - dig Current-flow centralities
- Rating and ranking
  - Massey
  - Keener
  - Offense-Defense
  - Elo
- Power
  - learn A measure of power in networks
  - dig A theory on power in networks
  - dig Bargaining and power in networks. Chapter 12 in book Networks, crowds and markets
- Similarity and heterogeneity
  - learn Similarity
  - learn Heterogeneity
- Community detection
  - learn Modularity
  - learn Spectral comunity detection
  - learn Hierarchical clustering
  - learn Other methods
- Structure
  - learn Network models
  - learn Components and resiliance
  - make Components and resilience in R [html / Rmd]
  - watch The science of six degrees of separation
  - read Chains, by Frigyes Karinthy
  - read Erdös number
  - watch The strength of weak ties
  - learn Small-world networks
  - make Small-world networks in R [html / Rmd]
  - learn Degree distribution
  - make Degree distribution in R [html, Rmd]
  - read Power-law distribution
  - learn Transitivity and reciprocity
  - learn Assortative mixing
Text mining
- Strings and regular expressions
  - learn Strings with stringr
  - dig Regular expressions and automata (Chapters 3 and 4)
  - glanceCheatsheet. Regular expressions
- Text Mining with R - A tidy approach
  - learn The tidy text format
  - learn Sentiment analysis
  - learn tf-idf
  - learn n-grams and correlations
  - learn Converting to and from non-tidy formats
  - learn Topic modelling
  - make Mining financial articles
  - make The great library heist
Blockchain & IPFS
- Blockchain
  - watch What is a blockchain?
  - learn Building a blockchain in R
  - dig Blockchain leashed
- IPFS
  - glance InterPlanetary File System
  - watch IPFS Simply Explained
  - watch IPFS and the Permanent Web

Task-tag legend

You will go through different tasks: learn, make, use, watch, glance, read, dig, listen. A legend is below:

learn: I teach, you listen (and hopefully learn).
make: I give you an assignment, you make it during the class. We discuss the solutions during the next class.
use: you use a software: download, install and run it for the first time. I give you a brief practical introduction to it.
watch: We watch a video together. By and large, the video acts as a teaser, introducing the next topic in an informal and attractive way.
glance: You give a brief and fast look at something, generally an informative website. I steer you towards the most important sections.
read: You read a story, typically at home. We discuss it together during the following class.
dig: You read a theoretical deepening of the current topic, normally at home. We talk about it during one of the next classes.
listen: The class is given by an invited speaker, an expert in the field.

Tools

Packages

all packages of the tidyverse, in particular dplyr and ggplot2 for data manipulation and visualization
igraph, ggraph, and tidygraph for network analysis and visualization
tidytext for text analysis

Books

E-learning

As an academic professional, I can sign my class up for an entire semester for free via DataCamp for the Classroom. This has some benefits:

you can learn by doing using DataCamp platform;
I can assign particular courses or chapters, and see who finished on time and who missed the deadline;
I can track student progress, grade automatically, and download reports

Datasets

Data challenges

In a data story (or data challenge) you tell a story with data. Find a dataset, pose questions, and try to solve them using an analysis notebook in R. Follow your curiosity and be creative.

Which are the performance classes in the latest Italian soccer Serie A league? challenge
In there a first-mover advantage in chess? challenge
Which are the most powerful countries in the European natural gas market? challenge
Detect the most dangerous terrorists involved in Madrid train bombing attack of 2011 challenge
Discover the most interdisciplinary and autarchic disciplines in science challenge
Detect communities in a Karate club friendship network challenge
Attack the resilience of the Madrid train bombing terror network challenge
Are relationships among dolphins assortative by sex? And by degree? challenge
Analyse the market of crypto art challenge

Exam

The exam consists of a written exam and a project with oral presentation.

The written part consists of a list of questions, either open questions or exercises, over all the covered syllabus. During the written exam students cannot use any material.
The project consists of one significant data challenge chosen by the student. It is done individually and must use methods, languages and software tools seen during the course. The student will discuss the project the day of the written exam, in a maximum time of 25 minutes, using a presentation on a personal laptop (bring adapters). The presentation must focus on the used dataset, the data questions, the performed analyzes and the results obtained. Both the project and the presentation skills will be evaluated. Each student can discuss the project only once. If the written part fails, the bonus of the project is still valid.

The final mark will be a weighted average of the written and project parts of the exam. The weight of the project is \(\varphi^{-1}\), where \(\varphi\) is the golden ratio. Passable marks are between 18 and 30. Excellent projects will be awarded with a praise bonus from 1 to 3 points to be summed to the result of the weighted average. The final mark is rounded to the closest integer.