Data Science

Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. Hadley Wickham

Instead of using data just to become more efficient, we can use data to become more humane and to connect with ourselves and others at a deeper level. Giorgia Lupi

The Hadley Wickham’s view

The Giorgia Lupi’s view

Data Science graph

The Giorgia Lupi view

Import

  • first you must import your data into R
  • this typically means that you take data stored in a file, database, or web API, and load it into a data frame in R
  • if you can’t get your data into R, you can’t do data science on it

Tidy

  • once you’ve imported your data, it is a good idea to tidy it
  • tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored
  • in brief, when your data is tidy, each column is a variable and each row is an observation
  • tidy data is important because the consistent structure lets you focus on questions about the data, not fighting to get the data into the right form to answer your questions

Transform

  • once you have tidy data, a common first step is to transform (or query) it
  • transformation includes:
    • narrowing in on observations of interest (like all people in one city, or all data from the last year)
    • creating new variables that are functions of existing variables (like computing velocity from speed and time)
    • calculating a set of summary statistics (like counts or means)
  • together, tidying and transforming are called wrangling, because getting your data in a form that’s natural to work with often feels like a fight!

Visualize and model

  • once you have tidy data with the variables you need, there are two main engines of knowledge generation: visualisation and modelling
  • these have complementary strengths and weaknesses so any real analysis will iterate between them many times

Visualize

  • visualisation is a fundamentally human activity
  • good visualisation will show you things that you did not expect, or raise new questions about the data
  • a good visualisation might also hint that you’re asking the wrong question, or you need to collect different data
  • visualisations can surprise you, but don’t scale particularly well because they require a human to interpret them

Model

  • models are complementary tools to visualisation
  • the goal of a model is to provide a simple low-dimensional summary of a dataset
  • ideally, the model will capture true signals (i.e. patterns generated by the phenomenon of interest) and ignore noise (i.e. random variation that you’re not interested in)
  • models are a fundamentally mathematical or computational tool, so they generally scale well
  • but “the map is not the territory”: every model makes assumptions, and these make a difference between reality and a model of reality

Communicate

  • the last step of data science is communication, an absolutely critical part of any data analysis project
  • it doesn’t matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others, including the future you