“Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy
“Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy
The data scientist manages variables, observations and values:
There are three interrelated rules which make a dataset tidy:
Why ensure that your data is tidy? There are two main advantages:
Each dataset below shows the same values of four variables country, year, population, and cases, but each dataset organises the values in a different way.
library(tidyr) table1
## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
table2
## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583
table3
## # A tibble: 6 x 3 ## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583
# Spread across two tibbles table4a # cases
## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766
table4b # population
## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 19987071 20595360 ## 2 Brazil 172006362 174504898 ## 3 China 1272915272 1280428583
Apply gathering when variables are in fact values and a single row gathers many observations
Gathering makes wide tables narrower and longer.
This is the case of table4a
and table4b
above, that we can gather as follows:
gather(table4a, `1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3 ## country year cases ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Brazil 1999 37737 ## 3 China 1999 212258 ## 4 Afghanistan 2000 2666 ## 5 Brazil 2000 80488 ## 6 China 2000 213766
# Development on gather() is complete, and for new code we recommend switching to pivot_longer() pivot_longer(table4a, c(`1999`, `2000`), names_to = "year", values_to = "cases")
## # A tibble: 6 x 3 ## country year cases ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766
gather(table4b, `1999`, `2000`, key = "year", value = "population")
## # A tibble: 6 x 3 ## country year population ## <chr> <chr> <int> ## 1 Afghanistan 1999 19987071 ## 2 Brazil 1999 172006362 ## 3 China 1999 1272915272 ## 4 Afghanistan 2000 20595360 ## 5 Brazil 2000 174504898 ## 6 China 2000 1280428583
# Development on gather() is complete, and for new code we recommend switching to pivot_longer() pivot_longer(table4b, c(`1999`, `2000`), names_to = "year", values_to = "population")
## # A tibble: 6 x 3 ## country year population ## <chr> <chr> <int> ## 1 Afghanistan 1999 19987071 ## 2 Afghanistan 2000 20595360 ## 3 Brazil 1999 172006362 ## 4 Brazil 2000 174504898 ## 5 China 1999 1272915272 ## 6 China 2000 1280428583
Finally, we can join the gathered tables to obtain the original tidy data frame stored in table1
above:
library(dplyr) left_join(gather(table4a, `1999`, `2000`, key = "year", value = "cases"), gather(table4b, `1999`, `2000`, key = "year", value = "population"))
## # A tibble: 6 x 4 ## country year cases population ## <chr> <chr> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Brazil 1999 37737 172006362 ## 3 China 1999 212258 1272915272 ## 4 Afghanistan 2000 2666 20595360 ## 5 Brazil 2000 80488 174504898 ## 6 China 2000 213766 1280428583
Apply spreading when values are in fact variables and a single observation is spread across many rows.
Spreading makes long tables shorter and wider.
This is the case of table2
above, that we can spread as follows:
table2
## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583
spread(table2, key = type, value = count)
## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
# Development on spread() is complete, and for new code we recommend switching to pivot_wider() pivot_wider(table2, names_from = type, values_from = count)
## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
Apply separating when values are composite (not atomic): a column gathers many variables.
This is the case of table3
above, that we can separate as follows:
table3
## # A tibble: 6 x 3 ## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583
separate(table3, rate, into = c("cases", "population"), sep = "/", convert = TRUE)
## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
You can also pass a vector of integers to sep
. separate()
will interpret the integers as positions to split at:
table5 = separate(table3, year, into = c("century", "year"), sep = 2) table5
## # A tibble: 6 x 4 ## country century year rate ## <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272 ## 6 China 20 00 213766/1280428583
Apply uniting when values are partial: a variable is spread across many columns.
This is the case of table5
below, that we can unite as follows:
table5
## # A tibble: 6 x 4 ## country century year rate ## <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272 ## 6 China 20 00 213766/1280428583
unite(table5, new, century, year, sep = "")
## # A tibble: 6 x 3 ## country new rate ## <chr> <chr> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583