The tidyr::who dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report. There’s a wealth of epidemiological information in this dataset, but it’s challenging to work with the data in the form that it’s provided.

library(tidyr)
library(dplyr)

View who dataset in RStudio with View() function. This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values. In short, who is messy, and we’ll need multiple steps to tidy it.

Let’s have a look at what we’ve got:

What about all the other columns (e.g. new_sp_m014, new_ep_m014, new_ep_f014)?

  1. The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.
  2. The next two or three letters describe the type of TB:
    • rel stands for cases of relapse
    • ep stands for cases of extrapulmonary TB
    • sn stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)
    • sp stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive)
  3. The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).
  4. The remaining numbers gives the age group. The dataset groups cases into seven age groups:
    • 014 = 0 – 14 years old
    • 1524 = 15 – 24 years old
    • 2534 = 25 – 34 years old
    • 3544 = 35 – 44 years old
    • 4554 = 45 – 54 years old
    • 5564 = 55 – 64 years old
    • 65 = 65 or older

We ask you to tidy the above data set following these steps.

Step 1: gather

Gather together all the columns from new_sp_m014 to newrel_f65. Use the variable key for the gathered columns and variable cases for the values. Notice there are a lot of missing values in the current representation, so use na.rm to remove them.

View(who)
who1 <- who %>% 
  gather(___, key = "___", value = "___", na.rm = ___)
who1
who1 <- who %>% 
  gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)
who1
## # A tibble: 76,046 x 6
##    country     iso2  iso3   year key         cases
##  * <chr>       <chr> <chr> <int> <chr>       <int>
##  1 Afghanistan AF    AFG    1997 new_sp_m014     0
##  2 Afghanistan AF    AFG    1998 new_sp_m014    30
##  3 Afghanistan AF    AFG    1999 new_sp_m014     8
##  4 Afghanistan AF    AFG    2000 new_sp_m014    52
##  5 Afghanistan AF    AFG    2001 new_sp_m014   129
##  6 Afghanistan AF    AFG    2002 new_sp_m014    90
##  7 Afghanistan AF    AFG    2003 new_sp_m014   127
##  8 Afghanistan AF    AFG    2004 new_sp_m014   139
##  9 Afghanistan AF    AFG    2005 new_sp_m014   151
## 10 Afghanistan AF    AFG    2006 new_sp_m014   193
## # ... with 76,036 more rows

Step 2: replace

We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of new_rel we have newrel Use stringr::str_replace() to replace the characters “newrel” with “new_rel”. This makes all variable names consistent.

who2 <- who1 %>% 
  mutate(___ = ___(___, "newrel", "new_rel"))
who2
who2 <- who1 %>% 
  mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2
## # A tibble: 76,046 x 6
##    country     iso2  iso3   year key         cases
##    <chr>       <chr> <chr> <int> <chr>       <int>
##  1 Afghanistan AF    AFG    1997 new_sp_m014     0
##  2 Afghanistan AF    AFG    1998 new_sp_m014    30
##  3 Afghanistan AF    AFG    1999 new_sp_m014     8
##  4 Afghanistan AF    AFG    2000 new_sp_m014    52
##  5 Afghanistan AF    AFG    2001 new_sp_m014   129
##  6 Afghanistan AF    AFG    2002 new_sp_m014    90
##  7 Afghanistan AF    AFG    2003 new_sp_m014   127
##  8 Afghanistan AF    AFG    2004 new_sp_m014   139
##  9 Afghanistan AF    AFG    2005 new_sp_m014   151
## 10 Afghanistan AF    AFG    2006 new_sp_m014   193
## # ... with 76,036 more rows

Step 3: separate

Separate the values in each code with two passes of separate(). First, separate key after underscore character in three new variables:

who3 <- who2 %>% 
  ___(___, c("new", "type", "sexage"), sep = "___")
who3
who3 <- who2 %>% 
  separate(key, c("new", "type", "sexage"), sep = "_")
who3
## # A tibble: 76,046 x 8
##    country     iso2  iso3   year new   type  sexage cases
##    <chr>       <chr> <chr> <int> <chr> <chr> <chr>  <int>
##  1 Afghanistan AF    AFG    1997 new   sp    m014       0
##  2 Afghanistan AF    AFG    1998 new   sp    m014      30
##  3 Afghanistan AF    AFG    1999 new   sp    m014       8
##  4 Afghanistan AF    AFG    2000 new   sp    m014      52
##  5 Afghanistan AF    AFG    2001 new   sp    m014     129
##  6 Afghanistan AF    AFG    2002 new   sp    m014      90
##  7 Afghanistan AF    AFG    2003 new   sp    m014     127
##  8 Afghanistan AF    AFG    2004 new   sp    m014     139
##  9 Afghanistan AF    AFG    2005 new   sp    m014     151
## 10 Afghanistan AF    AFG    2006 new   sp    m014     193
## # ... with 76,036 more rows

Next separate sexage into sex and age by splitting after the first character:

who4 <- who3 %>% 
  ___(sexage, c("___", "___"), sep = ___)
who4
who4 <- who3 %>% 
  separate(sexage, c("sex", "age"), sep = 1)
who4
## # A tibble: 76,046 x 9
##    country     iso2  iso3   year new   type  sex   age   cases
##    <chr>       <chr> <chr> <int> <chr> <chr> <chr> <chr> <int>
##  1 Afghanistan AF    AFG    1997 new   sp    m     014       0
##  2 Afghanistan AF    AFG    1998 new   sp    m     014      30
##  3 Afghanistan AF    AFG    1999 new   sp    m     014       8
##  4 Afghanistan AF    AFG    2000 new   sp    m     014      52
##  5 Afghanistan AF    AFG    2001 new   sp    m     014     129
##  6 Afghanistan AF    AFG    2002 new   sp    m     014      90
##  7 Afghanistan AF    AFG    2003 new   sp    m     014     127
##  8 Afghanistan AF    AFG    2004 new   sp    m     014     139
##  9 Afghanistan AF    AFG    2005 new   sp    m     014     151
## 10 Afghanistan AF    AFG    2006 new   sp    m     014     193
## # ... with 76,036 more rows

Step 4: drop

Then we might as well drop the new column because it’s constant in this dataset. While we’re dropping columns, let’s also drop iso2 and iso3 since they’re redundant.

# check that new is constant
who4 %>% ___

# drop redundant columns
who5 <- who4 %>% 
  select(___, ___, ___)
who4 %>% 
  count(new)
## # A tibble: 1 x 2
##   new       n
##   <chr> <int>
## 1 new   76046
who5 <- who4 %>% 
  select(-new, -iso2, -iso3)

Last step: use pipe

You have written the code a piece at a time, assigning each interim result to a new variable. Instead, gradually build up a complex pipe.

who %>%
  gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% 
  mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
  separate(code, c("new", "var", "sexage")) %>% 
  separate(sexage, c("sex", "age"), sep = 1) %>%
  select(-new, -iso2, -iso3)  
## # A tibble: 76,046 x 6
##    country      year var   sex   age   value
##    <chr>       <int> <chr> <chr> <chr> <int>
##  1 Afghanistan  1997 sp    m     014       0
##  2 Afghanistan  1998 sp    m     014      30
##  3 Afghanistan  1999 sp    m     014       8
##  4 Afghanistan  2000 sp    m     014      52
##  5 Afghanistan  2001 sp    m     014     129
##  6 Afghanistan  2002 sp    m     014      90
##  7 Afghanistan  2003 sp    m     014     127
##  8 Afghanistan  2004 sp    m     014     139
##  9 Afghanistan  2005 sp    m     014     151
## 10 Afghanistan  2006 sp    m     014     193
## # ... with 76,036 more rows