Tidy the WHO dataset

The tidyr::who dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report. There’s a wealth of epidemiological information in this dataset, but it’s challenging to work with the data in the form that it’s provided.

This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values. In short, who is messy, and we’ll need multiple steps to tidy it.

Let’s have a look at what we’ve got:

It looks like country, iso2, and iso3 are three variables that redundantly specify the country.
year is clearly also a variable.

What about all the other columns (e.g. new_sp_m014, new_ep_m014, new_ep_f014)?

The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.
The next two or three letters describe the type of TB:
- rel stands for cases of relapse
- ep stands for cases of extrapulmonary TB
- sn stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)
- sp stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive)
The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).
The remaining numbers gives the age group. The dataset groups cases into seven age groups:
- 014 = 0 – 14 years old
- 1524 = 15 – 24 years old
- 2534 = 25 – 34 years old
- 3544 = 35 – 44 years old
- 4554 = 45 – 54 years old
- 5564 = 55 – 64 years old
- 65 = 65 or older

We ask you to tidy the above data set following these steps.

Step 1: gather

Gather together all the columns from new_sp_m014 to newrel_f65. Use the variable key for the gathered columns and variable cases for the values. Notice there are a lot of missing values in the current representation, so use values_drop_na to remove them.

who1 <- who %>% 
  pivot_longer(___, names_to = ___, values_to = ___, values_drop_na = ___)
who1

## # A tibble: 7,240 x 60
##    country  iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
##    <chr>    <chr> <chr> <int>       <int>        <int>        <int>        <int>
##  1 Afghani… AF    AFG    1980          NA           NA           NA           NA
##  2 Afghani… AF    AFG    1981          NA           NA           NA           NA
##  3 Afghani… AF    AFG    1982          NA           NA           NA           NA
##  4 Afghani… AF    AFG    1983          NA           NA           NA           NA
##  5 Afghani… AF    AFG    1984          NA           NA           NA           NA
##  6 Afghani… AF    AFG    1985          NA           NA           NA           NA
##  7 Afghani… AF    AFG    1986          NA           NA           NA           NA
##  8 Afghani… AF    AFG    1987          NA           NA           NA           NA
##  9 Afghani… AF    AFG    1988          NA           NA           NA           NA
## 10 Afghani… AF    AFG    1989          NA           NA           NA           NA
## # … with 7,230 more rows, and 52 more variables: new_sp_m4554 <int>,
## #   new_sp_m5564 <int>, new_sp_m65 <int>, new_sp_f014 <int>,
## #   new_sp_f1524 <int>, new_sp_f2534 <int>, new_sp_f3544 <int>,
## #   new_sp_f4554 <int>, new_sp_f5564 <int>, new_sp_f65 <int>,
## #   new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,
## #   new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>,
## #   new_sn_m65 <int>, new_sn_f014 <int>, new_sn_f1524 <int>,
## #   new_sn_f2534 <int>, new_sn_f3544 <int>, new_sn_f4554 <int>,
## #   new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>,
## #   new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>,
## #   new_ep_m4554 <int>, new_ep_m5564 <int>, new_ep_m65 <int>,
## #   new_ep_f014 <int>, new_ep_f1524 <int>, new_ep_f2534 <int>,
## #   new_ep_f3544 <int>, new_ep_f4554 <int>, new_ep_f5564 <int>,
## #   new_ep_f65 <int>, newrel_m014 <int>, newrel_m1524 <int>,
## #   newrel_m2534 <int>, newrel_m3544 <int>, newrel_m4554 <int>,
## #   newrel_m5564 <int>, newrel_m65 <int>, newrel_f014 <int>,
## #   newrel_f1524 <int>, newrel_f2534 <int>, newrel_f3544 <int>,
## #   newrel_f4554 <int>, newrel_f5564 <int>, newrel_f65 <int>

## # A tibble: 76,046 x 6
##    country     iso2  iso3   year key          cases
##    <chr>       <chr> <chr> <int> <chr>        <int>
##  1 Afghanistan AF    AFG    1997 new_sp_m014      0
##  2 Afghanistan AF    AFG    1997 new_sp_m1524    10
##  3 Afghanistan AF    AFG    1997 new_sp_m2534     6
##  4 Afghanistan AF    AFG    1997 new_sp_m3544     3
##  5 Afghanistan AF    AFG    1997 new_sp_m4554     5
##  6 Afghanistan AF    AFG    1997 new_sp_m5564     2
##  7 Afghanistan AF    AFG    1997 new_sp_m65       0
##  8 Afghanistan AF    AFG    1997 new_sp_f014      5
##  9 Afghanistan AF    AFG    1997 new_sp_f1524    38
## 10 Afghanistan AF    AFG    1997 new_sp_f2534    36
## # … with 76,036 more rows

Step 2: replace

We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of new_rel we have newrel Use stringr::str_replace() to replace the characters “newrel” with “new_rel” in the key column. This makes all variable names consistent.

who2 <- who1 %>% 
  mutate(key = stringr::str_replace(___, ___, ___))
who2

## # A tibble: 76,046 x 6
##    country     iso2  iso3   year key          cases
##    <chr>       <chr> <chr> <int> <chr>        <int>
##  1 Afghanistan AF    AFG    1997 new_sp_m014      0
##  2 Afghanistan AF    AFG    1997 new_sp_m1524    10
##  3 Afghanistan AF    AFG    1997 new_sp_m2534     6
##  4 Afghanistan AF    AFG    1997 new_sp_m3544     3
##  5 Afghanistan AF    AFG    1997 new_sp_m4554     5
##  6 Afghanistan AF    AFG    1997 new_sp_m5564     2
##  7 Afghanistan AF    AFG    1997 new_sp_m65       0
##  8 Afghanistan AF    AFG    1997 new_sp_f014      5
##  9 Afghanistan AF    AFG    1997 new_sp_f1524    38
## 10 Afghanistan AF    AFG    1997 new_sp_f2534    36
## # … with 76,036 more rows

Step 3: separate

Separate the values in each code with three passes of separate(). First, separate key after underscore character in three new variables: new, type, and sexage

who3 <- who2 %>% 
  ___(___, into = ___, sep = "___")
who3

## # A tibble: 76,046 x 8
##    country     iso2  iso3   year new   type  sexage cases
##    <chr>       <chr> <chr> <int> <chr> <chr> <chr>  <int>
##  1 Afghanistan AF    AFG    1997 new   sp    m014       0
##  2 Afghanistan AF    AFG    1997 new   sp    m1524     10
##  3 Afghanistan AF    AFG    1997 new   sp    m2534      6
##  4 Afghanistan AF    AFG    1997 new   sp    m3544      3
##  5 Afghanistan AF    AFG    1997 new   sp    m4554      5
##  6 Afghanistan AF    AFG    1997 new   sp    m5564      2
##  7 Afghanistan AF    AFG    1997 new   sp    m65        0
##  8 Afghanistan AF    AFG    1997 new   sp    f014       5
##  9 Afghanistan AF    AFG    1997 new   sp    f1524     38
## 10 Afghanistan AF    AFG    1997 new   sp    f2534     36
## # … with 76,036 more rows

Next separate sexage into sex and age by splitting after the first character:

who4 <- who3 %>% 
  ___(___, into = ___, sep = ___)
who4

## # A tibble: 76,046 x 9
##    country     iso2  iso3   year new   type  sex   age   cases
##    <chr>       <chr> <chr> <int> <chr> <chr> <chr> <chr> <int>
##  1 Afghanistan AF    AFG    1997 new   sp    m     014       0
##  2 Afghanistan AF    AFG    1997 new   sp    m     1524     10
##  3 Afghanistan AF    AFG    1997 new   sp    m     2534      6
##  4 Afghanistan AF    AFG    1997 new   sp    m     3544      3
##  5 Afghanistan AF    AFG    1997 new   sp    m     4554      5
##  6 Afghanistan AF    AFG    1997 new   sp    m     5564      2
##  7 Afghanistan AF    AFG    1997 new   sp    m     65        0
##  8 Afghanistan AF    AFG    1997 new   sp    f     014       5
##  9 Afghanistan AF    AFG    1997 new   sp    f     1524     38
## 10 Afghanistan AF    AFG    1997 new   sp    f     2534     36
## # … with 76,036 more rows

Finally separate age into age_low and age_high by splitting after the second character (first replace “014” with “0014” and “65” with “6565”):

who5 <- who4 %>% 
  mutate(age = stringr::str_replace(___, ___, ___)) %>%
  mutate(age = stringr::str_replace(___, ___, ___)) %>%
  separate(___, into = ___, sep = ___)

## # A tibble: 76,046 x 10
##    country     iso2  iso3   year new   type  sex   age_low age_high cases
##    <chr>       <chr> <chr> <int> <chr> <chr> <chr> <chr>   <chr>    <int>
##  1 Afghanistan AF    AFG    1997 new   sp    m     00      14           0
##  2 Afghanistan AF    AFG    1997 new   sp    m     15      24          10
##  3 Afghanistan AF    AFG    1997 new   sp    m     25      34           6
##  4 Afghanistan AF    AFG    1997 new   sp    m     35      44           3
##  5 Afghanistan AF    AFG    1997 new   sp    m     45      54           5
##  6 Afghanistan AF    AFG    1997 new   sp    m     55      64           2
##  7 Afghanistan AF    AFG    1997 new   sp    m     65      65           0
##  8 Afghanistan AF    AFG    1997 new   sp    f     00      14           5
##  9 Afghanistan AF    AFG    1997 new   sp    f     15      24          38
## 10 Afghanistan AF    AFG    1997 new   sp    f     25      34          36
## # … with 76,036 more rows

Step 4: drop

Check that the new column is constant in this dataset. Hence drop column new as well as iso2 and iso3 since they’re redundant.

# check that new is constant
who5 %>% ___

# drop redundant columns
who6 <- who5 %>% 
  ___(___, ___, ___)
who6

## # A tibble: 1 x 2
##   new       n
##   <chr> <int>
## 1 new   76046

## # A tibble: 76,046 x 7
##    country      year type  sex   age_low age_high cases
##    <chr>       <int> <chr> <chr> <chr>   <chr>    <int>
##  1 Afghanistan  1997 sp    m     00      14           0
##  2 Afghanistan  1997 sp    m     15      24          10
##  3 Afghanistan  1997 sp    m     25      34           6
##  4 Afghanistan  1997 sp    m     35      44           3
##  5 Afghanistan  1997 sp    m     45      54           5
##  6 Afghanistan  1997 sp    m     55      64           2
##  7 Afghanistan  1997 sp    m     65      65           0
##  8 Afghanistan  1997 sp    f     00      14           5
##  9 Afghanistan  1997 sp    f     15      24          38
## 10 Afghanistan  1997 sp    f     25      34          36
## # … with 76,036 more rows

Last step: use pipe

You have written the code a piece at a time, assigning each interim result to a new variable. Instead, gradually build up a complex pipe.