The tidyr::who dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report. There’s a wealth of epidemiological information in this dataset, but it’s challenging to work with the data in the form that it’s provided.
library(tidyr)
library(dplyr)
View who dataset in RStudio with View() function. This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values. In short, who is messy, and we’ll need multiple steps to tidy it.
Let’s have a look at what we’ve got:
country, iso2, and iso3 are three variables that redundantly specify the country.year is clearly also a variable.What about all the other columns (e.g. new_sp_m014, new_ep_m014, new_ep_f014)?
rel stands for cases of relapseep stands for cases of extrapulmonary TBsn stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)sp stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive)m) and females (f).014 = 0 – 14 years old1524 = 15 – 24 years old2534 = 25 – 34 years old3544 = 35 – 44 years old4554 = 45 – 54 years old5564 = 55 – 64 years old65 = 65 or olderWe ask you to tidy the above data set following these steps.
Gather together all the columns from new_sp_m014 to newrel_f65. Use the variable key for the gathered columns and variable cases for the values. Notice there are a lot of missing values in the current representation, so use na.rm to remove them.
who1 <- who %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)
who1
We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of new_rel we have newrel Use stringr::str_replace() to replace the characters “newrel” with “new_rel” in the key column. This makes all variable names consistent.
who2 <- who1 %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2
Separate the values in each code with two passes of separate(). First, separate key after underscore character in three new variables: new, type, and sexage
who3 <- who2 %>%
separate(key, into = c("new", "type", "sexage"), sep = "_")
who3
Next separate sexage into sex and age by splitting after the first character:
who4 <- who3 %>%
separate(sexage, into = c("sex", "age"), sep = 1)
who4
Finally separate age into age_low and age_high by splitting after the second character (first replace “014” with “0014” and “65” with “6565”):
who5 <- who4 %>%
mutate(age = stringr::str_replace(age, "014", "0014")) %>%
mutate(age = stringr::str_replace(age, "65", "6565")) %>%
separate(age, into = c("age_low", "age_high"), sep = 2)
Check that the new column is constant in this dataset. Hence drop column new as well as iso2 and iso3 since they’re redundant.
who5 %>% count(new)
who6 <- who5 %>%
select(-new, -iso2, -iso3)
who5
You have written the code a piece at a time, assigning each interim result to a new variable. Instead, gradually build up a complex pipe.
who %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE) %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
separate(key, into = c("new", "type", "sexage"), sep = "_") %>%
separate(sexage, into = c("sex", "age"), sep = 1) %>%
mutate(age = stringr::str_replace(age, "014", "0014")) %>%
mutate(age = stringr::str_replace(age, "65", "6565")) %>%
separate(age, into = c("age_low", "age_high"), sep = 2) %>%
select(-new, -iso2, -iso3)