The tidyr::who
dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report. There’s a wealth of epidemiological information in this dataset, but it’s challenging to work with the data in the form that it’s provided.
library(tidyr)
library(dplyr)
View who dataset in RStudio with View()
function. This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values. In short, who
is messy, and we’ll need multiple steps to tidy it.
Let’s have a look at what we’ve got:
country
, iso2
, and iso3
are three variables that redundantly specify the country.year
is clearly also a variable.What about all the other columns (e.g. new_sp_m014
, new_ep_m014
, new_ep_f014
)?
rel
stands for cases of relapseep
stands for cases of extrapulmonary TBsn
stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)sp
stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive)m
) and females (f
).014
= 0 – 14 years old1524
= 15 – 24 years old2534
= 25 – 34 years old3544
= 35 – 44 years old4554
= 45 – 54 years old5564
= 55 – 64 years old65
= 65 or olderWe ask you to tidy the above data set following these steps.
Gather together all the columns from new_sp_m014
to newrel_f65
. Use the variable key
for the gathered columns and variable cases
for the values. Notice there are a lot of missing values in the current representation, so use na.rm
to remove them.
who1 <- who %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)
who1
We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of new_rel
we have newrel
Use stringr::str_replace()
to replace the characters “newrel” with “new_rel” in the key
column. This makes all variable names consistent.
who2 <- who1 %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2
Separate the values in each code with two passes of separate()
. First, separate key after underscore character in three new variables: new
, type
, and sexage
who3 <- who2 %>%
separate(key, into = c("new", "type", "sexage"), sep = "_")
who3
Next separate sexage
into sex
and age
by splitting after the first character:
who4 <- who3 %>%
separate(sexage, into = c("sex", "age"), sep = 1)
who4
Finally separate age
into age_low
and age_high
by splitting after the second character (first replace “014” with “0014” and “65” with “6565”):
who5 <- who4 %>%
mutate(age = stringr::str_replace(age, "014", "0014")) %>%
mutate(age = stringr::str_replace(age, "65", "6565")) %>%
separate(age, into = c("age_low", "age_high"), sep = 2)
Check that the new
column is constant in this dataset. Hence drop column new
as well as iso2
and iso3
since they’re redundant.
who5 %>% count(new)
who6 <- who5 %>%
select(-new, -iso2, -iso3)
who5
You have written the code a piece at a time, assigning each interim result to a new variable. Instead, gradually build up a complex pipe.
who %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE) %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
separate(key, into = c("new", "type", "sexage"), sep = "_") %>%
separate(sexage, into = c("sex", "age"), sep = 1) %>%
mutate(age = stringr::str_replace(age, "014", "0014")) %>%
mutate(age = stringr::str_replace(age, "65", "6565")) %>%
separate(age, into = c("age_low", "age_high"), sep = 2) %>%
select(-new, -iso2, -iso3)