The tidyr::who
dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report. There’s a wealth of epidemiological information in this dataset, but it’s challenging to work with the data in the form that it’s provided.
library(tidyr)
library(dplyr)
View who dataset in RStudio with View()
function. This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values. In short, who
is messy, and we’ll need multiple steps to tidy it.
Let’s have a look at what we’ve got:
country
, iso2
, and iso3
are three variables that redundantly specify the country.year
is clearly also a variable.What about all the other columns (e.g. new_sp_m014
, new_ep_m014
, new_ep_f014
)?
rel
stands for cases of relapseep
stands for cases of extrapulmonary TBsn
stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)sp
stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive)m
) and females (f
).014
= 0 – 14 years old1524
= 15 – 24 years old2534
= 25 – 34 years old3544
= 35 – 44 years old4554
= 45 – 54 years old5564
= 55 – 64 years old65
= 65 or olderWe ask you to tidy the above data set following these steps.
Gather together all the columns from new_sp_m014
to newrel_f65
. Use the variable key
for the gathered columns and variable cases
for the values. Notice there are a lot of missing values in the current representation, so use na.rm
to remove them.
View(who)
who1 <- who %>%
gather(___, key = "___", value = "___", na.rm = ___)
who1
who1 <- who %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)
who1
## # A tibble: 76,046 x 6
## country iso2 iso3 year key cases
## * <chr> <chr> <chr> <int> <chr> <int>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1998 new_sp_m014 30
## 3 Afghanistan AF AFG 1999 new_sp_m014 8
## 4 Afghanistan AF AFG 2000 new_sp_m014 52
## 5 Afghanistan AF AFG 2001 new_sp_m014 129
## 6 Afghanistan AF AFG 2002 new_sp_m014 90
## 7 Afghanistan AF AFG 2003 new_sp_m014 127
## 8 Afghanistan AF AFG 2004 new_sp_m014 139
## 9 Afghanistan AF AFG 2005 new_sp_m014 151
## 10 Afghanistan AF AFG 2006 new_sp_m014 193
## # ... with 76,036 more rows
We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of new_rel
we have newrel
Use stringr::str_replace()
to replace the characters “newrel” with “new_rel”. This makes all variable names consistent.
who2 <- who1 %>%
mutate(___ = ___(___, "newrel", "new_rel"))
who2
who2 <- who1 %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2
## # A tibble: 76,046 x 6
## country iso2 iso3 year key cases
## <chr> <chr> <chr> <int> <chr> <int>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1998 new_sp_m014 30
## 3 Afghanistan AF AFG 1999 new_sp_m014 8
## 4 Afghanistan AF AFG 2000 new_sp_m014 52
## 5 Afghanistan AF AFG 2001 new_sp_m014 129
## 6 Afghanistan AF AFG 2002 new_sp_m014 90
## 7 Afghanistan AF AFG 2003 new_sp_m014 127
## 8 Afghanistan AF AFG 2004 new_sp_m014 139
## 9 Afghanistan AF AFG 2005 new_sp_m014 151
## 10 Afghanistan AF AFG 2006 new_sp_m014 193
## # ... with 76,036 more rows
Separate the values in each code with two passes of separate()
. First, separate key after underscore character in three new variables:
who3 <- who2 %>%
___(___, c("new", "type", "sexage"), sep = "___")
who3
who3 <- who2 %>%
separate(key, c("new", "type", "sexage"), sep = "_")
who3
## # A tibble: 76,046 x 8
## country iso2 iso3 year new type sexage cases
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan AF AFG 1997 new sp m014 0
## 2 Afghanistan AF AFG 1998 new sp m014 30
## 3 Afghanistan AF AFG 1999 new sp m014 8
## 4 Afghanistan AF AFG 2000 new sp m014 52
## 5 Afghanistan AF AFG 2001 new sp m014 129
## 6 Afghanistan AF AFG 2002 new sp m014 90
## 7 Afghanistan AF AFG 2003 new sp m014 127
## 8 Afghanistan AF AFG 2004 new sp m014 139
## 9 Afghanistan AF AFG 2005 new sp m014 151
## 10 Afghanistan AF AFG 2006 new sp m014 193
## # ... with 76,036 more rows
Next separate sexage
into sex
and age
by splitting after the first character:
who4 <- who3 %>%
___(sexage, c("___", "___"), sep = ___)
who4
who4 <- who3 %>%
separate(sexage, c("sex", "age"), sep = 1)
who4
## # A tibble: 76,046 x 9
## country iso2 iso3 year new type sex age cases
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <int>
## 1 Afghanistan AF AFG 1997 new sp m 014 0
## 2 Afghanistan AF AFG 1998 new sp m 014 30
## 3 Afghanistan AF AFG 1999 new sp m 014 8
## 4 Afghanistan AF AFG 2000 new sp m 014 52
## 5 Afghanistan AF AFG 2001 new sp m 014 129
## 6 Afghanistan AF AFG 2002 new sp m 014 90
## 7 Afghanistan AF AFG 2003 new sp m 014 127
## 8 Afghanistan AF AFG 2004 new sp m 014 139
## 9 Afghanistan AF AFG 2005 new sp m 014 151
## 10 Afghanistan AF AFG 2006 new sp m 014 193
## # ... with 76,036 more rows
Then we might as well drop the new
column because it’s constant in this dataset. While we’re dropping columns, let’s also drop iso2
and iso3
since they’re redundant.
# check that new is constant
who4 %>% ___
# drop redundant columns
who5 <- who4 %>%
select(___, ___, ___)
who4 %>%
count(new)
## # A tibble: 1 x 2
## new n
## <chr> <int>
## 1 new 76046
who5 <- who4 %>%
select(-new, -iso2, -iso3)
You have written the code a piece at a time, assigning each interim result to a new variable. Instead, gradually build up a complex pipe.
who %>%
gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
separate(code, c("new", "var", "sexage")) %>%
separate(sexage, c("sex", "age"), sep = 1) %>%
select(-new, -iso2, -iso3)
## # A tibble: 76,046 x 6
## country year var sex age value
## <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 sp m 014 0
## 2 Afghanistan 1998 sp m 014 30
## 3 Afghanistan 1999 sp m 014 8
## 4 Afghanistan 2000 sp m 014 52
## 5 Afghanistan 2001 sp m 014 129
## 6 Afghanistan 2002 sp m 014 90
## 7 Afghanistan 2003 sp m 014 127
## 8 Afghanistan 2004 sp m 014 139
## 9 Afghanistan 2005 sp m 014 151
## 10 Afghanistan 2006 sp m 014 193
## # ... with 76,036 more rows