The tidyr::who
dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report. There’s a wealth of epidemiological information in this dataset, but it’s challenging to work with the data in the form that it’s provided.
This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values. In short, who
is messy, and we’ll need multiple steps to tidy it.
Let’s have a look at what we’ve got:
country
, iso2
, and iso3
are three variables that redundantly specify the country.year
is clearly also a variable.What about all the other columns (e.g. new_sp_m014
, new_ep_m014
, new_ep_f014
)?
rel
stands for cases of relapseep
stands for cases of extrapulmonary TBsn
stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)sp
stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive)m
) and females (f
).014
= 0 – 14 years old1524
= 15 – 24 years old2534
= 25 – 34 years old3544
= 35 – 44 years old4554
= 45 – 54 years old5564
= 55 – 64 years old65
= 65 or olderWe ask you to tidy the above data set following these steps.
Gather together all the columns from new_sp_m014
to newrel_f65
. Use the variable key
for the gathered columns and variable cases
for the values. Notice there are a lot of missing values in the current representation, so use values_drop_na
to remove them.
who1 <- who %>%
pivot_longer(___, names_to = ___, values_to = ___, values_drop_na = ___)
who1
## # A tibble: 7,240 x 60
## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
## <chr> <chr> <chr> <int> <int> <int> <int> <int>
## 1 Afghani… AF AFG 1980 NA NA NA NA
## 2 Afghani… AF AFG 1981 NA NA NA NA
## 3 Afghani… AF AFG 1982 NA NA NA NA
## 4 Afghani… AF AFG 1983 NA NA NA NA
## 5 Afghani… AF AFG 1984 NA NA NA NA
## 6 Afghani… AF AFG 1985 NA NA NA NA
## 7 Afghani… AF AFG 1986 NA NA NA NA
## 8 Afghani… AF AFG 1987 NA NA NA NA
## 9 Afghani… AF AFG 1988 NA NA NA NA
## 10 Afghani… AF AFG 1989 NA NA NA NA
## # … with 7,230 more rows, and 52 more variables: new_sp_m4554 <int>,
## # new_sp_m5564 <int>, new_sp_m65 <int>, new_sp_f014 <int>,
## # new_sp_f1524 <int>, new_sp_f2534 <int>, new_sp_f3544 <int>,
## # new_sp_f4554 <int>, new_sp_f5564 <int>, new_sp_f65 <int>,
## # new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,
## # new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>,
## # new_sn_m65 <int>, new_sn_f014 <int>, new_sn_f1524 <int>,
## # new_sn_f2534 <int>, new_sn_f3544 <int>, new_sn_f4554 <int>,
## # new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>,
## # new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>,
## # new_ep_m4554 <int>, new_ep_m5564 <int>, new_ep_m65 <int>,
## # new_ep_f014 <int>, new_ep_f1524 <int>, new_ep_f2534 <int>,
## # new_ep_f3544 <int>, new_ep_f4554 <int>, new_ep_f5564 <int>,
## # new_ep_f65 <int>, newrel_m014 <int>, newrel_m1524 <int>,
## # newrel_m2534 <int>, newrel_m3544 <int>, newrel_m4554 <int>,
## # newrel_m5564 <int>, newrel_m65 <int>, newrel_f014 <int>,
## # newrel_f1524 <int>, newrel_f2534 <int>, newrel_f3544 <int>,
## # newrel_f4554 <int>, newrel_f5564 <int>, newrel_f65 <int>
## # A tibble: 76,046 x 6
## country iso2 iso3 year key cases
## <chr> <chr> <chr> <int> <chr> <int>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1997 new_sp_m1524 10
## 3 Afghanistan AF AFG 1997 new_sp_m2534 6
## 4 Afghanistan AF AFG 1997 new_sp_m3544 3
## 5 Afghanistan AF AFG 1997 new_sp_m4554 5
## 6 Afghanistan AF AFG 1997 new_sp_m5564 2
## 7 Afghanistan AF AFG 1997 new_sp_m65 0
## 8 Afghanistan AF AFG 1997 new_sp_f014 5
## 9 Afghanistan AF AFG 1997 new_sp_f1524 38
## 10 Afghanistan AF AFG 1997 new_sp_f2534 36
## # … with 76,036 more rows
We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of new_rel
we have newrel
Use stringr::str_replace()
to replace the characters “newrel” with “new_rel” in the key
column. This makes all variable names consistent.
who2 <- who1 %>%
mutate(key = stringr::str_replace(___, ___, ___))
who2
## # A tibble: 76,046 x 6
## country iso2 iso3 year key cases
## <chr> <chr> <chr> <int> <chr> <int>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1997 new_sp_m1524 10
## 3 Afghanistan AF AFG 1997 new_sp_m2534 6
## 4 Afghanistan AF AFG 1997 new_sp_m3544 3
## 5 Afghanistan AF AFG 1997 new_sp_m4554 5
## 6 Afghanistan AF AFG 1997 new_sp_m5564 2
## 7 Afghanistan AF AFG 1997 new_sp_m65 0
## 8 Afghanistan AF AFG 1997 new_sp_f014 5
## 9 Afghanistan AF AFG 1997 new_sp_f1524 38
## 10 Afghanistan AF AFG 1997 new_sp_f2534 36
## # … with 76,036 more rows
Separate the values in each code with three passes of separate()
. First, separate key after underscore character in three new variables: new
, type
, and sexage
who3 <- who2 %>%
___(___, into = ___, sep = "___")
who3
## # A tibble: 76,046 x 8
## country iso2 iso3 year new type sexage cases
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan AF AFG 1997 new sp m014 0
## 2 Afghanistan AF AFG 1997 new sp m1524 10
## 3 Afghanistan AF AFG 1997 new sp m2534 6
## 4 Afghanistan AF AFG 1997 new sp m3544 3
## 5 Afghanistan AF AFG 1997 new sp m4554 5
## 6 Afghanistan AF AFG 1997 new sp m5564 2
## 7 Afghanistan AF AFG 1997 new sp m65 0
## 8 Afghanistan AF AFG 1997 new sp f014 5
## 9 Afghanistan AF AFG 1997 new sp f1524 38
## 10 Afghanistan AF AFG 1997 new sp f2534 36
## # … with 76,036 more rows
Next separate sexage
into sex
and age
by splitting after the first character:
who4 <- who3 %>%
___(___, into = ___, sep = ___)
who4
## # A tibble: 76,046 x 9
## country iso2 iso3 year new type sex age cases
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <int>
## 1 Afghanistan AF AFG 1997 new sp m 014 0
## 2 Afghanistan AF AFG 1997 new sp m 1524 10
## 3 Afghanistan AF AFG 1997 new sp m 2534 6
## 4 Afghanistan AF AFG 1997 new sp m 3544 3
## 5 Afghanistan AF AFG 1997 new sp m 4554 5
## 6 Afghanistan AF AFG 1997 new sp m 5564 2
## 7 Afghanistan AF AFG 1997 new sp m 65 0
## 8 Afghanistan AF AFG 1997 new sp f 014 5
## 9 Afghanistan AF AFG 1997 new sp f 1524 38
## 10 Afghanistan AF AFG 1997 new sp f 2534 36
## # … with 76,036 more rows
Finally separate age
into age_low
and age_high
by splitting after the second character (first replace “014” with “0014” and “65” with “6565”):
who5 <- who4 %>%
mutate(age = stringr::str_replace(___, ___, ___)) %>%
mutate(age = stringr::str_replace(___, ___, ___)) %>%
separate(___, into = ___, sep = ___)
## # A tibble: 76,046 x 10
## country iso2 iso3 year new type sex age_low age_high cases
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <int>
## 1 Afghanistan AF AFG 1997 new sp m 00 14 0
## 2 Afghanistan AF AFG 1997 new sp m 15 24 10
## 3 Afghanistan AF AFG 1997 new sp m 25 34 6
## 4 Afghanistan AF AFG 1997 new sp m 35 44 3
## 5 Afghanistan AF AFG 1997 new sp m 45 54 5
## 6 Afghanistan AF AFG 1997 new sp m 55 64 2
## 7 Afghanistan AF AFG 1997 new sp m 65 65 0
## 8 Afghanistan AF AFG 1997 new sp f 00 14 5
## 9 Afghanistan AF AFG 1997 new sp f 15 24 38
## 10 Afghanistan AF AFG 1997 new sp f 25 34 36
## # … with 76,036 more rows
Check that the new
column is constant in this dataset. Hence drop column new
as well as iso2
and iso3
since they’re redundant.
# check that new is constant
who5 %>% ___
# drop redundant columns
who6 <- who5 %>%
___(___, ___, ___)
who6
## # A tibble: 1 x 2
## new n
## <chr> <int>
## 1 new 76046
## # A tibble: 76,046 x 7
## country year type sex age_low age_high cases
## <chr> <int> <chr> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 sp m 00 14 0
## 2 Afghanistan 1997 sp m 15 24 10
## 3 Afghanistan 1997 sp m 25 34 6
## 4 Afghanistan 1997 sp m 35 44 3
## 5 Afghanistan 1997 sp m 45 54 5
## 6 Afghanistan 1997 sp m 55 64 2
## 7 Afghanistan 1997 sp m 65 65 0
## 8 Afghanistan 1997 sp f 00 14 5
## 9 Afghanistan 1997 sp f 15 24 38
## 10 Afghanistan 1997 sp f 25 34 36
## # … with 76,036 more rows
You have written the code a piece at a time, assigning each interim result to a new variable. Instead, gradually build up a complex pipe.