The tidyr::who dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report. There’s a wealth of epidemiological information in this dataset, but it’s challenging to work with the data in the form that it’s provided.
This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values. In short, who is messy, and we’ll need multiple steps to tidy it.
Let’s have a look at what we’ve got:
country, iso2, and iso3 are three variables that redundantly specify the country.year is clearly also a variable.What about all the other columns (e.g. new_sp_m014, new_ep_m014, new_ep_f014)?
rel stands for cases of relapseep stands for cases of extrapulmonary TBsn stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)sp stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive)m) and females (f).014 = 0 – 14 years old1524 = 15 – 24 years old2534 = 25 – 34 years old3544 = 35 – 44 years old4554 = 45 – 54 years old5564 = 55 – 64 years old65 = 65 or olderWe ask you to tidy the above data set following these steps.
Gather together all the columns from new_sp_m014 to newrel_f65. Use the variable key for the gathered columns and variable cases for the values. Notice there are a lot of missing values in the current representation, so use values_drop_na to remove them.
who1 <- who %>%
pivot_longer(___, names_to = ___, values_to = ___, values_drop_na = ___)
who1
## # A tibble: 7,240 x 60
## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
## <chr> <chr> <chr> <int> <int> <int> <int> <int>
## 1 Afghani… AF AFG 1980 NA NA NA NA
## 2 Afghani… AF AFG 1981 NA NA NA NA
## 3 Afghani… AF AFG 1982 NA NA NA NA
## 4 Afghani… AF AFG 1983 NA NA NA NA
## 5 Afghani… AF AFG 1984 NA NA NA NA
## 6 Afghani… AF AFG 1985 NA NA NA NA
## 7 Afghani… AF AFG 1986 NA NA NA NA
## 8 Afghani… AF AFG 1987 NA NA NA NA
## 9 Afghani… AF AFG 1988 NA NA NA NA
## 10 Afghani… AF AFG 1989 NA NA NA NA
## # … with 7,230 more rows, and 52 more variables: new_sp_m4554 <int>,
## # new_sp_m5564 <int>, new_sp_m65 <int>, new_sp_f014 <int>,
## # new_sp_f1524 <int>, new_sp_f2534 <int>, new_sp_f3544 <int>,
## # new_sp_f4554 <int>, new_sp_f5564 <int>, new_sp_f65 <int>,
## # new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,
## # new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>,
## # new_sn_m65 <int>, new_sn_f014 <int>, new_sn_f1524 <int>,
## # new_sn_f2534 <int>, new_sn_f3544 <int>, new_sn_f4554 <int>,
## # new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>,
## # new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>,
## # new_ep_m4554 <int>, new_ep_m5564 <int>, new_ep_m65 <int>,
## # new_ep_f014 <int>, new_ep_f1524 <int>, new_ep_f2534 <int>,
## # new_ep_f3544 <int>, new_ep_f4554 <int>, new_ep_f5564 <int>,
## # new_ep_f65 <int>, newrel_m014 <int>, newrel_m1524 <int>,
## # newrel_m2534 <int>, newrel_m3544 <int>, newrel_m4554 <int>,
## # newrel_m5564 <int>, newrel_m65 <int>, newrel_f014 <int>,
## # newrel_f1524 <int>, newrel_f2534 <int>, newrel_f3544 <int>,
## # newrel_f4554 <int>, newrel_f5564 <int>, newrel_f65 <int>
## # A tibble: 76,046 x 6
## country iso2 iso3 year key cases
## <chr> <chr> <chr> <int> <chr> <int>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1997 new_sp_m1524 10
## 3 Afghanistan AF AFG 1997 new_sp_m2534 6
## 4 Afghanistan AF AFG 1997 new_sp_m3544 3
## 5 Afghanistan AF AFG 1997 new_sp_m4554 5
## 6 Afghanistan AF AFG 1997 new_sp_m5564 2
## 7 Afghanistan AF AFG 1997 new_sp_m65 0
## 8 Afghanistan AF AFG 1997 new_sp_f014 5
## 9 Afghanistan AF AFG 1997 new_sp_f1524 38
## 10 Afghanistan AF AFG 1997 new_sp_f2534 36
## # … with 76,036 more rows
We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of new_rel we have newrel Use stringr::str_replace() to replace the characters “newrel” with “new_rel” in the key column. This makes all variable names consistent.
who2 <- who1 %>%
mutate(key = stringr::str_replace(___, ___, ___))
who2
## # A tibble: 76,046 x 6
## country iso2 iso3 year key cases
## <chr> <chr> <chr> <int> <chr> <int>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1997 new_sp_m1524 10
## 3 Afghanistan AF AFG 1997 new_sp_m2534 6
## 4 Afghanistan AF AFG 1997 new_sp_m3544 3
## 5 Afghanistan AF AFG 1997 new_sp_m4554 5
## 6 Afghanistan AF AFG 1997 new_sp_m5564 2
## 7 Afghanistan AF AFG 1997 new_sp_m65 0
## 8 Afghanistan AF AFG 1997 new_sp_f014 5
## 9 Afghanistan AF AFG 1997 new_sp_f1524 38
## 10 Afghanistan AF AFG 1997 new_sp_f2534 36
## # … with 76,036 more rows
Separate the values in each code with three passes of separate(). First, separate key after underscore character in three new variables: new, type, and sexage
who3 <- who2 %>%
___(___, into = ___, sep = "___")
who3
## # A tibble: 76,046 x 8
## country iso2 iso3 year new type sexage cases
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan AF AFG 1997 new sp m014 0
## 2 Afghanistan AF AFG 1997 new sp m1524 10
## 3 Afghanistan AF AFG 1997 new sp m2534 6
## 4 Afghanistan AF AFG 1997 new sp m3544 3
## 5 Afghanistan AF AFG 1997 new sp m4554 5
## 6 Afghanistan AF AFG 1997 new sp m5564 2
## 7 Afghanistan AF AFG 1997 new sp m65 0
## 8 Afghanistan AF AFG 1997 new sp f014 5
## 9 Afghanistan AF AFG 1997 new sp f1524 38
## 10 Afghanistan AF AFG 1997 new sp f2534 36
## # … with 76,036 more rows
Next separate sexage into sex and age by splitting after the first character:
who4 <- who3 %>%
___(___, into = ___, sep = ___)
who4
## # A tibble: 76,046 x 9
## country iso2 iso3 year new type sex age cases
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <int>
## 1 Afghanistan AF AFG 1997 new sp m 014 0
## 2 Afghanistan AF AFG 1997 new sp m 1524 10
## 3 Afghanistan AF AFG 1997 new sp m 2534 6
## 4 Afghanistan AF AFG 1997 new sp m 3544 3
## 5 Afghanistan AF AFG 1997 new sp m 4554 5
## 6 Afghanistan AF AFG 1997 new sp m 5564 2
## 7 Afghanistan AF AFG 1997 new sp m 65 0
## 8 Afghanistan AF AFG 1997 new sp f 014 5
## 9 Afghanistan AF AFG 1997 new sp f 1524 38
## 10 Afghanistan AF AFG 1997 new sp f 2534 36
## # … with 76,036 more rows
Finally separate age into age_low and age_high by splitting after the second character (first replace “014” with “0014” and “65” with “6565”):
who5 <- who4 %>%
mutate(age = stringr::str_replace(___, ___, ___)) %>%
mutate(age = stringr::str_replace(___, ___, ___)) %>%
separate(___, into = ___, sep = ___)
## # A tibble: 76,046 x 10
## country iso2 iso3 year new type sex age_low age_high cases
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <int>
## 1 Afghanistan AF AFG 1997 new sp m 00 14 0
## 2 Afghanistan AF AFG 1997 new sp m 15 24 10
## 3 Afghanistan AF AFG 1997 new sp m 25 34 6
## 4 Afghanistan AF AFG 1997 new sp m 35 44 3
## 5 Afghanistan AF AFG 1997 new sp m 45 54 5
## 6 Afghanistan AF AFG 1997 new sp m 55 64 2
## 7 Afghanistan AF AFG 1997 new sp m 65 65 0
## 8 Afghanistan AF AFG 1997 new sp f 00 14 5
## 9 Afghanistan AF AFG 1997 new sp f 15 24 38
## 10 Afghanistan AF AFG 1997 new sp f 25 34 36
## # … with 76,036 more rows
Check that the new column is constant in this dataset. Hence drop column new as well as iso2 and iso3 since they’re redundant.
# check that new is constant
who5 %>% ___
# drop redundant columns
who6 <- who5 %>%
___(___, ___, ___)
who6
## # A tibble: 1 x 2
## new n
## <chr> <int>
## 1 new 76046
## # A tibble: 76,046 x 7
## country year type sex age_low age_high cases
## <chr> <int> <chr> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 sp m 00 14 0
## 2 Afghanistan 1997 sp m 15 24 10
## 3 Afghanistan 1997 sp m 25 34 6
## 4 Afghanistan 1997 sp m 35 44 3
## 5 Afghanistan 1997 sp m 45 54 5
## 6 Afghanistan 1997 sp m 55 64 2
## 7 Afghanistan 1997 sp m 65 65 0
## 8 Afghanistan 1997 sp f 00 14 5
## 9 Afghanistan 1997 sp f 15 24 38
## 10 Afghanistan 1997 sp f 25 34 36
## # … with 76,036 more rows
You have written the code a piece at a time, assigning each interim result to a new variable. Instead, gradually build up a complex pipe.