Some data structures are designed to store document collections before tokenization, often called a corpus.
A corpus is a document collections before tokenization
One common example is Corpus
objects from the tm package. These store text alongside metadata, which may include an ID, date/time, title, or language for each document.
For example, the tm package comes with the acq
corpus, containing 50 articles from the news service Reuters.
data("acq")
acq
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 50
# first document
acq[[1]]
## <<PlainTextDocument>>
## Metadata: 15
## Content: chars: 1287
We can thus use the tidy()
method to construct a table with one row per document, including the metadata (such as id
and datetimestamp
) as columns alongside the text
.
acq_td <- tidy(acq)
acq_td
## # A tibble: 50 × 16
## author datetimestamp description heading id language origin topics
## <chr> <dttm> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> 1987-02-26 15:18:06 "" COMPUT… 10 en Reute… YES
## 2 <NA> 1987-02-26 15:19:15 "" OHIO M… 12 en Reute… YES
## 3 <NA> 1987-02-26 15:49:56 "" MCLEAN… 44 en Reute… YES
## 4 By Cal … 1987-02-26 15:51:17 "" CHEMLA… 45 en Reute… YES
## 5 <NA> 1987-02-26 16:08:33 "" <COFAB… 68 en Reute… YES
## 6 <NA> 1987-02-26 16:32:37 "" INVEST… 96 en Reute… YES
## 7 By Patt… 1987-02-26 16:43:13 "" AMERIC… 110 en Reute… YES
## 8 <NA> 1987-02-26 16:59:25 "" HONG K… 125 en Reute… YES
## 9 <NA> 1987-02-26 17:01:28 "" LIEBER… 128 en Reute… YES
## 10 <NA> 1987-02-26 17:08:27 "" GULF A… 134 en Reute… YES
## # ℹ 40 more rows
## # ℹ 8 more variables: lewissplit <chr>, cgisplit <chr>, oldid <chr>,
## # places <named list>, people <lgl>, orgs <lgl>, exchanges <lgl>, text <chr>
This can then be used with unnest_tokens()
.
acq_tokens <- acq_td %>%
select(-places) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word")
acq_tokens
## # A tibble: 4,092 × 15
## author datetimestamp description heading id language origin topics
## <chr> <dttm> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> 1987-02-26 15:18:06 "" COMPUTER… 10 en Reute… YES
## 2 <NA> 1987-02-26 15:18:06 "" COMPUTER… 10 en Reute… YES
## 3 <NA> 1987-02-26 15:18:06 "" COMPUTER… 10 en Reute… YES
## 4 <NA> 1987-02-26 15:18:06 "" COMPUTER… 10 en Reute… YES
## 5 <NA> 1987-02-26 15:18:06 "" COMPUTER… 10 en Reute… YES
## 6 <NA> 1987-02-26 15:18:06 "" COMPUTER… 10 en Reute… YES
## 7 <NA> 1987-02-26 15:18:06 "" COMPUTER… 10 en Reute… YES
## 8 <NA> 1987-02-26 15:18:06 "" COMPUTER… 10 en Reute… YES
## 9 <NA> 1987-02-26 15:18:06 "" COMPUTER… 10 en Reute… YES
## 10 <NA> 1987-02-26 15:18:06 "" COMPUTER… 10 en Reute… YES
## # ℹ 4,082 more rows
## # ℹ 7 more variables: lewissplit <chr>, cgisplit <chr>, oldid <chr>,
## # people <lgl>, orgs <lgl>, exchanges <lgl>, word <chr>