The quality of the glue

  • most of the existing R tools for natural language processing, besides the tidytext package, aren’t compatible with this format
  • The CRAN Task View for Natural Language Processing lists a large selection of packages that take other structures of input and provide non-tidy outputs
  • notably packages are tm and quanteda
  • these packages are very useful in text mining applications, and many existing text datasets are structured according to these formats

Document-term matrix

One of the most common structures that text mining packages work with is the document-term matrix (or DTM). This is a matrix where:

  • each row represents one document (such as a book or article)
  • each column represents one term
  • each value (typically) contains the number of appearances of that term in that document
  • since most pairings of document and term do not occur (they have the value zero), DTMs are usually implemented as sparse matrices

Document-term matrix

DTM objects cannot be used directly with tidy tools, just as tidy data frames cannot be used as input for most text mining packages.

Thus, the tidytext package provides two verbs that convert between the two formats:

  • tidy() turns a document-term matrix into a tidy data frame
  • cast() turns a tidy one-term-per-row data frame into a matrix. tidytext provides three variations of this verb, each converting to a different type of matrix:
    • cast_sparse() converts to a sparse matrix from the Matrix package
    • cast_dtm() converts to a DocumentTermMatrix object from tm package
    • cast_dfm() converts to a dfm object from quanteda package

Document-term matrix in tm package

Perhaps the most widely used implementation of DTMs in R is the DocumentTermMatrix class in the tm package. Many available text mining datasets are provided in this format.

For example, consider the collection of Associated Press newspaper articles included in the topicmodels package.

library(tm)
library(tidyverse)
library(tidytext)


# loads specified data sets
data("AssociatedPress", package = "topicmodels")
AssociatedPress
## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)
# tidy the DTM
ap_td <- tidy(AssociatedPress)
ap_td
## # A tibble: 302,031 × 3
##    document term       count
##       <int> <chr>      <dbl>
##  1        1 adding         1
##  2        1 adult          2
##  3        1 ago            1
##  4        1 alcohol        1
##  5        1 allegedly      1
##  6        1 allen          1
##  7        1 apparently     2
##  8        1 appeared       1
##  9        1 arrested       1
## 10        1 assault        1
## # ℹ 302,021 more rows

Notice that only the non-zero values are included in the tidied output; this means the tidied version has no rows where count is zero.

Document-term matrix in quanteda package

Other text mining packages provide alternative implementations of document-term matrices, such as the dfm (document-feature matrix) class from the quanteda package.

For example, the quanteda package comes with a corpus of presidential inauguration speeches, which can be converted to a dfm using the appropriate function.

library(quanteda)
data("data_corpus_inaugural", package = "quanteda")
inaug_dfm <- dfm(tokens(data_corpus_inaugural), verbose = FALSE)
inaug_dfm
## Document-feature matrix of: 59 documents, 9,437 features (91.84% sparse) and 4 docvars.
##                  features
## docs              fellow-citizens  of the senate and house representatives :
##   1789-Washington               1  71 116      1  48     2               2 1
##   1793-Washington               0  11  13      0   2     0               0 1
##   1797-Adams                    3 140 163      1 130     0               2 0
##   1801-Jefferson                2 104 130      0  81     0               0 1
##   1805-Jefferson                0 101 143      0  93     0               0 0
##   1809-Madison                  1  69 104      0  43     0               0 0
##                  features
## docs              among vicissitudes
##   1789-Washington     1            1
##   1793-Washington     0            0
##   1797-Adams          4            0
##   1801-Jefferson      1            0
##   1805-Jefferson      7            0
##   1809-Madison        0            0
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,427 more features ]
# tidy the DFM
inaug_td <- tidy(inaug_dfm)
inaug_td
## # A tibble: 45,452 × 3
##    document        term            count
##    <chr>           <chr>           <dbl>
##  1 1789-Washington fellow-citizens     1
##  2 1797-Adams      fellow-citizens     3
##  3 1801-Jefferson  fellow-citizens     2
##  4 1809-Madison    fellow-citizens     1
##  5 1813-Madison    fellow-citizens     1
##  6 1817-Monroe     fellow-citizens     5
##  7 1821-Monroe     fellow-citizens     1
##  8 1841-Harrison   fellow-citizens    11
##  9 1845-Polk       fellow-citizens     1
## 10 1849-Taylor     fellow-citizens     1
## # ℹ 45,442 more rows

Going back: from tidy to matrices

Just as some existing text mining packages provide document-term matrices as sample data or output, some algorithms expect such matrices as input.

Therefore, tidytext provides cast_ verbs for converting from a tidy form to these matrices.

For example, we could take the tidied AP dataset and cast it back into a document-term matrix using the cast_dtm() function.

ap_td %>%
  cast_dtm(document, term, count)
## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

Similarly, we could cast the table into a dfm object from quanteda’s dfm with cast_dfm().

ap_td %>%
  cast_dfm(document, term, count)
## Document-feature matrix of: 2,246 documents, 10,473 features (98.72% sparse) and 0 docvars.
##     features
## docs adding adult ago alcohol allegedly allen apparently appeared arrested
##    1      1     2   1       1         1     1          2        1        1
##    2      0     0   0       0         0     0          0        1        0
##    3      0     0   1       0         0     0          0        1        0
##    4      0     0   3       0         0     0          0        0        0
##    5      0     0   0       0         0     0          0        0        0
##    6      0     0   2       0         0     0          0        0        0
##     features
## docs assault
##    1       1
##    2       0
##    3       0
##    4       0
##    5       0
##    6       0
## [ reached max_ndoc ... 2,240 more documents, reached max_nfeat ... 10,463 more features ]

Some tools simply require a sparse matrix:

library(Matrix)

# cast into a Matrix object
m <- ap_td %>%
  cast_sparse(document, term, count)

# dimentions
dim(m)
## [1]  2246 10473
# first 10 terms of first document
m[1, 1:10]
##     adding      adult        ago    alcohol  allegedly      allen apparently 
##          1          2          1          1          1          1          2 
##   appeared   arrested    assault 
##          1          1          1

Corpus

Some data structures are designed to store document collections before tokenization, often called a corpus.

A corpus is a document collections before tokenization

One common example is Corpus objects from the tm package. These store text alongside metadata, which may include an ID, date/time, title, or language for each document.

For example, the tm package comes with the acq corpus, containing 50 articles from the news service Reuters.

data("acq")
acq
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 50
# first document
acq[[1]]
## <<PlainTextDocument>>
## Metadata:  15
## Content:  chars: 1287

We can thus use the tidy() method to construct a table with one row per document, including the metadata (such as id and datetimestamp) as columns alongside the text.

acq_td <- tidy(acq)
acq_td
## # A tibble: 50 × 16
##    author   datetimestamp       description heading id    language origin topics
##    <chr>    <dttm>              <chr>       <chr>   <chr> <chr>    <chr>  <chr> 
##  1 <NA>     1987-02-26 15:18:06 ""          COMPUT… 10    en       Reute… YES   
##  2 <NA>     1987-02-26 15:19:15 ""          OHIO M… 12    en       Reute… YES   
##  3 <NA>     1987-02-26 15:49:56 ""          MCLEAN… 44    en       Reute… YES   
##  4 By Cal … 1987-02-26 15:51:17 ""          CHEMLA… 45    en       Reute… YES   
##  5 <NA>     1987-02-26 16:08:33 ""          <COFAB… 68    en       Reute… YES   
##  6 <NA>     1987-02-26 16:32:37 ""          INVEST… 96    en       Reute… YES   
##  7 By Patt… 1987-02-26 16:43:13 ""          AMERIC… 110   en       Reute… YES   
##  8 <NA>     1987-02-26 16:59:25 ""          HONG K… 125   en       Reute… YES   
##  9 <NA>     1987-02-26 17:01:28 ""          LIEBER… 128   en       Reute… YES   
## 10 <NA>     1987-02-26 17:08:27 ""          GULF A… 134   en       Reute… YES   
## # ℹ 40 more rows
## # ℹ 8 more variables: lewissplit <chr>, cgisplit <chr>, oldid <chr>,
## #   places <named list>, people <lgl>, orgs <lgl>, exchanges <lgl>, text <chr>

This can then be used with unnest_tokens().

acq_tokens <- acq_td %>%
  select(-places) %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word")

acq_tokens
## # A tibble: 4,092 × 15
##    author datetimestamp       description heading   id    language origin topics
##    <chr>  <dttm>              <chr>       <chr>     <chr> <chr>    <chr>  <chr> 
##  1 <NA>   1987-02-26 15:18:06 ""          COMPUTER… 10    en       Reute… YES   
##  2 <NA>   1987-02-26 15:18:06 ""          COMPUTER… 10    en       Reute… YES   
##  3 <NA>   1987-02-26 15:18:06 ""          COMPUTER… 10    en       Reute… YES   
##  4 <NA>   1987-02-26 15:18:06 ""          COMPUTER… 10    en       Reute… YES   
##  5 <NA>   1987-02-26 15:18:06 ""          COMPUTER… 10    en       Reute… YES   
##  6 <NA>   1987-02-26 15:18:06 ""          COMPUTER… 10    en       Reute… YES   
##  7 <NA>   1987-02-26 15:18:06 ""          COMPUTER… 10    en       Reute… YES   
##  8 <NA>   1987-02-26 15:18:06 ""          COMPUTER… 10    en       Reute… YES   
##  9 <NA>   1987-02-26 15:18:06 ""          COMPUTER… 10    en       Reute… YES   
## 10 <NA>   1987-02-26 15:18:06 ""          COMPUTER… 10    en       Reute… YES   
## # ℹ 4,082 more rows
## # ℹ 7 more variables: lewissplit <chr>, cgisplit <chr>, oldid <chr>,
## #   people <lgl>, orgs <lgl>, exchanges <lgl>, word <chr>

Play

  • create a corpus from documents in your hard disk
  • analyse the word frequency of the corpus
  • plot the 10 most frequent words
# read pattern-like documents from a directory
class_corpus = VCorpus(DirSource(directory = "..", 
                                 pattern = "*.Rmd",
                                 recursive = TRUE))

# one document per row
class_corpus_tidy = tidy(class_corpus)


stop_words = stop_words %>% 
  rbind(word = "0", lexicon = "Rmd") %>% 
  rbind(word = "1", lexicon = "Rmd") %>% 
  rbind(word = "2", lexicon = "Rmd") %>% 
  rbind(word = "3", lexicon = "Rmd") %>% 
  rbind(word = "false", lexicon = "Rmd") %>% 
  rbind(word = "true", lexicon = "Rmd") %>% 
  rbind(word = "https", lexicon = "Rmd") %>% 
  rbind(word = "___", lexicon = "Rmd") %>% 
  rbind(word = "library", lexicon = "Rmd") %>% 
  rbind(word = "aes", lexicon = "Rmd")

# tokenization
class_corpus_tokens <- class_corpus_tidy %>%
  select(datetimestamp, id, language, text) %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word")

class_corpus_tokens
## # A tibble: 58,932 × 4
##    datetimestamp       id                language word                 
##    <dttm>              <chr>             <chr>    <chr>                
##  1 2024-12-13 12:55:19 assortativity.Rmd en       title                
##  2 2024-12-13 12:55:19 assortativity.Rmd en       assortativity        
##  3 2024-12-13 12:55:19 assortativity.Rmd en       author               
##  4 2024-12-13 12:55:19 assortativity.Rmd en       massimo              
##  5 2024-12-13 12:55:19 assortativity.Rmd en       franceschet          
##  6 2024-12-13 12:55:19 assortativity.Rmd en       output               
##  7 2024-12-13 12:55:19 assortativity.Rmd en       ioslides_presentation
##  8 2024-12-13 12:55:19 assortativity.Rmd en       css                  
##  9 2024-12-13 12:55:19 assortativity.Rmd en       style.css            
## 10 2024-12-13 12:55:19 assortativity.Rmd en       incremental          
## # ℹ 58,922 more rows
# count
class_corpus_tokens %>% 
  count(word, sort = TRUE)
## # A tibble: 7,728 × 2
##    word        n
##    <chr>   <int>
##  1 data      490
##  2 network   397
##  3 nodes     384
##  4 graph     382
##  5 word      325
##  6 matrix    302
##  7 alpha     284
##  8 edges     271
##  9 mutate    263
## 10 text      258
## # ℹ 7,718 more rows