Text mining

with information growing at exponential rates, it’s no surprise that historians are referring to this period of history as the Information Age
analysts are often trained to handle tabular or rectangular data that is mostly numeric, but much of the data proliferating today is unstructured and text-heavy
many of us who work in analytical fields are not trained in even simple interpretation of natural language

Tidy text

tidytext is an R package for analysing text withing the tidyverse philosophy
treating text as data frames of individual words allows us to manipulate, summarize, and visualize the characteristics of text easily and integrate natural language processing into effective workflows of the tidyverse

The tidy text approach

using tidy data principles is a powerful way to make handling data easier and more effective, and this is no less true when it comes to dealing with text
as described by Hadley Wickham, tidy data has a specific structure:
- Each variable is a column
- Each observation is a row
- Each type of observational unit is a table

One-token-per-row

we define the tidy text format as being a table with one-token-per-row
a token is a meaningful unit of text, such as a word, a sentence, or paragraph, that we are interested in using for analysis
tokenization is the process of splitting text into tokens

unnest_tokens

unnest_tokens is the main verb of tidytext
it splits text into tokens and outputs a one-token-per-row table
takes 3 main parameters:
1. tbl: a data frame containing the text to tokenize
2. output: the output column to be created
3. input: the input column that gets split
punctuation is stripped
tokens are converted to lowercase
other columns, such as the line number each word came from, are retained

unnest_tokens

# the tidy tools
library(tidyverse)
# the tidy tools for text
library(tidytext)

# Emily Dickinson wrote some lovely text in her time
text <- c("Because I could not stop for Death -",
          "He kindly stopped for me -",
          "The Carriage held but just Ourselves -",
          "and Immortality")


# a data frame with one row per sentence
text_df <- data_frame(line = 1:4, text = text)
text_df

## # A tibble: 4 × 2
##    line text                                  
##   <int> <chr>                                 
## 1     1 Because I could not stop for Death -  
## 2     2 He kindly stopped for me -            
## 3     3 The Carriage held but just Ourselves -
## 4     4 and Immortality

# tokenization: one row per word
unnest_tokens(tbl = text_df, output = word, input = text)

## # A tibble: 20 × 2
##     line word       
##    <int> <chr>      
##  1     1 because    
##  2     1 i          
##  3     1 could      
##  4     1 not        
##  5     1 stop       
##  6     1 for        
##  7     1 death      
##  8     2 he         
##  9     2 kindly     
## 10     2 stopped    
## 11     2 for        
## 12     2 me         
## 13     3 the        
## 14     3 carriage   
## 15     3 held       
## 16     3 but        
## 17     3 just       
## 18     3 ourselves  
## 19     4 and        
## 20     4 immortality

Stop words and stems

stop words are words which are filtered out before processing of natural language data (text), such as such as the, is, at, which, for, an and on
stemming is the process of reducing inflected words to their word stem, base or root form. For instance, a stemming algorithm might reduce the words fishing, fished, and fisher to the stem fish
a popular stemmer is Porter’s stemming algorithm

Jane Austen’s novels

let’s use the text of Jane Austen’s 6 completed, published novels from the janeaustenr package, and transform them into a tidy format
the janeaustenr package provides these texts in a one-row-per-line format, where a line is this context is analogous to a literal printed line in a physical book

Jane Austen’s novels

library(janeaustenr)
library(stringr)

# one sentence per row
austen_books()

## # A tibble: 73,422 × 2
##    text                    book               
##  * <chr>                   <fct>              
##  1 "SENSE AND SENSIBILITY" Sense & Sensibility
##  2 ""                      Sense & Sensibility
##  3 "by Jane Austen"        Sense & Sensibility
##  4 ""                      Sense & Sensibility
##  5 "(1811)"                Sense & Sensibility
##  6 ""                      Sense & Sensibility
##  7 ""                      Sense & Sensibility
##  8 ""                      Sense & Sensibility
##  9 ""                      Sense & Sensibility
## 10 "CHAPTER 1"             Sense & Sensibility
## # ℹ 73,412 more rows

# add line and chapter numbers relative to books
original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(
           str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
  ungroup()

original_books

## # A tibble: 73,422 × 4
##    text                    book                linenumber chapter
##    <chr>                   <fct>                    <int>   <int>
##  1 "SENSE AND SENSIBILITY" Sense & Sensibility          1       0
##  2 ""                      Sense & Sensibility          2       0
##  3 "by Jane Austen"        Sense & Sensibility          3       0
##  4 ""                      Sense & Sensibility          4       0
##  5 "(1811)"                Sense & Sensibility          5       0
##  6 ""                      Sense & Sensibility          6       0
##  7 ""                      Sense & Sensibility          7       0
##  8 ""                      Sense & Sensibility          8       0
##  9 ""                      Sense & Sensibility          9       0
## 10 "CHAPTER 1"             Sense & Sensibility         10       1
## # ℹ 73,412 more rows

# tokenize: one work per row
tidy_books <- original_books %>%
  unnest_tokens(word, text)
tidy_books

## # A tibble: 725,055 × 4
##    book                linenumber chapter word       
##    <fct>                    <int>   <int> <chr>      
##  1 Sense & Sensibility          1       0 sense      
##  2 Sense & Sensibility          1       0 and        
##  3 Sense & Sensibility          1       0 sensibility
##  4 Sense & Sensibility          3       0 by         
##  5 Sense & Sensibility          3       0 jane       
##  6 Sense & Sensibility          3       0 austen     
##  7 Sense & Sensibility          5       0 1811       
##  8 Sense & Sensibility         10       1 chapter    
##  9 Sense & Sensibility         10       1 1          
## 10 Sense & Sensibility         13       1 the        
## # ℹ 725,045 more rows

# remove stop words
stop_words

## # A tibble: 1,149 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # ℹ 1,139 more rows

tidy_books <- tidy_books %>%
  anti_join(stop_words)

# word frequency
tidy_books %>%
  count(word, sort = TRUE)

## # A tibble: 13,914 × 2
##    word       n
##    <chr>  <int>
##  1 miss    1855
##  2 time    1337
##  3 fanny    862
##  4 dear     822
##  5 lady     817
##  6 sir      806
##  7 day      797
##  8 emma     787
##  9 sister   727
## 10 house    699
## # ℹ 13,904 more rows

# plot word frequency
tidy_books %>%
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  # reorder levels of factor word wrt n
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

library(wordcloud)

tidy_books %>%
  count(word) %>%
  # evaluate an R expression in an environment constructed from data
  with(wordcloud(word, n, max.words = 100))

# Porter's word stemming
library(SnowballC)
tidy_books <- tidy_books %>%
  mutate(word = wordStem(word)) # stemming

# plot word frequency
tidy_books %>%
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

# tokenize by pattern (regular expression)
austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, 
                token = "regex", 
                pattern = "(Chapter|CHAPTER) [\\dIVXLC]{1,8}") %>%
  ungroup()

# how many chapters in each book?
austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n()) %>% 
  arrange(-chapters)

## # A tibble: 6 × 2
##   book                chapters
##   <fct>                  <int>
## 1 Pride & Prejudice         62
## 2 Emma                      56
## 3 Sense & Sensibility       51
## 4 Mansfield Park            49
## 5 Northanger Abbey          32
## 6 Persuasion                25

Project Gutenberg

now that we’ve used the janeaustenr package to explore tidying text, let’s introduce the gutenbergr package
the gutenbergr package provides access to the public domain works from the Project Gutenberg collection
we will mostly use the function gutenberg_download() that downloads one or more works from Project Gutenberg by ID

Project Gutenberg - H.G. Wells

Let’s look at some science fiction and fantasy novels by H.G. Wells, who lived in the late 19th and early 20th centuries. Let’s get:

Download the RDS file.

library(gutenbergr)
# run once and save the result as RDS
#hgwells <- gutenberg_download(c(35, 36, 5230, 159))
#write_rds(hgwells, "hgwells.rds")

# read from RDS
hgwells = read_rds("hgwells.rds")
hgwells

## # A tibble: 20,347 × 2
##    gutenberg_id text              
##           <int> <chr>             
##  1           35 "The Time Machine"
##  2           35 ""                
##  3           35 "An Invention"    
##  4           35 ""                
##  5           35 "by H. G. Wells"  
##  6           35 ""                
##  7           35 ""                
##  8           35 "CONTENTS"        
##  9           35 ""                
## 10           35 " I Introduction" 
## # ℹ 20,337 more rows

tidy_hgwells <- hgwells %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

tidy_hgwells %>%
  count(word, sort = TRUE)

## # A tibble: 11,830 × 2
##    word       n
##    <chr>  <int>
##  1 time     461
##  2 people   302
##  3 door     260
##  4 heard    249
##  5 black    232
##  6 stood    229
##  7 white    224
##  8 hand     218
##  9 kemp     213
## 10 eyes     210
## # ℹ 11,820 more rows

Project Gutenberg - Brontë sisters

Now let’s get some well-known works of the Brontë sisters, whose lives overlapped with Jane Austen’s somewhat but who wrote in a rather different style. Let’s get:

Download the RDS file.

# run once and save the result as RDS
#bronte <- gutenberg_download(c(1260, 768, 969, 9182, 767))
#write_rds(bronte, "bronte.rds")

# read from RDS
bronte = read_rds("bronte.rds")
bronte

## # A tibble: 80,089 × 2
##    gutenberg_id text                              
##           <int> <chr>                             
##  1          767 "Agnes Grey"                      
##  2          767 "A NOVEL,"                        
##  3          767 ""                                
##  4          767 "by ACTON BELL."                  
##  5          767 ""                                
##  6          767 "LONDON:"                         
##  7          767 "THOMAS CAUTLEY NEWBY, PUBLISHER,"
##  8          767 "72, MORTIMER ST., CAVENDISH SQ." 
##  9          767 ""                                
## 10          767 "1847."                           
## # ℹ 80,079 more rows

tidy_bronte <- bronte %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

tidy_bronte %>%
  count(word, sort = TRUE)

## # A tibble: 23,303 × 2
##    word       n
##    <chr>  <int>
##  1 time    1064
##  2 miss     854
##  3 day      826
##  4 hand     767
##  5 eyes     713
##  6 don’t    666
##  7 night    648
##  8 heart    638
##  9 looked   601
## 10 door     591
## # ℹ 23,293 more rows

Interesting that “time”, “eyes”, and “hand” are in the top 10 for both H.G. Wells and the Brontë sisters.

Compare words used by Jane Austen, the Brontë sisters, and H.G. Wells

Now, let’s calculate the frequency for each word for the works of Jane Austen, the Brontë sisters, and H.G. Wells by binding the data frames together.

frequency <- 
  bind_rows(mutate(tidy_bronte, author = "Brontë Sisters"),
                       mutate(tidy_hgwells, author = "H.G. Wells"), 
                       mutate(tidy_books, author = "Jane Austen")) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  spread(author, proportion) 

frequency

## # A tibble: 30,375 × 4
##    word        `Brontë Sisters` `H.G. Wells` `Jane Austen`
##    <chr>                  <dbl>        <dbl>         <dbl>
##  1 a                 0.0000587     0.0000148    0.0000138 
##  2 a'n't            NA            NA            0.00000460
##  3 aback             0.00000391    0.0000148   NA         
##  4 abaht             0.00000391   NA           NA         
##  5 abandon           0.0000313     0.0000148    0.00000460
##  6 abandoned         0.0000900     0.000178    NA         
##  7 abandoning        0.00000391    0.0000445   NA         
##  8 abandonment       0.0000196     0.0000148   NA         
##  9 abart            NA             0.0000148   NA         
## 10 abase             0.00000391   NA           NA         
## # ℹ 30,365 more rows

We use str_extract() here because the UTF-8 encoded texts from Project Gutenberg have some examples of words with underscores around them to indicate emphasis (like italics).

Compare words used by Jane Austen, the Brontë sisters, and H.G. Wells

Let’s comparing the word frequencies of Jane Austen, the Brontë sisters, and H.G. Wells:

library(scales)

# correlate frequencies of words in `Brontë Sisters` and `Jane Austen` books
# expect a warning about rows with missing values being removed
ggplot(frequency, aes(x = `Brontë Sisters`, y = `Jane Austen`)) +
  geom_abline(color = "gray40", lty = 2) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  labs(y = "Jane Austen", x = "Brontë Sisters") +
  theme_bw()

# correlate frequencies of words in `H.G. Wells` and `Jane Austen` books
ggplot(frequency, aes(x = `H.G. Wells`, y = `Jane Austen`)) +
  geom_abline(color = "gray40", lty = 2) +
  #geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  labs(y = "Jane Austen", x = "H.G. Wells") +
  theme_bw()

Compare words used by Jane Austen, the Brontë sisters, and H.G. Wells

Let’s quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between Austen and the Brontë sisters, and between Austen and Wells?

# quantify correlation
cor.test(frequency$`Brontë Sisters`,  frequency$`Jane Austen`)

## 
##  Pearson's product-moment correlation
## 
## data:  frequency$`Brontë Sisters` and frequency$`Jane Austen`
## t = 50.92, df = 3480, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6339628 0.6720503
## sample estimates:
##       cor 
## 0.6534199

cor.test(frequency$`H.G. Wells`,  frequency$`Jane Austen`)

## 
##  Pearson's product-moment correlation
## 
## data:  frequency$`H.G. Wells` and frequency$`Jane Austen`
## t = 17.938, df = 2295, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3142812 0.3860345
## sample estimates:
##       cor 
## 0.3506724

Just as we saw in the plots, the word frequencies are more correlated between the Austen and Brontë novels than between Austen and H.G. Wells.

Play

find and download from Project Gutenberg Homer’s Iliad (RDS) and Odyssey (RDS)
compare words used in the two poems