n-grams

We’ll explore some of the methods tidytext offers for calculating and visualizing relationships between words in your text dataset.

An n-grams is a consecutive sequences of n words. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them.

library(tidyverse)
library(tidytext)
library(janeaustenr)


austen_bigrams <- austen_books() %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
  filter(!is.na(bigram))

austen_bigrams
## # A tibble: 662,783 × 2
##    book                bigram         
##    <fct>               <chr>          
##  1 Sense & Sensibility sense and      
##  2 Sense & Sensibility and sensibility
##  3 Sense & Sensibility by jane        
##  4 Sense & Sensibility jane austen    
##  5 Sense & Sensibility chapter 1      
##  6 Sense & Sensibility the family     
##  7 Sense & Sensibility family of      
##  8 Sense & Sensibility of dashwood    
##  9 Sense & Sensibility dashwood had   
## 10 Sense & Sensibility had long       
## # … with 662,773 more rows
# count bigrams
austen_bigrams %>%
  count(bigram, sort = TRUE)
## # A tibble: 193,209 × 2
##    bigram       n
##    <chr>    <int>
##  1 of the    2853
##  2 to be     2670
##  3 in the    2221
##  4 it was    1691
##  5 i am      1485
##  6 she had   1405
##  7 of her    1363
##  8 to the    1315
##  9 she was   1309
## 10 had been  1206
## # … with 193,199 more rows
# remove stop words
bigrams_separated <- austen_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_counts
## # A tibble: 28,974 × 3
##    word1   word2         n
##    <chr>   <chr>     <int>
##  1 sir     thomas      266
##  2 miss    crawford    196
##  3 captain wentworth   143
##  4 miss    woodhouse   143
##  5 frank   churchill   114
##  6 lady    russell     110
##  7 sir     walter      108
##  8 lady    bertram     101
##  9 miss    fairfax      98
## 10 colonel brandon      96
## # … with 28,964 more rows
# back to bigrams
bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")

bigrams_united
## # A tibble: 38,913 × 2
##    book                bigram                  
##    <fct>               <chr>                   
##  1 Sense & Sensibility jane austen             
##  2 Sense & Sensibility chapter 1               
##  3 Sense & Sensibility norland park            
##  4 Sense & Sensibility surrounding acquaintance
##  5 Sense & Sensibility late owner              
##  6 Sense & Sensibility advanced age            
##  7 Sense & Sensibility constant companion      
##  8 Sense & Sensibility happened ten            
##  9 Sense & Sensibility henry dashwood          
## 10 Sense & Sensibility norland estate          
## # … with 38,903 more rows
# count 3-grams
austen_books() %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  filter(!is.na(trigram)) %>% 
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word3 %in% stop_words$word) %>%
  count(word1, word2, word3, sort = TRUE)
## # A tibble: 6,140 × 4
##    word1     word2     word3         n
##    <chr>     <chr>     <chr>     <int>
##  1 dear      miss      woodhouse    20
##  2 miss      de        bourgh       17
##  3 lady      catherine de           11
##  4 poor      miss      taylor       11
##  5 sir       walter    elliot       10
##  6 catherine de        bourgh        9
##  7 dear      sir       thomas        8
##  8 replied   miss      crawford      7
##  9 sir       william   lucas         7
## 10 ten       thousand  pounds        7
## # … with 6,130 more rows

tf-idf of bigrams

A bigram can also be treated as a term in a document in the same way that we treated individual words. For example, we can look at the tf-idf of bigrams across Austen novels.

bigram_tf_idf <- bigrams_united %>%
  count(book, bigram) %>%
  bind_tf_idf(bigram, book, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf
## # A tibble: 31,391 × 6
##    book                bigram                n     tf   idf tf_idf
##    <fct>               <chr>             <int>  <dbl> <dbl>  <dbl>
##  1 Mansfield Park      sir thomas          266 0.0304  1.79 0.0545
##  2 Persuasion          captain wentworth   143 0.0290  1.79 0.0519
##  3 Mansfield Park      miss crawford       196 0.0224  1.79 0.0402
##  4 Persuasion          lady russell        110 0.0223  1.79 0.0399
##  5 Persuasion          sir walter          108 0.0219  1.79 0.0392
##  6 Emma                miss woodhouse      143 0.0173  1.79 0.0309
##  7 Northanger Abbey    miss tilney          74 0.0165  1.79 0.0295
##  8 Sense & Sensibility colonel brandon      96 0.0155  1.79 0.0278
##  9 Sense & Sensibility sir john             94 0.0152  1.79 0.0273
## 10 Pride & Prejudice   lady catherine       87 0.0139  1.79 0.0248
## # … with 31,381 more rows
bigram_tf_idf %>%
  arrange(desc(tf_idf)) %>%
  group_by(book) %>%
  top_n(12, tf_idf) %>%
  ungroup() %>%
  mutate(bigram = reorder(bigram, tf_idf)) %>%
  ggplot(aes(bigram, tf_idf, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ book, ncol = 2, scales = "free") +
  coord_flip() +
  labs(y = "tf-idf of bigram to novel",
       x = "")

  • there are advantages and disadvantages to examining the tf-idf of bigrams rather than individual words
  • pairs of consecutive words might capture structure that isn’t present when one is just counting single words, and may provide context that makes tokens more understandable
  • however, the per-bigram counts are also sparser: a typical two-word pair is rarer than either of its component words
  • thus, bigrams can be especially useful when you have a very large text dataset

Bigrams to provide context in sentiment analysis

  • we can use bigrams to provide context in sentiment analysis
  • our sentiment analysis approach above simply counted the appearance of positive or negative words, according to a reference lexicon
  • one of the problems with this approach is that a word’s context can matter nearly as much as its presence
  • for example, the words “happy” and “like” will be counted as positive, even in a sentence like “I’m not happy and I don’t like it!”

Bigrams to provide context in sentiment analysis

Now that we have the data organized into bigrams, it’s easy to tell how often words are preceded by a word like “not”:

bigrams_separated %>%
  filter(word1 == "not") %>%
  count(word1, word2, sort = TRUE)
## # A tibble: 1,178 × 3
##    word1 word2     n
##    <chr> <chr> <int>
##  1 not   be      580
##  2 not   to      335
##  3 not   have    307
##  4 not   know    237
##  5 not   a       184
##  6 not   think   162
##  7 not   been    151
##  8 not   the     135
##  9 not   at      126
## 10 not   in      110
## # … with 1,168 more rows
AFINN <- get_sentiments("afinn")

not_words <- bigrams_separated %>%
  filter(word1 == "not") %>%
  inner_join(AFINN, by = c("word2" = "word")) %>%
  count(word2, value, sort = TRUE)

not_words
## # A tibble: 229 × 3
##    word2   value     n
##    <chr>   <dbl> <int>
##  1 like        2    95
##  2 help        2    77
##  3 want        1    41
##  4 wish        1    39
##  5 allow       1    30
##  6 care        2    21
##  7 sorry      -1    20
##  8 leave      -1    17
##  9 pretend    -1    17
## 10 worth       2    17
## # … with 219 more rows

It’s worth asking which words contributed the most in the “wrong” direction. To compute that, we can multiply their score by the number of times they appear.

not_words %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(word2, n * value, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("Words preceded by \"not\"") +
  ylab("Sentiment value * number of occurrences") +
  coord_flip()

Bigrams as graphs

We may be interested in visualizing all of the relationships among words simultaneously, rather than just the top few at a time.

As one common visualization, we can arrange the words into a network, or graph.

library(igraph)
library(tidygraph)
library(ggraph)


# original counts
bigram_counts
## # A tibble: 28,974 × 3
##    word1   word2         n
##    <chr>   <chr>     <int>
##  1 sir     thomas      266
##  2 miss    crawford    196
##  3 captain wentworth   143
##  4 miss    woodhouse   143
##  5 frank   churchill   114
##  6 lady    russell     110
##  7 sir     walter      108
##  8 lady    bertram     101
##  9 miss    fairfax      98
## 10 colonel brandon      96
## # … with 28,964 more rows
# create the graph
bigram_graph <- bigram_counts %>%
  filter(n > 20) %>% 
  as_tbl_graph()

bigram_graph
## # A tbl_graph: 85 nodes and 70 edges
## #
## # A directed acyclic simple graph with 17 components
## #
## # Node Data: 85 × 1 (active)
##   name   
##   <chr>  
## 1 sir    
## 2 miss   
## 3 captain
## 4 frank  
## 5 lady   
## 6 colonel
## # … with 79 more rows
## #
## # Edge Data: 70 × 3
##    from    to     n
##   <int> <int> <int>
## 1     1    27   266
## 2     2    28   196
## 3     3    29   143
## # … with 67 more rows
# arrows
a <- grid::arrow(type = "closed", length = unit(.1, "inches"))

# plot the graph
ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 1) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

  • note that this is a visualization of a Markov chain, a common model in text processing
  • In a Markov chain, each choice of word depends only on the previous word
  • in this case, a random generator following this model might spit out “dear”, then “sir”, then “william/walter/thomas/thomas’s”, by following each word to the most common words that follow it

Co-occurrence of words

  • tokenizing by n-gram is a useful way to explore pairs of adjacent words
  • however, we may also be interested in words that tend to co-occur within particular documents or particular chapters, even if they don’t occur next to each other
  • the widyr package makes operations such as computing counts and correlations easy, by simplifying the pattern of widen data, perform an operation, then re-tidy data
  • the widyr package:
    1. first casts a tidy dataset into a wide matrix
    2. performs an operation such as a correlation on it
    3. then re-tidies the result

Co-occurrence of words: the wider package

library(widyr)

# separate words in sections (10 lines)
austen_section_words <- austen_books() %>%
  filter(book == "Pride & Prejudice") %>%
  mutate(section = row_number() %/% 10) %>%
  filter(section > 0) %>%
  unnest_tokens(word, text) %>%
  filter(!(word %in% stop_words$word))

austen_section_words
## # A tibble: 37,240 × 3
##    book              section word        
##    <fct>               <dbl> <chr>       
##  1 Pride & Prejudice       1 truth       
##  2 Pride & Prejudice       1 universally 
##  3 Pride & Prejudice       1 acknowledged
##  4 Pride & Prejudice       1 single      
##  5 Pride & Prejudice       1 possession  
##  6 Pride & Prejudice       1 fortune     
##  7 Pride & Prejudice       1 wife        
##  8 Pride & Prejudice       1 feelings    
##  9 Pride & Prejudice       1 views       
## 10 Pride & Prejudice       1 entering    
## # … with 37,230 more rows
# count words co-occuring within sections
word_pairs <- austen_section_words %>%
  pairwise_count(word, section, sort = TRUE)

word_pairs
## # A tibble: 796,008 × 3
##    item1     item2         n
##    <chr>     <chr>     <dbl>
##  1 darcy     elizabeth   144
##  2 elizabeth darcy       144
##  3 miss      elizabeth   110
##  4 elizabeth miss        110
##  5 elizabeth jane        106
##  6 jane      elizabeth   106
##  7 miss      darcy        92
##  8 darcy     miss         92
##  9 elizabeth bingley      91
## 10 bingley   elizabeth    91
## # … with 795,998 more rows

Co-occurrence of words: correlation coefficient

Pairs like “Elizabeth” and “Darcy” are the most common co-occurring words, but that’s not particularly meaningful since they’re also the most common individual words.

We may instead want to examine correlation among words, which indicates how often they appear together relative to how often they appear separately.

In particular, here we’ll focus on the phi coefficient, a common measure for binary correlation.

The focus of the phi coefficient is how much more likely it is that either both word X and Y appear, or neither do, than that one appears without the other.

Consider the following table:

Has word Y No word Y Total
Has word X \(n_{11}\) \(n_{10}\) \(n_{1\cdot}\)
No word X \(n_{01}\) \(n_{00}\) \(n_{0\cdot}\)
Total \(n_{\cdot 1}\) \(n_{\cdot 0}\) n

In terms of this table, the phi coefficient is:

\[\phi=\frac{n_{11}n_{00}-n_{10}n_{01}}{\sqrt{n_{1\cdot}n_{0\cdot}n_{\cdot0}n_{\cdot1}}}\]

Introduced by Karl Pearson, this measure is similar to the Pearson correlation coefficient in its interpretation.

# we need to filter for at least relatively common words first
word_cors <- austen_section_words %>%
  group_by(word) %>%
  filter(n() >= 20) %>%
  pairwise_cor(word, section, sort = TRUE)

word_cors %>%
  filter(item1 == "pounds")
## # A tibble: 393 × 3
##    item1  item2     correlation
##    <chr>  <chr>           <dbl>
##  1 pounds thousand       0.701 
##  2 pounds ten            0.231 
##  3 pounds fortune        0.164 
##  4 pounds settled        0.149 
##  5 pounds wickham's      0.142 
##  6 pounds children       0.129 
##  7 pounds mother's       0.119 
##  8 pounds believed       0.0932
##  9 pounds estate         0.0890
## 10 pounds ready          0.0860
## # … with 383 more rows

Just as we used ggraph to visualize bigrams, we can use it to visualize the correlations and clusters of words that were found by the widyr package.

word_cors %>%
  filter(correlation > .15) %>%
  as_tbl_graph() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 1) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

Play