Analyzing word and document frequency

Term frequency in Jane Austen’s novels

library(tidyverse)
library(tidytext)
library(janeaustenr)


book_words <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word, sort = TRUE) %>%
  ungroup()

total_words <- book_words %>% 
  group_by(book) %>% 
  summarize(total = sum(n))

book_words <- 
  left_join(book_words, total_words)

# term frequency distribution
ggplot(book_words, aes(n / total, fill = book)) +
  geom_histogram(show.legend = FALSE) +
  # You can leave one value as NA to compute from the range of the data.
  xlim(NA, 0.0009) +
  facet_wrap(~book, ncol = 2, scales = "free_y")

Zipf’s law

These plots exhibit similar right long-tail distributions for all the novels, with many words that occur rarely and fewer words that occur frequently.

Distributions like those are typical in language: a classic version of this relationship is called Zipf’s law, after George Zipf, a 20th century American linguist.

Suppose you rank all words in decreasing order of appearence (from most popular to most rare words). The rank of a word is the position of the word (1, 2, …) on this ranking.

Zipf’s law states that the frequency that a word appears is inversely proportional to its rank.

The frequency \(p_k\) of a word in rank \(k\) is hence expressed by the following equation:

\[p_k = \alpha k^{-1}\] where \(\alpha\) is a contant.

Zipf’s law for Jane Austen’s novels

freq_by_rank <- book_words %>% 
  group_by(book) %>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total)

freq_by_rank

## # A tibble: 40,379 x 6
## # Groups:   book [6]
##    book              word      n  total  rank `term frequency`
##    <fct>             <chr> <int>  <int> <int>            <dbl>
##  1 Mansfield Park    the    6206 160460     1           0.0387
##  2 Mansfield Park    to     5475 160460     2           0.0341
##  3 Mansfield Park    and    5438 160460     3           0.0339
##  4 Emma              to     5239 160996     1           0.0325
##  5 Emma              the    5201 160996     2           0.0323
##  6 Emma              and    4896 160996     3           0.0304
##  7 Mansfield Park    of     4778 160460     4           0.0298
##  8 Pride & Prejudice the    4331 122204     1           0.0354
##  9 Emma              of     4291 160996     4           0.0267
## 10 Pride & Prejudice to     4162 122204     2           0.0341
## # … with 40,369 more rows

Zipf’s law is often visualized by plotting rank on the x-axis and term frequency on the y-axis, on logarithmic scales.

Plotting this way, an inversely proportional relationship will have a constant, negative slope. Indeed, if \(p_k = \alpha k^{-1}\), then:

\[ \ln p_k = \ln \alpha - \ln k \]

Hence, plotting a Zipf distribution on a log-log scale will result in a straight line with -1 slope.

freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, group = book, color = book)) + 
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10()

It is not quite constant, though; perhaps we could view this as a broken power law with, say, three sections.

Let’s see what the exponent of the power law is for the initial section of the rank range:

rank_subset <- freq_by_rank %>% 
  filter(rank < 500,
         rank > 10)

mod = lm(log10(`term frequency`) ~ log10(rank), data = rank_subset)

freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = book)) + 
  geom_abline(intercept = mod$coefficients[1], 
              slope = mod$coefficients[2], 
              color = "gray50", 
              linetype = 2) +
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10()

We have found a result close to the classic version of Zipf’s law for the corpus of Jane Austen’s novels.

the deviations we see here at high rank are not uncommon for many kinds of language
a corpus of language often contains fewer rare words than predicted by a single power law
the deviations at low rank are more unusual. Jane Austen uses a lower percentage of the most common words than many collections of language

Play

write a function to test the Zipf’s law for words in a given book
use the function on Homer’s Iliad and Odyssey
remove stop words and redo the test

Analyzing word and document frequency

a central question in text mining and natural language processing is how to quantify what a document in a collection (or corpus) is about
one measure of how important a word may be is its term frequency (tf), how frequently a word occurs in a document
there are words in a document, however, that occur many times but may not be important to a document, because they are very popular also to other documents in the corpus
an approach to remove these words is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents

The tf.idf statistics

For a given term \(t\) and a document \(d\) in a collection of documents, we have:

\[ \mathrm{tf.idf}(t) = \mathrm{tf}(t) \cdot \mathrm{idf}(t) = \frac{f_{d,t}}{f_{d}} \cdot \ln \frac{n}{n_t} \]

where:

\(f_{d,t}\) is the number of times term \(t\) appears in document \(d\)
\(f_d\) is the number of words of document \(d\)
\(n\) is the number of documents in the collection
\(n_t\) is the number of documents in the collection containing the term \(t\)

The statistic tf.idf is intended to measure how important a word is to a document in a collection (or corpus) of documents.

Calculating tf-idf

the bind_tf_idf function in the tidytext package takes a tidy text dataset as input with one row per token (term), per document
one column (word here) contains the terms/tokens
one column contains the documents (book in this case)
the last necessary column contains the counts, how many times each document contains each term (n in this example)

book_words

## # A tibble: 40,379 x 4
##    book              word      n  total
##    <fct>             <chr> <int>  <int>
##  1 Mansfield Park    the    6206 160460
##  2 Mansfield Park    to     5475 160460
##  3 Mansfield Park    and    5438 160460
##  4 Emma              to     5239 160996
##  5 Emma              the    5201 160996
##  6 Emma              and    4896 160996
##  7 Mansfield Park    of     4778 160460
##  8 Pride & Prejudice the    4331 122204
##  9 Emma              of     4291 160996
## 10 Pride & Prejudice to     4162 122204
## # … with 40,369 more rows

book_words <- book_words %>%
  bind_tf_idf(word, book, n)
book_words

## # A tibble: 40,379 x 7
##    book              word      n  total     tf   idf tf_idf
##    <fct>             <chr> <int>  <int>  <dbl> <dbl>  <dbl>
##  1 Mansfield Park    the    6206 160460 0.0387     0      0
##  2 Mansfield Park    to     5475 160460 0.0341     0      0
##  3 Mansfield Park    and    5438 160460 0.0339     0      0
##  4 Emma              to     5239 160996 0.0325     0      0
##  5 Emma              the    5201 160996 0.0323     0      0
##  6 Emma              and    4896 160996 0.0304     0      0
##  7 Mansfield Park    of     4778 160460 0.0298     0      0
##  8 Pride & Prejudice the    4331 122204 0.0354     0      0
##  9 Emma              of     4291 160996 0.0267     0      0
## 10 Pride & Prejudice to     4162 122204 0.0341     0      0
## # … with 40,369 more rows

Notice that idf and thus tf-idf are zero for these extremely common words. These are all words that appear in all six of Jane Austen’s novels, so the idf term (which will then be the natural log of 1) is zero.

Let’s look at terms with high tf-idf in Jane Austen’s works.

book_words %>%
  select(-total) %>%
  arrange(desc(tf_idf))

## # A tibble: 40,379 x 6
##    book                word          n      tf   idf  tf_idf
##    <fct>               <chr>     <int>   <dbl> <dbl>   <dbl>
##  1 Sense & Sensibility elinor      623 0.00519  1.79 0.00931
##  2 Sense & Sensibility marianne    492 0.00410  1.79 0.00735
##  3 Mansfield Park      crawford    493 0.00307  1.79 0.00551
##  4 Pride & Prejudice   darcy       373 0.00305  1.79 0.00547
##  5 Persuasion          elliot      254 0.00304  1.79 0.00544
##  6 Emma                emma        786 0.00488  1.10 0.00536
##  7 Northanger Abbey    tilney      196 0.00252  1.79 0.00452
##  8 Emma                weston      389 0.00242  1.79 0.00433
##  9 Pride & Prejudice   bennet      294 0.00241  1.79 0.00431
## 10 Persuasion          wentworth   191 0.00228  1.79 0.00409
## # … with 40,369 more rows

Here we see all proper nouns, names that are in fact important in these novels. None of them occur in all of novels, and they are important, characteristic words for each text within the corpus of Jane Austen’s novels.

Let’s look at a visualization for these high tf-idf words:

book_words %>%
  arrange(desc(tf_idf)) %>%
  group_by(book) %>% 
  top_n(15, tf_idf) %>% 
  ungroup %>%
  mutate(word = reorder(word, tf_idf)) %>%
  ggplot(aes(word, tf_idf, fill = book)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~book, ncol = 2, scales = "free") +
  coord_flip()

Still all proper nouns!

these words are, as measured by tf-idf, the most important to each novel and most readers would likely agree
what measuring tf-idf has done here is show us that Jane Austen used similar language across her six novels
what distinguishes one novel from the rest within the collection of her works are the proper nouns, the names of people and places
this is the point of tf-idf: it identifies words that are important to one document within a collection of documents