freq_by_rank <- book_words %>%
group_by(book) %>%
mutate(rank = row_number(),
`term frequency` = n/total)
freq_by_rank
## # A tibble: 40,379 x 6
## # Groups: book [6]
## book word n total rank `term frequency`
## <fct> <chr> <int> <int> <int> <dbl>
## 1 Mansfield Park the 6206 160460 1 0.0387
## 2 Mansfield Park to 5475 160460 2 0.0341
## 3 Mansfield Park and 5438 160460 3 0.0339
## 4 Emma to 5239 160996 1 0.0325
## 5 Emma the 5201 160996 2 0.0323
## 6 Emma and 4896 160996 3 0.0304
## 7 Mansfield Park of 4778 160460 4 0.0298
## 8 Pride & Prejudice the 4331 122204 1 0.0354
## 9 Emma of 4291 160996 4 0.0267
## 10 Pride & Prejudice to 4162 122204 2 0.0341
## # … with 40,369 more rows
Zipf’s law is often visualized by plotting rank on the x-axis and term frequency on the y-axis, on logarithmic scales.
Plotting this way, an inversely proportional relationship will have a constant, negative slope. Indeed, if \(p_k = \alpha k^{-1}\), then:
\[
\ln p_k = \ln \alpha - \ln k
\]
Hence, plotting a Zipf distribution on a log-log scale will result in a straight line with -1 slope.
freq_by_rank %>%
ggplot(aes(rank, `term frequency`, group = book, color = book)) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10()

It is not quite constant, though; perhaps we could view this as a broken power law with, say, three sections.
Let’s see what the exponent of the power law is for the initial section of the rank range:
rank_subset <- freq_by_rank %>%
filter(rank < 500,
rank > 10)
mod = lm(log10(`term frequency`) ~ log10(rank), data = rank_subset)
freq_by_rank %>%
ggplot(aes(rank, `term frequency`, color = book)) +
geom_abline(intercept = mod$coefficients[1],
slope = mod$coefficients[2],
color = "gray50",
linetype = 2) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10()

We have found a result close to the classic version of Zipf’s law for the corpus of Jane Austen’s novels.
- the deviations we see here at high rank are not uncommon for many kinds of language
- a corpus of language often contains fewer rare words than predicted by a single power law
- the deviations at low rank are more unusual. Jane Austen uses a lower percentage of the most common words than many collections of language