Topic modeling

in text mining, we often have collections of documents, such as blog posts or news articles, that we’d like to divide into natural groups so that we can understand them separately
topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for

Latent Dirichlet allocation

Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model
it treats each document as a mixture of topics, and each topic as a mixture of words
this allows documents to overlap each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language

Latent Dirichlet allocation

Without diving into the math behind the model, we can understand it as being guided by two principles:

Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B”
Every topic is a mixture of words. For example, we could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.” The most common words in the politics topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be shared between topics; a word like “budget” might appear in both equally

Word-topic probabilities

We already introduced the tidy() method for tidying model objects. The tidytext package provides this method for extracting the per-topic-per-word probabilities, called \(\beta\) (“beta”), from the model.

library(tidyverse)
library(tidytext)
library(topicmodels)

# This is a collection of 2246 news articles from an American news agency, mostly published around 1988.
data("AssociatedPress")

# set a seed so that the output of the model is predictable
ap_lda <- LDA(AssociatedPress, k = 2, control = list(seed = 1234))

ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics

## # A tibble: 20,946 × 3
##    topic term           beta
##    <int> <chr>         <dbl>
##  1     1 aaron      1.69e-12
##  2     2 aaron      3.90e- 5
##  3     1 abandon    2.65e- 5
##  4     2 abandon    3.99e- 5
##  5     1 abandoned  1.39e- 4
##  6     2 abandoned  5.88e- 5
##  7     1 abandoning 2.45e-33
##  8     2 abandoning 2.34e- 5
##  9     1 abbott     2.13e- 6
## 10     2 abbott     2.97e- 5
## # ℹ 20,936 more rows

# a topic is a bag of words
ap_topics %>% 
  group_by(topic) %>% 
  summarize(sum(beta))

## # A tibble: 2 × 2
##   topic `sum(beta)`
##   <int>       <dbl>
## 1     1        1.00
## 2     2        1.00

# the 10 terms that are most common within each topic
ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

ap_top_terms

## # A tibble: 20 × 3
##    topic term          beta
##    <int> <chr>        <dbl>
##  1     1 percent    0.00981
##  2     1 million    0.00684
##  3     1 new        0.00594
##  4     1 year       0.00575
##  5     1 billion    0.00427
##  6     1 last       0.00368
##  7     1 two        0.00360
##  8     1 company    0.00348
##  9     1 people     0.00345
## 10     1 market     0.00333
## 11     2 i          0.00705
## 12     2 president  0.00489
## 13     2 government 0.00452
## 14     2 people     0.00407
## 15     2 soviet     0.00372
## 16     2 new        0.00370
## 17     2 bush       0.00370
## 18     2 two        0.00361
## 19     2 years      0.00339
## 20     2 states     0.00320

ap_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered()

this visualization lets us understand the two topics that were extracted from the articles
the most common words in topic 1 include “percent”, “million”, “billion”, and “company”, which suggests it may represent business or financial news
those most common in topic 2 include “president”, “government”, and “soviet”, suggesting that this topic represents political news
one important observation about the words in each topic is that some words, such as “new” and “people”, are common within both topics
this is an advantage of topic modeling as opposed to “hard clustering” methods: topics used in natural language could have some overlap in terms of words

Word-topic probabilities

As an alternative, we could consider the terms that had the greatest difference in \(\beta\) between topic 1 and topic 2.

This can be estimated based on the log ratio of the two: \[\log_2\frac{\beta_2}{\beta_1}\] A log ratio is useful because it makes the difference symmetrical:

\(\beta_2\) being twice as large leads to a log ratio of 1
\(\beta_1\) being twice as large results in -1

beta_spread <- ap_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  # filter for relatively common words
  filter(topic1 > .001 | topic2 > .001) %>%
  mutate(log_ratio = log2(topic2 / topic1))

beta_spread

## # A tibble: 198 × 4
##    term              topic1      topic2 log_ratio
##    <chr>              <dbl>       <dbl>     <dbl>
##  1 administration 0.000431  0.00138         1.68 
##  2 ago            0.00107   0.000842       -0.339
##  3 agreement      0.000671  0.00104         0.630
##  4 aid            0.0000476 0.00105         4.46 
##  5 air            0.00214   0.000297       -2.85 
##  6 american       0.00203   0.00168        -0.270
##  7 analysts       0.00109   0.000000578   -10.9  
##  8 area           0.00137   0.000231       -2.57 
##  9 army           0.000262  0.00105         2.00 
## 10 asked          0.000189  0.00156         3.05 
## # ℹ 188 more rows

beta_spread %>%
  group_by(direction = log_ratio > 0) %>%
  top_n(10, abs(log_ratio)) %>%
  ungroup() %>%
  mutate(term = reorder(term, log_ratio)) %>%
  ggplot(aes(term, log_ratio, fill = direction)) +
  geom_col(show.legend = FALSE) +
  labs(y = "Log2 ratio of beta in topic 2 / topic 1") +
  coord_flip()

we can see that the words more common in topic 2 (but not in topic 1) include political parties such as “democratic” and “republican”, as well as politician’s names such as “dukakis” and “gorbachev”
topic 1 was more characterized by currencies like “yen” and “dollar”, as well as financial terms such as “index”, “prices” and “rates”
this helps confirm that the two topics the algorithm identified were political and financial news

Document-topic probabilities

Besides estimating each topic as a mixture of words, LDA also models each document as a mixture of topics.

We can examine the per-document-per-topic probabilities, called \(\gamma\) (“gamma”), with the matrix = "gamma" argument to tidy().

ap_documents <- tidy(ap_lda, matrix = "gamma")
ap_documents

## # A tibble: 4,492 × 3
##    document topic    gamma
##       <int> <int>    <dbl>
##  1        1     1 0.248   
##  2        2     1 0.362   
##  3        3     1 0.527   
##  4        4     1 0.357   
##  5        5     1 0.181   
##  6        6     1 0.000588
##  7        7     1 0.773   
##  8        8     1 0.00445 
##  9        9     1 0.967   
## 10       10     1 0.147   
## # ℹ 4,482 more rows

# a document is a bag of topics
ap_documents %>% 
  group_by(document) %>% 
  summarize(sum(gamma))

## # A tibble: 2,246 × 2
##    document `sum(gamma)`
##       <int>        <dbl>
##  1        1            1
##  2        2            1
##  3        3            1
##  4        4            1
##  5        5            1
##  6        6            1
##  7        7            1
##  8        8            1
##  9        9            1
## 10       10            1
## # ℹ 2,236 more rows

# the topic that is most associated with each document
ap_documents %>%
  group_by(document) %>%
  top_n(1, gamma) %>%
  ungroup()

## # A tibble: 2,246 × 3
##    document topic gamma
##       <int> <int> <dbl>
##  1        3     1 0.527
##  2        7     1 0.773
##  3        9     1 0.967
##  4       11     1 0.995
##  5       15     1 0.999
##  6       16     1 0.761
##  7       19     1 0.999
##  8       20     1 0.999
##  9       21     1 0.910
## 10       22     1 0.977
## # ℹ 2,236 more rows

Document-topic probabilities

ap_documents

## # A tibble: 4,492 × 3
##    document topic    gamma
##       <int> <int>    <dbl>
##  1        1     1 0.248   
##  2        2     1 0.362   
##  3        3     1 0.527   
##  4        4     1 0.357   
##  5        5     1 0.181   
##  6        6     1 0.000588
##  7        7     1 0.773   
##  8        8     1 0.00445 
##  9        9     1 0.967   
## 10       10     1 0.147   
## # ℹ 4,482 more rows

We can see that many of these documents were drawn from a mix of the two topics, but that document 6 was drawn almost entirely from topic 2, having a \(\gamma\) from topic 1 close to zero.

To check this answer, we could tidy the document-term matrix and check what the most common words in that document were:

tidy(AssociatedPress) %>%
  filter(document == 6) %>%
  arrange(desc(count))

## # A tibble: 287 × 3
##    document term           count
##       <int> <chr>          <dbl>
##  1        6 noriega           16
##  2        6 panama            12
##  3        6 jackson            6
##  4        6 powell             6
##  5        6 administration     5
##  6        6 economic           5
##  7        6 general            5
##  8        6 i                  5
##  9        6 panamanian         5
## 10        6 american           4
## # ℹ 277 more rows

Based on the most common words, this appears to be an article about the relationship between the American government and Panamanian dictator Manuel Noriega, which means the algorithm was right to place it in topic 2 (as political/national news).

Play

download from Project Gutenberg Homer’s Odyssey (RDS)
treat each book of Odyssey as a document
run LDA on two topics and give an interpretation of the outcome