We already introduced the tidy()
method for tidying model objects. The tidytext package provides this method for extracting the per-topic-per-word probabilities, called \(\beta\) (“beta”), from the model.
library(tidyverse)
library(tidytext)
library(topicmodels)
# This is a collection of 2246 news articles from an American news agency, mostly published around 1988.
data("AssociatedPress")
# set a seed so that the output of the model is predictable
ap_lda <- LDA(AssociatedPress, k = 2, control = list(seed = 1234))
ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics
## # A tibble: 20,946 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 aaron 1.69e-12
## 2 2 aaron 3.90e- 5
## 3 1 abandon 2.65e- 5
## 4 2 abandon 3.99e- 5
## 5 1 abandoned 1.39e- 4
## 6 2 abandoned 5.88e- 5
## 7 1 abandoning 2.45e-33
## 8 2 abandoning 2.34e- 5
## 9 1 abbott 2.13e- 6
## 10 2 abbott 2.97e- 5
## # ℹ 20,936 more rows
# a topic is a bag of words
ap_topics %>%
group_by(topic) %>%
summarize(sum(beta))
## # A tibble: 2 × 2
## topic `sum(beta)`
## <int> <dbl>
## 1 1 1.00
## 2 2 1.00
# the 10 terms that are most common within each topic
ap_top_terms <- ap_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
ap_top_terms
## # A tibble: 20 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 percent 0.00981
## 2 1 million 0.00684
## 3 1 new 0.00594
## 4 1 year 0.00575
## 5 1 billion 0.00427
## 6 1 last 0.00368
## 7 1 two 0.00360
## 8 1 company 0.00348
## 9 1 people 0.00345
## 10 1 market 0.00333
## 11 2 i 0.00705
## 12 2 president 0.00489
## 13 2 government 0.00452
## 14 2 people 0.00407
## 15 2 soviet 0.00372
## 16 2 new 0.00370
## 17 2 bush 0.00370
## 18 2 two 0.00361
## 19 2 years 0.00339
## 20 2 states 0.00320
ap_top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()

- this visualization lets us understand the two topics that were extracted from the articles
- the most common words in topic 1 include “percent”, “million”, “billion”, and “company”, which suggests it may represent business or financial news
- those most common in topic 2 include “president”, “government”, and “soviet”, suggesting that this topic represents political news
- one important observation about the words in each topic is that some words, such as “new” and “people”, are common within both topics
- this is an advantage of topic modeling as opposed to “hard clustering” methods: topics used in natural language could have some overlap in terms of words