The RDS file (downloaded in May 2018) contains articles relevant to nine major technology stocks: Microsoft, Apple, Google, Amazon, Facebook, Twitter, IBM, Yahoo, and Netflix.
You are interested in using news to analyze the market and make investment decisions. Use sentiment analysis to determine whether the news coverage was positive or negative.
Use the lexicon Loughran and McDonald dictionary of financial sentiment terms. This dictionary was developed based on analyses of financial reports, and intentionally avoids words like “share” and “fool”, as well as subtler terms like “liability” and “risk” that may not have a negative meaning in a financial context. The Loughran data divides words into six sentiments: “positive”, “negative”, “litigious”, “uncertain”, “constraining”, and “superfluous”. Use command get_sentiments("loughran")
to get the lexicon.
library(tidytext)
library(tidyverse)
library(quanteda)
library(tm)
stock_articles <- read_rds("stock_articles.rds")
stock_articles
## # A tibble: 9 × 3
## company symbol corpus
## <chr> <chr> <list>
## 1 Microsoft MSFT <WebCorps>
## 2 Apple AAPL <WebCorps>
## 3 Google GOOG <WebCorps>
## 4 Amazon AMZN <WebCorps>
## 5 Facebook FB <WebCorps>
## 6 Twitter TWTR <WebCorps>
## 7 IBM IBM <WebCorps>
## 8 Yahoo YHOO <WebCorps>
## 9 Netflix NFLX <WebCorps>
# extract tokens
stock_tokens <- stock_articles %>%
mutate(corpus = map(corpus, tidy)) %>%
unnest(corpus) %>%
unnest_tokens(word, text) %>%
select(company, word)
stock_tokens
## # A tibble: 54,125 × 2
## company word
## <chr> <chr>
## 1 Microsoft share
## 2 Microsoft file
## 3 Microsoft this
## 4 Microsoft nov
## 5 Microsoft 29
## 6 Microsoft 2017
## 7 Microsoft file
## 8 Microsoft photo
## 9 Microsoft shows
## 10 Microsoft microsoft
## # … with 54,115 more rows
# visualize tokens grouped by sentiments
stock_tokens %>%
count(word) %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
group_by(sentiment) %>%
top_n(5, n) %>%
ungroup() %>%
# reorder treats its first argument as a categorical variable, and reorders its levels based on the values of a second variable
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
coord_flip() +
facet_wrap(~ sentiment, scales = "free") +
ylab("Frequency of this word in the recent financial articles")
# table sentiments per company
stock_sentiment_count <- stock_tokens %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
count(sentiment, company) %>%
spread(sentiment, n, fill = 0)
stock_sentiment_count
## # A tibble: 9 × 7
## company constraining litigious negative positive superfluous uncertainty
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Amazon 19 13 103 149 3 111
## 2 Apple 8 14 143 82 0 82
## 3 Facebook 12 63 129 30 0 52
## 4 Google 21 3 51 52 1 34
## 5 IBM 2 5 31 46 1 36
## 6 Microsoft 0 3 48 42 0 34
## 7 Netflix 6 9 114 103 1 64
## 8 Twitter 6 11 52 18 1 27
## 9 Yahoo 11 167 300 56 0 56
# visualize positive/negative companies
stock_sentiment_count %>%
# positive score normalized by number of words
mutate(score = (positive - negative) / (positive + negative)) %>%
mutate(company = reorder(company, score)) %>%
ggplot(aes(company, score, fill = score > 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(x = "Company",
y = "Positivity score among 20 news articles")