Mining financial articles

The RDS file (downloaded in May 2018) contains articles relevant to nine major technology stocks: Microsoft, Apple, Google, Amazon, Facebook, Twitter, IBM, Yahoo, and Netflix.

You are interested in using news to analyze the market and make investment decisions. Use sentiment analysis to determine whether the news coverage was positive or negative.

Use the lexicon Loughran and McDonald dictionary of financial sentiment terms. This dictionary was developed based on analyses of financial reports, and intentionally avoids words like “share” and “fool”, as well as subtler terms like “liability” and “risk” that may not have a negative meaning in a financial context. The Loughran data divides words into six sentiments: “positive”, “negative”, “litigious”, “uncertain”, “constraining”, and “superfluous”. Use command get_sentiments("loughran") to get the lexicon.

library(tidytext)
library(tidyverse)
library(quanteda)
library(tm)

stock_articles <- read_rds("stock_articles.rds")
stock_articles

## # A tibble: 9 × 3
##   company   symbol corpus    
##   <chr>     <chr>  <list>    
## 1 Microsoft MSFT   <WebCorps>
## 2 Apple     AAPL   <WebCorps>
## 3 Google    GOOG   <WebCorps>
## 4 Amazon    AMZN   <WebCorps>
## 5 Facebook  FB     <WebCorps>
## 6 Twitter   TWTR   <WebCorps>
## 7 IBM       IBM    <WebCorps>
## 8 Yahoo     YHOO   <WebCorps>
## 9 Netflix   NFLX   <WebCorps>

# extract tokens
stock_tokens <- stock_articles %>%
  mutate(corpus = map(corpus, tidy)) %>%
  unnest(corpus) %>%
  unnest_tokens(word, text) %>%
  select(company, word)

stock_tokens

## # A tibble: 54,125 × 2
##    company   word     
##    <chr>     <chr>    
##  1 Microsoft share    
##  2 Microsoft file     
##  3 Microsoft this     
##  4 Microsoft nov      
##  5 Microsoft 29       
##  6 Microsoft 2017     
##  7 Microsoft file     
##  8 Microsoft photo    
##  9 Microsoft shows    
## 10 Microsoft microsoft
## # … with 54,115 more rows

# visualize tokens grouped by sentiments
stock_tokens %>%
  count(word) %>%
  inner_join(get_sentiments("loughran"), by = "word") %>%
  group_by(sentiment) %>%
  top_n(5, n) %>%
  ungroup() %>%
  # reorder treats its first argument as a categorical variable, and reorders its levels based on the values of a second variable
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ sentiment, scales = "free") +
  ylab("Frequency of this word in the recent financial articles")

# table sentiments per company 
stock_sentiment_count <- stock_tokens %>%
  inner_join(get_sentiments("loughran"), by = "word") %>%
  count(sentiment, company) %>%
  spread(sentiment, n, fill = 0)

stock_sentiment_count

## # A tibble: 9 × 7
##   company   constraining litigious negative positive superfluous uncertainty
##   <chr>            <dbl>     <dbl>    <dbl>    <dbl>       <dbl>       <dbl>
## 1 Amazon              19        13      103      149           3         111
## 2 Apple                8        14      143       82           0          82
## 3 Facebook            12        63      129       30           0          52
## 4 Google              21         3       51       52           1          34
## 5 IBM                  2         5       31       46           1          36
## 6 Microsoft            0         3       48       42           0          34
## 7 Netflix              6         9      114      103           1          64
## 8 Twitter              6        11       52       18           1          27
## 9 Yahoo               11       167      300       56           0          56

# visualize positive/negative companies
stock_sentiment_count %>%
  # positive score normalized by number of words
  mutate(score = (positive - negative) / (positive + negative)) %>%
  mutate(company = reorder(company, score)) %>%
  ggplot(aes(company, score, fill = score > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(x = "Company",
       y = "Positivity score among 20 news articles")