Let’s work with a dataset of the number of flights that leave NYC per day. We’ll also use the lubridate
package for dates and times manipulation. Let’s get started by counting the number of flights per day and visualising it.
library(dplyr)
library(ggplot2)
library(modelr)
library(nycflights13)
library(lubridate)
daily <- flights %>%
mutate(date = make_date(year, month, day)) %>%
group_by(date) %>%
summarise(n = n())
daily
## # A tibble: 365 x 2
## date n
## <date> <int>
## 1 2013-01-01 842
## 2 2013-01-02 943
## 3 2013-01-03 914
## 4 2013-01-04 915
## 5 2013-01-05 720
## 6 2013-01-06 832
## 7 2013-01-07 933
## 8 2013-01-08 899
## 9 2013-01-09 902
## 10 2013-01-10 932
## # … with 355 more rows
ggplot(daily, aes(date, n)) +
geom_line()
Understanding the long-term trend is challenging because there’s a very strong day-of-week effect that dominates the subtler patterns. Let’s start by looking at the distribution of flight numbers by day-of-week:
daily <- daily %>%
mutate(wday = wday(date, label = TRUE))
daily
## # A tibble: 365 x 3
## date n wday
## <date> <int> <ord>
## 1 2013-01-01 842 Mar
## 2 2013-01-02 943 Mer
## 3 2013-01-03 914 Gio
## 4 2013-01-04 915 Ven
## 5 2013-01-05 720 Sab
## 6 2013-01-06 832 Dom
## 7 2013-01-07 933 Lun
## 8 2013-01-08 899 Mar
## 9 2013-01-09 902 Mer
## 10 2013-01-10 932 Gio
## # … with 355 more rows
ggplot(daily, aes(wday, n)) +
geom_boxplot()
There are fewer flights on weekends because most travel is for business. The effect is particularly pronounced on Saturday: you might sometimes leave on Sunday for a Monday morning meeting, but it’s very rare that you’d leave on Saturday as you’d much rather be at home with your family.
Let’s remove weekend days from the orginal plot showing the number of flights per date:
ggplot(filter(daily, ! (wday %in% c("Sab", "Dom"))), aes(date, n)) +
geom_line() +
geom_smooth(se = FALSE, span = 0.20)
We see a clearer pattern here than before, with fewer flights at the beginning and end of the years, and more flights in the middle.
Another way to remove this strong pattern is to use a model. First, we fit the model, and display its predictions overlaid on the original data:
mod <- lm(n ~ wday, data = daily)
grid <- daily %>%
data_grid(wday) %>%
add_predictions(mod, "n")
grid
## # A tibble: 7 x 2
## wday n
## <ord> <dbl>
## 1 Dom 891.
## 2 Lun 975.
## 3 Mar 951.
## 4 Mer 963.
## 5 Gio 966.
## 6 Ven 967.
## 7 Sab 745.
ggplot(daily, aes(wday, n)) +
geom_boxplot() +
geom_point(data = grid, colour = "red", size = 4)
Next we compute and visualise the residuals:
daily <- daily %>%
add_residuals(mod)
daily %>%
ggplot(aes(date, resid)) +
geom_ref_line(h = 0) +
geom_line()
Note the change in the y-axis: now we are seeing the deviation from the expected number of flights, given the day of week. This plot is useful because now that we’ve removed much of the large day-of-week effect, we can see some of the subtler patterns that remain:
There seems to be some smoother long term trend over the course of a year. We can highlight that trend with geom_smooth()
:
daily %>%
ggplot(aes(date, resid)) +
geom_ref_line(h = 0) +
geom_line(colour = "grey50") +
geom_smooth(se = FALSE, span = 0.20)
There are fewer flights in January (and December), and more in summer (May-Sep).
Drawing a plot with one line for each day of the week makes clear that the model fails to accurately predict the number of flights on Saturday: during summer there are more flights than we expect, and during Fall there are fewer.
ggplot(daily, aes(date, resid, colour = wday)) +
geom_ref_line(h = 0) +
geom_line()
There are some days with far fewer flights than expected:
daily %>%
filter(resid < -100) %>%
arrange(resid)
## # A tibble: 11 x 4
## date n wday resid
## <date> <int> <ord> <dbl>
## 1 2013-11-28 634 Gio -332.
## 2 2013-11-29 661 Ven -306.
## 3 2013-12-25 719 Mer -244.
## 4 2013-07-04 737 Gio -229.
## 5 2013-12-24 761 Mar -190.
## 6 2013-12-31 776 Mar -175.
## 7 2013-09-01 718 Dom -173.
## 8 2013-05-26 729 Dom -162.
## 9 2013-07-05 822 Ven -145.
## 10 2013-01-01 842 Mar -109.
## 11 2013-01-20 786 Dom -105.
# 9-11 effect?
daily %>%
filter(date == "2013-09-11")
## # A tibble: 1 x 4
## date n wday resid
## <date> <int> <ord> <dbl>
## 1 2013-09-11 947 Mer -15.7
If you’re familiar with American public holidays, you might spot New Year’s day, July 4th, Thanksgiving and Christmas. There are some others that don’t seem to correspond to public holidays.