There are three popular correlation coefficients: Pearson, Spearman and Kendall.
Pearson correlation coefficient is used to estimate rating-based correlation if data come from bivariate normal distribution. It is the covariance of two variables, divided by the product of their standard deviations:
\[cor(X,Y) =
\frac{cov(X, Y)}{\sigma_{X} \cdot \sigma_{Y}} =
\frac{\sum_{i} (x_i - \mu_X) (y_i - \mu_Y)}{\sqrt{\sum_{i} (x_i - \mu_X)^2} \sqrt{\sum_{i} (y_i - \mu_Y)^2}}\]
library(ggplot2)
displ = mpg$displ
hwy = mpg$hwy
cor(displ, hwy)
## [1] -0.76602
Kendall’s and Spearman’s correlation coefficients are used to estimate a rank-based correlation. These are more robust and have been recommended if the data do not necessarily come from a bivariate normal distribution.
Spearman correlation coefficient uses Pearson’s formula of ranks, not ratings (values). For instance:
# vectors
x = c(1, 100, 2, 0)
y = c(500, 6, 2, 4)
# ranks
(rankx = rank(x))
## [1] 2 4 3 1
(ranky = rank(y))
## [1] 4 3 1 2
cor(x, y, method = "spearman")
## [1] 0
cor(rankx, ranky)
## [1] 0
cor(displ, hwy, method = "spearman")
## [1] -0.8266576
cor(rank(displ), rank(hwy))
## [1] -0.8266576
Kendall correlation coefficient is defined as the relative difference of concordant and discordant pairs among the two variables:
\[\frac{n(concord) - n(discord)}{n (n-1) /2}
\] where \(n(concord)\) is the number of concordant pairs and \(n(discord)\) is the number of discordant pairs.
- A pair \((i,j)\) is concordant if both \(x_i < x_j\) and \(y_i < y_j\) or both \(x_i > x_j\) and \(y_i > y_j\).
- A pair \((i,j)\) is disconcordant if \(x_i < x_j\) and \(y_i > y_j\) or \(x_i > x_j\) and \(y_i < y_j\).
- If \(x_i = x_j\) or \(y_i = y_j\) the pair is neither concordant nor discordant.
Find a concordant and a discordant pair in the mpg dataset with respect to displ and hwy variables.
library(dplyr)
mpg %>%
arrange(displ) %>%
mutate(id = 1:nrow((mpg))) %>%
select(id, displ, hwy)
## # A tibble: 234 × 3
## id displ hwy
## <int> <dbl> <int>
## 1 1 1.6 33
## 2 2 1.6 32
## 3 3 1.6 32
## 4 4 1.6 29
## 5 5 1.6 32
## 6 6 1.8 29
## 7 7 1.8 29
## 8 8 1.8 26
## 9 9 1.8 25
## 10 10 1.8 34
## # ℹ 224 more rows
cor(displ, hwy, method = "kendall")
## [1] -0.6536974