This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. The data were collected by Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2–5.

Challenges

  1. Explore the pairwise correlation among variables
  2. Use package corrplot to visualize correlations among variables

We first explore correlation two variables (sepal and petal lengths) using Pearson, Spearman and Kendall correlation coefficients. Pearson is used to estimate rating-based correlation if data come from bivariate normal distribution.

\[cor(X,Y) = \frac{cov(X, Y)}{\sigma_{X} \cdot \sigma_{Y}} = \frac{\sum_{i} (x_i - \mu_X) (y_i - \mu_Y)}{\sqrt{\sum_{i} (x_i - \mu_X)^2} \sqrt{\sum_{i} (y_i - \mu_Y)^2}}\]

Kendall’s or Spearman’s statistic is used to estimate a rank-based correlation. These are more robust and have been recommended if the data do not necessarily come from a bivariate normal distribution. Spearman uses the formula of Pearson on ranks, not ratings (values). Kendall is defined as:

\[\frac{n(concord) - n(discord)}{n (n-1) /2} \] where \(n(concord)\) is the number of concordant pairs and \(n(discord)\) is the number of discordant pairs.

# The database is attached to the R search path. This means that the database is searched by R when evaluating a variable, so objects in the database can be accessed by simply giving their names.
attach(iris)

# Pearson correlation
cor(Sepal.Length, Petal.Length)
## [1] 0.8717538
# correlation test
cor.test(Sepal.Length, Petal.Length)
## 
##  Pearson's product-moment correlation
## 
## data:  Sepal.Length and Petal.Length
## t = 21.646, df = 148, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8270363 0.9055080
## sample estimates:
##       cor 
## 0.8717538
# normal distribution?
par(mfrow = c(1, 2) )
hist(Sepal.Length, main="")
hist(Petal.Length, main="")

par(mfrow = c(1, 1) )

# Spearman correlation
cor(Sepal.Length, Petal.Length, method = "spearman")
## [1] 0.8818981
# Kendall correlation
cor(Sepal.Length, Petal.Length, method = "kendall")
## [1] 0.7185159
# draw scatterplot
plot(Sepal.Length, Petal.Length)

ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
  geom_point() + 
  geom_smooth()

# filter setosa
iris_setosa = filter(iris, Species == "setosa")
cor.test(iris_setosa$Sepal.Length, iris_setosa$Petal.Length)
## 
##  Pearson's product-moment correlation
## 
## data:  iris_setosa$Sepal.Length and iris_setosa$Petal.Length
## t = 1.9209, df = 48, p-value = 0.0607
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01206954  0.50776233
## sample estimates:
##       cor 
## 0.2671758

We check pairwise correlation among the 4 variables (6 pairs):

dataset = iris[, 1:4]

# correlation matrix
M = cor(dataset)
round(M, 2)
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length         1.00       -0.12         0.87        0.82
## Sepal.Width         -0.12        1.00        -0.43       -0.37
## Petal.Length         0.87       -0.43         1.00        0.96
## Petal.Width          0.82       -0.37         0.96        1.00
# correlation plots matrix
pairs(dataset)

# correlation plot matrix using corplot package
corrplot(M, method="ellipse")

# add correlation coefficient
corrplot.mixed(M, lower="number", upper="ellipse")

# cluster variables
corrplot(M, order = "AOE")