R as a language for statistical computing

We start with an overview of R graphics. R includes tools for drawing most common charts. Use curve to draw a function in implicit form:

curve(x^3, from=0, to=2)
curve(x^2, from=0, to=2, col="red", add=TRUE)

A scatter plot is drawn with command plot:

plot(serieA$score, serieA$F)

To interactively identify the points on the plot use function identity:

identify(serieA$score, serieA$F, serieA$team)

To add point labels on the plot use text:

plot(serieA$score, serieA$F, xlim=c(20,70), ylim=c(20,60), xlab="score", ylab="goals", main="Serie A")

text(serieA$score[serieA$team!="Bologna"], serieA$F[serieA$team!="Bologna"], labels=serieA$team[serieA$team!="Bologna"], adj=c(0,-0.5), cex=0.75)

text(serieA$score[serieA$team=="Bologna"], serieA$F[serieA$team=="Bologna"], labels=c("Bologna"), adj=c(0,-1.5), cex=0.75)

You can customize plots using graphical parameters. You can specify these parameters using arguments of graphics functions or with the par function, which sets the parameters for the specific graphics device (see help for par function). For instance, you may plot multiple charts within the same chart area setting the mfcol parameter (the arguments are number of rows and number of columns of the chart matrix):

par(mfcol = c(3,2))

Or scale the text in the chart (title, axis annotation, labels) with the cex parameter (1 is the default):

par(cex = 1.5)

The default plotting color is controlled by the col parameter, while line type and line width are set by lty and lwd, respectively:

plot(sin, -pi, pi, col = "blue", lty = 2, lwd = 2) 
lines(c(-pi/2, pi/2), c(-1, 1), lty = 3, lwd = 3)

To change the symbol used for points use pch parameter:

plot(rep(1, 5), pch = 1, ylim = c(1,5))
points(rep(2, 5), pch = 2)
points(rep(3, 5), pch = 3)
points(rep(4, 5), pch = 4)
points(rep(5, 5), pch = 5)

To remove axes, use xaxt and yaxt. To add a customised axis, use the axis function:

plot(1:5, xaxt="n")
axis(side = 1, at = 1:5, labels = month.abb[1:5])
axis(side = 3, at = 1:5, labels = letters[1:5])
axis(side = 4, at = 1:5, labels = LETTERS[1:5])

To draw bar charts, use barplot:

univ = data.frame(
   year = c(2005, 2006, 2007, 2008, 2009),
   cs = c(1200, 1190, 1100, 1120, 890),
   eng = c(6200, 6690, 6700, 7120, 7150),
   med = c(8900, 8790, 8760, 8800, 9010),
   bio = c(3300, 3490, 3660, 4300, 4510),
   phy = c(2190, 2000, 1890, 1740, 1500)
)  
 
univ.m = as.matrix(univ[2:6])
rownames(univ.m) = univ[,1]

> univ.m  
       cs  eng  med  bio  phy
2005 1200 6200 8900 3300 2190
2006 1190 6690 8790 3490 2000
2007 1100 6700 8760 3660 1890
2008 1120 7120 8800 4300 1740
2009  890 7150 9010 4510 1500

barplot(univ.m[1,])

This draws multiple stacked bars:

barplot(univ.m, legend=TRUE)

Use juxtaposed bars instead:

barplot(univ.m, beside=TRUE, legend=TRUE)

Reverse the variables (function t computes the transpose of a matrix):

barplot(t(univ.m), legend=TRUE, ylim=c(0, 33000))

Pie and dot charts are alternative representations to bar charts:

x = c(35, 11, 30, 8, 6)
names(x) = c("PDL", "LN", "PD", "IDV", "UC")
pie(x)
dotchart(x)

Plot categorial data with spineplot and assocplot. For instance, we analyze how co-authorship influences the quality of academic papers for medical sciences. Consider a contingency table where rows are classes of paper co-authors and columns are paper quality judgements (Limited, Acceptable, Good, and Excellent).

> peer = matrix(scan("peer.dat", 0), nrow=4, byrow=TRUE)
Read 16 items

dimnames(peer) = list(c("1-6", "7-8", "9-10", ">10"), c("L", "A", "G", "E"))
> peer
      L   A   G   E
1-6  82 177 363 159
7-8  31 124 306 135
9-10 16  78 303 132
>10  18  84 307 222

A spineplot can be seen as a generalization of a stacked bar plot where the widths of the bars correspond to the relative frequencies of 'x', and the heights of the bars correspond to the conditional relative frequencies of 'y' in every 'x' group.

spineplot(peer, main="Effect of co-authorship on paper quality", xlab="authors", ylab="quality")

Function assocplot plots a Cohen-Friendly association plot, a set of bar charts showing the deviation from the expected frequencies in case of independence of the two variables. Each cell is represented by a rectangle that has area proportional to the difference in observed and expected frequencies. The rectangles in each row are positioned relative to a baseline indicating independence. If the observed frequency of a cell is greater than the expected one, the box rises above the baseline; otherwise, the box falls below the baseline.

assocplot(peer, main="Effect of co-authorship on paper quality", xlab="authors", ylab="quality")

You can plot probability distributions using different plots. Function hist generates a histogram:

# random sample of size 1000 according to normal distribution
bell = rnorm(1000)

hist(bell, breaks = seq(floor(min(bell)), ceiling(max(bell)), 0.2), probability=TRUE, main="Histogram of a normal sample")

Function density calculates the kernel density estimates of the sample distribution, which can be plotted. Additionally you may add a rug (a 1-dimensional plot in which each point is represented by a short line segment):

plot(density(bell), main="Density of a normal sample")
rug(bell)

A quantile-quantile plot, or Q-Q plot, plots the quantiles of an empirical distribution against the quantiles of a theoretical distribution. It is useful to compare empirical and theoretical distributions. Use function qqnorm to compare an empirical sample with the normal distribution, qqline to draw a fitting line, and qqplot to compare two empirical samples:

qqnorm(bell)
qqline(bell)

unif = runif(1000)
qqplot(bell, unif)

A box plot shows 1st quartile, 2nd quartile (median), and 3rd quartile of the distribution inside a box, plus upper and lower adjacent values as whiskers of the box and possible outliers above and below them. By default, the whiskers extend to the most extreme data point which is no more than 1.5 times the interquartile range from the box (use the range parameter to control whiskers. A value of zero causes the whiskers to extend to the data extremes).

boxplot(bell)

Finally, to save charts in common formats, call the device name (like png, postscript, and pdf) passing the file name, plot the chart, and recall to close the device (and save the file) with dev.off function:

png("plot.png")
plot(1:10)
dev.off() 

postscript("plot.eps", onefile=FALSE);
plot(1:10)
dev.off() 

pdf("plot.pdf", onefile=FALSE);
plot(1:10)
dev.off()

Let's do some basic statistics with R. R includes a variety of functions for calculating summary statistics. As an example, we load citations received by Computer Science journal papers:

> cites = scan("cscites.txt", 0)
Read 9140 items

> min(cites)
[1] 0

> max(cites)
[1] 1014

> mean(cites)
[1] 11.69026

> median(cites)
[1] 4

> var(cites)
[1] 1002.378

> sd(cites)
[1] 31.66035

To ignore NA values, set the na.rm parameter. To symmetrically trim the vector before computing the mean (to erase outliers), use the trim parameter:

> mean(c(1, 2, NA))
[1] NA

> mean(c(1, 2, NA), na.rm=TRUE)
[1] 1.5

> mean(cites, trim=0.1)
[1] 6.053063

The following summary functions do a convenient job:

> range(cites)
[1]    0 1014

> fivenum(cites)
[1]    0    1    4   11 1014

> summary(cites)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    1.00    4.00   11.69   11.00 1014.00

To compute a specific quantile:

> quantile(cites, 0.60)
60% 
  6 

> quantile(cites, seq(0, 1, 0.1))
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
   0    0    0    1    2    4    6    9   14   27 1014

We can write simple functions to compute skewness and kurtosis of a sample distribution:

skew = function(x) sum((x - mean(x))^3) / (length(x) * sd(x)^3)

> skew(cites)
[1] 13.07667

> skew(rnorm(1000))
[1] -0.01478604

kurt = function(x) sum((x - mean(x))^4) / (length(x) * sd(x)^4) - 3

> kurt(cites)
[1] 284.0616

> kurt(rnorm(1000))
[1] -0.1071780

Correlation among variables is computed with cor function:

> cor(serieA$score, serieA$F)
[1] 0.8428363

> cor(serieA$score, serieA$S)
[1] -0.5867943

> cor(serieA$score, serieA$N)
[1] 0.09274693

We can generate a scatterplot matrix for the pairwise combination of the variables in a matrix or data frame with function pairs:

df = serieA[, c("score", "V", "N", "P", "F", "S")] 

> df
   score  V  N  P  F  S
1     63 18  9  4 58 28
2     62 18  8  5 56 35
3     60 17  9  5 49 29
4     51 14  9  8 46 38
5     48 14  6 11 48 44
6     48 12 12  7 41 36
7     48 13  9  9 38 37
8     44 13  5 13 43 36
9     44 12  8 11 51 51
10    43 11 10 10 38 37
11    42 11  9 11 31 38
12    40 11  7 13 48 47
13    38 10  8 13 27 29
14    35  9  8 14 34 44
15    35  8 11 12 34 36
16    33  7 12 12 27 33
17    32  8  8 15 38 49
18    28  7  7 17 29 42
19    26  6  8 17 32 53
20    25  6  7 18 21 47

pairs(df)

Similarly, we can generate a correlation matrix for the pairwise combination of the variables in a matrix or data frame with function cor:

> cor(df)
            score           V           N          P          F          S
score  1.00000000  0.98727658  0.09274693 -0.9556292  0.8428363 -0.5867943
V      0.98727658  1.00000000 -0.06676016 -0.8966296  0.8654055 -0.5253528
N      0.09274693 -0.06676016  1.00000000 -0.3819344 -0.1303003 -0.3924024
P     -0.95562917 -0.89662964 -0.38193436  1.0000000 -0.7437634  0.6607481
F      0.84283632  0.86540551 -0.13030033 -0.7437634  1.0000000 -0.1621842
S     -0.58679433 -0.52535279 -0.39240241  0.6607481 -0.1621842  1.0000000

Interestingly, score is not influenced by the number of ties (N), and its association with scored goals (F) is stronger, in absolute value, than the association with suffered goals (S). This means that the best strategy to win the championship is to attack and not to defend.

To exclude NA values, set parameter use to complete.obs. To test whether the correlation is statistically significant, make a correlation test using cor.test. If the p-value is lower than a significance level, usually set to 5%, than the correlation is significant at that level:

> cor.test(serieA$score, serieA$F)

        Pearson's product-moment correlation

data:  serieA$score and serieA$F 
t = 6.6445, df = 18, p-value = 3.095e-06
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 0.6384339 0.9361856 
sample estimates:
      cor 
0.8428363 

> cor.test(serieA$score, serieA$S)

        Pearson's product-moment correlation

data:  serieA$score and serieA$S 
t = -3.0745, df = 18, p-value = 0.006533
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 -0.8171316 -0.1948770 
sample estimates:
       cor 
-0.5867943 

> cor.test(serieA$score, serieA$N)

        Pearson's product-moment correlation

data:  serieA$score and serieA$N 
t = 0.3952, df = 18, p-value = 0.6973
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 -0.3647438  0.5141651 
sample estimates:
       cor 
0.09274693

The computed correlation is Pearson. To calculate nonparametric Spearman and Kendall correlations, use parameter method:

> cor(serieA$score, serieA$F, method="spearman")
[1] 0.8228183

> cor(serieA$score, serieA$F, method="kendall")
[1] 0.661249

Let us test the correlation of categorial variables. A contingency table can be constructed from two vectors of observations with function table:

authors = c("1-6", "1-6", "1-6", "1-6", "7-8", "7-8", "7-8", "9-10", "9-10", ">10")
quality = c("L", "L", "L", "L", "A", "A", "A", "G", "E", "E") 

cross.tab = table(authors, quality)

> cross.tab
       quality
authors A E G L
   >10  0 1 0 0
   1-6  0 0 0 4
   7-8  3 0 0 0
   9-10 0 1 1 0

Given a contingency table, we may use the Pearson's Chi-squared test to check whether there exists a significant association between the two variables in the table. Consider again the relationship between number of authors and quality of papers in medical sciences. The variable authors is categorized in disjoint intervals of authors, and quality can take four categorial values (E, G, A, L):

> peer
      L   A   G   E
1-6  82 177 363 159
7-8  31 124 306 135
9-10 16  78 303 132
>10  18  84 307 222

> chisq.test(peer)

        Pearson's Chi-squared test

data:  peer 
X-squared = 110.1166, df = 9, p-value < 2.2e-16

The p-value is the probability of obtaining by chance a similar departure from the expected frequencies in case of independence. Since the p-value is nearly 0, the observed results are very far from independence, and we can reject the hypothesis that variables number of authors and quality of papers are independent. For instance, the observed number of excellent papers with more than 10 authors is 222, while the expected one in case of independence would be much less:

> sum(peer[">10",]) * sum(peer[,"E"]) / sum(peer)
[1] 161.1699

We can use Chi-squared test to check whether the population probabilities equal those specified in parameter p, or are all equal if p is not given. Let's check whether Greek Pi is a good random source of digits:

> pi.digits = scan("pi.dat", 0)
Read 10000 items

> table(pi.digits)
pi.digits
   0    1    2    3    4    5    6    7    8    9 
 968 1026 1021  974 1012 1046 1021  970  948 1014 
 
> chisq.test(pi.digits)

        Chi-squared test for given probabilities

data:  pi.digits 
X-squared = 9.318, df = 9, p-value = 0.4085

Since the p-value is very high, we cannot reject the hypothesis that probabilities are all equal.

R provides functions to compute typical probability distributions. For instance, let's work with the normal distribution (to change distribution, just plug in the name of the distribution and of the corresponding parameters). To find the probability density at a given value, use the dnorm function:

> dnorm(0)
[1] 0.3989423

> dnorm(0, mean = 1, sd = 2)
[1] 0.1760327

plot(dnorm, -3, 3, main="Density function for normal distribution")

Let's plot densities with different standard deviations on the same plot:

plot(function(x) {dnorm(x)}, -5, 5, ylab="dnorm")
plot(function(x) {dnorm(x, sd=1.5)}, -5, 5, col="red", add=TRUE)
plot(function(x) {dnorm(x, sd=2)}, -5, 5, col="blue", add=TRUE)

To find the distribution function at a given value q, that is, the probability p = P(X <= q), use the pnorm function. Set parameter lower.tail = FALSE for the complementary probability p = P(X > q).

> pnorm(0)
[1] 0.5

> pnorm(0, lower.tail=FALSE)
[1] 0.5

# probability that the value is less than 1 standard deviation below the mean
> pnorm(-1)
[1] 0.1586553

# probability that the value is less than 2 standard deviations below the mean
> pnorm(-2)
[1] 0.02275013

# probability that the value is within 2 standard deviations from the mean
> pnorm(2) - pnorm(-2) 
[1] 0.9544997

plot(pnorm, -3, 3, main="Probability function for normal distribution")

The quantile function is the reverse of the probability function. For a given probability value p, it returns q such that p = P(X <= q).

# 1st quartile
> qnorm(0.25)
[1] -0.6744898

# 2nd quartile (median)
> qnorm(0.5)
[1] 0

# 3rd quartile
> qnorm(0.75)
[1] 0.6744898

# left and right sides of a 95% confidence interval
> c(qnorm(0.025), qnorm(0.975))
[1] -1.959964  1.959964

plot(qnorm, 0, 1)

Finally, a random sample taken from the normal distribution is generated with rnorm:

> rnorm(10)
 [1] -0.70682679 -0.64476272  1.99289305  0.04677949 -0.30369483  0.97250698
 [7]  0.04484939  0.54280398 -1.52733082 -0.79838026