R as a data manipulation language

You can use the scan function to read data from a file into a vector. Let us read files vector.dat, vector.csv, and vector.txt:

> scan("vector.dat", what=integer())
Read 8 items
[1]  2  3  5  7 11 13 17 19

# or equivalently:
scan("vector.dat", 0)

> scan("vector.csv", what=character(), sep=",")
Read 4 items
[1] "Hello World!" "Hello"        "World"        "!"

# or equivalently:
scan("vector.csv", "a", sep=",")

> scan("vector.txt", what=character(), quote="\"")
Read 4 items
[1] "Hello World!" "Hello"        "World"        "!"

To read a matrix, first scan the data into a vector, and then map the vector into the desired matrix. Let us read matrix.dat:

> matrix(scan("matrix.dat", 0), nrow=4, byrow=TRUE)
Read 12 items
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12

You can read the table teams.dat into a data frame:

> read.table("teams.dat", header=TRUE)
    team  score win tie lost
1   Inter    59  17   8    3
2   Milan    58  17   7    4
3    Roma    53  15   8    5
4 Palermo    46  13   7    8

You can write data to a file (after converting it to a data frame) as follows:

league = read.table("teams.dat", header=TRUE)
write.table(league, file="league.txt", quote=FALSE, row.names=FALSE)

To save a compressed version of an object:

save(league, file="league.dat")

To load the saved object:

load("league.dat")

The c and paste functions concatenate multiple vectors into a single vector with different effects:

x = letters[1:4]
> x
[1] "a" "b" "c" "d"

y = LETTERS[1:4]
> y
[1] "A" "B" "C" "D"

> c(x, y)
[1] "a" "b" "c" "d" "A" "B" "C" "D"

> paste(x, y) 
[1] "a A" "b B" "c C" "d D"

> paste(x, y, sep="/")
[1] "a/A" "b/B" "c/C" "d/D"

> paste(x, collapse="-")
[1] "a-b-c-d"

Functions cbind and rbind add columns and rows, respectively, to matrices and data frames:

m = matrix(data = 1:9, nrow=3, byrow=TRUE)

> rbind(m, c(1, 1, 1))
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]    1    1    1

> cbind(m, c(1, 1, 1))
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    1
[2,]    4    5    6    1
[3,]    7    8    9    1

league = read.table("teams.dat", header=TRUE)
> league
     team score win tie lost
1   Inter    59  17   8    3
2   Milan    58  17   7    4
3    Roma    53  15   8    5
4 Palermo    46  13   7    8

top = league[1:2,]
bottom = league[3:4,]

> top
   team score win tie lost
1 Inter    59  17   8    3
2 Milan    58  17   7    4

> bottom
     team score win tie lost
3    Roma    53  15   8    5
4 Palermo    46  13   7    8

> rbind(top, bottom)
     team score win tie lost
1   Inter    59  17   8    3
2   Milan    58  17   7    4
3    Roma    53  15   8    5
4 Palermo    46  13   7    8

goals = c(58, 56, 49, 46);
> cbind(league, goals)
     team score win tie lost goals
1   Inter    59  17   8    3    58
2   Milan    58  17   7    4    56
3    Roma    53  15   8    5    49
4 Palermo    46  13   7    8    46

One way to take the subset of a data set is to use the bracket notation. Let us work with the ranking of Italian soccer Serie A. Use the first dimension in square brackets to filter the table rows, and the second dimension to select the table columns:

serieA = read.table("serieA.dat", header=TRUE)

> serieA[serieA$score > 44 & serieA$F > 38, c("team", "score")]
      team score
1    Inter    63
2     Roma    62
3    Milan    60
4  Palermo    51
5 Juventus    48
6   Napoli    48

Equivalently, you may use the subset function:

subset(serieA, score > 44 & F > 38, c("team", "score"))

We can select a random sample of a data set as follows:

> sample(1:100, 10)
 [1] 84 18 42 39 11  4 32 70 71 54

# random permutation
> sample(1:10, 10)
 [1]  4  2  9  6 10  8  1  7  5  3

# with element replacement
> sample(c(0,1), 10, replace=TRUE)
 [1] 0 1 1 1 0 1 0 0 1 1

To merge information from different tables use the merge function. It implements the join operation of relational databases:

league = read.table("teams.dat", header=TRUE)
story = read.table("story.dat", header=TRUE)

> league
     team score win tie lost
1   Inter    59  17   8    3
2   Milan    58  17   7    4
3    Roma    53  15   8    5
4 Palermo    46  13   7    8

> story
     team    city year
1   Inter  Milano 1907
2   Milan  Milano 1901
3    Roma    Roma 1910
4 Palermo Palermo 1934

> merge(league, story)
     team score win tie lost    city year
1   Inter    59  17   8    3  Milano 1907
2   Milan    58  17   7    4  Milano 1901
3 Palermo    46  13   7    8 Palermo 1934
4    Roma    53  15   8    5    Roma 1910

# which is equivalent to:
merge(league, story, by.x=c("team"), by.y=c("team"))

A convenient function for transforming a data frame is transform:

> league
     team score win tie lost
1   Inter    59  17   8    3
2   Milan    58  17   7    4
3    Roma    53  15   8    5
4 Palermo    46  13   7    8

> transform(league, score = score - 3, win = win - 1, lost = lost + 1) 
     team score win tie lost
1   Inter    56  16   8    4
2   Milan    55  16   7    5
3    Roma    50  14   8    6
4 Palermo    43  12   7    9

You can also use transform to add new variables:

> transform(league, old.score = win * 2 + tie) 
     team score win tie lost old.score
1   Inter    59  17   8    3        42
2   Milan    58  17   7    4        41
3    Roma    53  15   8    5        38
4 Palermo    46  13   7    8        33

R provides a number of functions to summarizing data, aggregating records together to build a smaller data set. Function apply applies a function to the specified dimensions of an array:

a = 1:9;
dim(a) = c(3, 3)
> a
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

> apply(X=a, MARGIN=1, sum)
[1] 12 15 18

> apply(X=a, MARGIN=2, max)
[1] 3 6 9

Function tapply summarizes a vector using a specified partition of the vector elements and function to apply:

letter = c("A", "A", "A", "B", "B", "C")
number = 1:6
f = data.frame(letter, number)
> f
  letter number
1      A      1
2      A      2
3      A      3
4      B      4
5      B      5
6      C      6

> tapply(X = f$number, INDEX = list(f$letter), sum)
A B C 
6 9 6

# passing parameters to the aggregation function
> tapply(X = f$number, INDEX = list(f$letter), sum, na.rm = TRUE)
A B C 
6 9 6

color = c("green", "blue", "green", "red", "red", "white")
f = data.frame(letter, color, number)
> f
  letter color number
1      A green      1
2      A  blue      2
3      A green      3
4      B   red      4
5      B   red      5
6      C white      6

> tapply(f$number, list(f$letter, f$color), sum)
  blue green red white
A    2     4  NA    NA
B   NA    NA   9    NA
C   NA    NA  NA     6

To count the number of observations that take on each possible value of a variable use function table:

> table(c(0,1,1,1,2,2,3))
0 1 2 3 
1 3 2 1

Let us tabulate the digits of Pi (courtesy of the Pi search page) after the decimal point:

> pi.digits = scan("pi.dat", 0)
Read 10000 items

> table(pi.digits)
pi.digits
   0    1    2    3    4    5    6    7    8    9 
 968 1026 1021  974 1012 1046 1021  970  948 1014

Let us work with categorial data:

pressure = c("low", "normal", "low", "low", "normal", "normal", "normal", "high", "high", "high")
cholesterol = c("low", "low", "low", "low", "normal", "high", "normal", "high", "normal", "high")
health = data.frame(pressure, cholesterol)
> health
   pressure cholesterol
1       low         low
2    normal         low
3       low         low
4       low         low
5    normal      normal
6    normal        high
7    normal      normal
8      high        high
9      high      normal
10     high        high

> table(health$pressure)

  high    low normal 
     3      3      4 

> table(health$cholesterol)

  high    low normal 
     3      4      3 

> table(health$pressure, health$cholesterol)
        
         high low normal
  high      2   0      1
  low       0   3      0
  normal    1   1      2

Find and remove duplicates with duplicated and unique functions:

> league
     team score win tie lost
1   Inter    59  17   8    3
2   Milan    58  17   7    4
3    Roma    53  15   8    5
4 Palermo    46  13   7    8

> duplicated(league)
[1] FALSE FALSE FALSE FALSE

dup.league = rbind(league, league[1:2,])
> dup.league
     team score win tie lost
1   Inter    59  17   8    3
2   Milan    58  17   7    4
3    Roma    53  15   8    5
4 Palermo    46  13   7    8
5   Inter    59  17   8    3
6   Milan    58  17   7    4

> duplicated(dup.league)
[1] FALSE FALSE FALSE FALSE  TRUE  TRUE

> dup.league[!duplicated(dup.league), ]
     team score win tie lost
1   Inter    59  17   8    3
2   Milan    58  17   7    4
3    Roma    53  15   8    5
4 Palermo    46  13   7    8

> unique(dup.league)
     team score win tie lost
1   Inter    59  17   8    3
2   Milan    58  17   7    4
3    Roma    53  15   8    5
4 Palermo    46  13   7    8

Sort vectors and data frames with sort and order:

a = sample(1:100, 10)
> a
 [1]  4 78 93 42 67 69 52 33 97 89

> sort(a)
 [1]  4 33 42 52 67 69 78 89 93 97

> sort(a, decreasing=TRUE)
 [1] 97 93 89 78 69 67 52 42 33  4

> order(a)
 [1]  1  8  4  7  5  6  2 10  3  9

> a[order(a)]
 [1]  4 33 42 52 67 69 78 89 93 97

> serieA[, c("team", "score", "F", "S")]
         team score  F  S
1       Inter    63 58 28
2        Roma    62 56 35
3       Milan    60 49 29
4     Palermo    51 46 38
5    Juventus    48 48 44
6      Napoli    48 41 36
7   Sampdoria    48 38 37
8  Fiorentina    44 43 36
9       Genoa    44 51 51
10       Bari    43 38 37
11      Parma    42 31 38
12   Cagliari    40 48 47
13     Chievo    38 27 29
14    Bologna    35 34 44
15    Catania    35 34 36
16      Lazio    33 27 33
17    Udinese    32 38 49
18   Atalanta    28 29 42
19      Siena    26 32 53
20    Livorno    25 21 47

> serieA[order(serieA$score), c("team", "score", "F", "S")]
         team score  F  S
20    Livorno    25 21 47
19      Siena    26 32 53
18   Atalanta    28 29 42
17    Udinese    32 38 49
16      Lazio    33 27 33
14    Bologna    35 34 44
15    Catania    35 34 36
13     Chievo    38 27 29
12   Cagliari    40 48 47
11      Parma    42 31 38
10       Bari    43 38 37
8  Fiorentina    44 43 36
9       Genoa    44 51 51
5    Juventus    48 48 44
6      Napoli    48 41 36
7   Sampdoria    48 38 37
4     Palermo    51 46 38
3       Milan    60 49 29
2        Roma    62 56 35
1       Inter    63 58 28

> serieA[order(serieA$score, serieA$F, -serieA$S, decreasing=TRUE), c("team", "score", "F", "S")]
         team score  F  S
1       Inter    63 58 28
2        Roma    62 56 35
3       Milan    60 49 29
4     Palermo    51 46 38
5    Juventus    48 48 44
6      Napoli    48 41 36
7   Sampdoria    48 38 37
9       Genoa    44 51 51
8  Fiorentina    44 43 36
10       Bari    43 38 37
11      Parma    42 31 38
12   Cagliari    40 48 47
13     Chievo    38 27 29
15    Catania    35 34 36
14    Bologna    35 34 44
16      Lazio    33 27 33
17    Udinese    32 38 49
18   Atalanta    28 29 42
19      Siena    26 32 53
20    Livorno    25 21 47