You can use the scan function to read data from a file into a vector. Let us read files vector.dat, vector.csv, and vector.txt:
> scan("vector.dat", what=integer()) Read 8 items [1] 2 3 5 7 11 13 17 19 # or equivalently: scan("vector.dat", 0)
> scan("vector.csv", what=character(), sep=",") Read 4 items [1] "Hello World!" "Hello" "World" "!" # or equivalently: scan("vector.csv", "a", sep=",")
> scan("vector.txt", what=character(), quote="\"") Read 4 items [1] "Hello World!" "Hello" "World" "!"
To read a matrix, first scan the data into a vector, and then map the vector into the desired matrix. Let us read matrix.dat:
> matrix(scan("matrix.dat", 0), nrow=4, byrow=TRUE) Read 12 items [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 [4,] 10 11 12
You can read the table teams.dat into a data frame:
> read.table("teams.dat", header=TRUE) team score win tie lost 1 Inter 59 17 8 3 2 Milan 58 17 7 4 3 Roma 53 15 8 5 4 Palermo 46 13 7 8
You can write data to a file (after converting it to a data frame) as follows:
league = read.table("teams.dat", header=TRUE) write.table(league, file="league.txt", quote=FALSE, row.names=FALSE)
To save a compressed version of an object:
save(league, file="league.dat")
To load the saved object:
load("league.dat")
The c and paste functions concatenate multiple vectors into a single vector with different effects:
x = letters[1:4] > x [1] "a" "b" "c" "d" y = LETTERS[1:4] > y [1] "A" "B" "C" "D" > c(x, y) [1] "a" "b" "c" "d" "A" "B" "C" "D" > paste(x, y) [1] "a A" "b B" "c C" "d D" > paste(x, y, sep="/") [1] "a/A" "b/B" "c/C" "d/D" > paste(x, collapse="-") [1] "a-b-c-d"
Functions cbind and rbind add columns and rows, respectively, to matrices and data frames:
m = matrix(data = 1:9, nrow=3, byrow=TRUE) > rbind(m, c(1, 1, 1)) [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 [4,] 1 1 1 > cbind(m, c(1, 1, 1)) [,1] [,2] [,3] [,4] [1,] 1 2 3 1 [2,] 4 5 6 1 [3,] 7 8 9 1 league = read.table("teams.dat", header=TRUE) > league team score win tie lost 1 Inter 59 17 8 3 2 Milan 58 17 7 4 3 Roma 53 15 8 5 4 Palermo 46 13 7 8 top = league[1:2,] bottom = league[3:4,] > top team score win tie lost 1 Inter 59 17 8 3 2 Milan 58 17 7 4 > bottom team score win tie lost 3 Roma 53 15 8 5 4 Palermo 46 13 7 8 > rbind(top, bottom) team score win tie lost 1 Inter 59 17 8 3 2 Milan 58 17 7 4 3 Roma 53 15 8 5 4 Palermo 46 13 7 8 goals = c(58, 56, 49, 46); > cbind(league, goals) team score win tie lost goals 1 Inter 59 17 8 3 58 2 Milan 58 17 7 4 56 3 Roma 53 15 8 5 49 4 Palermo 46 13 7 8 46
One way to take the subset of a data set is to use the bracket notation. Let us work with the ranking of Italian soccer Serie A. Use the first dimension in square brackets to filter the table rows, and the second dimension to select the table columns:
serieA = read.table("serieA.dat", header=TRUE) > serieA[serieA$score > 44 & serieA$F > 38, c("team", "score")] team score 1 Inter 63 2 Roma 62 3 Milan 60 4 Palermo 51 5 Juventus 48 6 Napoli 48
Equivalently, you may use the subset function:
subset(serieA, score > 44 & F > 38, c("team", "score"))
We can select a random sample of a data set as follows:
> sample(1:100, 10) [1] 84 18 42 39 11 4 32 70 71 54 # random permutation > sample(1:10, 10) [1] 4 2 9 6 10 8 1 7 5 3 # with element replacement > sample(c(0,1), 10, replace=TRUE) [1] 0 1 1 1 0 1 0 0 1 1
To merge information from different tables use the merge function. It implements the join operation of relational databases:
league = read.table("teams.dat", header=TRUE) story = read.table("story.dat", header=TRUE) > league team score win tie lost 1 Inter 59 17 8 3 2 Milan 58 17 7 4 3 Roma 53 15 8 5 4 Palermo 46 13 7 8 > story team city year 1 Inter Milano 1907 2 Milan Milano 1901 3 Roma Roma 1910 4 Palermo Palermo 1934 > merge(league, story) team score win tie lost city year 1 Inter 59 17 8 3 Milano 1907 2 Milan 58 17 7 4 Milano 1901 3 Palermo 46 13 7 8 Palermo 1934 4 Roma 53 15 8 5 Roma 1910 # which is equivalent to: merge(league, story, by.x=c("team"), by.y=c("team"))
A convenient function for transforming a data frame is transform:
> league team score win tie lost 1 Inter 59 17 8 3 2 Milan 58 17 7 4 3 Roma 53 15 8 5 4 Palermo 46 13 7 8 > transform(league, score = score - 3, win = win - 1, lost = lost + 1) team score win tie lost 1 Inter 56 16 8 4 2 Milan 55 16 7 5 3 Roma 50 14 8 6 4 Palermo 43 12 7 9
You can also use transform to add new variables:
> transform(league, old.score = win * 2 + tie) team score win tie lost old.score 1 Inter 59 17 8 3 42 2 Milan 58 17 7 4 41 3 Roma 53 15 8 5 38 4 Palermo 46 13 7 8 33
R provides a number of functions to summarizing data, aggregating records together to build a smaller data set. Function apply applies a function to the specified dimensions of an array:
a = 1:9; dim(a) = c(3, 3) > a [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 > apply(X=a, MARGIN=1, sum) [1] 12 15 18 > apply(X=a, MARGIN=2, max) [1] 3 6 9
Function tapply summarizes a vector using a specified partition of the vector elements and function to apply:
letter = c("A", "A", "A", "B", "B", "C") number = 1:6 f = data.frame(letter, number) > f letter number 1 A 1 2 A 2 3 A 3 4 B 4 5 B 5 6 C 6 > tapply(X = f$number, INDEX = list(f$letter), sum) A B C 6 9 6 # passing parameters to the aggregation function > tapply(X = f$number, INDEX = list(f$letter), sum, na.rm = TRUE) A B C 6 9 6 color = c("green", "blue", "green", "red", "red", "white") f = data.frame(letter, color, number) > f letter color number 1 A green 1 2 A blue 2 3 A green 3 4 B red 4 5 B red 5 6 C white 6 > tapply(f$number, list(f$letter, f$color), sum) blue green red white A 2 4 NA NA B NA NA 9 NA C NA NA NA 6
To count the number of observations that take on each possible value of a variable use function table:
> table(c(0,1,1,1,2,2,3)) 0 1 2 3 1 3 2 1
Let us tabulate the digits of Pi (courtesy of the Pi search page) after the decimal point:
> pi.digits = scan("pi.dat", 0) Read 10000 items > table(pi.digits) pi.digits 0 1 2 3 4 5 6 7 8 9 968 1026 1021 974 1012 1046 1021 970 948 1014
Let us work with categorial data:
pressure = c("low", "normal", "low", "low", "normal", "normal", "normal", "high", "high", "high") cholesterol = c("low", "low", "low", "low", "normal", "high", "normal", "high", "normal", "high") health = data.frame(pressure, cholesterol) > health pressure cholesterol 1 low low 2 normal low 3 low low 4 low low 5 normal normal 6 normal high 7 normal normal 8 high high 9 high normal 10 high high > table(health$pressure) high low normal 3 3 4 > table(health$cholesterol) high low normal 3 4 3 > table(health$pressure, health$cholesterol) high low normal high 2 0 1 low 0 3 0 normal 1 1 2
Find and remove duplicates with duplicated and unique functions:
> league team score win tie lost 1 Inter 59 17 8 3 2 Milan 58 17 7 4 3 Roma 53 15 8 5 4 Palermo 46 13 7 8 > duplicated(league) [1] FALSE FALSE FALSE FALSE dup.league = rbind(league, league[1:2,]) > dup.league team score win tie lost 1 Inter 59 17 8 3 2 Milan 58 17 7 4 3 Roma 53 15 8 5 4 Palermo 46 13 7 8 5 Inter 59 17 8 3 6 Milan 58 17 7 4 > duplicated(dup.league) [1] FALSE FALSE FALSE FALSE TRUE TRUE > dup.league[!duplicated(dup.league), ] team score win tie lost 1 Inter 59 17 8 3 2 Milan 58 17 7 4 3 Roma 53 15 8 5 4 Palermo 46 13 7 8 > unique(dup.league) team score win tie lost 1 Inter 59 17 8 3 2 Milan 58 17 7 4 3 Roma 53 15 8 5 4 Palermo 46 13 7 8
Sort vectors and data frames with sort and order:
a = sample(1:100, 10) > a [1] 4 78 93 42 67 69 52 33 97 89 > sort(a) [1] 4 33 42 52 67 69 78 89 93 97 > sort(a, decreasing=TRUE) [1] 97 93 89 78 69 67 52 42 33 4 > order(a) [1] 1 8 4 7 5 6 2 10 3 9 > a[order(a)] [1] 4 33 42 52 67 69 78 89 93 97 > serieA[, c("team", "score", "F", "S")] team score F S 1 Inter 63 58 28 2 Roma 62 56 35 3 Milan 60 49 29 4 Palermo 51 46 38 5 Juventus 48 48 44 6 Napoli 48 41 36 7 Sampdoria 48 38 37 8 Fiorentina 44 43 36 9 Genoa 44 51 51 10 Bari 43 38 37 11 Parma 42 31 38 12 Cagliari 40 48 47 13 Chievo 38 27 29 14 Bologna 35 34 44 15 Catania 35 34 36 16 Lazio 33 27 33 17 Udinese 32 38 49 18 Atalanta 28 29 42 19 Siena 26 32 53 20 Livorno 25 21 47 > serieA[order(serieA$score), c("team", "score", "F", "S")] team score F S 20 Livorno 25 21 47 19 Siena 26 32 53 18 Atalanta 28 29 42 17 Udinese 32 38 49 16 Lazio 33 27 33 14 Bologna 35 34 44 15 Catania 35 34 36 13 Chievo 38 27 29 12 Cagliari 40 48 47 11 Parma 42 31 38 10 Bari 43 38 37 8 Fiorentina 44 43 36 9 Genoa 44 51 51 5 Juventus 48 48 44 6 Napoli 48 41 36 7 Sampdoria 48 38 37 4 Palermo 51 46 38 3 Milan 60 49 29 2 Roma 62 56 35 1 Inter 63 58 28 > serieA[order(serieA$score, serieA$F, -serieA$S, decreasing=TRUE), c("team", "score", "F", "S")] team score F S 1 Inter 63 58 28 2 Roma 62 56 35 3 Milan 60 49 29 4 Palermo 51 46 38 5 Juventus 48 48 44 6 Napoli 48 41 36 7 Sampdoria 48 38 37 9 Genoa 44 51 51 8 Fiorentina 44 43 36 10 Bari 43 38 37 11 Parma 42 31 38 12 Cagliari 40 48 47 13 Chievo 38 27 29 15 Catania 35 34 36 14 Bologna 35 34 44 16 Lazio 33 27 33 17 Udinese 32 38 49 18 Atalanta 28 29 42 19 Siena 26 32 53 20 Livorno 25 21 47