Basics

R

R is a free software environment for statistical computing and graphics. R is a needful language for the data scientist. Its strengths include:

  1. capability: It offers a gargantuan set of functionalities
  2. community: It has an elephantine, ever growing community of users
  3. performance: It is lightning fast (when running in main memory)

RStudio

RStudio is an integrated development environment (IDE) for R. It includes:

  • a console
  • syntax-highlighting editor that supports direct code execution
  • tools for plotting, history, debugging and workspace management

Help and packages

  • get help, get help, get help! Use stack overflow for specific questions
# help on log function
?log
  • R comes with a number of packages, some of them are loaded by default
# install or update a package (only once!)
install.packages("igraph")

# load a package (when you need it)
library(igraph)

# list all packages where an update is available
old.packages()

# update all available packages
update.packages()

Basic arithmetic and logic operators

  • arithmetic: sum (+), minus (-), product (*), division (/), integer division (%/%), modulus (%%), exponent (^)
  • comparison: equal (==), different (!=), less than (<), greater than (>), less than or equal to (<=), greater than or equal to (>=)
  • logic operators: conjunction (&), disjunction (|), negation (!), exclusive disjunction (xor)

Play

Explain the following mismatch between math and R:

\[ (\sqrt{2}) ^ 2 \stackrel{?}{=} 2\]

sqrt(2) ^ 2 == 2
## [1] FALSE

Solution

The computer uses finite binary arithmetic and the binary representation of \(\sqrt{2}\) has infinite figures, hence it is rounded.

Play

Define the xor operator in terms of conjunction (&), disjunction (|), and negation (!).

Solution

x = TRUE
y = TRUE
# first solution
(x | y) & !(x & y)
## [1] FALSE
# second solution
(x & !y) | (y & !x)
## [1] FALSE
x = TRUE
y = FALSE
(x | y) & !(x & y)
## [1] TRUE
(x & !y) | (y & !x)
## [1] TRUE

Special values

  • the value NA (not available) is used to represent missing values;
  • the value NULL is the null object (not to be confused with NULL in databases);
  • the value Inf stands for positive infinity;
  • the value NaN (not a number) is the result of a computation that makes no sense.

Special values

NA & TRUE
## [1] NA
NA & FALSE
## [1] FALSE
NA | TRUE
## [1] TRUE
NA | FALSE
## [1] NA
!NA
## [1] NA
2^1024
## [1] Inf
1/0
## [1] Inf
0 / 0
## [1] NaN
Inf - Inf
## [1] NaN

Variables

Of course, you may use variables to store values. There are 3 equivalent ways to assign a value to a variable:

x = 42  # my favourite
x <- 42 # this is the politically correct one!
42 -> x # used rarely

# print x
x
## [1] 42
# print structure of x (with type)
str(x)
##  num 42

Atomic types

R has four main atomic types:

# double (double-precision number)
x = 108.801
typeof(x)
## [1] "double"
# integer (integer number)
x = 108L
typeof(x)
## [1] "integer"
# character (a string of characters)
x = "108L"
typeof(x)
## [1] "character"
# logical (a Boolean, either TRUE or FALSE)
x = TRUE
typeof(x)
## [1] "logical"

Data structures

Outline

The main data structures used in R include:

  • atomic vector
  • list
  • matrix
  • data frame
Dim Homogeneous     Heterogeneous
1d  atomic vector   list
2d  matrix          data frame

Atomic vectors

A vector is a sequence of elements with the same type. Vector indexes start at 1 (not 0).

# create a vector with c() function
c(1, 3, 5, 7)
## [1] 1 3 5 7
# concatenate vectors
c(c(1, 3), c(5, 7))
## [1] 1 3 5 7
# element-wise sum
c(1, 2, 3, 4) + c(10, 20, 30, 40)
## [1] 11 22 33 44
# recyclying
10 + c(1, 2, 3, 4)
## [1] 11 12 13 14
# element-wise product
c(1, 2, 3, 4) * c(10, 20, 30, 40)
## [1]  10  40  90 160
# recyclying
10 * c(1, 2, 3, 4)
## [1] 10 20 30 40
# scalar product (the result is a 1x1 matrix)
c(1, 2, 3, 4) %*% c(10, 20, 30, 40)
##      [,1]
## [1,]  300

Atomic vectors

x = c(TRUE, FALSE, TRUE, FALSE)
(y = !x) # also prints result
## [1] FALSE  TRUE FALSE  TRUE
x & y
## [1] FALSE FALSE FALSE FALSE
x | y
## [1] TRUE TRUE TRUE TRUE
xor(x, y)
## [1] TRUE TRUE TRUE TRUE

Indexing

You may refer to members of a vector in several ways:

primes = c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29)
primes[5]
## [1] 11
primes[c(1, 5, 10)]
## [1]  2 11 29
primes[-1]
## [1]  3  5  7 11 13 17 19 23 29
primes[-c(1, 5, 10)]
## [1]  3  5  7 13 17 19 23
primes > 15
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
primes[primes > 15]
## [1] 17 19 23 29
# modify the vector
primes[primes > 15] = Inf
primes
##  [1]   2   3   5   7  11  13 Inf Inf Inf Inf

Coercion (type cast)

All elements of an atomic vector must be the same type, so when you attempt to combine different types they will be casted to the most flexible type (coercion). Types from least to most flexible are: logical, integer, double, and character.

x = c(TRUE, TRUE, FALSE, FALSE)
# how many TRUE?
sum(x)
## [1] 2
# how many TRUE on average
mean(x)
## [1] 0.5

Named vectors

Vector elements can have names:

x = c(a = 1, b = 2, c = 3)
# or
x = c(1, 2, 3)
names(x) = c("a", "b", "c")
x["a"]
## a 
## 1
x[c("a", "b")]
## a b 
## 1 2

Play

Given a vector of integers from 0 to 100, select all numbers that are (Hint: use the : operator to generate the vector):

  • even
  • even or divisible by 5
  • odd and divisible by 7

Solution

# vector
x = 0:100

# even 
x[x %% 2 == 0]
##  [1]   0   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36
## [20]  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74
## [39]  76  78  80  82  84  86  88  90  92  94  96  98 100
# even or divisible by 5
x[x %% 2 == 0 | x %% 5 == 0]
##  [1]   0   2   4   5   6   8  10  12  14  15  16  18  20  22  24  25  26  28  30
## [20]  32  34  35  36  38  40  42  44  45  46  48  50  52  54  55  56  58  60  62
## [39]  64  65  66  68  70  72  74  75  76  78  80  82  84  85  86  88  90  92  94
## [58]  95  96  98 100
# odd and divisible by 7
x[x %% 2 == 1 & x %% 7 == 0]
## [1]  7 21 35 49 63 77 91

Play

Write a logical condition that is TRUE is the number is prime (Hint: take advantage of the all function).

Solution

n = 109
n == 2L || all(n %% 2:(n-1) != 0)
## [1] TRUE
n = 111
n == 2L || all(n %% 2:(n-1) != 0)
## [1] FALSE

Factors

A factor is a vector that can contain only predefined values, and is used to store categorical variables (for instance sex or religion).

Factors are built on top of integer vectors using the levels attribute, which defines the set of allowed values.

x = factor(c("male", "female", "female", "male", "male"))
x
## [1] male   female female male   male  
## Levels: female male
typeof(x)
## [1] "integer"
levels(x)
## [1] "female" "male"
# if you use values that are not levels 
# a warning is issued and a NA is generated
x[1] = "unknown"
x
## [1] <NA>   female female male   male  
## Levels: female male

Lists

A list is a sequence of elements that might have different types.

# create a list
l = list(thing = "hat", size = 8.25, female = TRUE)

# print the list
l
## $thing
## [1] "hat"
## 
## $size
## [1] 8.25
## 
## $female
## [1] TRUE
str(l)
## List of 3
##  $ thing : chr "hat"
##  $ size  : num 8.25
##  $ female: logi TRUE
# an element
l$thing
## [1] "hat"
l[[1]]
## [1] "hat"
# a sublist
l[c("thing", "size")]
## $thing
## [1] "hat"
## 
## $size
## [1] 8.25
l[c(1, 2)]
## $thing
## [1] "hat"
## 
## $size
## [1] 8.25

Lists

“If list x is a train carrying objects, then x[[5]] is the object in car 5; x[5] is car number 5.”

# a sublist containing the first element of the list
l[1]
## $thing
## [1] "hat"
typeof(l[1])
## [1] "list"
# the first element of the list
l[[1]]
## [1] "hat"
typeof(l[[1]])
## [1] "character"

Lists

List elements can have any atomic or complex type. Hence a list can contain other lists, making it a nested list.

l = list(1, list(1, 2, 3), list("a", 1, list("TRUE", "FALSE")))
str(l)
## List of 3
##  $ : num 1
##  $ :List of 3
##   ..$ : num 1
##   ..$ : num 2
##   ..$ : num 3
##  $ :List of 3
##   ..$ : chr "a"
##   ..$ : num 1
##   ..$ :List of 2
##   .. ..$ : chr "TRUE"
##   .. ..$ : chr "FALSE"

Play

Consider the list:

l = list(1, list(1, 2, 3), list("a", 1, list("TRUE", "FALSE")))

Find:

  • the list list(1, 2, 3)
  • the element 1 of list list(1, 2, 3)
  • the element TRUE of list list("TRUE", "FALSE")

Solution

l = list(1, list(1, 2, 3), list("a", 1, list("TRUE", "FALSE")))
l[[2]]
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
l[[2]][[1]]
## [1] 1
l[[3]][[3]][[1]]
## [1] "TRUE"

Play

Write a list containing the information of the Porphyrian Tree. Then select the insensitive part of the tree.

Solution

substance = 
  list(immaterial = "spirit", 
       material = list(
                   body = list(
                     inanimate = "mineral", 
                     animate = list(
                       living = list(
                         insensitive = "plant", 
                         sensitive = list(
                           irrational = "beast", 
                           rational = 
                             list(human = c("Arendt", "Butler", "Barad"))))))))

str(substance)
## List of 2
##  $ immaterial: chr "spirit"
##  $ material  :List of 1
##   ..$ body:List of 2
##   .. ..$ inanimate: chr "mineral"
##   .. ..$ animate  :List of 1
##   .. .. ..$ living:List of 2
##   .. .. .. ..$ insensitive: chr "plant"
##   .. .. .. ..$ sensitive  :List of 2
##   .. .. .. .. ..$ irrational: chr "beast"
##   .. .. .. .. ..$ rational  :List of 1
##   .. .. .. .. .. ..$ human: chr [1:3] "Arendt" "Butler" "Barad"
substance$material$body$animate$living$insensitive
## [1] "plant"

Matrices

A matrix is a 2-dimensional vector, that is a vector of vectors of the same type and length.

# by row
M = matrix(data = 1:9, nrow = 3, byrow = TRUE)
M
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
# by column (the default)
N = matrix(data = 1:9, ncol = 3)
N
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
nrow(M)
## [1] 3
ncol(M)
## [1] 3
dim(M)
## [1] 3 3

Indexing

M
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
# element in row 1 and column 2
M[1, 2]
## [1] 2
# first row
M[1, ]
## [1] 1 2 3
# first column
M[ ,1]
## [1] 1 4 7
# sub-matrix
M[1:2, 1:2]
##      [,1] [,2]
## [1,]    1    2
## [2,]    4    5
M[-3, -3]
##      [,1] [,2]
## [1,]    1    2
## [2,]    4    5
# diagonal
diag(M)
## [1] 1 5 9

Add rows and columns

P = matrix(data = runif(9), nrow = 3, byrow = TRUE)
P
##           [,1]       [,2]       [,3]
## [1,] 0.9027998 0.20863525 0.06612994
## [2,] 0.5501463 0.75075613 0.74441485
## [3,] 0.2105642 0.08571193 0.40182803
# add column
cbind(P, c(0, 0, 0))
##           [,1]       [,2]       [,3] [,4]
## [1,] 0.9027998 0.20863525 0.06612994    0
## [2,] 0.5501463 0.75075613 0.74441485    0
## [3,] 0.2105642 0.08571193 0.40182803    0
# modify matrix
P
##           [,1]       [,2]       [,3]
## [1,] 0.9027998 0.20863525 0.06612994
## [2,] 0.5501463 0.75075613 0.74441485
## [3,] 0.2105642 0.08571193 0.40182803
P = cbind(P, c(0, 0, 0))
P
##           [,1]       [,2]       [,3] [,4]
## [1,] 0.9027998 0.20863525 0.06612994    0
## [2,] 0.5501463 0.75075613 0.74441485    0
## [3,] 0.2105642 0.08571193 0.40182803    0
# add row
P = rbind(P, c(0, 0, 0, 0))
P
##           [,1]       [,2]       [,3] [,4]
## [1,] 0.9027998 0.20863525 0.06612994    0
## [2,] 0.5501463 0.75075613 0.74441485    0
## [3,] 0.2105642 0.08571193 0.40182803    0
## [4,] 0.0000000 0.00000000 0.00000000    0

Operations on matrices

M
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
N
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
# element-wise sum
M + N
##      [,1] [,2] [,3]
## [1,]    2    6   10
## [2,]    6   10   14
## [3,]   10   14   18
# element-wise product
M * N
##      [,1] [,2] [,3]
## [1,]    1    8   21
## [2,]    8   25   48
## [3,]   21   48   81
# matrix product
M %*% N
##      [,1] [,2] [,3]
## [1,]   14   32   50
## [2,]   32   77  122
## [3,]   50  122  194
# matrix transpose
M
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
t(M)
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
# matrix inverse
C = matrix(c(1,0,1, 1,1,1, 1,1,0), nrow=3, byrow=TRUE)
D = solve(C)
D
##      [,1] [,2] [,3]
## [1,]    1   -1    1
## [2,]   -1    1    0
## [3,]    0    1   -1
D %*% C
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
C %*% D
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
# linear systems C x = b
C
##      [,1] [,2] [,3]
## [1,]    1    0    1
## [2,]    1    1    1
## [3,]    1    1    0
b = c(2, 1, 3)
# the system is:
# x1      + x3 = 2
# x1 + x2 + x3 = 1
# x1 + x2      = 3
x = solve(C,b)
x
## [1]  4 -1 -2
C %*% x
##      [,1]
## [1,]    2
## [2,]    1
## [3,]    3
# matrix spectrum
spectrum = eigen(C)
# columns are the eigenvectors
spectrum$vectors
##            [,1]       [,2]       [,3]
## [1,] -0.4151581 -0.4743098 -0.6026918
## [2,] -0.7480890 -0.2110877  0.7515444
## [3,] -0.5176936  0.8546767  0.2682231
# eigenvalues
spectrum$values
## [1]  2.2469796 -0.8019377  0.5549581
# check 
x = spectrum$vectors[, 1]
lambda = spectrum$values[1]
lambda * x
## [1] -0.9328517 -1.6809406 -1.1632470
C %*% x
##            [,1]
## [1,] -0.9328517
## [2,] -1.6809406
## [3,] -1.1632470

Play

  1. verify that the trace of a matrix (the sum of the diagonal elements) is the sum of its eigenvalues
  2. compute the determinant of a matrix as the product of its eigenvalues (use function prod)

Solution

C = matrix(c(1,0,1, 1,1,1, 1,1,0), nrow=3, byrow=TRUE)

(v = eigen(C)$values)
## [1]  2.2469796 -0.8019377  0.5549581
sum(v)
## [1] 2
sum(diag(C))
## [1] 2
prod(v)
## [1] -1

Data frames

A data frame is a list of vectors (called columns). A data frame is like a database table:

  • each column has a name and contain elements of the same type
  • different columns have the same length and may have different types
name = c("John", "Samuel", "Uma", "Bruce", "Tim")
age = c(23, 31, 17, 41, 25)
married = c(TRUE, FALSE, FALSE, TRUE, TRUE)

pulp = data.frame(name, age, married)
pulp
##     name age married
## 1   John  23    TRUE
## 2 Samuel  31   FALSE
## 3    Uma  17   FALSE
## 4  Bruce  41    TRUE
## 5    Tim  25    TRUE

Indexing

# first row
pulp[1, ]
##   name age married
## 1 John  23    TRUE
# first column
# matrix style
pulp[ ,1]
## [1] "John"   "Samuel" "Uma"    "Bruce"  "Tim"
pulp[, "name"]
## [1] "John"   "Samuel" "Uma"    "Bruce"  "Tim"
# list style (remember a data frame is a list)
pulp$name 
## [1] "John"   "Samuel" "Uma"    "Bruce"  "Tim"
pulp[[1]]
## [1] "John"   "Samuel" "Uma"    "Bruce"  "Tim"
# filtering
pulp[pulp$name == "Uma", ]
##   name age married
## 3  Uma  17   FALSE
pulp[pulp$age < 18, ]
##   name age married
## 3  Uma  17   FALSE
pulp[married == TRUE, "name"]
## [1] "John"  "Bruce" "Tim"

Play

Extract from the pulp data frame the names of adult people that are not married.

Solution

pulp[married == FALSE & age >= 18, "name"]
## [1] "Samuel"

Nested data frames

Since a data frame is a list, and lists can contain other lists as elements, you can create nested data frames, that is data frames whose elements are data frames.

# a data frame
Venus = data.frame(
  x = c(17, 19), 
  y = c("Hello", "Venus")
)

# a data frame
Jupiter = data.frame(
  x = c(21, 23), 
  y = c("Hello", "Jupiter")
)

# a nested data frame
# I() treats the object ‘as is’
worlds = data.frame(
  x = I(list(Venus, Jupiter)), 
  y = c("Hello", "Worlds")
)

str(worlds)
## 'data.frame':    2 obs. of  2 variables:
##  $ x:List of 2
##   ..$ :'data.frame': 2 obs. of  2 variables:
##   .. ..$ x: num  17 19
##   .. ..$ y: chr  "Hello" "Venus"
##   ..$ :'data.frame': 2 obs. of  2 variables:
##   .. ..$ x: num  21 23
##   .. ..$ y: chr  "Hello" "Jupiter"
##   ..- attr(*, "class")= chr "AsIs"
##  $ y: chr  "Hello" "Worlds"
worlds$x[[1]]
##    x     y
## 1 17 Hello
## 2 19 Venus
worlds$x[[2]]
##    x       y
## 1 21   Hello
## 2 23 Jupiter

Programming

Conditional and repetition

R is an Turing-complete (functional) programming language.

It includes conditional statements:

x = 49
if (x %% 7 == 0) x else -x
## [1] 49

And loops:

x = 108
i = 2
while (i <= x/2) {
 if (x %% i == 0) print(i)
 i = i + 1;
}  
## [1] 2
## [1] 3
## [1] 4
## [1] 6
## [1] 9
## [1] 12
## [1] 18
## [1] 27
## [1] 36
## [1] 54
for (i in 2:(x/2)) {
  if (x %% i == 0) print(i)
}
## [1] 2
## [1] 3
## [1] 4
## [1] 6
## [1] 9
## [1] 12
## [1] 18
## [1] 27
## [1] 36
## [1] 54

Kind of loops

df = data.frame(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

# we know the sequence and output lengths
# create a vector of a given size
output = vector("double", ncol(df))  
for (i in 1:ncol(df)) {               
  output[i] = mean(df[[i]])          
}

Kind of loops

# we know the sequence length but we do NOT know the output length (slow solution)
means = c(0, 1, 2)
# a vector of doubles of length 0
output = double() 
for (i in 1:length(means)) {
  n = sample(1:100, 1)
  # dynamically increase the vector (slow)
  output = c(output, rnorm(n, means[i])) 
}

# create a list with length(means) elements (faster solution)
output = vector("list", length(means))
for (i in 1:length(means)) {
  n = sample(1:100, 1)
  output[[i]] = rnorm(n, means[i])
}
# unlist the list into a vector
output = unlist(output)  

Kind of loops

# we do not know the sequence length
# iterate until a sequence of Heads of length difficulty is found
flips = 0
nheads = 0
difficulty = 10

while (nheads < difficulty) {
  if (sample(c("T", "H"), 1) == "H") {
    nheads = nheads + 1
  } else {
    nheads = 0
  }
  flips = flips + 1
}
flips
## [1] 2837

Avoid loops (if possible)

Most of the times you can perform your task by applying functions, avoiding loops. This is typically faster.

x = 1:100

# compute the sum (bad)
s = 0
for (i in 1:length(x)) {               
  s = s + x[i]
}
s
## [1] 5050
# compute the sum (good)
sum(x)
## [1] 5050
# even faster
n = length(x)
n * (n+1) / 2
## [1] 5050

Functions

You may use built-in functions:

log
## function (x, base = exp(1))  .Primitive("log")
args(log)
## function (x, base = exp(1)) 
## NULL
log(x = 128, base = 2)
## [1] 7
log(base = 2, x = 128)
## [1] 7
log(128, 2)
## [1] 7
log(2, 128)
## [1] 0.1428571
log(128)
## [1] 4.85203

Or define your our functions:

euclidean = function(x=0, y=0) {sqrt(x^2 + y^2)}

euclidean(1, 1)
## [1] 1.414214
euclidean(1)
## [1] 1
euclidean()
## [1] 0

Or define your own binary operators using functions:

# xor
'%()%' = function(x, y) {(x | y) & !(x & y)}

TRUE %()% TRUE 
## [1] FALSE
TRUE %()% FALSE
## [1] TRUE
FALSE %()% TRUE
## [1] TRUE
FALSE %()% FALSE
## [1] FALSE

Functionals

Functions may be recursive:

factorial = function(x) {
 if (x == 0) 1 else x * factorial(x-1)
}
factorial(5)
## [1] 120

You may write functionals, that are functions whose arguments are other functions:

# compute the sum of applications of f up to n
g = function(f, n) {
 sum = 0
 for (i in 1:n) sum = sum + f(i)
 return(sum)
}
 
g(factorial, 5)
## [1] 153

Apply-like functionals

An application of functionals and iteration is the set of apply-like functionals:

df = data.frame(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

# apply mean to each column of data frame, returns a list
lapply(df, mean)
## $a
## [1] -0.03719727
## 
## $b
## [1] -0.1672439
## 
## $c
## [1] -0.3510924
## 
## $d
## [1] 0.388815
# apply mean to each column of data frame, returns an atomic vector
sapply(df, mean)
##           a           b           c           d 
## -0.03719727 -0.16724386 -0.35109241  0.38881500
# apply to a vector
sapply(1:100, function(x) {x^2})
##   [1]     1     4     9    16    25    36    49    64    81   100   121   144
##  [13]   169   196   225   256   289   324   361   400   441   484   529   576
##  [25]   625   676   729   784   841   900   961  1024  1089  1156  1225  1296
##  [37]  1369  1444  1521  1600  1681  1764  1849  1936  2025  2116  2209  2304
##  [49]  2401  2500  2601  2704  2809  2916  3025  3136  3249  3364  3481  3600
##  [61]  3721  3844  3969  4096  4225  4356  4489  4624  4761  4900  5041  5184
##  [73]  5329  5476  5625  5776  5929  6084  6241  6400  6561  6724  6889  7056
##  [85]  7225  7396  7569  7744  7921  8100  8281  8464  8649  8836  9025  9216
##  [97]  9409  9604  9801 10000
mtx <- cbind(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)


# apply mean to each column of matrix, returns an atomic vector
apply(mtx, 2, mean)
##           a           b           c           d 
##  0.06472701 -0.38940661  0.45435349 -0.36427525
# apply mean to each row of matrix, returns an atomic vector
apply(mtx, 1, mean)
##  [1]  0.36985815  0.53755424  0.06308799 -0.81283658 -0.12011102  0.41473053
##  [7]  0.89464644 -1.02795479 -0.36259165 -0.54288674

Play

Write a function that, given a square matrix \(A\) and an integer \(n \geq 0\), computes the power \(A^n\) (use diag function to build the identity matrix).

Solution

power = function(A, n) {
  k = nrow(A)
  I = diag(k)
  if (n == 0) return(I)
  if (n == 1) return(A)
  B = A
  for (i in 2:n) {
    B = B %*% A 
  }
  return(B)
}

A = matrix(c(1,2,0, 0,3,0, 0,5,1), nrow=3, byrow=TRUE)
power(A, 5)
##      [,1] [,2] [,3]
## [1,]    1  242    0
## [2,]    0  243    0
## [3,]    0  605    1

Play

The determinant of a square matrix is the product of the eigenvalues of the matrix.

  1. write a function that, given a matrix \(A\) computes the determinant of \(A\) (use the function prod)
  2. we know that \[\det(A^n) = \det(A)^n\] Write a function that, given a matrix \(A\) and an integer \(n \geq 0\), computes the determinant of \(A^n\)

Solution

det = function(A) {
  v = eigen(A)$values
  return (prod(v))
}

detn = function(A, n) {
  v = eigen(A)$values
  return (prod(v)^n)
}

(A = matrix(c(1,2,0, 0,3,0, 0,5,1), nrow=3, byrow=TRUE))
##      [,1] [,2] [,3]
## [1,]    1    2    0
## [2,]    0    3    0
## [3,]    0    5    1
det(A)
## [1] 3
detn(A, 5)
## [1] 243

Plot

Barplot

# a data matrix
M = matrix(c(
   c(1200, 1190, 1100, 1120, 890),
   c(6200, 6690, 6700, 7120, 7150),
   c(8900, 8790, 8760, 8800, 9010),
   c(3300, 3490, 3660, 4300, 4510),
   c(2190, 2000, 1890, 1740, 1500)), ncol = 5
)  

# give names to rows
rownames(M) = 2014:2018
# give names to columns
colnames(M) = LETTERS[1:5]
M
##         A    B    C    D    E
## 2014 1200 6200 8900 3300 2190
## 2015 1190 6690 8790 3490 2000
## 2016 1100 6700 8760 3660 1890
## 2017 1120 7120 8800 4300 1740
## 2018  890 7150 9010 4510 1500
# barplot
barplot(M[1,])

# stacked barplot
barplot(M, legend=TRUE)

#  juxtaposed barplot
barplot(M, beside=TRUE, legend=TRUE)

Histogram

# histogram
x = rnorm(1000)
hist(x, probability=TRUE, main="Histogram of a normal sample")
## add distribution
rug(x)

# density plot
plot(density(x), main="Density of a normal sample")
rug(x)

Boxplot

# boxplot
# If range is positive, the whiskers extend to the most extreme 
# data point  which is no more than range times the interquartile 
# range from the box.  A value of zero causes the whiskers to extend 
# to the data extremes.
x = rnorm(1000)
boxplot(x, range = 1.5)

boxplot(x, range = 0)

Scatter plot

# scatter plot
x = rnorm(100)
y = rnorm(100)
plot(x, y)

y = x + runif(100)
plot(x, y)

R Markdown

  • R Markdown provides an unified authoring framework for data science, combining your code, its results, and your prose commentary
  • R Markdown documents are fully reproducible and support many output formats, like HTML, PDF, and slideshows

R Markdown

R Markdown files are designed to be used in three ways:

  1. for communicating to decision makers, who want to focus on the conclusions, not the code behind the analysis
  2. for collaborating with other data scientists (including future you!), who are interested in both your conclusions, and how you reached them (the code)
  3. as an environment in which to do data science, as a modern day lab notebook where you can capture not only what you did, but also what you were thinking

Play

  1. open in RStudio a new R Markdown document
  2. generate an output document clicking on the Knit button
  3. read the source R Markdown document and compare it with the rendered output document
  4. modify the source R Markdown with something new you’ve learnt

Dig deeper