Data import

Tibbles are data frames, but they tweak some older behaviours to make life a little easier. They are defined in the tibble package.

Most R packages use regular data frames, so you might want to coerce a data frame to a tibble:

library(tibble)
as_tibble(iris)

## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows

You can create a new tibble from individual vectors with tibble(); it will automatically recycle inputs of length 1, and allows you to refer to variables that you just created:

tibble(
  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
)

## # A tibble: 5 x 3
##       x     y     z
##   <int> <dbl> <dbl>
## 1     1     1     2
## 2     2     1     5
## 3     3     1    10
## 4     4     1    17
## 5     5     1    26

If you’re already familiar with data frames, note that tibbles do much less:

it never changes the names of variables
they never do partial matching when subsetting

# it never changes the names of variables
tibble(a = 1:10, a = 11:20)
data.frame(a = 1:10, a = 11:20)

names(tibble(`crazy name` = 1))
names(data.frame(`crazy name` = 1))

# they never do partial matching when subsetting
tb = tibble(abc = 1:10)
df = data.frame(abc = 1:10)
tb$a
df$a

Here you’ll learn how to read plain-text rectangular files (Comma Separated Value, or CSV file) into R using the readr package.

library(readr)

Read a CSV file with read_csv(); it prints out a column specification that gives the name and type of each column:

heights = read_csv("http://users.dimi.uniud.it/~massimo.franceschet/ns/plugandplay/import/heights.csv")
heights

## # A tibble: 1,192 x 6
##     earn height sex       ed   age race    
##    <dbl>  <dbl> <chr>  <dbl> <dbl> <chr>   
##  1 50000   74.4 male      16    45 white   
##  2 60000   65.5 female    16    58 white   
##  3 30000   63.6 female    16    29 white   
##  4 50000   63.1 female    16    91 other   
##  5 51000   63.4 female    17    39 white   
##  6  9000   64.4 female    15    26 white   
##  7 29000   61.7 female    12    49 white   
##  8 32000   72.7 male      17    46 white   
##  9  2000   72.0 male      15    21 hispanic
## 10 27000   72.2 male      12    26 white   
## # … with 1,182 more rows

You can also view the data frame with the View() function.

read_csv() uses the first line of the data (the header) for the column names, which is a very common convention. There are cases where you might want to tweak this behaviour:

# skip lines
read_csv("The first line of metadata
          The second line of metadata
          x,y,z
          1,2,3", 
        skip = 2)

## # A tibble: 1 x 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3

# skip comments    
read_csv("# A comment I want to skip
          x,y,z
          1,2,3", 
         comment = "#")

## # A tibble: 1 x 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3

# no header
read_csv("1,2,3\n4,5,6", col_names = FALSE)

## # A tibble: 2 x 3
##      X1    X2    X3
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))

## # A tibble: 2 x 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

You can specify which data to interpret as NA values as follows:

# NA as NA
read_csv("a,b,c\n1,NA,.")

## # A tibble: 1 x 3
##       a b     c    
##   <dbl> <lgl> <chr>
## 1     1 NA    .

# . as NA
read_csv("a,b,c\n1,NA,.", na = ".")

## # A tibble: 1 x 3
##       a b     c    
##   <dbl> <chr> <lgl>
## 1     1 NA    NA

Package readr also comes with a useful function for writing data back to disk: write_csv().

write_csv(heights, "heights.csv")

However the type information of columns is lost when you save to CSV. This makes CSVs a little unreliable for caching interim results. One alternative is using RDS format (R’s custom binary format):

write_rds(heights, "heights.rds")
read_rds("heights.rds")

## # A tibble: 1,192 x 6
##     earn height sex       ed   age race    
##    <dbl>  <dbl> <chr>  <dbl> <dbl> <chr>   
##  1 50000   74.4 male      16    45 white   
##  2 60000   65.5 female    16    58 white   
##  3 30000   63.6 female    16    29 white   
##  4 50000   63.1 female    16    91 other   
##  5 51000   63.4 female    17    39 white   
##  6  9000   64.4 female    15    26 white   
##  7 29000   61.7 female    12    49 white   
##  8 32000   72.7 male      17    46 white   
##  9  2000   72.0 male      15    21 hispanic
## 10 27000   72.2 male      12    26 white   
## # … with 1,182 more rows

Tibbles

Read and write CSV