The goal of this make is to read complex data on wins and losses for all World Series games.
scan
. In particular pay attention to attributes what, skip, and nlinesscan
to read data on wins and losses for all World Series games. Make a numeric vector for years and a character vector for the patterns of wins and lossesscan
reads from left to right, but the dataset is organized by columns and so the years appear in a strange order. Use function order
to order the data chronologically# Read the dataset with function scan
world_series <- scan("http://lib.stat.cmu.edu/datasets/wseries",
___, # - Skip the first 35 lines
___, # - Then read 23 lines of data
___) # - The data occurs in pairs: a year (numeric) and a pattern (character)
# find a sorting permutation of sorted years (use function order)
perm <- order(___)
# using the sorting permutation make a data frame with sorted information about years and patterns
world_series <- data.frame(year = ___, pattern = ___, stringsAsFactors = ___)
The package readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. This challenging CSV illustrates some problems.
read_csv()
. You’ll see some problems. Print the data frame and notice the types of columnsguess_max
parameter# SOLUTION
library(readr)
# read with no comumn spec
challenge <- read_csv("challenge.csv")
# print challenge
challenge
# read with column spec
challenge <- read_csv("challenge.csv", col_types = cols(x = col_double(), y = col_date()))
# print challenge
challenge
# another solution
read_csv("challenge.csv", guess_max = 1001)