Dry run \(\alpha\)

The goal of this make is to read complex data on wins and losses for all World Series games.

  1. Read R documentation for function scan. In particular pay attention to attributes what, skip, and nlines
  2. Use scan to read data on wins and losses for all World Series games. Make a numeric vector for years and a character vector for the patterns of wins and losses
  3. The function scan reads from left to right, but the dataset is organized by columns and so the years appear in a strange order. Use function order to order the data chronologically
  4. Finally make a data frame with the year and pattern components
# Read the dataset with function scan
world_series <- scan("http://lib.stat.cmu.edu/datasets/wseries",
                     ___, # - Skip the first 35 lines
                     ___, # - Then read 23 lines of data
                     ___) # - The data occurs in pairs: a year (numeric) and a pattern (character)


# find a sorting permutation of sorted years (use function order)
perm <- order(___)

# using the sorting permutation make a data frame with sorted information about years and patterns
world_series <- data.frame(year = ___, pattern = ___, stringsAsFactors = ___)

Dry run \(\beta\)

The package readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. This challenging CSV illustrates some problems.

  1. download the CSV in your working directory
  2. read it with read_csv(). You’ll see some problems. Print the data frame and notice the types of columns
  3. open the CSV file in RStudio and scroll it down beyond line number 1000. What do you see?
  4. Try to solve the problems using appropriate column specification
  5. Solve the problems using an appropriate guess_max parameter
# SOLUTION

library(readr)

# read with no comumn spec
challenge <- read_csv("challenge.csv")

# print challenge
challenge

# read with column spec
challenge <- read_csv("challenge.csv", col_types = cols(x = col_double(), y = col_date()))

# print challenge
challenge

# another solution
read_csv("challenge.csv", guess_max = 1001)