This is a conqueror’s proof assessment.

Warm-up

  1. load packages tidyverse, lubridate, hexbin and modelr
  2. import CSV files for football matches and team ratings into tibbles football_matches and football_ratings;
  3. glance the imported tibbles. Read field notes for football_matches. As for football_ratings, variables are: Team (the team name), V (number of victories), P (number of draws), S (number of defeats), GF (goals for), GS (goals against).

Transform

  1. select variables of interest, that is those from Date to AR
  2. create Date objects from Date variable width dmy function
  3. get team names and arrange them in alphabetical order
  4. add to football_matches columns HomeTeamId and AwayTeamId with numeric identifiers for the teams
  5. move in football_matches columns Date, HomeTeam, AwayTeam, HomeTeamId, and AwayTeamId in front of everything
  6. add to football_ratings variables DG for the difference between goals for and goals against and PT for the number of points (a victory is 3 points, a draw is 1 point, a defeat is 0 points.)
  7. add to football_ratings variable region with the region (north, center, or south Italy) of the city of the team
  8. arrange football_ratings by points
  9. add to football_ratings variable league with values: champions (from rank 1 to 3), europa (from rank 4 to 5), nothing (from rank 6 to 17), and retro (from rank 18 to 20)

Query

  1. compare descriptive statistics (mean, median, max and sd) of home and away goals
  2. compute the absolute number of teams, the relative number of teams, and the average number of points of teams grouped by region and arrange the result by absolute number of teams in decreasing order
  3. group matches by goal spread
  4. retrieve the busy months (those with more than 40 matches)
  5. retrieve the matches during the busy months
  6. retrieve matches played at home by teams that qualified for the champions league
  7. retrieve matches played between teams from the south

Visualize

  1. plot goals against points
  2. add variable region to the previous plot
  3. add global smoothed line to the previous plot
  4. make a barplot with variable region
  5. make a barplot with variables region and league
  6. make a boxplot with variables region and league and reorder region by median
  7. add the mean to the boxplot and reorder region by mean
  8. use stat_summary to display min, max and mean and reorder region by mean
  9. plot goals against points faceting over region
  10. plot goals against points faceting over league and region
  11. plot count over region and league
  12. plot histograms of home and away team goals as well as goal spread
  13. plot shots on target versus goals
  14. plot fouls committed versus yellow cards

Program

  1. program a function that computes team ratings (victories, draws, defeats, points, goals for, goals against) from the team matches
  2. program a function that computes foresight prediction accuracy of match result using team points
  3. add the home field advantage (HFA) to the previous function. In the accuracy improved with HFA? Visualize the accuracy with and without HFA
  4. program a function that computes team points after every match for a given team and visualize the temporal evolution of points for two given teams
  5. visualize the temporal evolution of points for all teams

Model

  1. model PT in terms of GF using linear regression. Plot residuals and sort teams by residuals. Which are the top-ranked and bottom-ranked teams?
  2. do the same for GS and DG. Which model is the best?
  3. model PT in terms of GF and GS using multiple linear regression. What is the difference with respect to the model of PT in terms of DG? Which among GF and GS contribute more to PT? Use the answer to suggest a market strategy for a team.
  4. model PT in terms of V, P and S using multiple linear regression. Explain the outcome