Modelling

Overview

the goal of a model is to provide a simple low-dimensional summary of a dataset
ideally, the model will capture true signals (i.e. patterns generated by the phenomenon of interest), and ignore noise (i.e. random variation that you’re not interested in)
we are going to use models as a tool for exploration (hypothesis generations), and not for confirming that an hypothesis is true (hypothesis confirmation)

Models

We’re going to use models to partition data into patterns (signal) and residuals (noise).

\[observed = pattern + residual\]

There are two parts to a model:

First, you define a family of models that express a precise, but generic, pattern that you want to capture. For example, the pattern might be a straight line, or a quadatric curve. You will express the model family as an equation like \[y = a_1 \cdot x + a_2\] or \[y = a_1 \cdot x^{a_2}\] Here, \(x\) and \(y\) are known variables from your data, and \(a_1\) and \(a_2\) are parameters that can vary to capture different patterns.
Next, you generate a fitted model by finding the model from the family that is the closest to your data. This takes the generic model family and makes it specific, like \[y = 3 \cdot x + 7\] or \[y = 9 \cdot x ^ 2\]

The map is not the territory

It’s important to understand that a fitted model is just the closest model from a family of models.

That implies that you have the best model in the chosen playground; it doesn’t imply that you have a good model in general.

The map is not the territory

All models are wrong, but some are useful

Linear regression

linear regression is a linear approach to modelling the relationship between a response variable (or dependent variable) and one or more explanatory variables (also known as independent variables)

\[ y = a_1 x_1 + a_2 x_2 + \ldots + a_n x_n \]

the case of one explanatory variable (\(n = 1\)) is called simple linear regression; for more than one (\(n > 1\)), the process is called multiple linear regression
linear regression models are often fitted using the least squares approach, that is finding the linear model (the line or plane in general) that minimizes the sum of the squares of the residuals (the difference between observed and predicted values)

Linear regression

Linear regression has many practical uses. Most applications fall into one of the following two broad categories:

if the goal is to explain variation in the response variable that can be attributed to variation in the explanatory variables, linear regression analysis can be applied to quantify the strength of the relationship between the response and the explanatory variables, and in particular to determine whether there is a positive correlation, a negative correlation or no correlation (independence) among dependent and independent variables
if the goal is prediction, linear regression can be used to fit a predictive model to an observed data set of values of the response and explanatory variables. After developing such a model, if additional values of the explanatory variables are collected without an accompanying response value, the fitted model can be used to make a prediction of the response

Response, prediction and residual

We have three interesting values in this plot for each value of variable \(x\):

the value of variable \(y\) observed in the dataset (the response);
the value of variable \(y\) predicted by the model (the prediction)
the difference between observed and predicted values for variable \(y\) (the residual)

the predictions tells you the pattern that the model has captured
the residuals tell you what the model has missed
residuals are powerful because they allow us to use models to remove striking patterns so we can study the subtler trends that remain

How to detect a linear correlation

scatter plot the explanatory and response variables and visually inspect the plot
is you see any correlation, compute a linear regression model, plot it over the scatterplot and, again, visually inspect the plot
compute the Pearson correlation coefficient \(R\): it is the covariance of two variables, divided by the product of their standard deviations. It runs from -1 to 1 with values close to 1 indicating positive correlation, values close to -1 indicating negative correlation, and values close to 0 indication no correlation (independence)
you can also compute the coefficient of determination \(R^2\), that is the square of the correlation coefficient. This runs form 0 to 1 with 1 indicating correlation (either positive or negative) and 0 indicating independence. The linear regression line is the one that maximizes the \(R^2\)

Correlation coefficients

There are three popular correlation coefficients: Pearson, Spearman and Kendall.

Pearson correlation coefficient is used to estimate rating-based correlation if data come from bivariate normal distribution. It is the covariance of two variables, divided by the product of their standard deviations:

\[cor(X,Y) = \frac{cov(X, Y)}{\sigma_{X} \cdot \sigma_{Y}} = \frac{\sum_{i} (x_i - \mu_X) (y_i - \mu_Y)}{\sqrt{\sum_{i} (x_i - \mu_X)^2} \sqrt{\sum_{i} (y_i - \mu_Y)^2}}\]

Kendall’s and Spearman’s correlation coefficients are used to estimate a rank-based correlation. These are more robust and have been recommended if the data do not necessarily come from a bivariate normal distribution.

Spearman correlation coefficient uses Pearson’s formula of ranks, not ratings (values).

Kendall correlation coefficient is defined as the relative difference of concordant and discordant pairs among the two variables:

\[\frac{n(concord) - n(discord)}{n (n-1) /2} \] where \(n(concord)\) is the number of concordant pairs and \(n(discord)\) is the number of discordant pairs.

A pair \((i,j)\) is concordant if both \(x_i < x_j\) and \(y_i < y_j\) or both \(x_i > x_j\) and \(y_i > y_j\).
A pair \((i,j)\) is disconcordant if \(x_i < x_j\) and \(y_i > y_j\) or \(x_i > x_j\) and \(y_i < y_j\).
If \(x_i = x_j\) or \(y_i = y_j\) the pair is neither concordant nor discordant.

Code

library(dplyr)
library(ggplot2)
library(modelr)

# dataset
sim1

## # A tibble: 30 × 2
##        x     y
##    <int> <dbl>
##  1     1  4.20
##  2     1  7.51
##  3     1  2.13
##  4     2  8.99
##  5     2 10.2 
##  6     2 11.3 
##  7     3  7.36
##  8     3 10.5 
##  9     3 10.5 
## 10     4 12.4 
## # ℹ 20 more rows

# scatter plot
ggplot(sim1, aes(x,y)) + geom_point()

# linear model 
mod1 = lm(y ~ x, data = sim1)

#model (all information)
summary(mod1)

## 
## Call:
## lm(formula = y ~ x, data = sim1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1469 -1.5197  0.1331  1.4670  4.6516 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.2208     0.8688   4.858 4.09e-05 ***
## x             2.0515     0.1400  14.651 1.17e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.203 on 28 degrees of freedom
## Multiple R-squared:  0.8846, Adjusted R-squared:  0.8805 
## F-statistic: 214.7 on 1 and 28 DF,  p-value: 1.173e-14

# coefficients
mod1$coefficients

## (Intercept)           x 
##    4.220822    2.051533

The model predics \(y\) in terms of \(x\) using the following linear relationship:

\[ y = 4.220822 + 2.051533 \cdot x \]

# visualizing the model
ggplot(sim1, aes(x,y)) + 
  geom_point() + 
  geom_abline(intercept = mod1$coefficients[1], 
              slope = mod1$coefficients[2], 
              color = "red")

Let’s compute the correlation coefficients:

# correlation coefficients
cor(sim1$x, sim1$y, method = "pearson") # default

## [1] 0.9405384

cor(sim1$x, sim1$y, method = "spearman")

## [1] 0.9526352

cor(sim1$x, sim1$y, method = "kendall")

## [1] 0.8410127

And the correlation of determination:

# coefficient of determination
summary(mod1)$r.squared

## [1] 0.8846124

# or
cor(sim1$x, sim1$y)^2

## [1] 0.8846124

A general method

# To visualise a model, it is very useful to be able to generate 
# an evenly spaced grid of points from the data
(grid <- data_grid(sim1, x))

## # A tibble: 10 × 1
##        x
##    <int>
##  1     1
##  2     2
##  3     3
##  4     4
##  5     5
##  6     6
##  7     7
##  8     8
##  9     9
## 10    10

# add values predicted by the model over the grid
(grid <- add_predictions(grid, mod1))

## # A tibble: 10 × 2
##        x  pred
##    <int> <dbl>
##  1     1  6.27
##  2     2  8.32
##  3     3 10.4 
##  4     4 12.4 
##  5     5 14.5 
##  6     6 16.5 
##  7     7 18.6 
##  8     8 20.6 
##  9     9 22.7 
## 10    10 24.7

# plot both observed and predicted values
ggplot(sim1, aes(x = x)) +
  geom_point(aes(y = y)) + # observed values
  geom_line(data = grid, mapping = aes(y = pred), colour = "red") # predicted values

# add predictions to the model
(sim1 <- add_predictions(sim1, mod1))

## # A tibble: 30 × 3
##        x     y  pred
##    <int> <dbl> <dbl>
##  1     1  4.20  6.27
##  2     1  7.51  6.27
##  3     1  2.13  6.27
##  4     2  8.99  8.32
##  5     2 10.2   8.32
##  6     2 11.3   8.32
##  7     3  7.36 10.4 
##  8     3 10.5  10.4 
##  9     3 10.5  10.4 
## 10     4 12.4  12.4 
## # ℹ 20 more rows

# add residuals to the model
(sim1 <- add_residuals(sim1, mod1))

## # A tibble: 30 × 4
##        x     y  pred    resid
##    <int> <dbl> <dbl>    <dbl>
##  1     1  4.20  6.27 -2.07   
##  2     1  7.51  6.27  1.24   
##  3     1  2.13  6.27 -4.15   
##  4     2  8.99  8.32  0.665  
##  5     2 10.2   8.32  1.92   
##  6     2 11.3   8.32  2.97   
##  7     3  7.36 10.4  -3.02   
##  8     3 10.5  10.4   0.130  
##  9     3 10.5  10.4   0.136  
## 10     4 12.4  12.4   0.00763
## # ℹ 20 more rows

# histogram of residuals (mean is always 0)
ggplot(sim1, aes(resid)) + 
  geom_freqpoly(binwidth = 0.5)

# scatterplot of residuals (plot residuals as outcomes)
ggplot(sim1, aes(x, resid)) + 
  geom_ref_line(h = 0) +
  geom_point()

Learn more

Models in book R for Data Science (first edition)
Models with tidymodels packages in book Tidy Modeling with R by Max Kuhn and Julia Silge