$$ \newcommand{\esp}[1]{\mathbb{E}\left(#1\right)} \newcommand{\var}[1]{\mbox{Var}\left(#1\right)} \newcommand{\deriv}[1]{\dot{#1}(t)} \newcommand{\prob}[1]{ \mathbb{P}\!(#1)} \newcommand{\eqdef}{\mathop{=}\limits^{\mathrm{def}}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bc}{\boldsymbol{c}} \newcommand{\bpsi}{\boldsymbol{\psi}} \def\pmacro{\texttt{p}} \def\like{{\cal L}} \def\llike{{\cal LL}} \def\logit{{\rm logit}} \def\probit{{\rm probit}} \def\one{{\rm 1\!I}} \def\iid{\mathop{\sim}_{\rm i.i.d.}} \def\simh0{\mathop{\sim}_{H_0}} \def\df{\texttt{df}} \def\res{e} \def\xomega{x} \newcommand{\argmin}[1]{{\rm arg}\min_{#1}} \newcommand{\argmax}[1]{{\rm arg}\max_{#1}} \newcommand{\Rset}{\mbox{$\mathbb{R}$}} \def\param{\theta} \def\setparam{\Theta} \def\xnew{x_{\rm new}} \def\fnew{f_{\rm new}} \def\ynew{y_{\rm new}} \def\nnew{n_{\rm new}} \def\enew{e_{\rm new}} \def\Xnew{X_{\rm new}} \def\hfnew{\widehat{\fnew}} \def\degree{m} \def\nbeta{d} \newcommand{\limite}[1]{\mathop{\longrightarrow}\limits_{#1}} \def\ka{k{\scriptstyle a}} \def\ska{k{\scriptscriptstyle a}} \def\kel{k{\scriptstyle e}} \def\skel{k{\scriptscriptstyle e}} \def\cl{C{\small l}} \def\Tlag{T\hspace{-0.1em}{\scriptstyle lag}} \def\sTlag{T\hspace{-0.07em}{\scriptscriptstyle lag}} \def\Tk{T\hspace{-0.1em}{\scriptstyle k0}} \def\sTk{T\hspace{-0.07em}{\scriptscriptstyle k0}} \def\thalf{t{\scriptstyle 1/2}} \newcommand{\Dphi}[1]{\partial_\pphi #1} \def\asigma{a} \def\pphi{\psi} \newcommand{\stheta}{{\theta^\star}} \newcommand{\htheta}{{\widehat{\theta}}} $$


1 Introduction

The data cars give the speed of cars and the distances taken to stop (Note that the data were recorded in the 1920s).

The data frame consists of 50 observations (rows) and 2 variables (columns): speed (mph), stopping distance (ft)

data(cars)
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Scatter plots can help visualize any relationship between the explanatory variable speed (also called regression variable, or predictor) and the response variable dist.

library(ggplot2)
theme_set(theme_bw())
pl <- ggplot(cars) + geom_point(aes(x=speed, y=dist), size=2, colour="#993399") + 
  xlab("Speed (mph)") + ylab("Stopping distance (ft)")  
print(pl)

Based on this data, our objective is to build a regression model of the form

\[y_j = f(x_j) + e_j \quad ; \quad 1 \leq j \leq n\] where \((x_j, 1 \leq j \leq n)\) and \((y_j, 1 \leq j \leq n)\) represent, respectively, the \(n\) measured speeds and distances and where \((e_j, 1 \leq j \leq n)\) is a sequence of residual errors. In other words, \(e_j\) represents the difference between the distance predicted by the model \(f(x_j)\) and the observed distance \(y_j\).

We will restrict ourselves to polynomial regression, by considering functions of the form \[ \begin{aligned} f(x) &= f(x ; c_0, c_1, c_2, \ldots, c_d) \\ &= c_0 + c_1 x + c_2 x^2 + \ldots + c_d x^d \end{aligned} \]

Building a polynomial regression model requires to perform the following taks:


2 Fitting polynomial models

2.1 Fitting a polynomial of degree 0

A very basic model considers that the observations \((y_j)\) fluctuates around a constant value \(c_0\): \[y_j = c_0 + e_j\]

attach(cars)
lm0 <- lm(dist ~ 1)
lm0
## 
## Call:
## lm(formula = dist ~ 1)
## 
## Coefficients:
## (Intercept)  
##       42.98

The coefficient \(c_0\) is the empirical mean

mean(dist)
## [1] 42.98

Looking how the model fits the data is more than enough for concluding that this model is misspecified.

pl + geom_hline(yintercept=coef(lm0)[1],size=1, colour="#339900")

Indeed, a clear increasing trend is visible in the data while the model assumes that there is no trend. We therefore reject this first model.


2.2 Fitting a polynomial of degree 1

2.2.1 Numerical results

The scatter plot above rather suggests a linearly increasing relationship between the explanatory and response variables. Let us therefore assume now a linear trend which is mathematically represented by a polynomial of degree 1:

\[y_j = c_0 + c_1 x_j + e_j \quad ; \quad 1 \leq j \leq n\]

We can then fit this model to our data using the function lm:

lm1 <- lm(dist ~ speed, data=cars)
coef(lm1)
## (Intercept)       speed 
##  -17.579095    3.932409

These coefficients are the intercept and the slope of the regression line, but more informative results about this model are available:

summary(lm1)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The slope \(c_1\) is clearly statistically significant (p-value = 1.49e-12) while the model explains about \(65\%\) of the variability of the data. The confidence interval for \(c_1\) confirms that an increase of the speed leads to a significant increase of the stopping distance.

confint(lm1)
##                  2.5 %    97.5 %
## (Intercept) -31.167850 -3.990340
## speed         3.096964  4.767853

See how these quantities are computed.


2.2.2 Some diagnostic plots

The fact that the slope is significantly different from zero does not imply that this polynomial model of degree 1 correctly describes the data: at this stage, we can only conclude that a polynomial of degree 1 better explains the variability of the data than a constant model.

Diagnostic plots are visual tools that allows one to ``see?????? if something is not right between a chosen model and the data it is hypothesized to describe.

First, we can add the regression line to the plot of the data.

pl  + geom_abline(intercept=coef(lm1)[1],slope=coef(lm1)[2],size=1, colour="#339900")

The regression line decribes pretty well the global trend in the data: based on this graphic, there is no reason to reject the model.

Several diagnotic plots are available for a lm object. The first two are a plot of residuals against fitted values and a normal QQ plot.

par(mfrow = c(1, 2))
plot(lm1, which=c(1,2))

The residual plot shows a slight (decreasing and increasing) trend which suggests that the residuals are not identically distributed around 0. Furthermore, the QQ plot shows that the extreme residual values are not the extreme values of a normal distribution. It may be therefore necessary to improve the regression model.


See how these plot are obtained.


2.2.3 The predictive performance of the model

Even if the model it somewhat misspecified, it may have good predictive properties.

A common practice is to split the dataset into a 80:20 sample (training:test), then, build the model on the 80% sample and then use the model thus built to predict the reponse variable on test data.

Doing it this way, we will have the model predicted values for the 20% data (test). We can then see how the model will perform with this ``new?????? data, by comparing these predicted values with the original ones. We can alo check the stability of the prediction given by the model, by comparing these predicted values with those obtained previouly, when the complete data were used for building the model.

Let us first randomly define the training and test samples:

set.seed(100)
n <- nrow(cars)
i.training <- sort(sample(n,round(n*0.8)))
cars.training <- cars[i.training,]
cars.test  <- cars[-i.training,]
pred1a.test <- predict(lm1, newdata=cars.test)
lm1.training <- lm(dist ~ speed, data=cars.training)
pred1b.test <- predict(lm1.training, newdata=cars.test)
data.frame(cars.test, pred1a.test, pred1b.test)
##    speed dist pred1a.test pred1b.test
## 1      4    2   -1.849460   -5.392776
## 4      7   22    9.947766    7.555787
## 8     10   26   21.744993   20.504349
## 20    14   26   37.474628   37.769100
## 26    15   54   41.407036   42.085287
## 31    17   50   49.271854   50.717663
## 37    19   46   57.136672   59.350038
## 39    20   32   61.069080   63.666225
## 40    20   48   61.069080   63.666225
## 42    20   56   61.069080   63.666225
dist.test <- cars.test$dist
par(mfrow=c(1,2))
plot(pred1b.test, dist.test)
abline(a=0, b=1, lty=2)
plot(pred1b.test, pred1a.test)
abline(a=0, b=1, lty=2)

On one hand, it is reassuring to see that removing part of the data has a very little impact on the predictions (right graph). On the other hand, the predictive performance of the model remains limited because of the natural variability of the data (left graph). Indeed, this model built with the training sample only explains 69 % of the variabilility of the new test sample.

cor.test <- cor(pred1a.test, dist.test)
R2.test <- cor.test^2
R2.test
## [1] 0.6851758

Let us see now how the model behaves with extreme values by defining as test sample the data points with the 5 smallest and the 5 largest speeds.

i.training <- c(6:45)
cars.training <- cars[i.training,]
cars.test  <- cars[-i.training,]
dist.test <- cars.test$dist
pred1a.test <- predict(lm1, newdata=cars.test)
lm1.training <- lm(dist ~ speed, data=cars.training)
pred1b.test <- predict(lm1.training, newdata=cars.test)
par(mfrow=c(1,2))
plot(pred1b.test, dist.test)
abline(a=0, b=1, lty=2)
plot(pred1b.test, pred1a.test)
abline(a=0, b=1, lty=2)

We see that the model now underestimate the stopping distance for the largest speeds.

In conclusion, the model may be used with confidence for predicting the stopping distance for given speeds which are central values. Another model should probably be developed for predicting stopping distances for extreme speed values.


2.2.4 Confidence interval and prediction interval

Imagine we aim to predict the stopping distance for a speed \(x\). Using the estimated coefficients of our polynomial model, the prediction will be \[\hat{f}(x) = \hat{c}_0 + \hat{c}_1x \] This prediction is an estimation since it is based on estimated coefficients \(\hat{c}_0\) and \(\hat{c}_1\). Then, the uncertainty on the prediction should be quantified by providing a confidence interval for \(f(x)\).

Since the model can be used for predicting the stopping distance for non extreme speeds, we will restrict ourselves to speeds between 6 and 23 mph.

alpha <- 0.05
df.new <- data.frame(speed=(6:23))
conf.dist <- predict(lm1, newdata = df.new, interval="confidence", level=1-alpha) 
head(conf.dist)
##         fit       lwr      upr
## 1  6.015358 -2.973341 15.00406
## 2  9.947766  1.678977 18.21656
## 3 13.880175  6.307527 21.45282
## 4 17.812584 10.905120 24.72005
## 5 21.744993 15.461917 28.02807
## 6 25.677401 19.964525 31.39028

A prediction interval for a new measured distance \(y=f(x)+e\) can also be computed. This prediction interval takes into account both the uncertainty on the predicted distance \(f(x)\) and the variability of the measure, represented in the model by the residual error \(e\).

pred.dist <- predict(lm1, newdata = df.new, interval="prediction", level=1-alpha) 
head(pred.dist)
##         fit        lwr      upr
## 1  6.015358 -26.187314 38.21803
## 2  9.947766 -22.061423 41.95696
## 3 13.880175 -17.956287 45.71664
## 4 17.812584 -13.872245 49.49741
## 5 21.744993  -9.809601 53.29959
## 6 25.677401  -5.768620 57.12342

Let us plot these two intervals.

df.new[c("fit","lwr.conf", "upr.conf")] <- conf.dist
df.new[c("lwr.pred", "upr.pred")] <- pred.dist[,2:3]
pl +   
  geom_ribbon(data=df.new, aes(x=speed, ymin=lwr.pred, ymax=upr.pred), alpha=0.1, inherit.aes=F, fill="blue") + 
  geom_ribbon(data=df.new, aes(x=speed, ymin=lwr.conf, ymax=upr.conf), alpha=0.2, inherit.aes=F, fill="#339900") +  
  geom_line(data=df.new, aes(x=speed, y=fit), colour="#339900", size=1)

These intervals are of very little interest and should not be used in practice.
Indeed, we can see that both the confidence interval and the prediction interval contain negative values for small or moderate speeds, which is obviously unrealistic??? We???ll see later how to tranform the model and/or the data in order to take into account some constraints about the data.


See how these intervals are computed.


2.3 Fitting a polynomial of degree 2

We can expect to better describe the extreme values by using a polynomial of higher degree. Let us therefore fit a polynomial of degree 2 to the data.

lm2 <- lm(dist ~  speed + I(speed^2))
summary(lm2)
## 
## Call:
## lm(formula = dist ~ speed + I(speed^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.720  -9.184  -3.188   4.628  45.152 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.47014   14.81716   0.167    0.868
## speed        0.91329    2.03422   0.449    0.656
## I(speed^2)   0.09996    0.06597   1.515    0.136
## 
## Residual standard error: 15.18 on 47 degrees of freedom
## Multiple R-squared:  0.6673, Adjusted R-squared:  0.6532 
## F-statistic: 47.14 on 2 and 47 DF,  p-value: 5.852e-12

Surprisingly, the 3 \(t\)-tests seem to indicate that none of the coefficients is statistically significant. This result should be considered with caution. Indeed, it does not mean that the three coefficients can be replaced by 0 (i.e.??removed) in the model.

The difficulty of interpretating these p-values is due to the strong correlations that exist between the estimates \(\hat{c}_0\),
\(\hat{c}_1\) and \(\hat{c}_2\). It is recalled that the variance covariance matrix of the vector of estimates is \(\sigma^2(X^\prime X)^{-1}\). We can then easily derive the correlation matrix from covariance matrix.

X <- model.matrix(lm2)
S <- solve(t(X)%*%X)
d <- sqrt(diag(S))
R <- S/(d%*%t(d))
R
##             (Intercept)      speed I(speed^2)
## (Intercept)   1.0000000 -0.9605503  0.8929849
## speed        -0.9605503  1.0000000 -0.9794765
## I(speed^2)    0.8929849 -0.9794765  1.0000000

We will see below how to use a sequence of orthogonal polynomials in order to get uncorrelated estimates.

The R-squared indicate that 66.7% of the variability of the data is explained by a polynomial of degree 2, instead of 65.1% with a polynomial of degree 1. On the other hand, the standard deviation of the residual error is 15.18 ft with this model instead of 15.38 ft with the previous one. These improvments are really to small to justify the use of a quadratic polynomial.

Nevertheless, plotting the data together with the fitted polynomial suggests that the extreme values are better predicted with a polynomial of degree 2 than with a straight line.

f <- function(x,c) coef(lm2)[1] + coef(lm2)[2]*x + coef(lm2)[3]*x^2
#f <- function(x,c) coef(lm2)[1]*x + coef(lm2)[2]*x^2
pl  + geom_abline(intercept=coef(lm1)[1],slope=coef(lm1)[2],size=1, colour="#339900") +
  stat_function(fun=f, colour="red", size=1)

The predictive performance of the model need to be assesed in order to confirm this property of the model. We can again use the 80% most central data points to build the model and test it on the 20% remaining data points.

lm2t <- lm(dist ~ speed + I(speed^2), data=cars.training)
f <- function(x,c) coef(lm2t)[1] + coef(lm2t)[2]*x + coef(lm2t)[3]*x^2
pl + stat_function(fun=f, colour="#339900", size=1) + geom_point(data=cars.test, aes(x=speed, y=dist), color="blue")

The shape of the polynomial is totally different when it is built using the training data only. That means that the model is quite unstable, depending strongly on some data values. Then, repeating the same experiment (under the same experimental conditions) would probably lead to significantly different results.

Such lack of reproducibility of the predictions reinforces our idea that this model should be rejected.


2.4 Fitting a polynomial without intercept

The stopping distance predicted by the model should be null when the speed is null. This constraint can easily be achieved by removing the intercept from the regression model: \[ f(x) = c_1 x + c_2 x^2 + \ldots + c_d x^d \] Let us fit a regression model of degree 2 without intercept to our data.

lm2.noint <- lm(dist ~ -1 + speed + I(speed^2))
coef(lm2.noint)
##      speed I(speed^2) 
## 1.23902996 0.09013877

We can check that the design matrix consists of the 2 columns \((x_j)\) and \((x_j^2)\).

X <- model.matrix(lm2.noint)
head(X)
##   speed I(speed^2)
## 1     4         16
## 2     4         16
## 3     7         49
## 4     7         49
## 5     8         64
## 6     9         81

We can plot the data with the predicted response \(f(x)\), confidence interval for \(f(x)\) and prediction intervals for new observations \(y\).

df.new <- data.frame(speed=(0:27))
conf.dist <- predict(lm2.noint, newdata = df.new, interval="confidence", level=1-alpha) 
pred.dist <- predict(lm2.noint, newdata = df.new, interval="prediction", level=1-alpha) 
df.new[c("fit","lwr.conf", "upr.conf")] <- conf.dist
df.new[c("lwr.pred", "upr.pred")] <- pred.dist[,2:3]
pl +   
  geom_ribbon(data=df.new, aes(x=speed, ymin=lwr.pred, ymax=upr.pred), alpha=0.1, inherit.aes=F, fill="blue") + 
  geom_ribbon(data=df.new, aes(x=speed, ymin=lwr.conf, ymax=upr.conf), alpha=0.2, inherit.aes=F, fill="#339900") +  
  geom_line(data=df.new, aes(x=speed, y=fit), colour="#339900", size=1)

Confidence interval for \(f(x)\) now only contain positive values for any \(x>0\) and is reduce to 0 when \(x=0\).

On the other hand, prediction intervals for new observations still contain negative values. This is due to the fact that the residual error model is a constant error model \(y=f(x)+e\), assuming that the variance of the error \(e\) does not depend on the predicted value \(f(x)\). This hypothesis doe not reflect the true phenomena and should be rejected. We will see that an appropriate alternative may consist in transforming the data.


2.5 Using orthogonal polynomials

Instead of defining each component of the regression model lm(dist ~ speed + I(speed^2) ), it is equivalent to define explicitly the regression model as a polynomial of degree 2

lm2.poly <- lm(dist ~ poly(speed, degree=2, raw=T))
M <- model.matrix(lm2.poly)
head(M)
##   (Intercept) poly(speed, degree = 2, raw = T)1
## 1           1                                 4
## 2           1                                 4
## 3           1                                 7
## 4           1                                 7
## 5           1                                 8
## 6           1                                 9
##   poly(speed, degree = 2, raw = T)2
## 1                                16
## 2                                16
## 3                                49
## 4                                49
## 5                                64
## 6                                81

Results obtained with both methods are absolutely identical

print(coef(lm2.poly))
##                       (Intercept) poly(speed, degree = 2, raw = T)1 
##                         2.4701378                         0.9132876 
## poly(speed, degree = 2, raw = T)2 
##                         0.0999593

We have seen that this model leads to highly correlated estimators of the coefficients of the model which make it difficult the interpretation of the results. The option raw=FALSE (which is the default) allows to use an orthogonal basis of polynomial.

Let us see the results obtained with a polynomial of degree 1

lm1.ortho <- lm(dist ~ poly(speed, degree=1))
M1 <- model.matrix(lm1.ortho)
head(M1)
##   (Intercept) poly(speed, degree = 1)
## 1           1              -0.3079956
## 2           1              -0.3079956
## 3           1              -0.2269442
## 4           1              -0.2269442
## 5           1              -0.1999270
## 6           1              -0.1729098
t(M1)%*%M1
##                          (Intercept) poly(speed, degree = 1)
## (Intercept)             5.000000e+01            1.665335e-15
## poly(speed, degree = 1) 1.665335e-15            1.000000e+00

Here, the estimated intercept \(\hat{c}_0\) is the empirical mean \(\bar{y}=\) 42.98

coef(lm1.ortho)
##             (Intercept) poly(speed, degree = 1) 
##                 42.9800                145.5523

Adding a term of degree 2 keeps the first two column of the design matrix unchanged

lm2.ortho <- lm(dist ~ poly(speed, degree=2))
M2 <- model.matrix(lm2.ortho)
head(M2)
##   (Intercept) poly(speed, degree = 2)1 poly(speed, degree = 2)2
## 1           1               -0.3079956               0.41625480
## 2           1               -0.3079956               0.41625480
## 3           1               -0.2269442               0.16583013
## 4           1               -0.2269442               0.16583013
## 5           1               -0.1999270               0.09974267
## 6           1               -0.1729098               0.04234892
t(M2)%*%M2
##                           (Intercept) poly(speed, degree = 2)1
## (Intercept)              5.000000e+01             1.665335e-15
## poly(speed, degree = 2)1 1.665335e-15             1.000000e+00
## poly(speed, degree = 2)2 1.110223e-16             5.551115e-17
##                          poly(speed, degree = 2)2
## (Intercept)                          1.110223e-16
## poly(speed, degree = 2)1             5.551115e-17
## poly(speed, degree = 2)2             1.000000e+00

Let us look at the results obtained with this model:

summary(lm2.ortho)
## 
## Call:
## lm(formula = dist ~ poly(speed, degree = 2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.720  -9.184  -3.188   4.628  45.152 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                42.980      2.146  20.026  < 2e-16 ***
## poly(speed, degree = 2)1  145.552     15.176   9.591 1.21e-12 ***
## poly(speed, degree = 2)2   22.996     15.176   1.515    0.136    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.18 on 47 degrees of freedom
## Multiple R-squared:  0.6673, Adjusted R-squared:  0.6532 
## F-statistic: 47.14 on 2 and 47 DF,  p-value: 5.852e-12

These results give rise to several comments:

  • the first two estimated coefficients \(\hat{c}_0\) and \(\hat{c}_1\) are those obtained previously with a polynomial of degree 1
  • the p-values of the \(t\)-test can now properly be interpreted and confirm our previous analyis: the first two coefficients are statistically significant. The need of a quadratic term is much less obvious (\(p=0.136\)). Furthermore, confidence intervals can be derived for each coefficient
confint(lm2.ortho)
##                               2.5 %    97.5 %
## (Intercept)               38.662361  47.29764
## poly(speed, degree = 2)1 115.021940 176.08257
## poly(speed, degree = 2)2  -7.534552  53.52608
  • The estimated variance of the residuals \(\hat{\sigma}^2\), the R-squared value and the adjusted R-squared value are identical when orthogonal or non orthogonal polynomial are used. Indeed, even if the parameterization is different, the two models are the same polynomials of degree 2 and therefore lead to the same predictions.


See how these quantities are computed.

3 Model comparison


3.1 t-test

Several quantitative criteria and several statistical tests are available for comparing models.

First, we have seen that \(t\)-test are performed for each coefficient of the model

summary(lm1.ortho)$coefficients
##                         Estimate Std. Error  t value     Pr(>|t|)
## (Intercept)              42.9800   2.175002 19.76090 1.061332e-24
## poly(speed, degree = 1) 145.5523  15.379587  9.46399 1.489836e-12
summary(lm2.ortho)$coefficients
##                           Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)               42.98000    2.14622 20.025902 1.188618e-24
## poly(speed, degree = 2)1 145.55226   15.17607  9.590906 1.211446e-12
## poly(speed, degree = 2)2  22.99576   15.17607  1.515265 1.364024e-01

Based on these results, we can conclude that \(c_0 \neq 0\) and \(c_1 \neq 0\). On the other hand, the data does not allow us to conclude that \(c_2 \neq 0\).


3.2 Analysis-of-variance (anova)

By construction, the F-statistics provided with the summary of a fitted model is the F-statistics of an anova for testing this fitted model against the model without explanatory variable lm0. This test is known as the overall F-test for regression.

anova(lm0, lm1)
## Analysis of Variance Table
## 
## Model 1: dist ~ 1
## Model 2: dist ~ speed
##   Res.Df   RSS Df Sum of Sq      F   Pr(>F)    
## 1     49 32539                                 
## 2     48 11354  1     21186 89.567 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm0, lm2)
## Analysis of Variance Table
## 
## Model 1: dist ~ 1
## Model 2: dist ~ speed + I(speed^2)
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1     49 32539                                  
## 2     47 10825  2     21714 47.141 5.852e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Not surprisingly, both lm1 and lm2 are clearly preferred to lm0. As already stated above the \(t\)-test and the \(F-test\) for testing lm0 against lm1 are equivalent.

We can also perform an ANOVA to test lm1 against lm2

anova(lm1, lm2)
## Analysis of Variance Table
## 
## Model 1: dist ~ speed
## Model 2: dist ~ speed + I(speed^2)
##   Res.Df   RSS Df Sum of Sq     F Pr(>F)
## 1     48 11354                          
## 2     47 10825  1    528.81 2.296 0.1364

lm1 is slightly preferred to lm2 for this criteria.


3.3 Likelihood ratio test (LRT)

The likelihood ratio test can be used for testing 2 nested models. Since our 3 polynomial models are nested, we can use the LRT fo testing one of these model against another one.

log-likelihoods \(\llike_0(\hat\theta_0)\), \(\llike_1(\hat\theta_1)\) and \(\llike_2(\hat\theta_2)\) are available, where \(\hat\theta_k\) is the maximum likelihood estimate of the parameter of model lmk for \(k=0, 1, 2\).

logLik(lm0)
logLik(lm1)
logLik(lm2)
## 'log Lik.' -232.9012 (df=2)
## 'log Lik.' -206.5784 (df=3)
## 'log Lik.' -205.386 (df=4)

For testing lm0against lm1 for instance, we compute the test statistics \(2(\llike_1(\hat\theta_1) - \llike_0(\hat\theta_0)\) which has a \(chi^2\) distribution with 1 degree of freedom under lm0.

dl <- 2*as.numeric(logLik(lm1) - logLik(lm0))
1-pchisq(dl,1)
## [1] 3.995693e-13

The test statistics \(2(\llike_2(\hat\theta_2) - \llike_1(\hat\theta_1)\) is used for testing lm1against lm2

dl <- 2*as.numeric(logLik(lm2) - logLik(lm1))
1-pchisq(dl,1)
## [1] 0.122521

We can see with this example that the ANOVA and the LRT give very similar results and lead to the same conclusion: the constant model can be rejected with high confidence (i.e.??the stopping distance depends on the speed), while a model with both a linear and a quadratic component is not preferred to a model with a linear component only (i.e.??we don???t reject the hypothesis that the stopping distance increase linearly with the speed).


3.4 Information criteria

Information criteria such as the Akaike information criteria (AIC) and the Bayesian information criateria (BIC) can also be used for comparing models which are not necessarily nested.

AIC(lm0,lm1,lm2)
##     df      AIC
## lm0  2 469.8024
## lm1  3 419.1569
## lm2  4 418.7721
BIC(lm0,lm1,lm2)
##     df      BIC
## lm0  2 473.6265
## lm1  3 424.8929
## lm2  4 426.4202

Models with lowest AIC and/or BIC will be preferred. Here, both criteria agree for rejecting lm0 with high confidence. AIC has a very slight preference for lm2 while BIC penalizes a little bit more polynomial with higher degree and prefers lm1. Nevertheless, these differences are not large enough for selecting definitely any of these 2 models.


See how these quantities are computed.

4 Data transformation

When fitting a linear regression model one assumes that there is a linear relationship between the response variable and each of the explanatory variables. However, in many situations there may instead be a non-linear relationship between the variables. This can sometimes be remedied by applying a suitable transformation to some (or all) of the variables, such as power transformations or logarithms.

In addition, transformations can be used to correct violations of model assumptions such as constant error variance and normality.

Applying suitable transformations to the original data prior to performing regression may be sufficient to make linear regression models appropriate for the transformed data.

4.1 log based tranformations

Let???s plot the data using semi-log scales: we use a logarithmic scale for the x-axis and a linear scale for the y-axis in the first plot, a linear scale for the x-axis and a logarithmic scale for the y-axis in the second plot.

library(gridExtra)
pl1 <- pl + scale_x_log10() 
pl2 <- pl + scale_y_log10() 
grid.arrange(pl1, pl2, nrow=1)

None of thee two plots show any linear trend. Then, let us try with a log-log transformation.

print(pl + scale_x_log10() + scale_y_log10() )

Looks much better!

We can then try to fit the following model on the data: \[ \log(y_j) = c_0 + c_1\log(x_j) + e_j \] Another representation of this model exhibits a power relationship between the explanatory and response variables \[ y_j = A \, x_j^{c_1} \, e^{e_j} \] where \(A=e^{c_0}\). This model ensures that the response \(y\) can only take positive values. Furthermore, the variability of the response increases with the explanatory variable. Indeed, for small \(e_j\), \[ y_j \approx A \, x_j^{c_1}(1 + e_j)\]

lm1.log <- lm(log(dist) ~ log(speed))
summary(lm1.log)
## 
## Call:
## lm(formula = log(dist) ~ log(speed))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00215 -0.24578 -0.02898  0.20717  0.88289 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.7297     0.3758  -1.941   0.0581 .  
## log(speed)    1.6024     0.1395  11.484 2.26e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4053 on 48 degrees of freedom
## Multiple R-squared:  0.7331, Adjusted R-squared:  0.7276 
## F-statistic: 131.9 on 1 and 48 DF,  p-value: 2.259e-15

This model now explains 73% of the (transformed) data.


4.2 Diagnostic plots

Let us plot the data with the regression line and the residuals

par(mfrow=c(1,2))
plot(log(speed),log(dist))
abline(a=coef(lm1.log)[1], b=coef(lm1.log)[2], col="#339900")
plot(lm1.log, which=1)

par(mfrow=c(1,1))

These two diagnostic plots show that the model fit quite well the data.

In the next plot, we compare the fits obtained when the model is built using the complete data (green) and when only the training sample is used (red).

i.training <- c(6:45)
lm1.training <- lm(log(dist) ~ log(speed), data=cars[i.training,])

ggplot(cars, aes(x=log(speed), y=log(dist))) + geom_point(size=2, colour="#993399") + 
geom_point(data=cars[-i.training,], size=2, colour="red") +   geom_abline(intercept=coef(lm1.log)[1], slope=coef(lm1.log)[2], colour="#339900", size=1)  + 
  geom_abline(intercept=coef(lm1.training)[1], slope=coef(lm1.training)[2], colour="red", size=1)+
    xlab("speed (mph)") + ylab("stopping distance (ft)")  

These two models are very similar: the estimation is relatively insensible to extreme values.


4.3 Confidence interval and prediction interval

Based on the previous analysis, the model can be used for predicting stopping distances for a wide range of speeds. Let us compute and display the confidence interval for the regression line (of the model lm(log(dist) ~ log(speed))) and the prediction interval for new observed log-distances.

alpha <- 0.05
df.new <- data.frame(speed=(3:30))
conf.dist <- predict(lm1.log, newdata = df.new, interval="confidence", level=1-alpha) 
pred.dist <- predict(lm1.log, newdata = df.new, interval="prediction", level=1-alpha) 

df.new[c("fit","lwr.conf", "upr.conf")] <- conf.dist
df.new[c("lwr.pred", "upr.pred")] <- pred.dist[,2:3]
ggplot(cars) + geom_point(aes(x=log(speed), y=log(dist)), size=2, colour="#993399") + 
  geom_ribbon(data=df.new, aes(x=log(speed), ymin=lwr.pred, ymax=upr.pred), alpha=0.1, inherit.aes=F, fill="blue") + 
  geom_ribbon(data=df.new, aes(x=log(speed), ymin=lwr.conf, ymax=upr.conf), alpha=0.2, inherit.aes=F, fill="#339900") +  
  geom_line(data=df.new, aes(x=log(speed), y=fit), colour="#339900", size=1) +
    xlab("log-speed (mph)") + ylab("log-stopping distance (ft)")  

We can derive from these intervals a confidence interval and a prediction interval for the non transformed data (i.e.??using the original scale).

ggplot(cars) + geom_point(aes(x=speed, y=dist), size=2, colour="#993399") + 
  geom_ribbon(data=df.new, aes(x=speed, ymin=exp(lwr.pred), ymax=exp(upr.pred)), alpha=0.1, inherit.aes=F, fill="blue") + 
  geom_ribbon(data=df.new, aes(x=speed, ymin=exp(lwr.conf), ymax=exp(upr.conf)), alpha=0.2, inherit.aes=F, fill="#339900") +  
  geom_line(data=df.new, aes(x=speed, y=exp(fit)), colour="#339900", size=1) +
    xlab("speed (mph)") + ylab("stopping distance (ft)")  

The general trend describes quite well how the stopping distance increases with the speed. Nevertheles, the residual error model seems to over-estimate the variability of the data for the greatest speeds.


4.4 Model comparison

We will investigate now if the numerical criteria for model comparison confirm that this model based on a log-log transformation should be preferred to the polynomial models.

Let u compute first the log-likelihood for this model.

logLik(lm1.log)
## 'log Lik.' -24.76592 (df=3)

WARNING: This value of the log-likelihood cannot be compared a it is to the log-likelihoods computed previously. Indeed, by definition, the log-likelihood for model lm1.log is the probability density function (pdf) of \(log(y)\).

\[ \begin{aligned} \llike_{\rm lm1.log} &= \llike(\hat{\theta}_{\rm lm1.log}) \\ &= \pmacro (\log(y) ; \hat{\theta}_{\rm lm1.log}) \end{aligned} \] where \(\hat{\theta}_{\rm lm1.log}\) is the ML estimator of \(\theta\) for model lm1.log.

Then, the pdf of \(log(y)\) under this model cannot be directly compared to the pdf of \(y\) under another model.

We therefore need to compute the pdf of \(y\) under this model. For any \(\theta\) and any \(1 \leq j \leq n\),

\[ \begin{aligned} \pmacro(y_j ; \theta) &= \pmacro(\log(y_j) ; \theta)\times \frac{1}{y_j} \end{aligned} \] and \[ \log (\pmacro(y ; \theta)) = \log (\pmacro (\log(y) ; \theta)) - \sum_{j=1}^n \log(y_j) \]

#new.cars <- data.frame(dist=cars$dist/100, speed=cars$speed/100)
#new.lm1 <- lm(dist ~ speed, data = new.cars)
#summary(new.lm1)
#logLik(new.lm1)
#logLik(new.lm1) + n*log(1/100)
# log.y1 <- predict(lm1.log)
# RSS <- sum((log(y)-log.y1)^2)
# -n/2 -n/2*log(2*pi*RSS/n)
logLik(lm1.log) - sum(log(dist))
## 'log Lik.' -201.5613 (df=3)

The information criteria should also take into account this transformation

AIC(lm1.log) - 2*sum(log(1/dist))
## [1] 409.1226
BIC(lm1.log) - 2*sum(log(1/dist))
## [1] 414.8587

Both criteria prefer this model based on a log-log transformation than any polynomial model without any transformation.

Remark: Since models lm1 and lm1.log have the same number of parameters, log-likelihoods without penalisation can be used for comparing these 2 models.