January 11th, 2017

The data `cars`

give the speed of cars and the distances taken to stop (Note that the data were recorded in the 1920s).

The data frame consists of 50 observations (rows) and 2 variables (columns): speed (mph), stopping distance (ft)

```
data(cars)
head(cars)
```

```
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
```

Scatter plots can help visualize any relationship between the explanatory variable *speed* (also called regression variable, or predictor) and the response variable *dist*.

```
library(ggplot2)
theme_set(theme_bw())
pl <- ggplot(cars) + geom_point(aes(x=speed, y=dist), size=2, colour="#993399") +
xlab("Speed (mph)") + ylab("Stopping distance (ft)")
print(pl)
```

Based on this data, our objective is to build a regression model of the form

\[y_j = f(x_j) + e_j \quad ; \quad 1 \leq j \leq n\] where \((x_j, 1 \leq j \leq n)\) and \((y_j, 1 \leq j \leq n)\) represent, respectively, the \(n\) measured speeds and distances and where \((e_j, 1 \leq j \leq n)\) is a sequence of *residual errors*. In other words, \(e_j\) represents the difference between the distance predicted by the model \(f(x_j)\) and the observed distance \(y_j\).

We will restrict ourselves to polynomial regression, by considering functions of the form \[ \begin{aligned} f(x) &= f(x ; c_0, c_1, c_2, \ldots, c_d) \\ &= c_0 + c_1 x + c_2 x^2 + \ldots + c_d x^d \end{aligned} \]

Building a polynomial regression model requires to perform the following taks:

- For a given degree \(d\),
- estimate the parameters of the model,
- assess the validity of the model,
- evaluate the predictive performance of the model

- Compare the different possible models and select the best one(s).

A very basic model considers that the observations \((y_j)\) fluctuates around a constant value \(c_0\): \[y_j = c_0 + e_j\]

```
attach(cars)
lm0 <- lm(dist ~ 1)
lm0
```

```
##
## Call:
## lm(formula = dist ~ 1)
##
## Coefficients:
## (Intercept)
## 42.98
```

The coefficient \(c_0\) is the empirical mean

`mean(dist)`

`## [1] 42.98`

Looking how the model fits the data is more than enough for concluding that this model is misspecified.

`pl + geom_hline(yintercept=coef(lm0)[1],size=1, colour="#339900")`

Indeed, a clear increasing trend is visible in the data while the model assumes that there is no trend. We therefore reject this first model.

The scatter plot above rather suggests a linearly increasing relationship between the explanatory and response variables. Let us therefore assume now a linear trend which is mathematically represented by a polynomial of degree 1:

\[y_j = c_0 + c_1 x_j + e_j \quad ; \quad 1 \leq j \leq n\]

We can then fit this model to our data using the function `lm`

:

```
lm1 <- lm(dist ~ speed, data=cars)
coef(lm1)
```

```
## (Intercept) speed
## -17.579095 3.932409
```

These coefficients are the intercept and the slope of the regression line, but more informative results about this model are available:

`summary(lm1)`

```
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
```

The slope \(c_1\) is clearly statistically significant (p-value = 1.49e-12) while the model explains about \(65\%\) of the variability of the data. The confidence interval for \(c_1\) confirms that an increase of the speed leads to a significant increase of the stopping distance.

`confint(lm1)`

```
## 2.5 % 97.5 %
## (Intercept) -31.167850 -3.990340
## speed 3.096964 4.767853
```

See how these quantities are computed.

The fact that the slope is significantly different from zero does not imply that this polynomial model of degree 1 correctly describes the data: at this stage, we can only conclude that a polynomial of degree 1 better explains the variability of the data than a constant model.

Diagnostic plots are visual tools that allows one to ``see?????? if something is not right between a chosen model and the data it is hypothesized to describe.

First, we can add the regression line to the plot of the data.

`pl + geom_abline(intercept=coef(lm1)[1],slope=coef(lm1)[2],size=1, colour="#339900")`

The regression line decribes pretty well the global trend in the data: based on this graphic, there is no reason to reject the model.

Several diagnotic plots are available for a `lm`

object. The first two are a plot of residuals against fitted values and a normal QQ plot.

```
par(mfrow = c(1, 2))
plot(lm1, which=c(1,2))
```

The residual plot shows a slight (decreasing and increasing) trend which suggests that the residuals are not identically distributed around 0. Furthermore, the QQ plot shows that the extreme residual values are not the extreme values of a normal distribution. It may be therefore necessary to improve the regression model.

See how these plot are obtained.

Even if the model it somewhat misspecified, it may have good predictive properties.

A common practice is to split the dataset into a 80:20 sample (training:test), then, build the model on the 80% sample and then use the model thus built to predict the reponse variable on test data.

Doing it this way, we will have the model predicted values for the 20% data (test). We can then see how the model will perform with this ``new?????? data, by comparing these predicted values with the original ones. We can alo check the stability of the prediction given by the model, by comparing these predicted values with those obtained previouly, when the complete data were used for building the model.

Let us first randomly define the training and test samples:

```
set.seed(100)
n <- nrow(cars)
i.training <- sort(sample(n,round(n*0.8)))
cars.training <- cars[i.training,]
cars.test <- cars[-i.training,]
```

`pred1a.test <- predict(lm1, newdata=cars.test)`

```
lm1.training <- lm(dist ~ speed, data=cars.training)
pred1b.test <- predict(lm1.training, newdata=cars.test)
```

`data.frame(cars.test, pred1a.test, pred1b.test)`

```
## speed dist pred1a.test pred1b.test
## 1 4 2 -1.849460 -5.392776
## 4 7 22 9.947766 7.555787
## 8 10 26 21.744993 20.504349
## 20 14 26 37.474628 37.769100
## 26 15 54 41.407036 42.085287
## 31 17 50 49.271854 50.717663
## 37 19 46 57.136672 59.350038
## 39 20 32 61.069080 63.666225
## 40 20 48 61.069080 63.666225
## 42 20 56 61.069080 63.666225
```

```
dist.test <- cars.test$dist
par(mfrow=c(1,2))
plot(pred1b.test, dist.test)
abline(a=0, b=1, lty=2)
plot(pred1b.test, pred1a.test)
abline(a=0, b=1, lty=2)
```

On one hand, it is reassuring to see that removing part of the data has a very little impact on the predictions (right graph). On the other hand, the predictive performance of the model remains limited because of the natural variability of the data (left graph). Indeed, this model built with the training sample only explains 69 % of the variabilility of the new test sample.

```
cor.test <- cor(pred1a.test, dist.test)
R2.test <- cor.test^2
R2.test
```

`## [1] 0.6851758`

Let us see now how the model behaves with extreme values by defining as test sample the data points with the 5 smallest and the 5 largest speeds.

```
i.training <- c(6:45)
cars.training <- cars[i.training,]
cars.test <- cars[-i.training,]
dist.test <- cars.test$dist
pred1a.test <- predict(lm1, newdata=cars.test)
lm1.training <- lm(dist ~ speed, data=cars.training)
pred1b.test <- predict(lm1.training, newdata=cars.test)
par(mfrow=c(1,2))
plot(pred1b.test, dist.test)
abline(a=0, b=1, lty=2)
plot(pred1b.test, pred1a.test)
abline(a=0, b=1, lty=2)
```

We see that the model now underestimate the stopping distance for the largest speeds.

In conclusion, the model may be used with confidence for predicting the stopping distance for given speeds which are central values. Another model should probably be developed for predicting stopping distances for extreme speed values.

Imagine we aim to predict the stopping distance for a speed \(x\). Using the estimated coefficients of our polynomial model, the prediction will be \[\hat{f}(x) = \hat{c}_0 + \hat{c}_1x \] This prediction is an estimation since it is based on estimated coefficients \(\hat{c}_0\) and \(\hat{c}_1\). Then, the uncertainty on the prediction should be quantified by providing a confidence interval for \(f(x)\).

Since the model can be used for predicting the stopping distance for non extreme speeds, we will restrict ourselves to speeds between 6 and 23 mph.

```
alpha <- 0.05
df.new <- data.frame(speed=(6:23))
conf.dist <- predict(lm1, newdata = df.new, interval="confidence", level=1-alpha)
head(conf.dist)
```

```
## fit lwr upr
## 1 6.015358 -2.973341 15.00406
## 2 9.947766 1.678977 18.21656
## 3 13.880175 6.307527 21.45282
## 4 17.812584 10.905120 24.72005
## 5 21.744993 15.461917 28.02807
## 6 25.677401 19.964525 31.39028
```

A prediction interval for a new measured distance \(y=f(x)+e\) can also be computed. This prediction interval takes into account both the uncertainty on the predicted distance \(f(x)\) and the variability of the measure, represented in the model by the residual error \(e\).

```
pred.dist <- predict(lm1, newdata = df.new, interval="prediction", level=1-alpha)
head(pred.dist)
```

```
## fit lwr upr
## 1 6.015358 -26.187314 38.21803
## 2 9.947766 -22.061423 41.95696
## 3 13.880175 -17.956287 45.71664
## 4 17.812584 -13.872245 49.49741
## 5 21.744993 -9.809601 53.29959
## 6 25.677401 -5.768620 57.12342
```

Let us plot these two intervals.

```
df.new[c("fit","lwr.conf", "upr.conf")] <- conf.dist
df.new[c("lwr.pred", "upr.pred")] <- pred.dist[,2:3]
pl +
geom_ribbon(data=df.new, aes(x=speed, ymin=lwr.pred, ymax=upr.pred), alpha=0.1, inherit.aes=F, fill="blue") +
geom_ribbon(data=df.new, aes(x=speed, ymin=lwr.conf, ymax=upr.conf), alpha=0.2, inherit.aes=F, fill="#339900") +
geom_line(data=df.new, aes(x=speed, y=fit), colour="#339900", size=1)
```

These intervals are of very little interest and should not be used in practice.

Indeed, we can see that both the confidence interval and the prediction interval contain negative values for small or moderate speeds, which is obviously unrealistic??? We???ll see later how to tranform the model and/or the data in order to take into account some constraints about the data.

See how these intervals are computed.

We can expect to better describe the extreme values by using a polynomial of higher degree. Let us therefore fit a polynomial of degree 2 to the data.

```
lm2 <- lm(dist ~ speed + I(speed^2))
summary(lm2)
```

```
##
## Call:
## lm(formula = dist ~ speed + I(speed^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.720 -9.184 -3.188 4.628 45.152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.47014 14.81716 0.167 0.868
## speed 0.91329 2.03422 0.449 0.656
## I(speed^2) 0.09996 0.06597 1.515 0.136
##
## Residual standard error: 15.18 on 47 degrees of freedom
## Multiple R-squared: 0.6673, Adjusted R-squared: 0.6532
## F-statistic: 47.14 on 2 and 47 DF, p-value: 5.852e-12
```

Surprisingly, the 3 \(t\)-tests seem to indicate that none of the coefficients is statistically significant. This result should be considered with caution. Indeed, it does not mean that the three coefficients can be replaced by 0 (i.e.??removed) in the model.

The difficulty of interpretating these p-values is due to the strong correlations that exist between the estimates \(\hat{c}_0\),

\(\hat{c}_1\) and \(\hat{c}_2\). It is recalled that the variance covariance matrix of the vector of estimates is \(\sigma^2(X^\prime X)^{-1}\). We can then easily derive the correlation matrix from covariance matrix.

```
X <- model.matrix(lm2)
S <- solve(t(X)%*%X)
d <- sqrt(diag(S))
R <- S/(d%*%t(d))
R
```

```
## (Intercept) speed I(speed^2)
## (Intercept) 1.0000000 -0.9605503 0.8929849
## speed -0.9605503 1.0000000 -0.9794765
## I(speed^2) 0.8929849 -0.9794765 1.0000000
```

We will see below how to use a sequence of orthogonal polynomials in order to get uncorrelated estimates.

The R-squared indicate that 66.7% of the variability of the data is explained by a polynomial of degree 2, instead of 65.1% with a polynomial of degree 1. On the other hand, the standard deviation of the residual error is 15.18 ft with this model instead of 15.38 ft with the previous one. These improvments are really to small to justify the use of a quadratic polynomial.

Nevertheless, plotting the data together with the fitted polynomial suggests that the extreme values are better predicted with a polynomial of degree 2 than with a straight line.

```
f <- function(x,c) coef(lm2)[1] + coef(lm2)[2]*x + coef(lm2)[3]*x^2
#f <- function(x,c) coef(lm2)[1]*x + coef(lm2)[2]*x^2
pl + geom_abline(intercept=coef(lm1)[1],slope=coef(lm1)[2],size=1, colour="#339900") +
stat_function(fun=f, colour="red", size=1)
```

The predictive performance of the model need to be assesed in order to confirm this property of the model. We can again use the 80% most central data points to build the model and test it on the 20% remaining data points.

```
lm2t <- lm(dist ~ speed + I(speed^2), data=cars.training)
f <- function(x,c) coef(lm2t)[1] + coef(lm2t)[2]*x + coef(lm2t)[3]*x^2
pl + stat_function(fun=f, colour="#339900", size=1) + geom_point(data=cars.test, aes(x=speed, y=dist), color="blue")
```

The shape of the polynomial is totally different when it is built using the training data only. That means that the model is quite unstable, depending strongly on some data values. Then, repeating the same experiment (under the same experimental conditions) would probably lead to significantly different results.

Such lack of reproducibility of the predictions reinforces our idea that this model should be rejected.

The stopping distance predicted by the model should be null when the speed is null. This constraint can easily be achieved by removing the intercept from the regression model: \[ f(x) = c_1 x + c_2 x^2 + \ldots + c_d x^d \] Let us fit a regression model of degree 2 without intercept to our data.

```
lm2.noint <- lm(dist ~ -1 + speed + I(speed^2))
coef(lm2.noint)
```

```
## speed I(speed^2)
## 1.23902996 0.09013877
```

We can check that the design matrix consists of the 2 columns \((x_j)\) and \((x_j^2)\).

```
X <- model.matrix(lm2.noint)
head(X)
```

```
## speed I(speed^2)
## 1 4 16
## 2 4 16
## 3 7 49
## 4 7 49
## 5 8 64
## 6 9 81
```

We can plot the data with the predicted response \(f(x)\), confidence interval for \(f(x)\) and prediction intervals for new observations \(y\).

```
df.new <- data.frame(speed=(0:27))
conf.dist <- predict(lm2.noint, newdata = df.new, interval="confidence", level=1-alpha)
pred.dist <- predict(lm2.noint, newdata = df.new, interval="prediction", level=1-alpha)
df.new[c("fit","lwr.conf", "upr.conf")] <- conf.dist
df.new[c("lwr.pred", "upr.pred")] <- pred.dist[,2:3]
pl +
geom_ribbon(data=df.new, aes(x=speed, ymin=lwr.pred, ymax=upr.pred), alpha=0.1, inherit.aes=F, fill="blue") +
geom_ribbon(data=df.new, aes(x=speed, ymin=lwr.conf, ymax=upr.conf), alpha=0.2, inherit.aes=F, fill="#339900") +
geom_line(data=df.new, aes(x=speed, y=fit), colour="#339900", size=1)
```

Confidence interval for \(f(x)\) now only contain positive values for any \(x>0\) and is reduce to 0 when \(x=0\).

On the other hand, prediction intervals for new observations still contain negative values. This is due to the fact that the residual error model is a *constant error model* \(y=f(x)+e\), assuming that the variance of the error \(e\) does not depend on the predicted value \(f(x)\). This hypothesis doe not reflect the true phenomena and should be rejected. We will see that an appropriate alternative may consist in transforming the data.

Instead of defining each component of the regression model `lm(dist ~ speed + I(speed^2) )`

, it is equivalent to define explicitly the regression model as a polynomial of degree 2

```
lm2.poly <- lm(dist ~ poly(speed, degree=2, raw=T))
M <- model.matrix(lm2.poly)
head(M)
```

```
## (Intercept) poly(speed, degree = 2, raw = T)1
## 1 1 4
## 2 1 4
## 3 1 7
## 4 1 7
## 5 1 8
## 6 1 9
## poly(speed, degree = 2, raw = T)2
## 1 16
## 2 16
## 3 49
## 4 49
## 5 64
## 6 81
```

Results obtained with both methods are absolutely identical

`print(coef(lm2.poly))`

```
## (Intercept) poly(speed, degree = 2, raw = T)1
## 2.4701378 0.9132876
## poly(speed, degree = 2, raw = T)2
## 0.0999593
```

We have seen that this model leads to highly correlated estimators of the coefficients of the model which make it difficult the interpretation of the results. The option `raw=FALSE`

(which is the default) allows to use an orthogonal basis of polynomial.

Let us see the results obtained with a polynomial of degree 1

```
lm1.ortho <- lm(dist ~ poly(speed, degree=1))
M1 <- model.matrix(lm1.ortho)
head(M1)
```

```
## (Intercept) poly(speed, degree = 1)
## 1 1 -0.3079956
## 2 1 -0.3079956
## 3 1 -0.2269442
## 4 1 -0.2269442
## 5 1 -0.1999270
## 6 1 -0.1729098
```

`t(M1)%*%M1`

```
## (Intercept) poly(speed, degree = 1)
## (Intercept) 5.000000e+01 1.665335e-15
## poly(speed, degree = 1) 1.665335e-15 1.000000e+00
```

Here, the estimated intercept \(\hat{c}_0\) is the empirical mean \(\bar{y}=\) 42.98

`coef(lm1.ortho)`

```
## (Intercept) poly(speed, degree = 1)
## 42.9800 145.5523
```

Adding a term of degree 2 keeps the first two column of the design matrix unchanged

```
lm2.ortho <- lm(dist ~ poly(speed, degree=2))
M2 <- model.matrix(lm2.ortho)
head(M2)
```

```
## (Intercept) poly(speed, degree = 2)1 poly(speed, degree = 2)2
## 1 1 -0.3079956 0.41625480
## 2 1 -0.3079956 0.41625480
## 3 1 -0.2269442 0.16583013
## 4 1 -0.2269442 0.16583013
## 5 1 -0.1999270 0.09974267
## 6 1 -0.1729098 0.04234892
```

`t(M2)%*%M2`

```
## (Intercept) poly(speed, degree = 2)1
## (Intercept) 5.000000e+01 1.665335e-15
## poly(speed, degree = 2)1 1.665335e-15 1.000000e+00
## poly(speed, degree = 2)2 1.110223e-16 5.551115e-17
## poly(speed, degree = 2)2
## (Intercept) 1.110223e-16
## poly(speed, degree = 2)1 5.551115e-17
## poly(speed, degree = 2)2 1.000000e+00
```

Let us look at the results obtained with this model:

`summary(lm2.ortho)`

```
##
## Call:
## lm(formula = dist ~ poly(speed, degree = 2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.720 -9.184 -3.188 4.628 45.152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.980 2.146 20.026 < 2e-16 ***
## poly(speed, degree = 2)1 145.552 15.176 9.591 1.21e-12 ***
## poly(speed, degree = 2)2 22.996 15.176 1.515 0.136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.18 on 47 degrees of freedom
## Multiple R-squared: 0.6673, Adjusted R-squared: 0.6532
## F-statistic: 47.14 on 2 and 47 DF, p-value: 5.852e-12
```

These results give rise to several comments:

- the first two estimated coefficients \(\hat{c}_0\) and \(\hat{c}_1\) are those obtained previously with a polynomial of degree 1
- the p-values of the \(t\)-test can now properly be interpreted and confirm our previous analyis: the first two coefficients are statistically significant. The need of a quadratic term is much less obvious (\(p=0.136\)). Furthermore, confidence intervals can be derived for each coefficient

`confint(lm2.ortho)`

```
## 2.5 % 97.5 %
## (Intercept) 38.662361 47.29764
## poly(speed, degree = 2)1 115.021940 176.08257
## poly(speed, degree = 2)2 -7.534552 53.52608
```

- The estimated variance of the residuals \(\hat{\sigma}^2\), the R-squared value and the adjusted R-squared value are identical when orthogonal or non orthogonal polynomial are used. Indeed, even if the parameterization is different, the two models are the same polynomials of degree 2 and therefore lead to the same predictions.

Several quantitative criteria and several statistical tests are available for comparing models.

First, we have seen that \(t\)-test are performed for each coefficient of the model

`summary(lm1.ortho)$coefficients`

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.9800 2.175002 19.76090 1.061332e-24
## poly(speed, degree = 1) 145.5523 15.379587 9.46399 1.489836e-12
```

`summary(lm2.ortho)$coefficients`

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.98000 2.14622 20.025902 1.188618e-24
## poly(speed, degree = 2)1 145.55226 15.17607 9.590906 1.211446e-12
## poly(speed, degree = 2)2 22.99576 15.17607 1.515265 1.364024e-01
```

Based on these results, we can conclude that \(c_0 \neq 0\) and \(c_1 \neq 0\). On the other hand, the data does not allow us to conclude that \(c_2 \neq 0\).

By construction, the F-statistics provided with the summary of a fitted model is the F-statistics of an anova for testing this fitted model against the model without explanatory variable `lm0`

. This test is known as the overall F-test for regression.

`anova(lm0, lm1)`

```
## Analysis of Variance Table
##
## Model 1: dist ~ 1
## Model 2: dist ~ speed
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 49 32539
## 2 48 11354 1 21186 89.567 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(lm0, lm2)`

```
## Analysis of Variance Table
##
## Model 1: dist ~ 1
## Model 2: dist ~ speed + I(speed^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 49 32539
## 2 47 10825 2 21714 47.141 5.852e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Not surprisingly, both `lm1`

and `lm2`

are clearly preferred to `lm0`

. As already stated above the \(t\)-test and the \(F-test\) for testing `lm0`

against `lm1`

are equivalent.

We can also perform an ANOVA to test `lm1`

against `lm2`

`anova(lm1, lm2)`

```
## Analysis of Variance Table
##
## Model 1: dist ~ speed
## Model 2: dist ~ speed + I(speed^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 48 11354
## 2 47 10825 1 528.81 2.296 0.1364
```

`lm1`

is slightly preferred to `lm2`

for this criteria.

The likelihood ratio test can be used for testing 2 nested models. Since our 3 polynomial models are nested, we can use the LRT fo testing one of these model against another one.

log-likelihoods \(\llike_0(\hat\theta_0)\), \(\llike_1(\hat\theta_1)\) and \(\llike_2(\hat\theta_2)\) are available, where \(\hat\theta_k\) is the maximum likelihood estimate of the parameter of model `lmk`

for \(k=0, 1, 2\).

```
logLik(lm0)
logLik(lm1)
logLik(lm2)
```

`## 'log Lik.' -232.9012 (df=2)`

`## 'log Lik.' -206.5784 (df=3)`

`## 'log Lik.' -205.386 (df=4)`

For testing `lm0`

against `lm1`

for instance, we compute the test statistics \(2(\llike_1(\hat\theta_1) - \llike_0(\hat\theta_0)\) which has a \(chi^2\) distribution with 1 degree of freedom under `lm0`

.

```
dl <- 2*as.numeric(logLik(lm1) - logLik(lm0))
1-pchisq(dl,1)
```

`## [1] 3.995693e-13`

The test statistics \(2(\llike_2(\hat\theta_2) - \llike_1(\hat\theta_1)\) is used for testing `lm1`

against `lm2`

```
dl <- 2*as.numeric(logLik(lm2) - logLik(lm1))
1-pchisq(dl,1)
```

`## [1] 0.122521`

We can see with this example that the ANOVA and the LRT give very similar results and lead to the same conclusion: the constant model can be rejected with high confidence (i.e.??the stopping distance depends on the speed), while a model with both a linear and a quadratic component is not preferred to a model with a linear component only (i.e.??we don???t reject the hypothesis that the stopping distance increase linearly with the speed).

Information criteria such as the Akaike information criteria (AIC) and the Bayesian information criateria (BIC) can also be used for comparing models which are not necessarily nested.

`AIC(lm0,lm1,lm2)`

```
## df AIC
## lm0 2 469.8024
## lm1 3 419.1569
## lm2 4 418.7721
```

`BIC(lm0,lm1,lm2)`

```
## df BIC
## lm0 2 473.6265
## lm1 3 424.8929
## lm2 4 426.4202
```

Models with lowest AIC and/or BIC will be preferred. Here, both criteria agree for rejecting `lm0`

with high confidence. AIC has a very slight preference for `lm2`

while BIC penalizes a little bit more polynomial with higher degree and prefers `lm1`

. Nevertheless, these differences are not large enough for selecting definitely any of these 2 models.

When fitting a linear regression model one assumes that there is a linear relationship between the response variable and each of the explanatory variables. However, in many situations there may instead be a non-linear relationship between the variables. This can sometimes be remedied by applying a suitable transformation to some (or all) of the variables, such as power transformations or logarithms.

In addition, transformations can be used to correct violations of model assumptions such as constant error variance and normality.

Applying suitable transformations to the original data prior to performing regression may be sufficient to make linear regression models appropriate for the transformed data.

Let???s plot the data using semi-log scales: we use a logarithmic scale for the x-axis and a linear scale for the y-axis in the first plot, a linear scale for the x-axis and a logarithmic scale for the y-axis in the second plot.

```
library(gridExtra)
pl1 <- pl + scale_x_log10()
pl2 <- pl + scale_y_log10()
grid.arrange(pl1, pl2, nrow=1)
```

None of thee two plots show any linear trend. Then, let us try with a log-log transformation.

`print(pl + scale_x_log10() + scale_y_log10() )`

Looks much better!

We can then try to fit the following model on the data: \[ \log(y_j) = c_0 + c_1\log(x_j) + e_j \] Another representation of this model exhibits a power relationship between the explanatory and response variables \[ y_j = A \, x_j^{c_1} \, e^{e_j} \] where \(A=e^{c_0}\). This model ensures that the response \(y\) can only take positive values. Furthermore, the variability of the response increases with the explanatory variable. Indeed, for small \(e_j\), \[ y_j \approx A \, x_j^{c_1}(1 + e_j)\]

```
lm1.log <- lm(log(dist) ~ log(speed))
summary(lm1.log)
```

```
##
## Call:
## lm(formula = log(dist) ~ log(speed))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00215 -0.24578 -0.02898 0.20717 0.88289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.7297 0.3758 -1.941 0.0581 .
## log(speed) 1.6024 0.1395 11.484 2.26e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4053 on 48 degrees of freedom
## Multiple R-squared: 0.7331, Adjusted R-squared: 0.7276
## F-statistic: 131.9 on 1 and 48 DF, p-value: 2.259e-15
```

This model now explains 73% of the (transformed) data.

Let us plot the data with the regression line and the residuals

```
par(mfrow=c(1,2))
plot(log(speed),log(dist))
abline(a=coef(lm1.log)[1], b=coef(lm1.log)[2], col="#339900")
plot(lm1.log, which=1)
```

`par(mfrow=c(1,1))`

These two diagnostic plots show that the model fit quite well the data.

In the next plot, we compare the fits obtained when the model is built using the complete data (green) and when only the training sample is used (red).

```
i.training <- c(6:45)
lm1.training <- lm(log(dist) ~ log(speed), data=cars[i.training,])
ggplot(cars, aes(x=log(speed), y=log(dist))) + geom_point(size=2, colour="#993399") +
geom_point(data=cars[-i.training,], size=2, colour="red") + geom_abline(intercept=coef(lm1.log)[1], slope=coef(lm1.log)[2], colour="#339900", size=1) +
geom_abline(intercept=coef(lm1.training)[1], slope=coef(lm1.training)[2], colour="red", size=1)+
xlab("speed (mph)") + ylab("stopping distance (ft)")
```

These two models are very similar: the estimation is relatively insensible to extreme values.

Based on the previous analysis, the model can be used for predicting stopping distances for a wide range of speeds. Let us compute and display the confidence interval for the regression line (of the model `lm(log(dist) ~ log(speed)))`

and the prediction interval for new observed log-distances.

```
alpha <- 0.05
df.new <- data.frame(speed=(3:30))
conf.dist <- predict(lm1.log, newdata = df.new, interval="confidence", level=1-alpha)
pred.dist <- predict(lm1.log, newdata = df.new, interval="prediction", level=1-alpha)
df.new[c("fit","lwr.conf", "upr.conf")] <- conf.dist
df.new[c("lwr.pred", "upr.pred")] <- pred.dist[,2:3]
ggplot(cars) + geom_point(aes(x=log(speed), y=log(dist)), size=2, colour="#993399") +
geom_ribbon(data=df.new, aes(x=log(speed), ymin=lwr.pred, ymax=upr.pred), alpha=0.1, inherit.aes=F, fill="blue") +
geom_ribbon(data=df.new, aes(x=log(speed), ymin=lwr.conf, ymax=upr.conf), alpha=0.2, inherit.aes=F, fill="#339900") +
geom_line(data=df.new, aes(x=log(speed), y=fit), colour="#339900", size=1) +
xlab("log-speed (mph)") + ylab("log-stopping distance (ft)")
```

We can derive from these intervals a confidence interval and a prediction interval for the non transformed data (i.e.??using the original scale).

```
ggplot(cars) + geom_point(aes(x=speed, y=dist), size=2, colour="#993399") +
geom_ribbon(data=df.new, aes(x=speed, ymin=exp(lwr.pred), ymax=exp(upr.pred)), alpha=0.1, inherit.aes=F, fill="blue") +
geom_ribbon(data=df.new, aes(x=speed, ymin=exp(lwr.conf), ymax=exp(upr.conf)), alpha=0.2, inherit.aes=F, fill="#339900") +
geom_line(data=df.new, aes(x=speed, y=exp(fit)), colour="#339900", size=1) +
xlab("speed (mph)") + ylab("stopping distance (ft)")
```

The general trend describes quite well how the stopping distance increases with the speed. Nevertheles, the residual error model seems to over-estimate the variability of the data for the greatest speeds.

We will investigate now if the numerical criteria for model comparison confirm that this model based on a log-log transformation should be preferred to the polynomial models.

Let u compute first the log-likelihood for this model.

`logLik(lm1.log)`

`## 'log Lik.' -24.76592 (df=3)`

**WARNING:** This value of the log-likelihood cannot be compared a it is to the log-likelihoods computed previously. Indeed, by definition, the log-likelihood for model `lm1.log`

is the probability density function (pdf) of \(log(y)\).

\[
\begin{aligned}
\llike_{\rm lm1.log} &= \llike(\hat{\theta}_{\rm lm1.log}) \\
&= \pmacro (\log(y) ; \hat{\theta}_{\rm lm1.log})
\end{aligned}
\] where \(\hat{\theta}_{\rm lm1.log}\) is the ML estimator of \(\theta\) for model `lm1.log`

.

Then, the pdf of \(log(y)\) under this model cannot be directly compared to the pdf of \(y\) under another model.

We therefore need to compute the pdf of \(y\) under this model. For any \(\theta\) and any \(1 \leq j \leq n\),

\[ \begin{aligned} \pmacro(y_j ; \theta) &= \pmacro(\log(y_j) ; \theta)\times \frac{1}{y_j} \end{aligned} \] and \[ \log (\pmacro(y ; \theta)) = \log (\pmacro (\log(y) ; \theta)) - \sum_{j=1}^n \log(y_j) \]

```
#new.cars <- data.frame(dist=cars$dist/100, speed=cars$speed/100)
#new.lm1 <- lm(dist ~ speed, data = new.cars)
#summary(new.lm1)
#logLik(new.lm1)
#logLik(new.lm1) + n*log(1/100)
```

```
# log.y1 <- predict(lm1.log)
# RSS <- sum((log(y)-log.y1)^2)
# -n/2 -n/2*log(2*pi*RSS/n)
logLik(lm1.log) - sum(log(dist))
```

`## 'log Lik.' -201.5613 (df=3)`

The information criteria should also take into account this transformation

`AIC(lm1.log) - 2*sum(log(1/dist))`

`## [1] 409.1226`

`BIC(lm1.log) - 2*sum(log(1/dist))`

`## [1] 414.8587`

Both criteria prefer this model based on a log-log transformation than any polynomial model without any transformation.

**Remark:** Since models `lm1`

and `lm1.log`

have the same number of parameters, log-likelihoods without penalisation can be used for comparing these 2 models.