$$ \newcommand{\esp}[1]{\mathbb{E}\left(#1\right)} \newcommand{\var}[1]{\mbox{Var}\left(#1\right)} \newcommand{\deriv}[1]{\dot{#1}(t)} \newcommand{\prob}[1]{ \mathbb{P}\!(#1)} \newcommand{\eqdef}{\mathop{=}\limits^{\mathrm{def}}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bc}{\boldsymbol{c}} \newcommand{\bpsi}{\boldsymbol{\psi}} \def\pmacro{\texttt{p}} \def\like{{\cal L}} \def\llike{{\cal LL}} \def\logit{{\rm logit}} \def\probit{{\rm probit}} \def\one{{\rm 1\!I}} \def\iid{\mathop{\sim}_{\rm i.i.d.}} \def\simh0{\mathop{\sim}_{H_0}} \def\df{\texttt{df}} \def\res{e} \def\xomega{x} \newcommand{\argmin}[1]{{\rm arg}\min_{#1}} \newcommand{\argmax}[1]{{\rm arg}\max_{#1}} \newcommand{\Rset}{\mbox{$\mathbb{R}$}} \def\param{\theta} \def\setparam{\Theta} \def\xnew{x_{\rm new}} \def\fnew{f_{\rm new}} \def\ynew{y_{\rm new}} \def\nnew{n_{\rm new}} \def\enew{e_{\rm new}} \def\Xnew{X_{\rm new}} \def\hfnew{\widehat{\fnew}} \def\degree{m} \def\nbeta{d} \newcommand{\limite}[1]{\mathop{\longrightarrow}\limits_{#1}} \def\ka{k{\scriptstyle a}} \def\ska{k{\scriptscriptstyle a}} \def\kel{k{\scriptstyle e}} \def\skel{k{\scriptscriptstyle e}} \def\cl{C{\small l}} \def\Tlag{T\hspace{-0.1em}{\scriptstyle lag}} \def\sTlag{T\hspace{-0.07em}{\scriptscriptstyle lag}} \def\Tk{T\hspace{-0.1em}{\scriptstyle k0}} \def\sTk{T\hspace{-0.07em}{\scriptscriptstyle k0}} \def\thalf{t{\scriptstyle 1/2}} \newcommand{\Dphi}[1]{\partial_\pphi #1} \def\asigma{a} \def\pphi{\psi} \newcommand{\stheta}{{\theta^\star}} \newcommand{\htheta}{{\widehat{\theta}}} $$


1 Introduction

The data cars give the speed of cars and the distances taken to stop (Note that the data were recorded in the 1920s).

The data frame consists of 50 observations (rows) and 2 variables (columns): speed (mph), stopping distance (ft).

Let us plot this data.

library(ggplot2)
theme_set(theme_bw())
pl <- ggplot(cars, aes(x=speed, y=dist)) + geom_point(size=2, colour="#993399") + 
  xlab("Speed (mph)") + ylab("Stopping distance (ft)")  
print(pl)

The scatter plot above suggests a linearly increasing relationship between the explanatory and response variables. We are therefore assuming a linear trend which is mathematically represented by a polynomial of degree 1:

\[y_j = c_0 + c_1 x_j + e_j \quad ; \quad 1 \leq j \leq n\]

Let us fit this model to the data using the function lm:

attach(cars)
lm1 <- lm(dist ~ speed)

The function summary computes and returns a list of summary statistics of the fitted linear model given in lm1:

summary(lm1)
## 
## Call:
## lm(formula = dist ~ speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Some diagnostic plots are available

par(mfrow=c(1,2))
plot(lm1, which=c(1,2))

as well as the log-likelihood and some information criteria:

logLik(lm1)
## 'log Lik.' -206.5784 (df=3)
AIC(lm1)
## [1] 419.1569
BIC(lm1)
## [1] 424.8929

Let us see what all these numbers and diagnostic plots are and how they are calculated for any polynomial model.


2 Fitting a polynomial model as a linear model

We need first to rewrite the polynomial regression model \[ y_j = c_0 + c_1 x_j + c_2 x^2_j + \ldots + c_{\degree} x^{\degree}_j + e_j \] in a matricial form \[y = X\beta + e \] where \[ y = \left( \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array}\right) \quad ,\quad X = \left( \begin{array}{cccc} 1 & x_1 & \cdots & x_1^{\degree} \\ 1 & x_2 & \cdots & x_2^{\degree} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & \cdots & x_n^{\degree} \end{array}\right) \quad , \quad \beta = \left( \begin{array}{c} c_0 \\ c_1 \\ \vdots \\ c_{\degree} \end{array}\right) \quad , \quad e = \left( \begin{array}{c} e_1 \\ e_2 \\ \vdots \\ e_n \end{array}\right) \] Then, \(X\beta\) is the vector of predicted values and \(e\) the vector of residual errors.


2.1 The Least squares method

A method for choosing automatically the ???best parameters??? \((\hat{c}_0,\hat{c}_1)\) consists in minimizing the sum of squared errors of prediction, i.e.??the residual sum of squares (RSS) :

\[ \begin{aligned} RSS &= \sum_{j=1}^n (y_j - f(x_j))^2 \\ &= \sum_{j=1}^n \left( y_j - \sum_{k=0}^{\degree} c_k x_j^k \right)^2 \\ &= \|y - X\beta \|^2 \end{aligned} \]

Then, \[ \begin{aligned} \hat{\beta} &= \argmin{\beta}(\|y - X\beta \|^2) \\ &= (X^\prime X)^{-1}X^\prime y \end{aligned} \]

y <- dist
x <- speed
n <- length(y)
m <- 1
X <- NULL
for (k in (0:m))
  X <- cbind(X, x^k)
beta.est <- solve(t(X)%*%X)%*%t(X)%*%y 
beta.est
##            [,1]
## [1,] -17.579095
## [2,]   3.932409


2.2 Estimation of the residual error variance

We will assume that the residuals \((e_j)\) are independent random variables with mean 0 and variance \(\sigma^2\): \[ \esp{e_j} = 0 \quad ; \quad \esp{e^2_j} = \sigma^2 \]

Let \(\nbeta = \degree+1\) be the length of vector \(\beta\). Then, an unbiased estimate of \(\sigma^2\) is \[ \begin{aligned} \hat{\sigma}^2 &= \frac{1}{n-\nbeta} \|y - X\hat{\beta} \|^2 \\ &= \frac{1}{n-\nbeta} \sum_{j=1}^n (y_j - \hat{c}_0 - \hat{c}_1 x_j - \ldots - \hat{c}_{\degree} x_j^{\degree})^2 \end{aligned} \]

res <- y-X%*%beta.est
d <- m+1
sigma2.est <- sum(res^2)/(n-d)
sigma2.est
## [1] 236.5317

The standard deviation of the residual errors, called residual standard error in R, is the square root of this estimated variance

sqrt(sigma2.est)
## [1] 15.37959


2.3 The standard errors of the estimates

We can remark that \[ \begin{aligned} \hat{\beta} &= (X^\prime X)^{-1}X^\prime y \\ &= \beta + (X^\prime X)^{-1}X^\prime e \end{aligned} \]

Then, since \(\esp{e}=0\), \[ \esp{\hat{\beta}} = \beta\]

and \[ \begin{aligned} \var{\hat\beta} &= \var{(X^\prime X)^{-1}X^\prime e} \\ &= (X^\prime X)^{-1}X^\prime \var{e} X (X^\prime X)^{-1} \\ &= \sigma^2 (X^\prime X)^{-1} \end{aligned} \]

We can therefore use this formula to compute the variance covariance matrix of \((\hat{c}_0 ,\hat{c}_1)\)

( V.beta <- sigma2.est*solve(t(X)%*%X))
##           [,1]       [,2]
## [1,] 45.676514 -2.6588234
## [2,] -2.658823  0.1726509

Then, the standard error of each component of \(\hat\beta\) is defined as the square root of the diagonal elements of the variance-covariance matrix \(V=\var{\hat\beta}\): \[{\rm se}(\hat\beta_k) = \sqrt{V_{k k}} \]

(se.beta <- sqrt(diag(V.beta)))
## [1] 6.7584402 0.4155128


2.4 Statistical tests for the model parameters

We will assume that the residuals \((e_j)\) are independent and normally distributed with mean 0 and variance \(\sigma^2\): \[ e_j \iid {\cal N}(0 \ , \ \sigma^2) \]

Then, \(\hat{\beta}\) is also normally distributed:

\[ \hat{\beta} \sim {\cal N}(\beta \ , \ \sigma^2 (X^\prime X)^{-1}) \] and, for \(k=1, 2, \ldots , \nbeta\), \[t_k = \frac{\hat{\beta}_k - \beta_k}{{\rm se}(\hat{\beta}_k)}\] follows a \(t\)-distribution with \(n-d\) degrees of freedom.

For each component \(\beta_k\) of \(\beta\), we can then perform a \(t\)-test (known as the Wald test) to test \[ H_{k,0} : ``\beta_k = 0" \quad \text{versus} \quad H_{k,1}: ``\beta_k \neq 0" \]

Indeed, under the null hypothesis \(H_{k,0}\), \(t_{{\rm stat}, k} = {\hat{\beta}_k}/{{\rm se}(\hat{\beta}_k)}\) follows a \(t\)-distribution with \(n-d\) degrees of freedom.

The \(p\)-value for this test is therefore

\[\begin{aligned} p_k &= \prob{|t_{n-d}| \geq |t_{{\rm stat}, k}^{\rm obs}| } \\ &= 2(1 - \prob{t_{n-d} \leq |t_{{\rm stat}, k}^{\rm obs}| } ) \end{aligned} \]

t.stat <- beta.est/se.beta
res.beta <- cbind(beta.est, se.beta, t.stat, 2*(1 - pt(abs(t.stat),n-d)))
colnames(res.beta) <- c("beta.est", "se.beta", "t.stat", "P(>|t|)")
res.beta
##        beta.est   se.beta    t.stat      P(>|t|)
## [1,] -17.579095 6.7584402 -2.601058 1.231882e-02
## [2,]   3.932409 0.4155128  9.463990 1.489919e-12


2.5 Confidence interval for the model parameters

Function confint computes confidence intervals for \(c_0\) and \(c_1\) (default level = 95%))

confint(lm1)
##                  2.5 %    97.5 %
## (Intercept) -31.167850 -3.990340
## speed         3.096964  4.767853

Using the fact that \(t_k\) follows a \(t\)-distribution with \(n-\nbeta\) degrees of freedom, we can build a confidence interval for \(\beta_k\) of level \(1-\alpha\):

\[{\rm CI}_{1-\alpha}(\beta_k) = [\hat{\beta}_k + qt_{\alpha/2, n-d}\ {\rm se}(\hat{\beta}_k) \ , \ \hat{\beta}_k + qt_{1-\alpha/2, n-d}\ {\rm se}(\hat{\beta}_k)]\] where \(qt_{\alpha/2, n-d}\) and \(qt_{1-\alpha/2, n-d}\) are the quantiles of order \(\alpha/2\) and \(1-\alpha/2\) for a \(t\)-distribution with \(n-\nbeta\) df.

Indeed, we can easily check that \(\prob{{\rm CI}_{1-\alpha}(\beta_k) \ni \beta_k} = 1-\alpha\).

level <- 0.95
alpha <- 1 - level
cbind(beta.est + se.beta*qt(alpha/2,n-d) , beta.est + se.beta*qt(1-alpha/2,n-d) )
##            [,1]      [,2]
## [1,] -31.167850 -3.990340
## [2,]   3.096964  4.767853


2.6 Coefficient of determination

The multiple R-squared \(R^2\) is the proportion of variation in the response variable that has been explained by the model. Using the fact that \[ \|y - \bar{y}\|^2 = \|X\hat{\beta} - \bar{y}\|^2 + \|y - X\hat{\beta} \|^2 \]

\[ \begin{aligned} R^2 &= \frac{\|X\hat{\beta} - \bar{y}\|^2}{\|y - \bar{y}\|^2} \\ &= 1 - \frac{\|y - X\hat{\beta} \|^2}{\|y - \bar{y}\|^2} \end{aligned} \]

1- sum((y-X%*%beta.est)^2)/sum((y-mean(y))^2)
## [1] 0.6510794
sum((X%*%beta.est - mean(y))^2)/sum((y-mean(y))^2)
## [1] 0.6510794

By construction, adding more predictors to the model, i.e.??increasing the degree of the polynom, is always going to increase the R-squared value. Adjusted R-squared penalizes this effect by normalizing each term by the associated degree of freedom.

\[ \begin{aligned} R^2_{\rm adj} &= 1 - \frac{\|y - X\hat{\beta} \|^2/(n-\nbeta)}{\|y - \bar{y}\|^2/(n-1)} \end{aligned} \] The R-squared is a purely descriptive statistics. The adjusted R-squared should be preferably used to compare the explanatory power of models built from the same dataset.

1- sum((y-X%*%beta.est)^2)/(n-d)/(sum((y-mean(y))^2)/(n-1))
## [1] 0.6438102


2.7 F-test of the overall significance

A F-test is also performed to test if at least one of the coefficients \(c_1, c_2, \ldots , c_{\degree}\) is non zero: \[ \begin{aligned} H_0 &: \quad (c_1, c_2, \cdots ,c_{\degree}) = (0, 0, \cdots, 0) \\ H_1 &: \quad (c_1, c_2, \cdots ,c_{\degree}) \neq (0, 0, \cdots, 0) \end{aligned} \]

The test statistic for testing \(H_0\) against \(H_1\) is \[ \begin{aligned} F_{\rm stat} &= \frac{\|X\hat{\beta} - \bar{y}\|^2/(\nbeta-1)}{\|y - X\hat{\beta} \|^2/(n-\nbeta)} \end{aligned} \]

F.stat <- (sum((X%*%beta.est - mean(y))^2)/(d-1))/(sum((y-X%*%beta.est)^2)/(n-d))
F.stat
## [1] 89.56711

Under the null hypothesis, the test statistic \(F_{\rm stat}\) has a F distribution with \((\nbeta-1,n-\nbeta)\) degrees of freedom. The \(p\)-value of the F-test is therefore \[\text{p-value(F-test)} = \prob{F_{\nbeta-1,n-\nbeta} > F_{\rm stat}}=1- \prob{F_{\nbeta-1,n-\nbeta} \leq F_{\rm stat}}\]

1-pf(F.stat,d-1,n-d)
## [1] 1.489919e-12

Remark: t-test and F-test are equivalent for linear models with only one predictor. In the case of polynomial regression of degree \(m=1\), both tests can be used equally for testing if \(c_1=0\). Indeed, \[ \begin{aligned} F_{\rm stat} &= \frac{\|\hat{c}_0 + \hat{c}_1 x - \bar{y}\|^2}{\|y - \hat{c}_0 - \hat{c}_1 x\|^2/(n-2)} \\ &= \frac{\hat{c}_1^2 \|x - \bar{x}\|^2}{\hat{\sigma}^2} \\ &= \frac{\hat{c}_1^2}{se^2(\hat{c}_1)} \\ &= t_{\rm stat}^2 \end{aligned} \]

c(F.stat,  t.stat[2]^2)
## [1] 89.56711 89.56711

Furthermore, if \(t_{\rm stat}\) has a \(t\) distribution with \(n-2\) df, then \(t_{\rm stat}^2\) has a \(F\) distribution with \((1,n-2)\) df. Both p-values are therefore equal.


2.8 The residuals

A summary of the distribution of the residuals \(y_j - \hat{f}(x_j)\) includes the range, the mean and the quartiles:

f.est <- X%*%beta.est
summary(y-f.est)
##        V1         
##  Min.   :-29.069  
##  1st Qu.: -9.525  
##  Median : -2.272  
##  Mean   :  0.000  
##  3rd Qu.:  9.215  
##  Max.   : 43.201


3 Some diagnostic plots

Several diagnostic plots are available. The first one plots the residual versus the predicted (or fitted) values with a smooth line.

plot(lm1, which=1)

We can reproduce this plot:

lo2 <- loess(resid(lm1) ~ fitted(lm1), degree = 1, span=0.8)
plot(fitted(lm1),resid(lm1))
lines(fitted(lm1),predict(lo2), col='red', lwd=2)
abline(a=0, b=0, lty=2)

The second diagnostic plot i a normal QQ plot,

plot(lm1, which=2)

The QQ plot is obtained by plotting the standardized residuals (i.e.??the residuals divided by their standard deviation) versus the theoretical quantiles of order \(1/(n+1)\), \(2/(n+1)\), , \(n/(n+1)\). If the residual are normally distributed, then the points should be randomly distributed around the line \(y=x\).

plot(qnorm((1:n)/(n+1)),sort(resid(lm1)/sd(resid(lm1))), xlab="theoretical quantiles", ylab="standardized residuals")
abline(a=0, b=1, lty=2)


4 Model comparison

4.1 ANOVA

Let lm1 and lm2 be two nested linear models with, respectively, \(\nbeta_1\) and \(\nbeta_2\) coefficients.

Let \(\hat{y}_1\) and \(\hat{y}_2\) be the predicted values under, respectively, lm1 and lm2. Cochran Theorem states that \[ \|y - \hat{y}_1 \|^2 = \|\hat{y}_2 - \hat{y}_1\|^2 + \|y - \hat{y}_2 \|^2 \]

Then, the statistics used for the test is

\[ \begin{aligned} F_{\rm stat} &= \frac{\text{explained variance}}{\text{unexplained variance}}\\ &= \frac{\|\hat{y}_2 - \hat{y}_1\|^2/(\nbeta_2 - \nbeta_1)}{\|y - \hat{y}_2 \|^2/(n-\nbeta_2)} \end{aligned} \]

Under model lm1, the test statistics \(F_{\rm stat}\) follows a \(F\) distribution with \((\nbeta_2-\nbeta_1 , n-\nbeta_2)\) df.

Let lm1 and lm2 be the polynomial models of degree 1 and 2, repsectively.

lm2 <- lm(dist ~ poly(speed, degree=2))
lm2
## 
## Call:
## lm(formula = dist ~ poly(speed, degree = 2))
## 
## Coefficients:
##              (Intercept)  poly(speed, degree = 2)1  
##                    42.98                    145.55  
## poly(speed, degree = 2)2  
##                    23.00

We can then perform an ANOVA to test lm1 against lm2

anova(lm1, lm2)
## Analysis of Variance Table
## 
## Model 1: dist ~ speed
## Model 2: dist ~ poly(speed, degree = 2)
##   Res.Df   RSS Df Sum of Sq     F Pr(>F)
## 1     48 11354                          
## 2     47 10825  1    528.81 2.296 0.1364

Let us check these results,

y1 <- predict(lm1)
RSS1 <- sum((y-y1)^2)

y2 <- predict(lm2)
RSS2 <- sum((y-y2)^2)

ESS2 <- sum((y2-y1)^2)
c(RSS1, RSS2 + ESS2)
## [1] 11353.52 11353.52
d1 <- 2 ; d2 <- 3
F.stat <- (ESS2/(d2-d1))/(RSS2/(n-d2))
c(RSS2, ESS2, F.stat, 1-pf(F.stat,d2-d1,n-d2))
## [1] 1.082472e+04 5.288051e+02 2.296027e+00 1.364024e-01


4.2 Likelihood

Function logLik can be used for computing the likelihood under lm1 or lm2

logLik(lm1)
## 'log Lik.' -206.5784 (df=3)
logLik(lm2)
## 'log Lik.' -205.386 (df=4)

These values of the log-likelihood implicitely assume that the residual errors are independent and identically distributed (i.i.d.), with a normal distribution, with mean 0 and variance \(\sigma^2\). Then,

\[ y_j \sim {\cal N}\left(\sum_{k=0}^\degree c_k x_j^k,\sigma^2\right)\]

and the log-likelihood is \[ \llike(\param) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{j=1}^n\left(y_j -\sum_{k=0}^\degree c_k x_j^k\right)^2 \]

The ML estimator of \(\sigma^2\) is

\[ \hat{\sigma}^2 = \frac{1}{n}\sum_{j=1}^n\left(y_j -\sum_{k=0}^\degree \hat{c}_k \, x_j^k\right)^2 \]

Then, the log-likelihood computed with \(\hat{\param}=(\hat{\beta},\hat{\sigma}^2)\) reduces to \[ \llike(\hat{\param}) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\left(\frac{1}{n}\sum_{j=1}^n \hat{e}_j^2\right) -\frac{n}{2} \] where \[\sum_{j=1}^n \hat{e}_j^2 = \sum_{j=1}^n \left(y_j -\sum_{k=0}^\degree \hat{c}_k \, x_j^k \right)^2\] is the residual sum of squares(RSS). Let us apply this formula for computing the log-likelihood for models lm1 and lm2:

LL1 <- -n/2*log(2*pi) -n/2*log(RSS1/n)-n/2
LL2 <- -n/2*log(2*pi) -n/2*log(RSS2/n)-n/2 
c(LL1, LL2)
## [1] -206.5784 -205.3860


4.3 Likelihood ratio test

Since models lm1 and lm2 are nested, we can then perform the likelihood ratio test (LRT) for testing lm1 against lm2.

Let \(\llike_1\) and \(\llike_2\) be the log-likelihood functions for models lm1 and lm2. Then, for large \(n\), the distribution of the test statistics \[ \begin{aligned} LRT_{\rm stat} &= 2(\llike_2(\hat\theta_2) - \llike_1(\hat\theta_1) \\ &= n \log \left( \frac{\sum_{j=1}^n(y_j - f_1(x_j,\hat{\beta_1}))^2}{\sum_{j=1}^n(y_j - f_2(x_j,\hat{\beta_2}))^2} \right) \end{aligned} \] can be approximated by a \(\chi^2\) distribution with \(\nbeta_2-\nbeta_1=\degree_2-\degree_1\) df.

2*as.numeric(logLik(lm2) - logLik(lm1))
## [1] 2.384795

n*log(sum((y-y1)^2)/sum((y-y2)^2))
## [1] 2.384795
LRT.stat <- n*log(sum((y-y1)^2)/sum((y-y2)^2))
p.value  <- 1-pchisq(LRT.stat,d2-d1)
p.value
## [1] 0.122521


4.4 Deviance

The deviance for a given regression model and a given set of observations \(y\), is a measure of goodnes of fit defined, in R, as:

\[ D = \sum_{j=1}^n(y_j - f(x_j,\hat{\beta}))^2 \]

c(deviance(lm1), deviance(lm2))
## [1] 11353.52 10824.72
c(sum((y-y1)^2), sum((y-y2)^2))
## [1] 11353.52 10824.72


4.5 Information criteria

AIC and BIC compute the Akaike information criterion and Bayesian information criterion:

AIC(lm1)
## [1] 419.1569
BIC(lm1)
## [1] 424.8929

AIC and BIC are penalized versions of the log-likelihood defined by:

\[ \begin{aligned} AIC &= -2\llike(\hat{\param}) + 2P \\ BIC &= -2\llike(\hat{\param}) + \log(n)P \end{aligned} \] where \(P\) is the number of parameters of the model, i.e.??the length of \(\param\). Here, \(P=\degree+2\) for a polynomial model of degree \(\degree\).

On one hand, \(-2\llike(\hat{\param})\) decreases when \(P\) increases. On the other hand, the penalization term (\(2P\) or \(\log(n)P\)) increases with \(P\). The objective of these criteria is to propose a model with an optimal compromise between the goodness of fit (measured by the log-likelihood) and the complexity of the model (measured by the number of parameters \(P\)).

m1 <- 1 #degree of the polynomial
P1 <- m1 + 2
aic1 <- -2*logLik(lm1) + 2*P1
bic1 <- -2*logLik(lm1) + log(n)*P1
c(aic1, bic1)
## [1] 419.1569 424.8929


5 Confidence intervals and prediction intervals

For given values \(\xnew\) of the explanatory variable, we can use the fitted model for estimating the predicted response \(\fnew=f(\xnew)\):

speed.new <- c(5,15,25)
new.df <- data.frame(speed=speed.new)
predict(lm1,newdata=new.df)
##         1         2         3 
##  2.082949 41.407036 80.731124

This estimation is defined as \[ \hfnew = f(\xnew ,\hat{\beta}) \] Thus, if \(f\) is a polynomial of degree \(\degree\),

\[ \hfnew = \Xnew \hat{\beta} \] where \(\xnew^{k-1}\) is the \(k\)-th column of \(\Xnew\).

X1 <- cbind(1,speed)
beta1.est <- solve(t(X1)%*%X1)%*%t(X1)%*%y
X.new <- cbind(1,speed.new)
f.new <- X.new%*%beta1.est
t(f.new)
##          [,1]     [,2]     [,3]
## [1,] 2.082949 41.40704 80.73112

\(\hfnew\) is a random variable since it is a function of the observed \(y\). We can then compute a confidence interval for \(\fnew\):

level=0.9
predict(lm1,newdata=new.df, interval = "confidence", level=level)
##         fit       lwr      upr
## 1  2.082949 -6.031168 10.19707
## 2 41.407036 37.748435 45.06564
## 3 80.731124 73.110888 88.35136

since \(\hfnew= \Xnew \hat{\beta}\), \[ \hfnew \sim {\cal N} (\fnew , {\rm Var}(\hfnew ) ) \] where \[\begin{aligned} {\rm Var}(\hfnew ) &= \Xnew {\rm Var}(\hat{\beta}) \Xnew^\prime \\ &= \sigma^2 \Xnew(X^\prime X)^{-1}\Xnew^\prime \end{aligned}\]

\(\widehat{{\rm Var}(\hfnew)}\), an estimate of \({\rm Var}(\hfnew )\) is obtained using \(\hat\sigma^2\) instead of \(\sigma^2\).

Then, \[ \begin{aligned} \left({{\rm Var}(\hfnew)} \right)^{-1/2}(\hfnew - \fnew) &\sim {\cal N}(0, {\rm Id}_{\nnew} ) \\ \left(\widehat{{\rm Var}(\hfnew)} \right)^{-1/2}(\hfnew - \fnew) &\sim t_{\nnew,n-\nbeta} \end{aligned} \] where \(t_{\nnew,n-\nbeta}\) is the multivariate \(t\) distribution with \(n-\nbeta\) degrees of freedom (the components of this \(\nnew\)-vector are independent and follow a \(t\) distribution with \(n-\nbeta\) df).

V.newf <- X.new%*%V.beta%*%t(X.new)
sd.newf <- sqrt(diag(V.newf))
d1 <- m1 + 1
cbind(f.new, f.new+qt((1-level)/2, n-d1)*sd.newf, f.new+qt((1+level)/2, n-d1)*sd.newf)
##           [,1]      [,2]     [,3]
## [1,]  2.082949 -6.031168 10.19707
## [2,] 41.407036 37.748435 45.06564
## [3,] 80.731124 73.110888 88.35136

Consider now a vector of new measured values \(\ynew\). We can again use the predict function for computing a prediction interval for \(\ynew\):

predict(lm1,newdata=new.df, interval = "prediction", level=level)
##         fit       lwr       upr
## 1  2.082949 -24.95816  29.12406
## 2 41.407036  15.35386  67.46022
## 3 80.731124  53.83408 107.62816

By definition of the model, \[ \ynew \sim {\cal N} (\fnew , \sigma^2 \, {\rm Id}_{\nnew} ) \] Then, if we want to compute a prediction interval for \(\ynew\), we must take into account the variability of \(\ynew\) around \(\fnew\), but also the uncertainty on \(\fnew\) since it is unknown:

\[\ynew = \hfnew + (\fnew-\hfnew) + \enew \] Thus, \[ \ynew - \hfnew \sim {\cal N}(0, {\rm Var}(\hfnew) + \sigma^2 \, {\rm Id}_{\nnew} ) \] Then, \[ \begin{aligned} \left(\Xnew {\rm Var}(\hat{\beta}) \Xnew + {\sigma}^2 {\rm Id}_{\nnew} \right)^{-1/2}(\ynew - \hfnew) & \sim {\cal N}(0, {\rm Id}_{\nnew} ) \\ \left(\Xnew \widehat{{\rm Var}(\hat{\beta})} \Xnew + \hat{\sigma}^2 {\rm Id}_{\nnew} \right)^{-1/2}(\ynew - \hfnew) & \sim t_{\nnew,n-\nbeta} \end{aligned} \]

V.newy <- V.newf + sigma2.est*diag(rep(1,length(f.new)))
sd.newy <- sqrt(diag(V.newy))
cbind(f.new, f.new+qt((1-level)/2, n-d1)*sd.newy, f.new+qt((1+level)/2, n-d1)*sd.newy)
##           [,1]      [,2]      [,3]
## [1,]  2.082949 -24.95816  29.12406
## [2,] 41.407036  15.35386  67.46022
## [3,] 80.731124  53.83408 107.62816


6 Using orthogonal polynomials

Let \(X\) be the design matrix for a polynomial model of degree \(\degree\), \[ X = \left( \begin{array}{cccc} 1 & x_1 & \cdots & x_1^{\degree} \\ 1 & x_2 & \cdots & x_2^{\degree} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & \cdots & x_n^{\degree} \end{array}\right) \] Then, the columns of \(X\) are highly correlated since \[ (X^\prime X)_{kl} = \sum_{j=1}^n x_j^{k+l-2} \]


6.1 The Gram-Schmidt procedure

Orthogonal polynomials can be obtained by applying the Gram-Schmidt orthogonalization process to the basis 1, \(x\), \(x^2\), , \(x^{\degree}\):

\[ \begin{aligned} p_0(x) &= 1 \\ p_1(x) &= x - \frac{\langle x, p_0 \rangle}{\langle p_0, p_0 \rangle} p_0(x) \\ p_2(x) &= x^2 - \frac{\langle x^2 , p_1 \rangle}{\langle p_1, p_1 \rangle} p_1(x) - \frac{\langle x^2 , p_0 \rangle}{\langle p_0, p_0 \rangle} p_0(x) \\ &\vdots \\ p_{\degree}(x) &= x^{\degree} - \frac{\langle x^{\degree} , p_{\degree-1} \rangle}{\langle p_{\degree-1}, p_{\degree-1} \rangle} p_{\degree-1}(x) - \cdots - \frac{\langle x^{\degree} , p_0 \rangle}{\langle p_0, p_0 \rangle} p_0(x) \end{aligned} \]

Let \(\tilde{X} = (p_0(x) , p_1(x), \cdots , p_{\degree}(x))\) be the new design matrix. Then, \[ \tilde{X}^\prime \tilde{X} = \left( \begin{array}{cccc} n & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots& \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{array} \right) \]

u0 <- rep(1,n)
X2.ortho <- u0

u1 <- x - sum(x*u0)/sum(u0^2)*u0
e1 <- u1/norm(u1, type="2" )
X2.ortho <- cbind(X2.ortho, e1)
u2 <- x^2 - sum(x^2*u0)/sum(u0^2)*u0 - sum(x^2*u1)/sum(u1^2)*u1
e2 <- u2/norm(u2, type="2" )
X2.ortho <- cbind(X2.ortho, e2)
head(X2.ortho)
##      X2.ortho         e1         e2
## [1,]        1 -0.3079956 0.41625480
## [2,]        1 -0.3079956 0.41625480
## [3,]        1 -0.2269442 0.16583013
## [4,]        1 -0.2269442 0.16583013
## [5,]        1 -0.1999270 0.09974267
## [6,]        1 -0.1729098 0.04234892
t(X2.ortho)%*%X2.ortho
##              X2.ortho           e1           e2
## X2.ortho 5.000000e+01 4.996004e-16 1.776357e-15
## e1       4.996004e-16 1.000000e+00 1.249001e-15
## e2       1.776357e-15 1.249001e-15 1.000000e+00

More generally, the Gram-Schmidt orthogonalization process is easy to implement

GramSchmidt <- function(x,m)
{
  n <- length(x)
  X <- matrix(nrow=n, ncol=m+1)
  X[,1] <- rep(1,n)
  for (k in (1:m)){
    tk <- x^k
    uk <- tk
    for (j in (1:k)){
      ukj <- X[,j]
      uk <- uk - sum(tk*ukj)/sum(ukj^2)*ukj
    }
    X[,k+1] <- uk/norm(uk, type="2")
  }
  return(X)
}

It can then be used for any degree \(\degree\)

X4.ortho <- GramSchmidt(x,4)
head(X4.ortho)
##      [,1]       [,2]       [,3]        [,4]       [,5]
## [1,]    1 -0.3079956 0.41625480 -0.35962151  0.2454125
## [2,]    1 -0.3079956 0.41625480 -0.35962151  0.2454125
## [3,]    1 -0.2269442 0.16583013  0.05253037 -0.2757077
## [4,]    1 -0.2269442 0.16583013  0.05253037 -0.2757077
## [5,]    1 -0.1999270 0.09974267  0.11603440 -0.2650106
## [6,]    1 -0.1729098 0.04234892  0.15002916 -0.2078087