$$ \newcommand{\esp}[1]{\mathbb{E}\left(#1\right)} \newcommand{\var}[1]{\mbox{Var}\left(#1\right)} \newcommand{\deriv}[1]{\dot{#1}(t)} \newcommand{\prob}[1]{ \mathbb{P}\!(#1)} \newcommand{\eqdef}{\mathop{=}\limits^{\mathrm{def}}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bc}{\boldsymbol{c}} \newcommand{\bpsi}{\boldsymbol{\psi}} \def\pmacro{\texttt{p}} \def\like{{\cal L}} \def\llike{{\cal LL}} \def\logit{{\rm logit}} \def\probit{{\rm probit}} \def\one{{\rm 1\!I}} \def\iid{\mathop{\sim}_{\rm i.i.d.}} \def\simh0{\mathop{\sim}_{H_0}} \def\df{\texttt{df}} \def\res{e} \def\xomega{x} \newcommand{\argmin}[1]{{\rm arg}\min_{#1}} \newcommand{\argmax}[1]{{\rm arg}\max_{#1}} \newcommand{\Rset}{\mbox{$\mathbb{R}$}} \def\param{\theta} \def\setparam{\Theta} \def\xnew{x_{\rm new}} \def\fnew{f_{\rm new}} \def\ynew{y_{\rm new}} \def\nnew{n_{\rm new}} \def\enew{e_{\rm new}} \def\Xnew{X_{\rm new}} \def\hfnew{\widehat{\fnew}} \def\degree{m} \def\nbeta{d} \newcommand{\limite}[1]{\mathop{\longrightarrow}\limits_{#1}} \def\ka{k{\scriptstyle a}} \def\ska{k{\scriptscriptstyle a}} \def\kel{k{\scriptstyle e}} \def\skel{k{\scriptscriptstyle e}} \def\cl{C{\small l}} \def\Tlag{T\hspace{-0.1em}{\scriptstyle lag}} \def\sTlag{T\hspace{-0.07em}{\scriptscriptstyle lag}} \def\Tk{T\hspace{-0.1em}{\scriptstyle k0}} \def\sTk{T\hspace{-0.07em}{\scriptscriptstyle k0}} \def\thalf{t{\scriptstyle 1/2}} \newcommand{\Dphi}[1]{\partial_\pphi #1} \def\asigma{a} \def\pphi{\psi} \newcommand{\stheta}{{\theta^\star}} \newcommand{\htheta}{{\widehat{\theta}}} $$


1 Introduction

The faithful data consist of the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

Let us see how these data look like.

library(ggplot2);  theme_set(theme_bw())
(pl <- ggplot(data=faithful) +  geom_point(aes(eruptions,waiting), size=1.5, colour="#993399") +
  ylab("Waiting time to next eruption (mn)") + xlab("duration of the eruption (mn)"))

We aim to fit a model to this data that describes the relationship between duration and waiting time.

If we try to fit a fit a polynomial model, we can check that a polynomial of degree 4 is considered as the ``best polynomial model??????.

x <- faithful$eruptions
y <- faithful$waiting
poly4 <- lm(y ~poly(x, 4))
xc <- seq(1, 6, by=0.01)
dfc <- data.frame(x=xc)
dfc$poly4 <- predict(poly4,dfc)
print(pl +   geom_line(data=dfc,aes(x,poly4), size=1, colour="#339900"))

Even if this model is the ``best?????? polynomial model, we may have serious doubts on the capabilities of the model to predict waiting times for durations outside of the observed range of durations.

Furthermore, parameters of the model, i.e.??the polynomials??? coefficients, have no obvious physical interpretation. Using a polynomial model here, we are therefore not seeking to build a structural model \(f\) that approximates a physical phenomenon, but merely seeking to rely the variability in the observations to the explanatory variables \(x\), \(x^2\), ???

We therefore need to consider other types of models, i) that do not necessarily assume linear relationships between the response variable and the explanatory variables, ii) whose parameters have some physical interpretation.

A logistic function (or logistic curve) is a common ???S??? shape (sigmoid curve), with equation: \[ f_1(x) = \frac{A}{1+{\rm exp}(-\gamma(x-\tau))} \]



Here, \(A\) is the limiting value (when \(x \to \infty\)), \(\gamma\) measure the steepness of the curve and \(\tau\) is the \(x\)-value of the sigmoid???s midpoint.

This model is a nonlinear model in the sense that the regression function \(f_1\) is a nonlinear function of the parameters.

Let us fit this model to our data using the nls function.

nlm1 <- nls(y ~  A/(1+exp(-gamma*(x-tau))), start=c(A=90, gamma=2, tau=2))
summary(nlm1)
## 
## Formula: y ~ A/(1 + exp(-gamma * (x - tau)))
## 
## Parameters:
##       Estimate Std. Error t value Pr(>|t|)    
## A      93.1097     4.5080  20.654  < 2e-16 ***
## gamma   0.6394     0.1022   6.254 1.57e-09 ***
## tau     1.4623     0.1092  13.391  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.763 on 269 degrees of freedom
## 
## Number of iterations to convergence: 8 
## Achieved convergence tolerance: 6.017e-06

We will see in the next sections what these results are and how they are computed.

An extension of the logistic function assumes a minimum waiting time \(S\) between eruptions:

\[ f_2(x) = S+ \frac{A-S}{1+{\rm exp}(-\gamma(x-\tau))} \]



We can again use nls to fit this nonlinear model to the data,

nlm2 <- nls(y ~  (A-S)/(1+exp(-gamma*(x-tau))) + S, start=c(A=90, gamma=2, tau=2, S=50))
summary(nlm2)
## 
## Formula: y ~ (A - S)/(1 + exp(-gamma * (x - tau))) + S
## 
## Parameters:
##       Estimate Std. Error t value Pr(>|t|)    
## A      82.4659     0.9973  82.689  < 2e-16 ***
## gamma   2.2539     0.4355   5.175 4.47e-07 ***
## tau     3.0553     0.1107  27.610  < 2e-16 ***
## S      51.3221     1.8303  28.040  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.622 on 268 degrees of freedom
## 
## Number of iterations to convergence: 9 
## Achieved convergence tolerance: 6.286e-06

We can now compute - and plot - the waiting times predicted with these two fitted models

dfc$nlm1 <- predict(nlm1, newdata=dfc)
dfc$nlm2 <- predict(nlm2, newdata=dfc)
dfc$poly4 <- NULL

library(reshape2)
melt.df <- melt(dfc, id="x", variable.name = "model", value.name = "f")

print(pl + geom_line(data=melt.df,aes(x,f,colour=model), size=1) + scale_colour_manual( values=c("#E69F00", "#56B4E9")))

We will see in the next sections

2 Fitting a nonlinear model

2.1 Estimation of the parameters of the model

2.1.1 Least squares estimation

In the model \[y_j = f(x_j,\beta) + e_j \quad ; \quad 1\leq j \leq n\] The least squares (LS) estimator of \(\beta\) minimizes the residual sum of squares (RSS) \[ \hat{\beta} = \argmin{\beta}\sum_{j=1}^n (y_j - f(x_j,\beta))^2 \] Here, there is no closed-form expression for \(\hat{\beta}\). An optimization procedure is then used for computing \(\hat{\beta}\).

Let us define a first model \({\cal M}_1\) as \[y_j = f_1(x_j,\beta) + e_j \quad ; \quad 1\leq j \leq n\] where \(\beta = (A,\gamma,\tau)\) and let us compute the LS estimate of \(\beta\). Function nlm carries out a minimization of the residual sum of squares using a Newton-type algorithm.

pred1=function(beta, x){ f <-  beta[1]/(1+exp(-beta[2]*(x-beta[3]))) }
rss1=function(beta, x, y){ rss <- sum(((y-pred1(beta, x)))^2) }

r.nlm1 <- nlm(rss1, c(90, 2, 2), x, y)
beta.est <- r.nlm1$estimate
beta.names <- names(coef(nlm1))
names(beta.est) <- beta.names 
beta.est
##          A      gamma        tau 
## 93.1100965  0.6393835  1.4622674

Assume now that the residual errors are random variables with mean 0 and variance \(\sigma^2\) \[ \esp{e_j}=0 \quad ; \quad \esp{e_j^2}=\sigma^2 \quad ; \quad 1 \leq j \leq n \] Then, following the approach for linear model, if \(\beta\) is a vector of length \(d\), there are \(n-d\) degrees of freedom and the residual error variance is defined as \[ \hat{\sigma}^2 = \frac{1}{n-d} \sum_{j=1}^n(y_j - f(x_j,\hat{\beta}))^2 \]

n <- length(x)
d <- length(beta.est)
df <- n-d
sigma.est <- sqrt(rss1(beta.est, x, y)/df)
sigma.est
## [1] 5.762887


2.1.2 Maximum likelihood estimation

Let \(e_j=\sigma \varepsilon_j\) where \((\varepsilon_j)\) is a sequence of independent and normally distributed random variables with mean 0 and variance \(1\) \[ \varepsilon_j \iid {\cal N}(0, 1). \] We can then rewrite the model as follows: \[y_j = f(x_j,\beta) + \sigma \varepsilon_j \quad ; \quad 1\leq j \leq n\]

The maximum likelihood (ML) estimator of \(\beta\) coincides with the least squares estimator

\[\begin{aligned} \hat{\beta} &= \argmin{\beta}\sum_{j=1}^n(y_j - f(x_j,\beta))^2 \end{aligned}\]

and the ML estimator of \(\sigma^2\) is

\[ \hat{\sigma}^2_{\rm ml} = \frac{1}{n} (y_j - f(x_j,\hat{\beta}))^2 \]

sigma.ml <- sqrt(rss1(beta.est, x, y)/n)
sigma.ml
## [1] 5.731018


2.2 Standard errors of the parameter estimates

Several methods exist for estimating the standard errors of the parameter estimates. In particular, the nls function uses a linear approximation of the model, but a likelihood approach or a parametric bootstrap may also provide estimates of these s.e.

2.2.1 Linearization approach

An nls object has methods for several generic functions, including vcov which computes the variance-covariance matrix of the estimated parameters \(\hat{\beta}\).

vcov(nlm1)
##                A        gamma          tau
## A     20.3218236 -0.451415395  0.432034293
## gamma -0.4514154  0.010452938 -0.008769245
## tau    0.4320343 -0.008769245  0.011923497

The standard errors of the estimates are then the square roots of the diagonal elements of this matrix

sqrt(diag(vcov(nlm1)))
##         A     gamma       tau 
## 4.5079733 0.1022396 0.1091948

The nls function linearizes the model for computing this variance-covariance matrix. Indeed, for any \(\beta\) ``close?????? to \(\hat{\beta}\), \[ f(x_j , \beta) \simeq f(x_j , \hat{\beta}) + \nabla f(x_j , \hat{\beta})(\beta - \hat{\beta}) \] where \(\nabla f(x_j , {\beta})\) is the gradient of \(f(x_j , {\beta})\), i.e. the row vector of the first derivatives of \(f(x_j ,{\beta})\) with respect to the \(d\) components of \(\beta\). Setting \(z_j=y_j-f(x_j , \hat{\beta}) + \nabla f(x_j , \hat{\beta}) \hat{\beta}\) and \(g_j=\nabla f(x_j , \hat{\beta}))\), the original model can be approximated by the linear model \[ z_j = g_j \ \beta + e_j. \] Writing this model in a matricial form \(z=G\, \beta + e\) where \(g_j\) is the \(j\)th row of matrix \(G\), we can check that the LS estimator of \(\beta\) for this model is the LS estimator of the original model \(\hat{\beta}\)

\[\begin{aligned} \hat{\beta} &= \argmin{\beta}\sum_{j=1}^n(y_j - f(x_j,\beta))^2 \\ &= \argmin{\beta}\sum_{j=1}^n(z_j - g_j\beta)^2 \end{aligned}\]

Proof: Let \(\tilde{\beta}\) be the LS estimator of the linearized model. Then,

\[\begin{aligned} \tilde{\beta} &= \argmin{\beta}\|z - G\beta \|^2 \\ &= (G^\prime G)^{-1}G^\prime z \\ &= (G^\prime G)^{-1}G^\prime (y - f(x,\hat\beta) + G\hat\beta) \\ &= \hat\beta + (G^\prime G)^{-1}\nabla f(x,\hat\beta) (y - f(x,\hat\beta) ) \\ \end{aligned}\]

By definition, \(\hat\beta\) minimizes \(U(\beta) = \| y -f(x,\beta) \|^2\). Then, \[\begin{aligned} \nabla U(\hat\beta) &= -2\nabla f(x,\hat\beta) (y - f(x,\hat\beta) ) \\ &= 0 \end{aligned}\] Thus, \(\tilde\beta=\hat\beta\). \(\Box\)

hx <- deriv(y ~ A/(1+exp(-k*(x-tau))), c("A", "k", "tau"), function(A, k, tau, x){} ) 
fr <- hx(beta.est[1], beta.est[2], beta.est[3], x)
G <- attr(fr,"gradient")
z <- y - pred1(beta.est, x) + G%*%beta.est
solve(t(G)%*%G)%*%t(G)%*%z
##          [,1]
## A   93.110121
## k    0.639383
## tau  1.462268

Since \(\hat{\beta} =(G^\prime G)^{-1}G^\prime z\), the variance-covariance of \(\hat\beta\) can be approximated by \[{\rm Var}_{\rm lin}(\hat\beta) = \hat\sigma^2 (G^\prime G)^{-1} \]

V.lin <- sigma.est^2*solve(t(G)%*%G)
V.lin
##              A            k          tau
## A   20.3230488 -0.451427057  0.432076372
## k   -0.4514271  0.010452837 -0.008769862
## tau  0.4320764 -0.008769862  0.011924743

and we can derive standard errors \(({\rm se}_{\rm lin}(\hat{\beta}_k), 1 \leq k \leq \nbeta)\) for the parameter estimates,

se.lin <- sqrt(diag(V.lin))
se.lin
##         A         k       tau 
## 4.5081092 0.1022391 0.1092005


2.2.2 Maximum likelihood approach

Let \(I_y(\hat{\beta})\) be the observed Fisher information matrix at \(\hat{\beta}\):

\[ \begin{aligned} I_y({\hat\beta}) &= - \frac{\partial^2}{\partial \beta \partial \beta^\prime} \llike(\hat{\beta},\hat{\sigma}^2) \\ &= \frac{1}{2\hat\sigma^2}\frac{\partial^2}{\partial \beta \partial \beta^\prime} \left(\sum_{j=1}^n(y_j - f(x_j,\hat\beta))^2 \right). \end{aligned} \]

Then, the Central Limit Theorem states that the variance of \(\hat{\beta}\) can be approximated by the inverse of \(I_y(\hat{\beta})\):

\[{\rm Var}_{\rm ml}(\hat{\beta}) = I_y(\hat{\beta})^{-1}.\]

Function nlm can return the Hessian of the function rss1 to minimize, i.e.??the matrix of the second derivatives \(\partial^2/\partial \beta \partial \beta^\prime \sum_{j=1}^n(y_j - f(x_j,\hat\beta))^2\)

r.nlm1 <- nlm(rss1, c(90, 2, 2), x, y, hessian="true")
r.nlm1$hessian
##            [,1]      [,2]        [,3]
## [1,]   324.8939   10848.5   -3794.477
## [2,] 10848.4973  379318.3 -114782.340
## [3,] -3794.4768 -114782.3   58845.488

We then derive the FIM and the variance \({\rm Var}_{\rm ml}(\hat{\beta})\):

I1 <- r.nlm1$hessian/(2*sigma.ml^2)
V.ml <- solve(I1)
V.ml
##            [,1]         [,2]         [,3]
## [1,] 17.4344832 -0.386666074  0.369991062
## [2,] -0.3866661  0.008998208 -0.007381366
## [3,]  0.3699911 -0.007381366  0.010576191

and

se.ml <- sqrt(diag(V.ml))
se.ml
## [1] 4.17546203 0.09485888 0.10284061

Remark: The Hessian of \(\sum_{j=1}^n(y_j - f(x_j,\hat\beta))^2\) can be computed as a function of the first and second derivatives of \(f\): \[ \frac{\partial^2}{\partial \beta \partial \beta^\prime} \left(\sum_{j=1}^n(y_j - f(x_j,\hat\beta))^2 \right) = 2 \sum_{j=1}^n \left( \left(\frac{\partial}{\partial \beta}f(x_j,\hat\beta)\right)\left(\frac{\partial}{\partial \beta}f(x_j,\hat\beta)\right)^\prime - \frac{\partial^2}{\partial \beta \partial \beta^\prime}f(x_j,\hat\beta)y_j \right) \]

hx <- deriv3(y ~ amax/(1+exp(-k*(x-tau))), c("amax", "k", "tau"),
             function(amax, k, tau, x){} ) 

fr <- hx(beta.est[1],beta.est[2],beta.est[3], faithful$waiting)
gr <- attr(fr,"gradient")
hr <- attr(fr,"hessian")
H <- matrix(nrow=d, ncol=d)
for (k in (1:d))
  H[k,] <- -2*colSums(hr[,,k]*(faithful$eruptions-fr)) + 2*gr[,k]%*%gr
H
##               [,1]          [,2]          [,3]
## [1,]  5.440000e+02  1.531469e-07 -2.233514e-09
## [2,]  1.531469e-07 -3.099496e-04  4.349319e-06
## [3,] -2.233514e-09  4.349319e-06 -6.577741e-08

The standard error of \(\hat{\sigma}^2\) is \(\hat{\sigma}^2/\sqrt{n/2}\)

sigma.est/sqrt(n/2)
## [1] 0.4941635


2.2.3 Parametric bootstrap

If we were able to repeat the same experiment under the same conditions, we would observe \(y^{(1)} = (y^{(1)}_j, 1\leq j \leq n)\) and we would compute \(\beta^{(1)}\), an estimate of \(\beta\). Then, if we could repeat this experiment \(L\) times, we would get \(L\) estimates \(\beta^{(1)}\), \(\beta^{(2)}\), , \(\beta^{(L)}\) of \(\beta\). This sequence of estimates \((\beta^{(\ell)}, 1 \leq \ell \leq L)\) would be a sample of random variables distributed as \(\hat{\beta}\) and could therefore be used for estimating this distribution.

When such replicates are not available, parametric bootstrap (or Monte-Carlo simulation) is a way to mimic the repetition of an experiment.

For \(\ell=1,2,\ldots, L\), we generate ``observations?????? \((y^{(\ell)}_j, 1\leq j \leq n)\) with the model of interest, the original explanatory variables \((x_j, 1\leq j\leq n)\) and using the estimated parameters \(\hat\theta=(\hat\beta,\hat{\sigma}^2)\):

\[ y^{(\ell)}_j = f(x_j,\hat{\beta}) + \hat{\sigma} \varepsilon^{(\ell)}_j \quad ; \quad 1\leq j \leq n \] where \(\varepsilon^{(\ell)}_j \iid {\cal N}(0,1)\). We also compute the LS /ML estimate of \(\beta\): \[\hat\beta^{(\ell)} = \argmin{\beta}\sum_{j=1}^n(y^{(\ell)}_j - f(x_j,\beta))^2\] The variance-covariance matrix of \(\hat\beta\) is then estimated by the empirical variance-covariance matrix of \((\hat\beta^{(\ell)})\):

\[ {\rm Var}_{\rm mc}(\hat{\beta}) = \frac{1}{L-1}\sum_{\ell=1}^L ( \hat\beta^{(\ell)} - \bar{\beta} )( \hat\beta^{(\ell)} - \bar{ \beta} )^\prime \] where \(\bar{ \beta} = 1/L \sum_{\ell=1}^L \hat\beta^{(\ell)}\).

L <- 1000
pred.nlm1 <- predict(nlm1)
B <- matrix(nrow=L, ncol=d)
for (l in (1:L)) {
  yl <- pred.nlm1 + sigma.est*rnorm(n) 
  rl.nlm1 <- nls(yl ~  pred1(beta,x), start=list(beta=beta.est))
  B[l,] <- coef(rl.nlm1)
}
V.mc <- cov(B)
V.mc
##            [,1]        [,2]        [,3]
## [1,] 29.0923837 -0.51565237  0.80267867
## [2,] -0.5156524  0.01088116 -0.01194807
## [3,]  0.8026787 -0.01194807  0.02656156
se.mc <- sqrt(diag(V.mc))
se.mc
## [1] 5.3937356 0.1043128 0.1629772

Remark: it would be equivalent to directly compute the empirical standard deviation of each component of the sequence \((\hat\beta^{(\ell)})\):

apply(B,MARGIN=2,FUN=sd)
## [1] 5.3937356 0.1043128 0.1629772

One of the main advantages of this method is that it doesn???t make any assumption on \(\hat\beta\), contrary to the maximum likelihood estimator which asymptotic distribution is known to be normal, with a known asymptotic variance. Then, this asymptotic distribution is used with a finite set of observations for approximating the distribution of the ML estimator, but without knowing how good this approximation is. On its part, the linearization approach makes use of an approximation of the structural model, without knowing how good this approximation is.

In this example, the ML estimator seems to underestimate the standard error of the estimates. On the other hand, results obtained with the linearization approach are very similar to those obtained by Monte Carlo simulation.


2.3 Statistical tests for the model parameters

The summary of model nlm1 includes several informations about the model parameters:

summary(nlm1)$coefficient
##         Estimate Std. Error   t value     Pr(>|t|)
## A     93.1097482  4.5079733 20.654458 1.975076e-57
## gamma  0.6393914  0.1022396  6.253852 1.567079e-09
## tau    1.4622600  0.1091948 13.391302 1.110292e-31

Let \(\beta = (\beta_k, 1\leq k \leq d)\) be the \(d\)-vector of parameters of the model. In the linearized model \(z=G\beta+e\),

\(t_k = (\hat{\beta}_k - \beta_k)/{\rm se}(\hat{\beta}_k)\) follows a \(t\)-distribution with \(n-d\) degrees of freedom. We can then perform a \(t\)-test to test if \(\beta_k=0\).

The test statistics is \(t_{{\rm stat}, k} = {\hat{\beta}_k}/{{\rm se}(\hat{\beta}_k)}\) and the \(p\)-value for this test is \[p_k = 2(1 - \prob{T_{n-d} \leq |t_{{\rm stat}, k}| } \]

t.stat <- beta.est/se.lin
p.value <- 2*(1 - pt(abs(t.stat),n-d))
res.beta <- cbind(beta.est, se.lin, t.stat, p.value)
res.beta
##         beta.est    se.lin    t.stat      p.value
## A     93.1100965 4.5081092 20.653913 0.000000e+00
## gamma  0.6393835 0.1022391  6.253805 1.567486e-09
## tau    1.4622674 0.1092005 13.390669 0.000000e+00


2.4 Confidence intervals for the model parameters


2.4.1 Linearization approach

Using the linearized model \(z=G\beta+e\), we can compute a confidence interval for each component of \(\beta\) as we do with any linear model:

\[{\rm CI}_{{\rm lin}, 1-\alpha}(\beta_k) = [\hat{\beta}_k + qt_{\alpha/2, n-d}\ {\rm se}_{\rm lin}(\hat{\beta}_k) \ , \ \hat{\beta}_k + qt_{1-\alpha/2, n-d}\ {\rm se}_{\rm lin}(\hat{\beta}_k)]\] where \(qt_{p,\nu}\) is the quantile of order \(p\) for a \(t\)-distribution with \(\nu\) degree of freedom.

level <- 0.95
alpha <- 1 - level
ci.lin <- cbind(beta.est + qt(alpha/2,n-d)*se.lin , beta.est + qt(1-alpha/2,n-d)*se.lin)
row.names(ci.lin) <- beta.names
colnames(ci.lin) <- c(paste0((1-level)/2*100,"%"),paste0((1+level)/2*100,"%"))
ci.lin
##             2.5%       97.5%
## A     84.2344319 101.9857610
## gamma  0.4380929   0.8406741
## tau    1.2472711   1.6772637


2.4.2 Maximum likelihood approach

We can adopt the same approach with the ML estimate. Here, the standard errors \(({\rm se}_{\rm ml}(\hat{\beta}_k), 1 \leq k \leq \nbeta)\) are derived from the asymptotic variance-covariance matrix of the parameter estimates \(V_{\rm ml}(\hat{\beta})\) .

\[{\rm CI}_{{\rm ml},1-\alpha}(\beta_k) = [\hat{\beta}_k + qt_{\alpha/2, n-d}\ {\rm se}_{\rm ml}(\hat{\beta}_k) \ , \ \hat{\beta}_k + qt_{1-\alpha/2, n-d}\ {\rm se}_{\rm ml}(\hat{\beta}_k)]\]

ci.ml <- cbind(beta.est + qt(alpha/2,n-d)*se.ml , beta.est + qt(1-alpha/2,n-d)*se.ml)
row.names(ci.ml) <- beta.names
colnames(ci.ml) <- c(paste0((1-level)/2*100,"%"),paste0((1+level)/2*100,"%"))
ci.ml
##             2.5%       97.5%
## A     84.8893551 101.3308379
## gamma  0.4526233   0.8261438
## tau    1.2597925   1.6647422


2.4.3 Parametric bootstrap

The sequence \((\hat{\beta}^{(\ell)}, 1 \leq \ell \leq L)\) obtained by Monte Carlo simulation can be used for computing an empirical confidence interval:

\[{\rm CI}_{{\rm mc},1-\alpha}(\beta_k) = [\hat{\beta}_{k,\alpha/2} \ , \ \hat{\beta}_{k,1-\alpha/2} ] \] where, for any \(0 < p < 1\), \(\hat{\beta}_{k,p}\) is the empirical quantile of order \(p\) of \((\hat{\beta}^{(\ell)}_k, 1 \leq \ell \leq L)\):

b <- c(apply(B,MARGIN=2,function(x) quantile(x,alpha/2)),
       apply(B,MARGIN=2,function(x) quantile(x,1-alpha/2)))
ci.mc=matrix(b,ncol=2)
row.names(ci.mc) <- beta.names
colnames(ci.mc) <- c(paste0((1-level)/2*100,"%"),paste0((1+level)/2*100,"%"))
ci.mc
##             2.5%       97.5%
## A     86.4303239 107.4691420
## gamma  0.4443587   0.8704914
## tau    1.3170630   1.9225209

Remark: These confidence intervals are slightly biased. We will use a linear model to explain where this bias comes from and show how to remove it.

Consider the linear model \(y = X\beta + \sigma\varepsilon\). A confidence interval of level \(1-\alpha\) for \(\beta_k\) is \[{\rm CI}_{ 1-\alpha}(\beta_k) = [\hat{\beta}_k + qt_{\alpha/2, n-d}\ {\rm se}_(\hat{\beta}_k) \ , \ \hat{\beta}_k + qt_{1-\alpha/2, n-d}\ {\rm se}(\hat{\beta}_k)]\] where \({\rm se}(\hat{\beta}_k) = \hat\sigma^2 (X^\prime X)^{-1}_{kk}\).

On the other hand, for \(\ell=1,2,\ldots,L\), \[y^{(\ell)} = X\hat{\beta} + \hat{\sigma} \varepsilon^{(\ell)} \] and \[\hat{\beta}^{(\ell)} = \hat\beta + \hat{\sigma}(X^\prime X)^{-1}X^\prime\varepsilon^{(\ell)}\]

Thus, conditionnally to the observations \(y\), i.e.??conditionnally to \(\hat\beta\), \(\hat{\beta}_k^{(\ell)} \sim {\cal N}(\hat\beta_k \ , \ {\rm se}^2(\hat{\beta}_k))\). Then, the empirical quantile \(\hat{\beta}_{k,p}\) is an estimator of the quantile of order \(p\) of a normal distribution with mean \(\hat\beta_k\) and variance \({\rm se}^2(\hat{\beta}_k)\). In other words, the confidence interval \({\rm CI}_{{\rm mc},1-\alpha}(\beta_k)\) is an estimator of the interval \([\hat{\beta}_k + q{\cal N}_{\alpha/2}\ {\rm se}_(\hat{\beta}_k) \ , \ \hat{\beta}_k + q{\cal N}_{1-\alpha/2}\ {\rm se}(\hat{\beta}_k)]\). where \(q{\cal N}_p\) is the quantile of order \(p\) for a \({\cal N}(0,1)\) distribution.

We see that these quantiles for a normal ditribution should be tranformed into quantiles for a \(t\)-ditribution with \(n-d\) df.

An unbiased confidence interval for \(\beta_k\) is therefore \[{\rm CI}^\star_{{\rm mc},1-\alpha}(\beta_k) = [\hat{\beta}_k + \frac{qt_{\alpha/2, n-d}}{q{\cal N}_{\alpha/2}}(\hat{\beta}_{k,\alpha/2} - \hat{\beta}_k)\ , \ \hat{\beta}_k + \frac{qt_{1-\alpha/2, n-d}}{q{\cal N}_{1-\alpha/2}}(\hat{\beta}_{k,1-\alpha/2} - \hat{\beta}_k)] \] The same correction can be used for nonlinear models.

rq <- qt(1-alpha/2,df)/qnorm(1-alpha/2)
beta.est + rq*(ci.mc-beta.est)
##             2.5%       97.5%
## A     86.4001349 107.5340372
## gamma  0.4434773   0.8715359
## tau    1.3164067   1.9246010


2.4.4 Profile likelihood

Function confint uses the profile likelihood method for computing confidence intervals for parameters in a fitted model.

confint(nlm1)
##             2.5%       97.5%
## A     87.1321726 105.5760369
## gamma  0.4625255   0.8324095
## tau    1.3109469   1.8569351

Profile likelihood confidence intervals are based on the log-likelihood function.

Imagine that we want to compute a confidence interval for \(\beta_1\). The profile likelihood of \(\beta_1\) is defined by

\[\like_p(\beta_1) = \max_{\beta_2, \ldots, \beta_d}\like(\beta_1, \beta_2, \ldots, \beta_d )\]

\(\like_p(\beta_1)\) does no longer depend on \(\beta_2, \ldots, \beta_d\) since it has been profiled out.

A an example, let us compute and display the profile log-likelihood of \(A\) for model nlm1

pred1.A=function(gamma,tau,x,A){ f <-  A/(1+exp(-gamma*(x-tau))) ; return(f)}

v.A <- seq(86,110,by=0.1)
ll.A <- NULL
for (A in v.A) {
  nlm1.A <- nls(y ~  pred1.A(gamma,tau,x,A), start=list(gamma=beta.est[2],tau=beta.est[3]))
  ll.A <- c(ll.A,as.numeric(logLik(nlm1.A)))
}

df.A <- data.frame(A=v.A, ll=ll.A)
ggplot(data=df.A) + geom_line(aes(A,ll), color="blue", size=1) + 
  geom_vline(xintercept=beta.est[1], color="red", linetype = "longdash") + 
  scale_x_continuous(breaks = c(seq(85,110,by=5),round(beta.est[[1]],2)),"A")

Consider the test of \(H_0: \beta_1 = \beta_1^\star\) against \(H_1: \beta_1 \neq \beta_1^\star\). The likelihood ratio statistics is

\[LR_{\rm stat} = 2\left(\log(\like(\hat\beta)) - \log(\like_p(\beta_1^\star))\right)\] where \(\hat\beta\) is the value of \(\beta\) that maximises the likelihood \(\like(\beta)\) under \(H_1\).

Under \(H_0\), \(LR_{\rm stat}\) follows a \(\chi^2\) distribution with 1 df. Then, the test is significant (i.e.??we reject \(H_0\)), if \(LR_{\rm stat}> q\chi^2_{1,1-\alpha}\) where \(q\chi^2_{1,1-\alpha}\) is the quantile of order \(1-\alpha\) for a \(\chi^2\) distribution with 1 df.

A ???profile likelihood confidence interval??? of level \(1-\alpha\) for \(\beta_1\) consists of those values \(\beta_1^\star\) for which the test is not significant.

dll.A <- 2*(as.numeric(logLik(nlm1)) - ll.A)
df.A <- data.frame(A=v.A, dll=dll.A)

qlevel <- qchisq(level,1)
lpA <- function(A, qlevel) {
nlm1.A <- nls(y ~  A/(1+exp(-gamma*(x-tau))), start=list(gamma=beta.est[2],tau=beta.est[3]))
as.numeric(2*(logLik(nlm1) -logLik(nlm1.A) ) - qlevel) }
c1.A <- uniroot(lpA, c(80,beta.est[1]), qlevel)$root
c2.A <- uniroot(lpA, c(beta.est[1], 110), qlevel)$root

ggplot(data=df.A) + geom_line(aes(A,dll), size=1, color="blue") + geom_hline(yintercept=qlevel, color="red", linetype = "longdash") +
  geom_segment(aes(x=c1.A, xend=c1.A, y=-Inf, yend=qlevel),  color="red", linetype = "longdash")  +
  geom_segment(aes(x=c2.A, xend=c2.A, y=-Inf, yend=qlevel),  color="red", linetype = "longdash")  +
  scale_y_continuous(breaks = c(0,2,3,6,round(qlevel,2)),1) +
  scale_x_continuous(breaks = c(seq(85,100,by=5),round(c1.A,1),round(c2.A,1), 110),"A") 

Let us now compute the profile likelihood confidence intervals for \(\gamma\) and \(\tau\)

lpgamma <- function(gamma) {
  nlm1.gamma <- nls(y ~  A/(1+exp(-gamma*(x-tau))), start=list(A=beta.est[1],tau=beta.est[3]))
  as.numeric(2*(logLik(nlm1) -logLik(nlm1.gamma) ) - qlevel) }
c1.gamma <- uniroot(lpgamma, lower=0.40, upper=beta.est[2])$root
c2.gamma <- uniroot(lpgamma, lower=beta.est[2], upper=1)$root

lptau <- function(tau) {
  nlm1.tau <- nls(y ~ A/(1+exp(-gamma*(x-tau))), start=list(A=beta.est[1],gamma=beta.est[2]))
  as.numeric(2*(logLik(nlm1) -logLik(nlm1.tau) ) - qlevel) }
c1.tau <- uniroot(lptau, lower=1.25, upper=beta.est[3])$root
c2.tau <- uniroot(lptau, lower=beta.est[3], upper=2.25)$root

ci.pl <- rbind(rbind(c(c1.A , c2.A),c(c1.gamma , c2.gamma)),c(c1.tau , c2.tau))
row.names(ci.pl) <- beta.names
colnames(ci.pl) <- c(paste0((1-level)/2*100,"%"),paste0((1+level)/2*100,"%"))
ci.pl
##             2.5%       97.5%
## A     87.1523081 105.4511191
## gamma  0.4636369   0.8310879
## tau    1.3118040   1.8520538

Remark:

The confint R function doesn???t use a \(\chi^2\) distribution with 1 df for the LRT statistics \(LR_{\rm stat}\) (which is theoretically the right asymptotic distribution).

On the contrary, the square root of \(LR_{\rm stat}\) is assumed to follow a half \(t\)-distribution with \(n-\nbeta\) df. Then, the null hypothesis \(H_0\) is rejected when \(LR_{\rm stat}> qt_{1-\alpha/2,n-\nbeta}^2\).

qlevel <- qt(1-alpha/2,df)^2
c1R.A <- uniroot(lpA, c(80,beta.est[1]), qlevel)$root
c2R.A <- uniroot(lpA, c(beta.est[1], 110), qlevel)$root
c(c1R.A, c2R.A)
## [1]  87.13229 105.53713

The two tests - and then the two confidence intervals - are equivalent for large \(n\) since a \(t\)-distribution with \(n\) df converges to a \({\cal N}(0,1)\) when \(n\) goes to infinity. Then, for any \(0 < p < 1\), \[ (qt_{p,n})^2 \limite{n\to \infty} q\chi^2_{p,1}\]

3 Diagnostic plots

Let us plot i) the observed waiting times versus predicted waiting times, and ii) the residuals versus predicted waiting times

par(mfrow=c(1,2))
plot(predict(nlm1), y)
abline(a=0, b=1, lty=1, col="magenta")
residual.nlm1 <- resid(nlm1)/sd(resid(nlm1))
plot(predict(nlm1), residual.nlm1)
abline(a=0, b=0, lty=2, col="magenta")

On one hand, observations and predictions look well randomly distributed around the line \(y=x\). On the other hand, residual look well ditributed around 0, with a constant variance.

Then, based on these graph, we don???t have any good reason for rejecting model nlm1.


4 Model comparison

We can produce the same the diagnostic plots with model nlm2 and arrive at the same conclusion concerning this model.

par(mfrow=c(1,2))
plot(predict(nlm2), y)
abline(a=0, b=1, lty=1, col="magenta")
residual.nlm2 <- resid(nlm2)/sd(resid(nlm2))
plot(predict(nlm2), residual.nlm2)
abline(a=0, b=0, lty=2, col="magenta")

Since nlm1 and nlm2 are two possible model for fitting our data, we need some criteria for comparing them. The statistical tests and the information criteria used for comparing linear models can also be used for comparing nonlinear models.

First, we can perform a ANOVA for testing model nlm1 against model nlm2 since these two models are nested (nlm1 correponds to nlm2 when \(S=0\))

anova(nlm1, nlm2)
## Analysis of Variance Table
## 
## Model 1: y ~ A/(1 + exp(-gamma * (x - tau)))
## Model 2: y ~ (A - S)/(1 + exp(-gamma * (x - tau))) + S
##   Res.Df Res.Sum Sq Df Sum Sq F value    Pr(>F)    
## 1    269     8933.7                                
## 2    268     8469.4  1  464.3  14.692 0.0001578 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let \(RSS_1 = \|y- f_1(x,\beta_1)\|\) and \(RSS_2 = \|y- f_2(x,\beta_2)\|\) be the residual sums of squares under, respectively, nlm1 and nlm2, and let \(d_1\) and \(d_2\) be lengths ov vectors \(\beta_1\) and \(\beta_2\). Then, \[ F_{\rm stat} = \frac{(RSS_1 - RSS_2)/(d_2-d_1)}{(RSS_2)/(n-d_2} \]

RSS1 <-sum(resid(nlm1)^2)
RSS2 <-sum(resid(nlm2)^2)
d1   <- length(coef(nlm1))
d2   <- length(coef(nlm2))
F.stat <- ((RSS1-RSS2)/(d2-d1))/(RSS2/(n-d2))
c(RSS2, RSS1-RSS2, F.stat, 1-pf(F.stat,d2-d1,n-d2))
## [1] 8.469424e+03 4.642987e+02 1.469192e+01 1.577976e-04

Another way to test nlm1 against nlm2 consists in testing if \(S=0\) in model nlm2:

summary(nlm2)$coefficients
##        Estimate Std. Error   t value      Pr(>|t|)
## A     82.465891  0.9973074 82.688535 8.974353e-193
## gamma  2.253936  0.4355265  5.175197  4.469537e-07
## tau    3.055263  0.1106570 27.610217  2.420401e-80
## S     51.322076  1.8302947 28.040335  1.110856e-81

Even if both tests clearly prefer model nlm2, we can remark that the \(t\)-test and the \(F\)-test are not equivalent since the models are nonlinear.

Information criteria such as AIC and BIC also prefer model nlm2:

as.matrix(AIC(nlm1, nlm2))
##      df      AIC
## nlm1  4 1729.668
## nlm2  5 1717.152
as.matrix(BIC(nlm1, nlm2))
##      df      BIC
## nlm1  4 1744.092
## nlm2  5 1735.181


5 Confidence intervals and prediction intervals for predicted values

5.1 The delta-method

For a given value \(x_0\) of the explanatory variable \(x\), we can use the model \(f\) with the estimated parameter \(\hat{\beta}\) and predict the response as \(f(x_0,\hat{\beta})\).

Since \(\hat{\beta}\) is a random vector with variance-covariance matrix \({\rm Var}(\hat{\beta})\), \(f(x_0,\hat{\beta})\) is also a random variable that can be approximated by a linear function of \(\hat{\beta}\) \[ f(x_0 , \beta) \simeq f(x_j , \hat{\beta}) + \nabla f(x_0 , \hat{\beta})(\beta - \hat{\beta}) \]

Then, the so-called delta-method consists in using this approximation for approximating the variance of \(f(x_0,\hat{\beta})\) by

\[ {\rm Var}(f(x_0,\hat{\beta})) \simeq \nabla f(x_0 , \hat{\beta}) {\rm Var}(\hat{\beta}) \nabla f(x_0 , \hat{\beta})^\prime, \]

and we can now use this approximation for computing a \((1-\alpha)100\%\) confidence interval for each prediction \(f(x_0,\beta)\):

\[{\rm CI}_{{\rm lin}, 1-\alpha}(f(x_0,\beta)) = [f(x_0,\hat{\beta}) + qt_{\alpha/2, n-d}\ {\rm s.e.}(f(x_0,\hat{\beta})) \ , \ f(x_0,\hat{\beta}) + qt_{1-\alpha/2, n-d}\ {\rm s.e.}(f(x_0,\hat{\beta}))]\]

where \({\rm s.e.}(f(x_0,\hat{\beta}))\) is the standard error of \(f(x_0,\hat{\beta})\) defined as \[ {\rm s.e.}(f(x_0,\hat{\beta})) = \sqrt{\nabla f(x_0 , \hat{\beta}) {\rm Var}(\hat{\beta}) \nabla f(x_0 , \hat{\beta})^\prime} \]

We can also compute a prediction interval for a future observation \[y_0 = f(x_0,\beta) + e_0\]

The prediction for \(y_0\) is \[\hat{y}_0 = f(x_0 , \hat{\beta}),\] Then, the standard error for this prediction should take into account the uncertainty on \(f(x_0,\beta)\) and the variability of the residual error \(e_0\):

\[ {\rm s.e.}(\hat{y}_0) = \sqrt{\nabla f(x_0 , \hat{\beta}) {\rm Var}(\hat{\beta}) \nabla f(x_0 , \hat{\beta})^\prime + \sigma^2} \] Then, \[{\rm CI}_{{\rm lin}, 1-\alpha}(y_0) = [f(x_0,\hat{\beta}) + qt_{\alpha/2, n-d}\ {\rm s.e.}(\hat{y}_0) \ , \ f(x_0,\hat{\beta}) + qt_{1-\alpha/2, n-d}\ {\rm s.e.}(\hat{y}_0)]\]

As an example, let us compute the variance of \(f_2(x_0,\hat{\beta}_2)\) for \(x_0 = 1, 1.1, 1.2, \ldots, 5.9, 6\),

fgh2 <- deriv(y ~ (A-S)/(1+exp(-gamma*(x-tau))) + S, c("A", "gamma", "tau", "S"), function(A, gamma, tau, S, x){} ) 

x.new <- seq(1, 6, by=0.1)
beta2.est <- coef(nlm2)
f.new <- fgh2(beta2.est[1],beta2.est[2],beta2.est[3],beta2.est[4], x.new)

g.new <- attr(f.new,"gradient")
V.beta2 <- vcov(nlm2)
GS=rowSums((g.new%*%V.beta2)*g.new)
head(GS)
## [1] 2.337435 2.159290 1.959564 1.739721 1.503322 1.256727

We can then derive a \(95\%\) confidence interval for each \(f_2(x_0,\beta)\)

alpha <- 0.05
deltaf <- sqrt(GS)*qt(1-alpha/2,df)
df.delta <- data.frame(x=x.new, f=f.new, lwr.conf=f.new-deltaf, upr.conf=f.new+deltaf)
head(df.delta)
##     x        f lwr.conf upr.conf
## 1 1.0 51.62222 48.61215 54.63229
## 2 1.1 51.69719 48.80410 54.59027
## 3 1.2 51.79059 49.03455 54.54664
## 4 1.3 51.90682 49.30997 54.50366
## 5 1.4 52.05119 49.63721 54.46516
## 6 1.5 52.23014 50.02302 54.43727

and for each \(y_0\)

sigma2.est <- summary(nlm2)$sigma
deltay <- sqrt(GS + sigma2.est^2)*qt(1-alpha/2,df)
df.delta[c("lwr.pred","upr.pred")] <- cbind(f.new - deltay,f.new + deltay)

We can now plot these two intervals together with the data:

pl + geom_ribbon(data=df.delta, aes(x=x.new, ymin=lwr.pred, ymax=upr.pred), alpha=0.1, fill="blue") + 
  geom_ribbon(data=df.delta, aes(x=x.new, ymin=lwr.conf, ymax=upr.conf), alpha=0.2, fill="#339900") +  
  geom_line(data=df.delta, aes(x=x.new, y=f.new), colour="#339900", size=1)

5.2 Parametric bootstrap

A we have already seen, parametric bootstrap consists in simulating \(L\) replicates of the experiment, by drawing random observations with the fitted model, i.e.??using the estimated parameters $ $. Then, for each replicate,

  1. an estimate \(\hat{\beta}^{(\ell)}\) of the vector of parameters \(\beta\) is computed,
  2. a prediction \(f(x_0,\hat{\beta}^{(\ell)})\) is computed for each value of \(x_0\),
  3. a new observation \(y_0^{(\ell)}\) is randomly drawn for each value of \(x_0\).

We can then use the empirical quantiles of \(f(x_0,\hat{\beta}^{(\ell)}, 1 \leq \ell \leq L)\) and \((y_0^{(\ell)}, 1 \leq \ell \leq L)\) to compute a confidence interval for \(f_0=f(x_0,\beta)\) and a prediction interval for \(y_0\).

Let \(f_{0,p}\) and \(y_{0,p}\) be the empirical quantiles of order \(p\) of \(f(x_0,\hat{\beta}^{(\ell)}, 1 \leq \ell \leq L)\) and \((y_0^{(\ell)}, 1 \leq \ell \leq L)\), respectively. Instead of using the empirical intervals \[\begin{aligned} {\rm CI}_{{\rm mc},1-\alpha}(f_0) &= [f_{0,\alpha/2} \ , \ f_{0,1-\alpha/2} ] \\ {\rm CI}_{{\rm mc},1-\alpha}(y_0) &= [y_{0,\alpha/2} \ , \ y_{0,1-\alpha/2} ] \end{aligned}\] we can define the confidence and prediction intervals using the correction previously introduced for obtaining unbiased intervals in the case of a linear model:

\[\begin{aligned} {\rm CI}^\star_{{\rm mc},1-\alpha}(f_0) &= [f(x_0,\hat{\beta}) + \frac{qt_{\alpha/2, n-d}}{q{\cal N}_{\alpha/2}}(f_{0,\alpha/2} - f(x_0,\hat{\beta}))\ , \ f(x_0,\hat{\beta}) + \frac{qt_{1-\alpha/2, n-d}}{q{\cal N}_{1-\alpha/2}}(f_{0,1-\alpha/2} - f(x_0,\hat{\beta}))] \\ {\rm CI}^\star_{{\rm mc},1-\alpha}(y_0) &= [f(x_0,\hat{\beta}) + \frac{qt_{\alpha/2, n-d}}{q{\cal N}_{\alpha/2}}(y_{0,\alpha/2} - f(x_0,\hat{\beta}))\ , \ f(x_0,\hat{\beta}) + \frac{qt_{1-\alpha/2, n-d}}{q{\cal N}_{1-\alpha/2}}(y_{0,1-\alpha/2} - f(x_0,\hat{\beta}))] \end{aligned}\]

pred2 <- function(beta, x){ f <-  beta[4] + (beta[1]-beta[4])/(1+exp(-beta[2]*(x-beta[3]))) }
pred.nlm2 <- pred2(beta2.est,x)
L <- 1000
n.new <- length(x.new)
df.mc <- data.frame(x=x.new, f=f.new)
F <- matrix(nrow=L,ncol=n.new)
Y <- matrix(nrow=L,ncol=n.new)
for (l in (1:L)) {
  yl    <- pred.nlm2 + rnorm(n,0,sigma2.est)
  rl.nlm2 <- nls(yl ~ pred2(beta,x), start=list(beta=beta2.est))
  fl <- predict(rl.nlm2, newdata=df.mc)
  F[l,] <- fl
  Y[l,] <- fl + rnorm(1,0,sigma2.est)
}

df.mc[c("lwr.conf","upr.conf")] <- t(apply(F,MARGIN=2,function(x) quantile(x,c(alpha/2,1-alpha/2))))
df.mc[c("lwr.pred","upr.pred")] <- t(apply(Y,MARGIN=2,function(x) quantile(x,c(alpha/2,1-alpha/2))))

df.mc[,(3:6)] <- f.new + rq*(df.mc[,(3:6)] - f.new)

pl + geom_ribbon(data=df.mc, aes(x=x.new, ymin=lwr.pred, ymax=upr.pred), alpha=0.1, fill="blue") + 
  geom_ribbon(data=df.mc, aes(x=x.new, ymin=lwr.conf, ymax=upr.conf), alpha=0.2, fill="#339900") +  
  geom_line(data=df.mc, aes(x=x.new, y=f.new), colour="#339900", size=1)