$$ \newcommand{\esp}[1]{\mathbb{E}\left(#1\right)} \newcommand{\var}[1]{\mbox{Var}\left(#1\right)} \newcommand{\deriv}[1]{\dot{#1}(t)} \newcommand{\prob}[1]{ \mathbb{P}\!(#1)} \newcommand{\eqdef}{\mathop{=}\limits^{\mathrm{def}}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bc}{\boldsymbol{c}} \newcommand{\bpsi}{\boldsymbol{\psi}} \def\pmacro{\texttt{p}} \def\like{{\cal L}} \def\llike{{\cal LL}} \def\logit{{\rm logit}} \def\probit{{\rm probit}} \def\one{{\rm 1\!I}} \def\iid{\mathop{\sim}_{\rm i.i.d.}} \def\simh0{\mathop{\sim}_{H_0}} \def\df{\texttt{df}} \def\res{e} \def\xomega{x} \newcommand{\argmin}[1]{{\rm arg}\min_{#1}} \newcommand{\argmax}[1]{{\rm arg}\max_{#1}} \newcommand{\Rset}{\mbox{$\mathbb{R}$}} \def\param{\theta} \def\setparam{\Theta} \def\xnew{x_{\rm new}} \def\fnew{f_{\rm new}} \def\ynew{y_{\rm new}} \def\nnew{n_{\rm new}} \def\enew{e_{\rm new}} \def\Xnew{X_{\rm new}} \def\hfnew{\widehat{\fnew}} \def\degree{m} \def\nbeta{d} \newcommand{\limite}[1]{\mathop{\longrightarrow}\limits_{#1}} \def\ka{k{\scriptstyle a}} \def\ska{k{\scriptscriptstyle a}} \def\kel{k{\scriptstyle e}} \def\skel{k{\scriptscriptstyle e}} \def\cl{C{\small l}} \def\Tlag{T\hspace{-0.1em}{\scriptstyle lag}} \def\sTlag{T\hspace{-0.07em}{\scriptscriptstyle lag}} \def\Tk{T\hspace{-0.1em}{\scriptstyle k0}} \def\sTk{T\hspace{-0.07em}{\scriptscriptstyle k0}} \def\thalf{t{\scriptstyle 1/2}} \newcommand{\Dphi}[1]{\partial_\pphi #1} \def\asigma{a} \def\pphi{\psi} \newcommand{\stheta}{{\theta^\star}} \newcommand{\htheta}{{\widehat{\theta}}} $$


1 The model

A regression model relates a response variable \(y\) to a set of explanatory variables \(x\). Assuming that we have access to \(n\) set of values \((x_j, y_j)\), \(1 \leq j \leq n)\), of these variable, the regression model is assumed to take the form \[y_j = f(x_j,\beta) + \sigma \varepsilon_j \quad ; \quad 1\leq j \leq n\]

where \(f\) is a structural model which depends on a \(d\)-vector of parameters \(\beta\) and where \((\varepsilon_j, 1 \leq j \leq n)\) is a sequence of independent and normally distributed random variables with mean 0 and variance \(1\): \[ \varepsilon_j \iid {\cal N}(0, 1). \] Then, the \(y_j\) are also independent and normally distributed:

\[ y_j \sim {\cal N}(f(x_j, \beta),\sigma^2).\]

The vector \(y=(y_1,y_2,\ldots,y_n)\) is therefore a Gaussian vector which probability density function (pdf) depends on a vector of parameters \(\param=(\beta,\sigma^2)\):

\[\begin{aligned} \pmacro(y ; \param) &= \prod_{j=1}^n \pmacro(y_j; \param) \\ &= \prod_{j=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} \text{exp}\left(-\frac{1}{2\sigma^2}(y_j - f(x_j, \beta))^2 \right) \\ &= \frac{1}{(2\pi \sigma^2)^{n/2}} \text{exp}\left(-\frac{1}{2\sigma^2}\sum_{j=1}^n(y_j - f(x_j, \beta))^2\right). \end{aligned}\]

For a given vector of observations \(y\), the likelihood \(\like\) is the function of the parameter \(\param=(\beta,\sigma^2)\) defined as:

\[ \begin{aligned} \like(\param) &= \pmacro(y ; \param) \\ \end{aligned} \] The log-likelihood is therefore \[ \begin{aligned} \llike(\param) &= \log(\pmacro(y ; \param)) \\ &= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{j=1}^n(y_j - f(x_j,\beta))^2 \end{aligned} \]


2 The maximum likelihood estimator

Assume that \(\param\) takes its values in a subset \(\setparam\) of \(\Rset^P\). Then, the Maximum Likelihood (ML) estimator of \(\param\) is a function of \(y\) that maximizes the likelihood function:

\[ \begin{aligned} \hat{\param} & = \argmax{\param \in \setparam}\like(\param) \\ & = \argmax{\param \in \setparam}\llike(\param) \end{aligned} \]

Maximization of the log-likelihood can be performed in two steps:

\[\begin{aligned} \hat{\beta} &= \argmin{\beta} \left\{ n\log(2\pi) + n\log(\sigma^2) + \frac{1}{\sigma^2}\sum_{j=1}^n(y_j - f(x_j,\beta))^2 \right\} \\ &= \argmin{\beta}\sum_{j=1}^n(y_j - f(x_j,\beta))^2 \end{aligned}\]

We see that, for this model, the Maximum Likelihood estimator \(\hat{\beta}\) is also the Least Squares estimator of \(\beta\).

\[\begin{aligned} \hat{\sigma}^2 &= \argmin{\sigma^2 \in \Rset^+} \left\{ n\log(2\pi) + n\log(\sigma^2) + \frac{1}{\sigma^2}\sum_{j=1}^n(y_j - f(x_j,\hat{\beta}))^2 \right\} \\ &= \frac{1}{n}\sum_{j=1}^n(y_j - f(x_j,\hat{\beta}))^2 \end{aligned}\]

Finally, the log-likelihood computed with \(\hat{\param}=(\hat{\beta},\hat{\sigma}^2)\) reduces to \[ \llike(\hat{\param}) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\left(\frac{1}{n}\sum_{j=1}^n(y_j - f(x_j,\hat{\beta}))^2\right) -\frac{n}{2} \]


3 The Fisher Information matrix

3.1 Some general definitions

The partial derivative of the log-likelihood with respect to \(\theta\) is called the score. Under general regularity conditions, the expected value of the score is 0. Indeed, it is easy to show that \[\esp{\frac{\partial}{\partial \theta} \log\pmacro(y;\stheta)}=0 .\] where \(\stheta\) is the ``true?????? unknown value of \(\theta\) such that the observations \(y\) where generated with model \(\pmacro(\cdot;\stheta)\).

The variance of the score is called the Fisher information matrix (FIM): \[ I_n(\stheta) = \esp{\left(\frac{\partial}{\partial \theta} \log\pmacro(y;\stheta)\right)\left(\frac{\partial}{\partial \theta} \log\pmacro(y;\stheta)\right)^\prime} . \] Furthermore, it can be shown that if \(\llike\) is twice differentiable with respect to \(\theta\),

\[ \begin{aligned} I_n(\stheta) &= - \esp{\frac{\partial^2}{\partial \theta \partial \theta^\prime} \log\pmacro(y;\stheta)} \\ &= - \sum_{j=1}^n \esp{\frac{\partial^2}{\partial \theta \partial \theta^\prime} \log\pmacro(y_j;\stheta)} \end{aligned} \]

3.2 The central limit theorem

The following central limit theorem (CLT) holds under certain regularity conditions: \[ I_n(\stheta)^{\frac{1}{2}}(\htheta-\stheta) \limite{n\to \infty} {\mathcal N}(0,{\rm Id}_n) . \] This theorem shows that under relevant hypotheses, the estimator \(\htheta\) is consistent and converges to \(\stheta\) at rate \(\sqrt{n}\) since \(I_n={\cal O}(n)\).

The normalizing term \(I_n(\stheta)^{-1}\) is unknown since it depends on the unknown parameter \(\stheta\). We can use instead the observed Fisher information: \[ \begin{aligned} I_y(\htheta) &= - \frac{\partial^2}{\partial \theta^2} \llike(\htheta) \\ &=-\sum_{i=1}^n \frac{\partial^2}{\partial \theta^2} \log \pmacro(y_i ; \htheta). \end{aligned} \] We can then approximate the distribution of \(\htheta\) by a normal distribution with mean \(\stheta\) and variance-covariance matrix \(I_{\by}(\htheta)^{-1}\): \[ \htheta \approx {\mathcal N}(\stheta , I_{\by}(\htheta)^{-1}) . \] The square roots of the diagonal elements of \(I_{\by}(\htheta)^{-1}\) are called the standard errors (s.e.) of the elements of \(\htheta\).


3.3 The FIM for a regression model

We have seen that, for a regression model, \[ \begin{aligned} \llike(\param) &= \llike(\beta,\sigma^2) \\ &= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{j=1}^n(y_j - f(x_j,\beta))^2 \end{aligned} \] By definition,

\[ I_n(\theta) = \left( \begin{array}{cc} -\esp{\frac{\partial^2}{\partial \beta \partial \beta^\prime} \llike(\beta,\sigma^2)} & -\esp{\frac{\partial^2}{\partial \beta \partial \sigma^2} \llike(\beta,\sigma^2)} \\ -\esp{\frac{\partial^2}{\partial \sigma^2 \partial \beta^\prime } \llike(\beta,\sigma^2)} & -\esp{\frac{\partial^2}{\partial \sigma^{2^2} } \llike(\beta,\sigma^2)} \end{array} \right) \]

Then, \[ \begin{aligned} \esp{\frac{\partial^2}{\partial \beta \partial \sigma^2} \llike(\beta,\sigma^2)} &= -\frac{1}{\sigma^4} \times\frac{\partial}{\partial \beta}f(x_j,\beta) \times\esp{y_j - f(x_j,\beta)} \\ &= 0 \end{aligned} \]

and the FIM reduces to

\[ I_n(\theta) = \left( \begin{array}{cc} - \esp{\frac{\partial^2}{\partial \beta \partial \beta^\prime} \llike(\beta,\sigma^2)} & 0 \\ 0 & -\esp{\frac{\partial^2}{\partial \sigma^{2^2} } \llike(\beta,\sigma^2)} \end{array} \right) \]

Because of the bloc structure of \(I_n(\stheta)\), the variance-covariance of \(\hat{\beta}\) can be estimated by \(I^{-1}_y(\hat{\beta})\) where \[ \begin{aligned} I_y(\hat{\beta}) &= - \frac{\partial^2}{\partial \beta \partial \beta^\prime} \llike(\hat{\beta},\hat{\sigma}^2) \\ &= \frac{1}{2\hat\sigma^2}\frac{\partial^2}{\partial \beta \partial \beta^\prime} \left(\sum_{j=1}^n(y_j - f(x_j,\hat\beta))^2 \right) \\ &= \frac{1}{\hat\sigma^2} \sum_{j=1}^n \left( \left(\frac{\partial}{\partial \beta}f(x_j,\hat\beta)\right)\left(\frac{\partial}{\partial \beta}f(x_j,\hat\beta)\right)^\prime - \frac{\partial^2}{\partial \beta \partial \beta^\prime}f(x_j,\hat\beta)y_j \right) \end{aligned} \] Remark: In the case of a linear model \(y=X\beta+e\), we find that \(I_y(\hat{\beta}) = (X^\prime X)/\hat\sigma^2\).

The variance of \(\hat{\sigma}^2\) is estimated by \(I^{-1}_y(\hat{\sigma}^2)\) where \[ \begin{aligned} I_y(\hat{\sigma}^2) &= -\frac{\partial^2}{\partial \sigma^{2^2} } \llike(\hat{\beta},\hat{\sigma}^2) \\ &= -\frac{n}{2\hat\sigma^4} + \frac{1}{\hat\sigma^6}\sum_{j=1}^n(y_j - f(x_j,\hat\beta))^2 \\ &= \frac{n}{2\hat\sigma^4} \end{aligned} \]

Then, \({\rm se}(\hat{\sigma}^2) = \hat{\sigma}^2/\sqrt{n/2}\).