
# 1 The model

A regression model relates a response variable $$y$$ to a set of explanatory variables $$x$$. Assuming that we have access to $$n$$ set of values $$(x_j, y_j)$$, $$1 \leq j \leq n)$$, of these variable, the regression model is assumed to take the form $y_j = f(x_j,\beta) + \sigma \varepsilon_j \quad ; \quad 1\leq j \leq n$

where $$f$$ is a structural model which depends on a $$d$$-vector of parameters $$\beta$$ and where $$(\varepsilon_j, 1 \leq j \leq n)$$ is a sequence of independent and normally distributed random variables with mean 0 and variance $$1$$: $\varepsilon_j \iid {\cal N}(0, 1).$ Then, the $$y_j$$ are also independent and normally distributed:

$y_j \sim {\cal N}(f(x_j, \beta),\sigma^2).$

The vector $$y=(y_1,y_2,\ldots,y_n)$$ is therefore a Gaussian vector which probability density function (pdf) depends on a vector of parameters $$\param=(\beta,\sigma^2)$$:

\begin{aligned} \pmacro(y ; \param) &= \prod_{j=1}^n \pmacro(y_j; \param) \\ &= \prod_{j=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} \text{exp}\left(-\frac{1}{2\sigma^2}(y_j - f(x_j, \beta))^2 \right) \\ &= \frac{1}{(2\pi \sigma^2)^{n/2}} \text{exp}\left(-\frac{1}{2\sigma^2}\sum_{j=1}^n(y_j - f(x_j, \beta))^2\right). \end{aligned}

For a given vector of observations $$y$$, the likelihood $$\like$$ is the function of the parameter $$\param=(\beta,\sigma^2)$$ defined as:

\begin{aligned} \like(\param) &= \pmacro(y ; \param) \\ \end{aligned} The log-likelihood is therefore \begin{aligned} \llike(\param) &= \log(\pmacro(y ; \param)) \\ &= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{j=1}^n(y_j - f(x_j,\beta))^2 \end{aligned}

# 2 The maximum likelihood estimator

Assume that $$\param$$ takes its values in a subset $$\setparam$$ of $$\Rset^P$$. Then, the Maximum Likelihood (ML) estimator of $$\param$$ is a function of $$y$$ that maximizes the likelihood function:

\begin{aligned} \hat{\param} & = \argmax{\param \in \setparam}\like(\param) \\ & = \argmax{\param \in \setparam}\llike(\param) \end{aligned}

Maximization of the log-likelihood can be performed in two steps:

• $$\beta$$, the parameter of the structural model is estimated by minimizing the residual sum of squares:

\begin{aligned} \hat{\beta} &= \argmin{\beta} \left\{ n\log(2\pi) + n\log(\sigma^2) + \frac{1}{\sigma^2}\sum_{j=1}^n(y_j - f(x_j,\beta))^2 \right\} \\ &= \argmin{\beta}\sum_{j=1}^n(y_j - f(x_j,\beta))^2 \end{aligned}

We see that, for this model, the Maximum Likelihood estimator $$\hat{\beta}$$ is also the Least Squares estimator of $$\beta$$.

• $$\sigma^2$$, the variance of the residual errors $$e_j = \sigma\varepsilon_j$$ is estimated in a second step:

\begin{aligned} \hat{\sigma}^2 &= \argmin{\sigma^2 \in \Rset^+} \left\{ n\log(2\pi) + n\log(\sigma^2) + \frac{1}{\sigma^2}\sum_{j=1}^n(y_j - f(x_j,\hat{\beta}))^2 \right\} \\ &= \frac{1}{n}\sum_{j=1}^n(y_j - f(x_j,\hat{\beta}))^2 \end{aligned}

Finally, the log-likelihood computed with $$\hat{\param}=(\hat{\beta},\hat{\sigma}^2)$$ reduces to $\llike(\hat{\param}) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\left(\frac{1}{n}\sum_{j=1}^n(y_j - f(x_j,\hat{\beta}))^2\right) -\frac{n}{2}$

# 3 The Fisher Information matrix

## 3.1 Some general definitions

The partial derivative of the log-likelihood with respect to $$\theta$$ is called the score. Under general regularity conditions, the expected value of the score is 0. Indeed, it is easy to show that $\esp{\frac{\partial}{\partial \theta} \log\pmacro(y;\stheta)}=0 .$ where $$\stheta$$ is the true?????? unknown value of $$\theta$$ such that the observations $$y$$ where generated with model $$\pmacro(\cdot;\stheta)$$.

The variance of the score is called the Fisher information matrix (FIM): $I_n(\stheta) = \esp{\left(\frac{\partial}{\partial \theta} \log\pmacro(y;\stheta)\right)\left(\frac{\partial}{\partial \theta} \log\pmacro(y;\stheta)\right)^\prime} .$ Furthermore, it can be shown that if $$\llike$$ is twice differentiable with respect to $$\theta$$,

\begin{aligned} I_n(\stheta) &= - \esp{\frac{\partial^2}{\partial \theta \partial \theta^\prime} \log\pmacro(y;\stheta)} \\ &= - \sum_{j=1}^n \esp{\frac{\partial^2}{\partial \theta \partial \theta^\prime} \log\pmacro(y_j;\stheta)} \end{aligned}

## 3.2 The central limit theorem

The following central limit theorem (CLT) holds under certain regularity conditions: $I_n(\stheta)^{\frac{1}{2}}(\htheta-\stheta) \limite{n\to \infty} {\mathcal N}(0,{\rm Id}_n) .$ This theorem shows that under relevant hypotheses, the estimator $$\htheta$$ is consistent and converges to $$\stheta$$ at rate $$\sqrt{n}$$ since $$I_n={\cal O}(n)$$.

The normalizing term $$I_n(\stheta)^{-1}$$ is unknown since it depends on the unknown parameter $$\stheta$$. We can use instead the observed Fisher information: \begin{aligned} I_y(\htheta) &= - \frac{\partial^2}{\partial \theta^2} \llike(\htheta) \\ &=-\sum_{i=1}^n \frac{\partial^2}{\partial \theta^2} \log \pmacro(y_i ; \htheta). \end{aligned} We can then approximate the distribution of $$\htheta$$ by a normal distribution with mean $$\stheta$$ and variance-covariance matrix $$I_{\by}(\htheta)^{-1}$$: $\htheta \approx {\mathcal N}(\stheta , I_{\by}(\htheta)^{-1}) .$ The square roots of the diagonal elements of $$I_{\by}(\htheta)^{-1}$$ are called the standard errors (s.e.) of the elements of $$\htheta$$.

## 3.3 The FIM for a regression model

We have seen that, for a regression model, \begin{aligned} \llike(\param) &= \llike(\beta,\sigma^2) \\ &= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{j=1}^n(y_j - f(x_j,\beta))^2 \end{aligned} By definition,

$I_n(\theta) = \left( \begin{array}{cc} -\esp{\frac{\partial^2}{\partial \beta \partial \beta^\prime} \llike(\beta,\sigma^2)} & -\esp{\frac{\partial^2}{\partial \beta \partial \sigma^2} \llike(\beta,\sigma^2)} \\ -\esp{\frac{\partial^2}{\partial \sigma^2 \partial \beta^\prime } \llike(\beta,\sigma^2)} & -\esp{\frac{\partial^2}{\partial \sigma^{2^2} } \llike(\beta,\sigma^2)} \end{array} \right)$

Then, \begin{aligned} \esp{\frac{\partial^2}{\partial \beta \partial \sigma^2} \llike(\beta,\sigma^2)} &= -\frac{1}{\sigma^4} \times\frac{\partial}{\partial \beta}f(x_j,\beta) \times\esp{y_j - f(x_j,\beta)} \\ &= 0 \end{aligned}

and the FIM reduces to

$I_n(\theta) = \left( \begin{array}{cc} - \esp{\frac{\partial^2}{\partial \beta \partial \beta^\prime} \llike(\beta,\sigma^2)} & 0 \\ 0 & -\esp{\frac{\partial^2}{\partial \sigma^{2^2} } \llike(\beta,\sigma^2)} \end{array} \right)$

Because of the bloc structure of $$I_n(\stheta)$$, the variance-covariance of $$\hat{\beta}$$ can be estimated by $$I^{-1}_y(\hat{\beta})$$ where \begin{aligned} I_y(\hat{\beta}) &= - \frac{\partial^2}{\partial \beta \partial \beta^\prime} \llike(\hat{\beta},\hat{\sigma}^2) \\ &= \frac{1}{2\hat\sigma^2}\frac{\partial^2}{\partial \beta \partial \beta^\prime} \left(\sum_{j=1}^n(y_j - f(x_j,\hat\beta))^2 \right) \\ &= \frac{1}{\hat\sigma^2} \sum_{j=1}^n \left( \left(\frac{\partial}{\partial \beta}f(x_j,\hat\beta)\right)\left(\frac{\partial}{\partial \beta}f(x_j,\hat\beta)\right)^\prime - \frac{\partial^2}{\partial \beta \partial \beta^\prime}f(x_j,\hat\beta)y_j \right) \end{aligned} Remark: In the case of a linear model $$y=X\beta+e$$, we find that $$I_y(\hat{\beta}) = (X^\prime X)/\hat\sigma^2$$.

The variance of $$\hat{\sigma}^2$$ is estimated by $$I^{-1}_y(\hat{\sigma}^2)$$ where \begin{aligned} I_y(\hat{\sigma}^2) &= -\frac{\partial^2}{\partial \sigma^{2^2} } \llike(\hat{\beta},\hat{\sigma}^2) \\ &= -\frac{n}{2\hat\sigma^4} + \frac{1}{\hat\sigma^6}\sum_{j=1}^n(y_j - f(x_j,\hat\beta))^2 \\ &= \frac{n}{2\hat\sigma^4} \end{aligned}

Then, $${\rm se}(\hat{\sigma}^2) = \hat{\sigma}^2/\sqrt{n/2}$$.