$$ \newcommand{\esp}[1]{\mathbb{E}\left(#1\right)} \newcommand{\var}[1]{\mbox{Var}\left(#1\right)} \newcommand{\deriv}[1]{\dot{#1}(t)} \newcommand{\prob}[1]{ \mathbb{P}\!(#1)} \newcommand{\eqdef}{\mathop{=}\limits^{\mathrm{def}}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bc}{\boldsymbol{c}} \newcommand{\bpsi}{\boldsymbol{\psi}} \def\pmacro{\texttt{p}} \def\like{{\cal L}} \def\llike{{\cal LL}} \def\logit{{\rm logit}} \def\probit{{\rm probit}} \def\one{{\rm 1\!I}} \def\iid{\mathop{\sim}_{\rm i.i.d.}} \def\simh0{\mathop{\sim}_{H_0}} \def\df{\texttt{df}} \def\res{e} \def\xomega{x} \newcommand{\argmin}[1]{{\rm arg}\min_{#1}} \newcommand{\argmax}[1]{{\rm arg}\max_{#1}} \newcommand{\Rset}{\mbox{$\mathbb{R}$}} \def\param{\theta} \def\setparam{\Theta} \def\xnew{x_{\rm new}} \def\fnew{f_{\rm new}} \def\ynew{y_{\rm new}} \def\nnew{n_{\rm new}} \def\enew{e_{\rm new}} \def\Xnew{X_{\rm new}} \def\hfnew{\widehat{\fnew}} \def\degree{m} \def\nbeta{d} \newcommand{\limite}[1]{\mathop{\longrightarrow}\limits_{#1}} \def\ka{k{\scriptstyle a}} \def\ska{k{\scriptscriptstyle a}} \def\kel{k{\scriptstyle e}} \def\skel{k{\scriptscriptstyle e}} \def\cl{C{\small l}} \def\Tlag{T\hspace{-0.1em}{\scriptstyle lag}} \def\sTlag{T\hspace{-0.07em}{\scriptscriptstyle lag}} \def\Tk{T\hspace{-0.1em}{\scriptstyle k0}} \def\sTk{T\hspace{-0.07em}{\scriptscriptstyle k0}} \def\thalf{t{\scriptstyle 1/2}} \newcommand{\Dphi}[1]{\partial_\pphi #1} \def\asigma{a} \def\pphi{\psi} \newcommand{\stheta}{{\theta^\star}} \newcommand{\htheta}{{\widehat{\theta}}} $$


1 The faithful data

1.1 The data

The faithful data consist of the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

data("faithful")
head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

We will consider the waiting time in the following. Let us display the empirical distribution of this variable,

y = faithful$waiting
hist(y, xlab="waiting time (mn)")

We clearly see 2 modes: the waiting times seem to be distributed either around 50mn or around 80mn.


1.2 k-means clustering

Imagine that we want to partition the data \((y_i , 1 \leq i \leq n)\) into \(L\) clusters. Let \(\mu_1\), \(\mu_2\), \(\mu_L\) be the \(L\) centers of these \(L\) clusters. A way to decide to which cluster belongs an observation \(y_i\) consists in minimizing the distance between \(y_i\) and the centers \((\mu_\ell)\).

Let \(z_i\) be a label variable such that \(z_i=\ell\) if observation \(i\) belongs to cluster \(\ell\). Then,

\[z_i = {\rm arg}\min_{ \ell \in \{1,2, \ldots,L\}} (y_i-\mu_\ell)^2\] The centers \((\mu_\ell)\) can be estimated by minimizing the within-cluster sum of squares

\[\begin{aligned} U(\mu_1,\mu_2,\ldots,\mu_L) & =\sum_{i=1}^n \min_{\ell \in \{1, \ldots, L\}} (y_i - \mu_\ell)^2 \\ &= \sum_{i=1}^n (y_i - \mu_{z_i})^2 \\ &= \sum_{i=1}^n \sum_{\ell=1}^L (y_i - \mu_\ell)^2 \one_{{z}_i=\ell} \end{aligned}\]

For \(\ell=1,2,\ldots, L\), the solution \(\hat\mu_\ell\) is the empirical mean computed in cluster \(\ell\). Let \(n_\ell = \sum_{i=1}^n \one_{{z}_i=\ell}\) be the number of observation belonging to cluster \(\ell\). Then \[ \hat\mu_\ell = \frac{1}{n_\ell} \sum_{i=1}^n y_i\one_{{z}_i=\ell} \]

Let us compute the centers of the two cluster for our faithful data:

U <- function(mu,y) {
  e1 <- (y-mu[1])^2
  e2 <- (y-mu[2])^2
  e.min <- pmin(e1,e2)
  e <- sum(e.min)
  return(e)
}

r.cluster <- nlm(U,c(50,80),y)
mu.est <- r.cluster$estimate
print(mu.est)
## [1] 54.74997 80.28484

We can then classify the \(n\) observations into these 2 clusters

e1 <- (y-mu.est[1])^2
e2 <- (y-mu.est[2])^2
group.est <- rep(1,length(y))
group.est[which(e2<e1)] <- 2
group.est <- as.factor(group.est)

library(ggplot2)
theme_set(theme_bw())

ggplot() + geom_point(data=data.frame(y,group.est), aes(x=y,y=0,colour=group.est),size=3) + geom_point(data=data.frame(y=mu.est,group=factor(c(1,2))), aes(y,0,colour=group),size=10, shape="x")+ geom_vline(xintercept=mean(mu.est))

and compute the sizes, the empirical means and standard deviations for each cluster:

print(c(length(y[group.est==1]),length(y[group.est==2])))
## [1] 100 172
print(c(mean(y[group.est==1]),mean(y[group.est==2])))
## [1] 54.75000 80.28488
print(c(sd(y[group.est==1]),sd(y[group.est==2])))
## [1] 5.895341 5.627335

Remark: we could equivalently the function kmeans:

r.km <- kmeans(y, centers=2)
r.km$cluster
##   [1] 2 1 2 1 2 1 2 2 1 2 1 2 2 1 2 1 1 2 1 2 1 1 2 2 2 2 1 2 2 2 2 2 1 2 2
##  [36] 1 1 2 1 2 2 1 2 1 2 2 1 1 2 1 2 2 1 2 1 2 2 1 2 2 1 2 1 2 1 2 2 2 1 2
##  [71] 2 1 2 2 1 2 1 2 2 2 2 2 2 1 2 2 2 2 1 2 1 2 1 2 1 2 2 2 1 2 1 2 1 2 2
## [106] 1 2 1 2 2 2 1 2 2 1 2 1 2 1 2 1 2 2 1 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
## [141] 2 1 2 2 2 1 2 1 2 1 2 2 1 2 2 2 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 1 2 2 2
## [176] 2 2 1 2 2 1 2 2 2 1 2 2 1 2 1 2 1 2 2 2 2 2 2 1 2 1 2 2 1 2 1 2 2 1 2
## [211] 2 2 1 2 1 2 1 2 1 2 1 2 1 2 2 2 2 2 2 2 2 1 2 1 2 1 1 2 2 1 2 1 2 1 2
## [246] 2 1 2 1 2 1 2 2 2 2 2 2 2 1 2 2 2 1 2 1 1 2 2 1 2 1 2
r.km$size
## [1] 100 172
as.vector(r.km$centers)
## [1] 54.75000 80.28488
sqrt(r.km$withinss/(r.km$size-1))
## [1] 5.895341 5.627335


1.3 Mixture of probability distributions

In a probability framework,

  • the labels \(z_1\), , \(z_n\) are a sequence of random variables that take their values in \(\{1, 2, \cdots, L \}\) and such that, for \(\ell=1,2,\ldots L\), \[\prob{z_i = \ell} = \pi_\ell\] where \(\sum_{\ell=1}^L \pi_\ell = 1\).

  • the observations in group \(\ell\), i.e.??such \(z_i=\ell\), are independent and follow a same probability distribution \(f_\ell\), \[ y_i | z_i=\ell \iid f_\ell\]

The probability distribution of \(y_i\) is therefore a mixture of \(L\) distributions: \[\begin{aligned} \pmacro(y_i) &= \sum_{\ell=1}^L \pmacro(y_i , z_i=\ell) \\ &= \sum_{\ell=1}^L \prob{ z_i=\ell}\pmacro(y_i | z_i=\ell) \\ &= \sum_{\ell=1}^L \pi_\ell \, f_\ell(y_i) \end{aligned}\] If, for each \(\ell\), \(f_\ell\) is a normal distribution with mean \(\mu_\ell\) and variance \(\sigma^2_\ell\), the model is a Gaussian mixture model:

\[ y_i \iid \sum_{\ell=1}^L \pi_\ell \, \cal{N}(\mu_\ell \ , \ \sigma^2_\ell) \] The vector of parameter of the model is \(\theta = (\pi_1,\mu_1,\sigma^2_1,\cdots,\pi_L,\mu_L,\sigma^2_L )\) and the likelihood function writes

\[\begin{aligned} \like(\theta , y) &= \pmacro(y ; \theta) \\ &= \prod_{i=1}^n \pmacro(y_i ; \theta) \\ &= \prod_{i=1}^n \left( \sum_{\ell=1}^L \prob{ z_i=\ell ; \theta}\pmacro(y_i | z_i=\ell ;\theta) \right) \\ &= \prod_{i=1}^n \left( \sum_{\ell=1}^L \frac{\pi_\ell}{\sqrt{2\pi \sigma^2_\ell}} \ \exp \left\{-\frac{1}{2\sigma_\ell^2}(y_i - \mu_\ell)^2 \right\} \right) \end{aligned}\]

The maximum likelihood (ML) estimate of \(\theta\) cannot be computed in a closed form but several methods can be used for maximizing this likelihood function.

For intance, a Newton-type algorithm can be used for minimizing the deviance \(-2\log(\like(\theta , y))\)

mixt.deviance <- function(theta,y) {
  pi1    <- theta[1]
  pi2    <- 1 - pi1
  mu1    <- theta[2]
  mu2    <- theta[3]
  sigma1 <- theta[4]
  sigma2 <- theta[5]
  pdf <- pi1*dnorm(y,mu1,sigma1) + pi2*dnorm(y,mu2,sigma2)
  deviance <- -2*sum(log(pdf))
 return(deviance)
  }

r.nlm <- nlm(mixt.deviance,c(.25,52,82,10,10),y)
theta.est <- c(r.nlm$estimate[1], 1-r.nlm$estimate[1], r.nlm$estimate[2:5])
print(matrix(theta.est,nrow=3,byrow = T))
##            [,1]       [,2]
## [1,]  0.3608861  0.6391139
## [2,] 54.6148563 80.0910688
## [3,]  5.8712181  5.8677343
print(mixt.deviance(r.nlm$estimate,y))
## [1] 2068.003

We can then plot the empirical distribution of the data together with the probability density function of the mixture:

dmixt <- function(x,theta) {
  pi1 <- theta[1]
  pi2 <- theta[2]
  mu1 <- theta[3]
  mu2 <- theta[4]
  sigma1 <- theta[5]
  sigma2 <- theta[6]
  f1 <- dnorm(x,mu1,sigma1)
  f2 <- dnorm(x,mu2,sigma2)
  f <- pi1*f1 + pi2*f2
}

x <- (35:100)
pdf.mixt <- dmixt(x,theta.est)
ggplot(data=faithful) + geom_histogram(aes(x=waiting, y=..density..), bins=12) +
  geom_line(data=data.frame(x,pdf.mixt), aes(x,pdf.mixt),colour="red",size=1.5)

Comparing the empirical and theoretical cumulative distribution functions (cdf) shows how well the mixture model fits the data

pmixt <- function(x,theta) {
  pi1 <- theta[1]
  pi2 <- theta[2]
  mu1 <- theta[3]
  mu2 <- theta[4]
  sigma1 <- theta[5]
  sigma2 <- theta[6]
  F1 <- pnorm(x,mu1,sigma1)
  F2 <- pnorm(x,mu2,sigma2)
  F  <- pi1*F1 + pi2*F2
  return(F)
}
cdf.mixt <- pmixt(x,theta.est)

ggplot(data=faithful) + stat_ecdf(aes(waiting), geom = "step", size=1) +
  geom_line(data=data.frame(x,cdf.mixt), aes(x,cdf.mixt),colour="red",size=1) +
  ylab("cdf")

The estimated mixture distribution \(F_{\hat{\theta}}\) (obtained with the maximum likelihood estimate \(\hat\theta\)) seems to perfectly fit the empirical distribution of the faithful data.

We can perform a Kolmogorov-Smirnov test for testing \(H_0: y_i \sim \ F_{\hat{\theta}}\) versus \(H_1: y_i \sim \!\!\!\!\!/ \ F_{\hat{\theta}}\):

ks.test(y,pmixt,theta.est)
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  y
## D = 0.033545, p-value = 0.9195
## alternative hypothesis: two-sided

We can compute the posterior distribution of the label variables: \[\begin{aligned} \prob{z_i=\ell \ | \ y_i \ ; \ \hat{\theta}} &= \frac{\prob{z_i=\ell \ ; \ \hat{\theta}}\pmacro(y_i \ | \ z_i=\ell \ ; \ \hat{\theta})}{\pmacro(y_i \ ; \ \hat{\theta})} \\ &= \frac{\prob{z_i=\ell \ ; \ \hat{\theta}}\pmacro(y_i \ | \ z_i=\ell \ ; \ \hat{\theta})} {\sum_{j=1}^L\prob{z_i=j \ ; \ \hat{\theta}}\pmacro(y_i \ | \ z_i=j \ ; \ \hat{\theta})} \\ &= \frac{\frac{\hat\pi_\ell}{\sqrt{2\pi \hat\sigma_\ell^2}} \exp \left\{-\frac{1}{2\hat\sigma_\ell^2}(y_i - \hat\mu_\ell)^2 \right\}} {\sum_{j=1}^L\frac{\hat\pi_j}{\sqrt{2\pi \hat\sigma_j^2}} \exp \left\{-\frac{1}{2\hat\sigma_j^2}(y_i - \hat\mu_j)^2 \right\}} \end{aligned}\]

pi1 <- theta.est[1]
pi2 <- theta.est[2]
mu1 <- theta.est[3]
mu2 <- theta.est[4]
sigma1 <- theta.est[5]
sigma2 <- theta.est[6]
p1.post <- pi1*dnorm(y,mu1,sigma1)/(pi1*dnorm(y,mu1,sigma1) + pi2*dnorm(y,mu2,sigma2))
p2.post <- 1-p1.post
head(data.frame(y, p1.post, p2.post))
##    y      p1.post      p2.post
## 1 79 1.030773e-04 9.998969e-01
## 2 54 9.999093e-01 9.066699e-05
## 3 74 4.135431e-03 9.958646e-01
## 4 62 9.673803e-01 3.261974e-02
## 5 85 1.223394e-06 9.999988e-01
## 6 55 9.998100e-01 1.900018e-04

The Expectation - Maximization (EM) algorithm implemented in the mixtools library could also be used for computing the ML estimate of \(\theta\). Both algorithms provide the same results.

library(mixtools)
r.mixEM = normalmixEM(y)
## number of iterations= 17
list(p=r.mixEM$lambda,mu=r.mixEM$mu,sigma=r.mixEM$sigma)
## $p
## [1] 0.3608871 0.6391129
## 
## $mu
## [1] 54.6149 80.0911
## 
## $sigma
## [1] 5.871248 5.867713
-2*r.mixEM$loglik
## [1] 2068.003
plot(r.mixEM,which=2)

head(r.mixEM$posterior)
##            comp.1       comp.2
## [1,] 1.030892e-04 9.998969e-01
## [2,] 9.999093e-01 9.065912e-05
## [3,] 4.135776e-03 9.958642e-01
## [4,] 9.673822e-01 3.261781e-02
## [5,] 1.223601e-06 9.999988e-01
## [6,] 9.998100e-01 1.899863e-04


1.4 Mixture model versus clustering

Let us sample some data from a Gaussian mixture model. Of course, the labels of the simulated data are known.

set.seed(12345)
y1 <- rnorm(120, 0, 1)
y2 <- rnorm(80, 3, 1)
y=c(y1, y2)
z=c(rep(1,120),rep(2,80))
d <- data.frame(y,z)

We can use the \(k\)-means method to create two clusters and compute the proportion, center and standard deviation for each cluster,

r.km <- kmeans(y, centers=c(-1,1))
list(p=r.km$size/sum(r.km$size), mu=as.vector(r.km$centers), sigma=sqrt(r.km$withinss/r.km$size))
## $p
## [1] 0.53 0.47
## 
## $mu
## [1] -0.04233643  2.90990098
## 
## $sigma
## [1] 0.8947708 0.9812917

We can instead use the EM algorithm with the same data and display the estimated parameters

r.mixEM = normalmixEM(y)
## number of iterations= 303
list(p=r.mixEM$lambda,mu=r.mixEM$mu,sigma=r.mixEM$sigma)
## $p
## [1] 0.5622311 0.4377689
## 
## $mu
## [1] 0.1440742 2.8878443
## 
## $sigma
## [1] 1.079719 1.110258

Since the ``true?????? labels are known, we can compute the classification error rate for each method (in % here):

d$km <- r.km$cluster
d$EM <- (r.mixEM$posterior[,1]<0.5)+1
print(c(mean(d$z != d$km)*100, mean(d$z != d$EM)*100))
## [1] 12.0 10.5

Of course, these results strongly on the model, i.e.??the parameters of the mixture. This Shiny app allows one to modify the parameters of the second distribution (the first distribution is assumed to be a normal distribution with mean 0 and variance 1) and compare the results provided by the two methods.


2 Some EM-type algorithms

2.1 Maximisation of the complete likelihood

Assume first that the label \((z_i)\) are known. Estimation of the parameters of the model is straightforward: for \(\ell =1,2, \ldots, L\),

\[\begin{aligned} \hat{\pi}_\ell &= \frac{\sum_{i=1}^n \one_{z_i=\ell}}{n} \\ \hat{\mu}_\ell &= \frac{\sum_{i=1}^n y_i\one_{z_i=\ell}}{\sum_{i=1}^n \one_{z_i=\ell}} \\ \hat{\sigma}_\ell^2 &= \frac{\sum_{i=1}^n y_i^2\one_{z_i=\ell}}{\sum_{i=1}^n \one_{z_i=\ell}} - \hat{\mu}_\ell^2\\ \end{aligned}\]

Then, we see that \(S(z,y) = (\sum_{i=1}^n \one_{z_i=\ell} \ , \ \sum_{i=1}^n y_i\one_{z_i=\ell}\ , \ \sum_{i=1}^n y_i^2\one_{z_i=\ell} \ ; \ 1 \leq \ell \leq L)\) is the sufficient statistics of the complete model. Indeed, the maximum likelihood estimate of \(\theta\) is a function of \(S(z,y)\):

\[\begin{aligned} \hat{\theta} &= (\hat{\pi}_1,\hat{\mu}_1,\hat{\sigma}_1^2, \ldots , \hat{\pi}_L,\hat{\mu}_L,\hat{\sigma}_L^2 ) \\ &= \hat\Theta(S(z,y)) \end{aligned}\] Where \(\hat\Theta\) is the function defining the maximum likelihood estimator of \(\theta\).


2.2 The EM algorithm

When the labels \((z_i)\) are unknown, the sufficient statistics \(S(z,y)\) cannot be computed. Then, the idea of EM is to replace \(S(z,y)\) by its conditional expectation \(\esp{S(z,y)|y ;\theta}\).

The problem is that this conditional expectation depends on the unknown parameter \(\theta\). EM is therefore an iterative procedure, where, at iteration \(k\):

  • the E-step computes \(S_k(y) = \esp{S(z,y)|y ;\theta_{k-1}}\)
  • the M-step updates the parameter estimate: \[ \theta_k = \hat\Theta(S_k(y)) \]

Let us see now how to compute \(\esp{S(z,y)|y ;\theta_{k-1}}\).

First, for any \(1\leq i \leq n\), and any \(1 \leq \ell \leq L\), let \[\tau_{i\ell,k} = \esp{\one_{z_i=\ell} \ | \ y_i \ ; \ \theta_{k-1}}\]

By definition, \[\begin{aligned} \tau_{i\ell,k} &= \esp{\one_{z_i=\ell} \ | \ y_i \ ; \ \theta_{k-1}} \\ &= \prob{z_i=\ell \ | \ y_i \ ; \ \theta_{k-1}} \\ &= \frac{\prob{z_i=\ell\ ; \ \theta_{k-1}}\pmacro(y_i \ | \ z_i=\ell \ ; \ \theta_{k-1})}{\pmacro(y_i \ ; \ \theta_{k-1})} \\ &= \frac{\prob{z_i=\ell\ ; \ \theta_{k-1}}f_\ell(y_i \ ; \ \theta_{k-1})} {\sum_{j=1}^L \prob{z_i=j\ ; \ \theta_{k-1}}f_j(y_i \ ; \ \theta_{k-1})} \\ &= \frac{\pi_{\ell,k-1}f_\ell(y_i \ ; \ \theta_{k-1})} {\sum_{j=1}^L \pi_{j,k-1}f_j(y_i \ ; \ \theta_{k-1})} \end{aligned}\] where \(\pi_{\ell,k-1}=\prob{z_i=\ell\ ; \ \theta_{k-1}}\) is the estimate of \(\pi_\ell\) obtained at iteration \(k-1\) and where \[f_\ell(y_i \ ; \ \theta_{k-1}) = \frac{1}{\sqrt{2\pi\sigma_{\ell,k-1}^2}} \exp\left\{ -\frac{1}{2\sigma_{\ell,k-1}^2}(y_i - \mu_{\ell,k-1})^2 \right\} \] is the probability density function of \(y_i\) when \(z_i=\ell\), computed at iteration \(k-1\) using \(\theta_{k-1}\).

The expected values of the other sufficient statistics can now easily be computed. Indeed, for \(i=1,2,\ldots,n\) and \(\ell = 1,2,\ldots,L\), \[\begin{aligned} \esp{y_i\one_{z_i=\ell} \ | \ y_i \ ; \ \theta_{k-1}} & = y_i\tau_{i\ell,k} \\ \esp{y_i^2\one_{z_i=\ell} \ | \ y_i \ ; \ \theta_{k-1}} & = y_i^2\tau_{i\ell,k} \end{aligned}\]

Then, the \(k\)-th iteration of the EM algorithm for a Gaussian mixture consists in

  • computing \(\tau_{i\ell,k}\) for \(i=1,2,\ldots,n\) and \(\ell = 1,2,\ldots,L\), using \(\theta_{k-1}\),
  • computing \(\theta_{k}= (\pi_{\ell,k},\mu_{\ell,k},\sigma_{\ell,k}^2 ; 1 \leq \ell \leq L)\) where

\[\begin{aligned} \pi_{\ell,k} &= \frac{\sum_{i=1}^n \tau_{i\ell,k}}{n} \\ \mu_{\ell,k} &= \frac{\sum_{i=1}^n y_i\tau_{i\ell,k}}{\sum_{i=1}^n \tau_{i\ell,k}} \\ \sigma_{\ell,k}^2 &= \frac{\sum_{i=1}^n y_i^2\tau_{i\ell,k}}{\sum_{i=1}^n \tau_{i\ell,k}} - \mu_{\ell,k}^2\\ \end{aligned}\]

For a given set of initial values \(\theta_0 = (\pi_{\ell,0},\mu_{\ell,0},\sigma_{\ell,0}^2 \ ; \ 1 \leq \ell \leq L)\) and a given number of iterations \(K\), the following function returns the EM estimate \(\theta_K\), the sequence of estimates \((\theta_k \ , \ 0\leq k \leq K)\) and the deviance computed with the final estimate \(-2\log(\pmacro(y\ ; \ \theta_K))\).

mixt2.em <- function(y, p, mu, sigma, K)
{
  # initialization
  like <- p[1]*dnorm(y,mu[1],sigma[1]) + p[2]*dnorm(y,mu[2],sigma[2])
  deviance <- -2*sum(log(like))
  res <- matrix(NA,K+1,8)
  res[1,] <- c(0, p, mu, sigma, deviance)
  for (k in 1:K) {
    # E step
    d1<-p[1]*dnorm(y,mu[1],sigma[1])
    d2<-p[2]*dnorm(y,mu[2],sigma[2])
    tau1 <-d1/(d1+d2)
    tau2 <- 1-tau1
    
    # M step
    p[1] <- mean(tau1)
    mu[1] <- sum(tau1*y)/sum(tau1)
    sigma[1] <-sqrt(sum(tau1*(y^2))/sum(tau1)-(mu[1])^2)
    p[2] <- 1-p[1]
    mu[2] <- sum((tau2)*y)/sum((tau2))
    sigma[2] <-sqrt(sum(tau2*(y^2))/sum(tau2)-(mu[2])^2)
    
    # -2 x LL
    like <- p[1]*dnorm(y,mu[1],sigma[1]) + p[2]*dnorm(y,mu[2],sigma[2])
    deviance <- -2*sum(log(like))
    
    # add results to output
    res[k+1,] <- c(k, p, mu, sigma, deviance)
  }
  res <- data.frame(res)
  names(res) <- c("iteration","p1","p2","mu1","mu2","sigma1","sigma2","deviance")
  out <- list(parameters=c(p, mu, sigma), deviance=deviance, res=res)
  return(out)
}

Let us use this function with our faithful data

r.mixt2 <- mixt2.em(faithful$waiting, p=c(.5,.5), mu=c(60,70), sigma=c(2,2), K=20)
r.mixt2$parameters
## [1]  0.3608821  0.6391179 54.6147241 80.0909857  5.8711065  5.8678180
r.mixt2$deviance
## [1] 2068.003

and let us plot the convergence of the algorithm:

library(reshape2)
library(gridExtra)

plotConvergence <- function(df, title=NULL, ylim=NULL)
{
  G <- (dim(df)[2]-1)/3
  df1<-melt(df[,c(1,2,3)],id.vars="iteration",value.name="proportion",variable.name="group")
  graf1 <- ggplot(df1)+geom_line(aes(iteration,proportion,color=group)) + theme(legend.position="none")  
  if (!is.null(ylim))  graf1 <- graf1 + ylim(ylim[1]*c(-1,1))
  df2<-melt(df[,c(1,4,5)],id.vars="iteration",value.name="mean",variable.name="group")
  graf2 <- ggplot(df2)+geom_line(aes(iteration,mean,color=group)) + theme(legend.position="none") 
  if (!is.null(ylim))  graf2 <- graf2 + ylim(ylim[2]*c(-1,1))
  df3<-melt(df[,c(1,6,7)],id.vars="iteration",value.name="standard.deviation",variable.name="group")
  graf3 <- ggplot(df3) + geom_line(aes(iteration,standard.deviation,color=group)) + theme(legend.position="none") 
  if (!is.null(ylim))  graf3 <- graf3 + ylim(ylim[3]*c(-1,1))
  graf4 <- ggplot(df)+geom_line(aes(iteration,deviance))  
  grid.arrange(graf1,graf2,graf3,graf4,nrow=2, top=title)
}
plotConvergence(r.mixt2$res, title="EM")


2.3 Running EM with different initial values

The sequence of EM estimates \((\theta_k)\) depends on the intial guess \(\theta_0\). Let us plot the convergence of the algorithm obtained with several inital values:

plotConvMC <- function(df, title=NULL)
{
  G <- (ncol(df)-2)/3
  df$rep <- as.factor(df$rep)
  graf <- vector("list", ncol(df)-2)
  for (j in (2:(ncol(df)-1))) {
    grafj <- ggplot(df)+geom_line(aes_string(df[,1],df[,j],color=df[,ncol(df)])) +
      xlab("iteration") + ylab(names(df[j])) + theme(legend.position = "none")
     graf[[j-1]] <- grafj
  }
  do.call("grid.arrange", c(graf, ncol=3, top=title))
}
D.em <- NULL
set.seed(1234)
for (m in (1:10)) {
  p1 <- runif(1,0.1,0.9)
  df.em <- mixt2.em(faithful$waiting, p=c(p1, 1-p1),mu=rnorm(2,70,15),sigma=rlnorm(2,2,0.7), K=50)$res
  df.em$rep <- m
  D.em <- rbind(D.em,df.em)
}
plotConvMC(D.em)

We see that, up to some permutation (the labels are interchangeable), all the runs converge to the same solution with this example.

Nevertheless, a very poor initial guess may lead to a very poor convergence of EM. In this example, one of the proportion converges to 0:

df.em <- mixt2.em(faithful$waiting, p=c(0.5, 0.5), mu=c(30, 40), sigma=c(2,10), K=5)
plotConvergence(df.em$res)


2.4 A stochastic version of EM

A stochastic version of EM consists, at iteration \(k\), in replacing the unknown labels \((z_i)\) by a sequence \((z_i^{(k)})\), where \(z_i^{(k)}\) is sampled from the conditional distribution of \(z_i\): \[\begin{aligned} \prob{z_i^{(k)}=\ell} &= \prob{z_i=\ell \ | \ y_i \ ; \ \theta_{k-1} } \\ &= \frac{\pi_{\ell,k-1}f_\ell(y_i \ ; \ \theta_{k-1})} {\sum_{j=1}^L \pi_{j,k-1}f_j(y_i \ ; \ \theta_{k-1})} \end{aligned}\]

We can then use these sampled labels for computing the sufficient statistics \(S(z^{(k)},y)\) and updating the estimation of \(\theta\) as \(\theta_k = \hat\Theta(S(z^{(k)},y))\).

mixt2.sem <- function(y, p, mu, sigma, K)
{
  # initialization
  like <- p[1]*dnorm(y,mu[1],sigma[1]) + p[2]*dnorm(y,mu[2],sigma[2])
  deviance <- -2*sum(log(like))
  res <- matrix(NA,K+1,8)
  res[1,] <- c(0, p, mu, sigma, deviance)
  n <- length(y)
  for (k in 1:K) {
    # S step
    d1<-p[1]*dnorm(y,mu[1],sigma[1])
    d2<-p[2]*dnorm(y,mu[2],sigma[2])
    ind1 <- (runif(n) < d1/(d1+d2))
    ind2 <- 1-ind1
    
    # M step
    p[1] <- mean(ind1)
    mu[1] <- sum(ind1*y)/sum(ind1)
    sigma[1] <-sqrt(sum(ind1*(y^2))/sum(ind1)-(mu[1])^2)
    p[2] <- 1-p[1]
    mu[2] <- sum((ind2)*y)/sum((ind2))
    sigma[2] <-sqrt(sum(ind2*(y^2))/sum(ind2)-(mu[2])^2)
    
    # -2 x LL
    like <- p[1]*dnorm(y,mu[1],sigma[1]) + p[2]*dnorm(y,mu[2],sigma[2])
    deviance <- -2*sum(log(like))
    
    # add results to output
    res[k+1,] <- c(k, p, mu, sigma, deviance)
  }
  res <- data.frame(res)
  names(res) <- c("iteration","p1","p2","mu1","mu2","sigma1","sigma2","deviance")
  out <- list(parameters=c(p, mu, sigma), deviance=deviance, res=res)
  return(out)
}
r.mixt2 <- mixt2.sem(faithful$waiting, p=c(.2,.8), mu=c(75,75), sigma=c(10,4), K=30)
r.mixt2$parameters
## [1]  0.3529412  0.6470588 54.4166667 79.8863636  5.8070838  6.1022198
r.mixt2$deviance
## [1] 2068.409
plotConvergence(r.mixt2$res, title="EM")



3 The Iris data

The famous (Fisher???s or Anderson???s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. (from ?iris).

The Iris flower data set is a nice example for learning supervised classification algorithms, and is known as a difficult case for unsupervised learning.

iris <- datasets::iris
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
species_labels <- iris[,5]
species_col <- c("#7DB0DD","#86B875","#E495A5")

pairs(iris[,-5], col = species_col[species_labels],
     lower.panel = NULL, cex.labels=2, pch=19, cex = 1)

par(xpd = TRUE)
legend(x = 0.1, y = 0.4, cex = 1.2,
       legend = as.character(levels(species_labels)), fill = species_col)

We can see that the Setosa species are distinctly different from Versicolor and Virginica (they have lower petal length and width). But Versicolor and Virginica cannot easily be separated based on measurements of their sepal and petal width/length.

The same conclusion can be made by looking at the parallel coordinates plot of the data:

# https://cran.r-project.org/web/packages/dendextend/vignettes/Cluster_Analysis.html
MASS::parcoord(iris[,-5], col = species_col[species_labels], var.label = TRUE, lwd = 2)

# Add a legend
par(xpd = TRUE)
legend(x = 1.75, y = -.25, cex = 1,
   legend = as.character(levels(species_labels)),
    fill = species_col, horiz = TRUE)

3.1 Supervised classification

3.1.1 Logistic regression for a binary variable

Let \(y_i\) be a binary response that take its values in \(\{0,1\}\) and let \(c_{i1}, \ldots, c_{iM}\) be \(M\) explanatory variables (or predictors).

Formally, the logistic regression model is that

\[\begin{aligned} \log\left(\frac{\prob{y_i=1}}{\prob{y_i=0}}\right) &= \log\left(\frac{\prob{y_i=1}}{1 - \prob{y_i=1}}\right) \\ &= \beta_0 + \sum_{m=1}^M \beta_m c_{im} \end{aligned}\] Then, \[\prob{y_i=1} = \frac{1}{1+ e^{-\beta_0 - \sum_{m=1}^M \beta_m c_{im}}}\]

myiris <- cbind(iris,virginica=ifelse(iris$Species=="virginica",1,0))
myiris$setosa
## NULL
iris.glm <- glm(virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, family=binomial, data=myiris)
summary(iris.glm)
## 
## Call:
## glm(formula = virginica ~ Sepal.Length + Sepal.Width + Petal.Length + 
##     Petal.Width, family = binomial, data = myiris)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.01105  -0.00065   0.00000   0.00048   1.78065  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   -42.638     25.708  -1.659   0.0972 .
## Sepal.Length   -2.465      2.394  -1.030   0.3032  
## Sepal.Width    -6.681      4.480  -1.491   0.1359  
## Petal.Length    9.429      4.737   1.990   0.0465 *
## Petal.Width    18.286      9.743   1.877   0.0605 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 190.954  on 149  degrees of freedom
## Residual deviance:  11.899  on 145  degrees of freedom
## AIC: 21.899
## 
## Number of Fisher Scoring iterations: 12
as.vector(round(fitted(iris.glm)))
##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1
iris.glm$linear.predictors[71]
##         71 
## -0.3853463
sum(cbind(1,iris[71,1:4])*iris.glm$coefficients)
## [1] -0.3853463
iris.glm$fitted.values[71]
##        71 
## 0.4048381
1/(1+exp(-iris.glm$linear.predictors[71]))
##        71 
## 0.4048381


3.1.2 Logistic regression with more than two classes

Assume now that \(y_i\) takes its values in \(\{1,2\ldots,L\}\). The logistic regression model now writes

\[\begin{aligned} \log\left(\frac{\prob{y_i=\ell}}{\prob{y_i=L}}\right) &= \beta_{\ell0} + \sum_{m=1}^M \beta_{\ell m} c_{im} \quad , \quad \ell=1,2,\ldots,L \end{aligned}\] Then,

\[ \prob{y_i=\ell} = \frac{e^{\beta_{\ell0} + \sum_{m=1}^M \beta_{\ell m} c_{im}}} {\sum_{j=1}^L e^{\beta_{j0} + \sum_{m=1}^M \beta_{j m} c_{im}}} \quad , \quad \ell=1,2,\ldots,L \]

library(VGAM)
## Loading required package: stats4
## Loading required package: splines
myiris <- cbind(myiris,versicolor=ifelse(iris$Species=="versicolor",1,0))
myiris <- cbind(myiris,setosa=ifelse(iris$Species=="setosa",1,0))

head(myiris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species virginica
## 1          5.1         3.5          1.4         0.2  setosa         0
## 2          4.9         3.0          1.4         0.2  setosa         0
## 3          4.7         3.2          1.3         0.2  setosa         0
## 4          4.6         3.1          1.5         0.2  setosa         0
## 5          5.0         3.6          1.4         0.2  setosa         0
## 6          5.4         3.9          1.7         0.4  setosa         0
##   versicolor setosa
## 1          0      1
## 2          0      1
## 3          0      1
## 4          0      1
## 5          0      1
## 6          0      1
iris.vglm <- vglm(cbind(virginica,versicolor,setosa) ~ Sepal.Length +
Sepal.Width + Petal.Length + Petal.Width, family=multinomial,
data=myiris)

summary(iris.vglm)
## 
## Call:
## vglm(formula = cbind(virginica, versicolor, setosa) ~ Sepal.Length + 
##     Sepal.Width + Petal.Length + Petal.Width, family = multinomial, 
##     data = myiris)
## 
## 
## Pearson residuals:
##                       Min         1Q     Median        3Q   Max
## log(mu[,1]/mu[,3]) -1.810 -0.0003211  1.265e-09 0.0002391 1.393
## log(mu[,2]/mu[,3]) -1.393 -0.0002585 -6.687e-06 0.0004556 1.810
## 
## Coefficients: 
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept):1    -35.365  19432.051  -0.002    0.999
## (Intercept):2      7.272  19432.034   0.000    1.000
## Sepal.Length:1    -9.242   5790.844  -0.002    0.999
## Sepal.Length:2    -6.777   5790.843  -0.001    0.999
## Sepal.Width:1    -12.569   2888.841  -0.004    0.997
## Sepal.Width:2     -5.889   2888.838  -0.002    0.998
## Petal.Length:1    22.630   4446.221   0.005    0.996
## Petal.Length:2    13.201   4446.219   0.003    0.998
## Petal.Width:1     33.919   7133.176   0.005    0.996
## Petal.Width:2     15.633   7133.169   0.002    0.998
## 
## Number of linear predictors:  2 
## 
## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
## 
## Residual deviance: 11.8985 on 290 degrees of freedom
## 
## Log-likelihood: -5.9493 on 290 degrees of freedom
## 
## Number of iterations: 21 
## 
## Reference group is level  3  of the response
iris.vglm@predictors[71,]
## log(mu[,1]/mu[,3]) log(mu[,2]/mu[,3]) 
##           39.56470           39.95004
c(sum(cbind(1,iris[71,1:4])*iris.vglm@coefficients[c(1,3,5,7,9)]), 
  sum(cbind(1,iris[71,1:4])*iris.vglm@coefficients[c(2,4,6,8,10)]))
## [1] 39.56470 39.95004
iris.vglm@fitted.values[71,]
##    virginica   versicolor       setosa 
## 4.048381e-01 5.951619e-01 2.657985e-18
lp71 <- as.vector(c(iris.vglm@predictors[71,],0))
exp(lp71)/sum(exp(lp71))
## [1] 4.048381e-01 5.951619e-01 2.657985e-18
head(fitted(iris.vglm))
##      virginica   versicolor setosa
## 1 5.895408e-39 3.817290e-12      1
## 2 2.007661e-35 2.812349e-10      1
## 3 1.073625e-36 8.972486e-11      1
## 4 8.785319e-34 4.462530e-09      1
## 5 4.226744e-39 4.171875e-12      1
## 6 1.894589e-36 5.671292e-11      1
# Now we use 'fitted.iris' as the model-based classifications
# of which flower is of each type.  How does this compare
# to the real types?


rclass.iris <- ifelse(apply(abs(round(fitted(iris.vglm)) - myiris[,6:8]),1,sum)==0,1,0)
mean(rclass.iris) 
## [1] 0.9866667

3.2 Non supervised classification

Ignoring the known labels (species) of theFisher Iris data, let us identify three clusters with the \(k\)-means method and compute the missclassification rate:

set.seed(1234) # labels are the original ones with this seed (avoid permutation)
r.km <- kmeans(iris[,1:4], centers=3)
mean(r.km$cluster!=as.numeric(iris$Species))*100
## [1] 10.66667

Let us know fit a mixture of three multidimensional Gaussian distributions and compute the missclassification rate for this model:

library(mixtools)
set.seed(3333)  # labels are the original ones with this seed (avoid permutation)
r.em <- mvnormalmixEM(iris[,1:4], k=3,arbvar=FALSE)
## number of iterations= 31
post <- r.em$posterior
r.em$class <- sapply(seq(nrow(post)), function(i) { j <- which.max(as.vector(post[i,])) })
mean(r.em$class!=as.numeric(iris$Species))*100
## [1] 11.33333
detach(package:mixtools)

Let us plot the original data with the three distributions estimated with EM

library(ellipse)
library(gridExtra)

Species <- levels(iris$Species)
j1=1;  j2=2
df_ell <- data.frame()
for(g in (1:3)){
#  M=r.em$sigma[[g]][c(j1,j2),c(j1,j2)]
  M=r.em$sigma[c(j1,j2),c(j1,j2)]
  c=r.em$mu[[g]][c(j1,j2)]
  df_ell <- rbind(df_ell, cbind(as.data.frame(ellipse(M,centre=c, level=0.68)), group=Species[g]))
}
pl1 <- ggplot(data=iris) + geom_point(aes_string(x=iris[,j1],y=iris[,j2], colour=iris[,5])) + 
  theme(legend.position="none") +xlab(names(iris)[j1])+ylab(names(iris)[j2]) +
geom_path(data=df_ell, aes(x=x, y=y,color=group), size=1, linetype=1) 

j1=3;  j2=4
df_ell <- data.frame()
for(g in (1:3)){
#  M=r.em$sigma[[g]][c(j1,j2),c(j1,j2)]
  M=r.em$sigma[c(j1,j2),c(j1,j2)]
  c=r.em$mu[[g]][c(j1,j2)]
  df_ell <- rbind(df_ell, cbind(as.data.frame(ellipse(M,centre=c, level=0.68)), group=Species[g]))
}
pl2 <- ggplot(data=iris) + geom_point(aes_string(x=iris[,j1],y=iris[,j2], colour=iris[,5])) + 
  theme(legend.position="none") +xlab(names(iris)[j1])+ylab(names(iris)[j2]) +
geom_path(data=df_ell, aes(x=x, y=y,color=group), size=1, linetype=1) 

grid.arrange(pl1, pl2)

This Shiny app allows one to select the two variables to plot.