$$ \newcommand{\esp}[1]{\mathbb{E}\left(#1\right)} \newcommand{\var}[1]{\mbox{Var}\left(#1\right)} \newcommand{\deriv}[1]{\dot{#1}(t)} \newcommand{\prob}[1]{ \mathbb{P}\!(#1)} \newcommand{\eqdef}{\mathop{=}\limits^{\mathrm{def}}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bc}{\boldsymbol{c}} \newcommand{\bpsi}{\boldsymbol{\psi}} \def\pmacro{\texttt{p}} \def\like{{\cal L}} \def\llike{{\cal LL}} \def\logit{{\rm logit}} \def\probit{{\rm probit}} \def\one{{\rm 1\!I}} \def\iid{\mathop{\sim}_{\rm i.i.d.}} \def\simh0{\mathop{\sim}_{H_0}} \def\df{\texttt{df}} \def\res{e} \def\xomega{x} \newcommand{\argmin}[1]{{\rm arg}\min_{#1}} \newcommand{\argmax}[1]{{\rm arg}\max_{#1}} \newcommand{\Rset}{\mbox{$\mathbb{R}$}} \def\param{\theta} \def\setparam{\Theta} \def\xnew{x_{\rm new}} \def\fnew{f_{\rm new}} \def\ynew{y_{\rm new}} \def\nnew{n_{\rm new}} \def\enew{e_{\rm new}} \def\Xnew{X_{\rm new}} \def\hfnew{\widehat{\fnew}} \def\degree{m} \def\nbeta{d} \newcommand{\limite}[1]{\mathop{\longrightarrow}\limits_{#1}} \def\ka{k{\scriptstyle a}} \def\ska{k{\scriptscriptstyle a}} \def\kel{k{\scriptstyle e}} \def\skel{k{\scriptscriptstyle e}} \def\cl{C{\small l}} \def\Tlag{T\hspace{-0.1em}{\scriptstyle lag}} \def\sTlag{T\hspace{-0.07em}{\scriptscriptstyle lag}} \def\Tk{T\hspace{-0.1em}{\scriptstyle k0}} \def\sTk{T\hspace{-0.07em}{\scriptscriptstyle k0}} \def\thalf{t{\scriptstyle 1/2}} \newcommand{\Dphi}[1]{\partial_\pphi #1} \def\asigma{a} \def\pphi{\psi} \newcommand{\stheta}{{\theta^\star}} \newcommand{\htheta}{{\widehat{\theta}}} $$


1 Introduction

Let us generate a binary \(100\times100\) image \(X = (X_{i,j})\) where each of the \(10\,000\) pixels takes its values in \(\{1,2\}\):

temp <- (matrix(1,nrow=3,ncol=3) %x% matrix(c(2,1,1,2),2))[1:5,1:5]
M1 <- temp %x% matrix(1,nrow=10,ncol=10)
M2 <- matrix(2,nrow=50, ncol=50)
M2[21:30,1:10] <- M2[21:30,41:50] <- 1
M2[11:40,11:20] <- M2[11:40,31:40] <- 1
M2[1:50,21:30] <- 1
M3 <- matrix(2,nrow=50, ncol=50)
M3[11:20,] <- M3[31:40,] <- 1
M3[1:10,21:30] <- M3[41:50,21:30] <- 1
M4 <- 3 - M2
X.ori <- cbind(rbind(M1,M4),rbind(M2,M3))

image(X.ori, axes=FALSE, frame.plot=FALSE, col = grey(c(0, 0.8)))

A degraded version of the original image is observed. Some white pixels are converted to black and some black pixels are converted to white. For the sake of simplicity, we assume in this example that the edges are not modified:

p <- 0.2
set.seed(12345)
N <- dim(X.ori)[1]
Y <- X.ori
Y1 <- Y[2:(N-1),2:(N-1)]
i1 <- which(runif(((N-2)^2)) < p)
Y1[i1] <- 3-Y1[i1]
Y[2:(N-1),2:(N-1)] <- Y1

image(Y, axes=FALSE, frame.plot=FALSE, col=grey(c(0, 0.8)))

We aim to recover the original image \(X\) from the degraded observed image \(Y\), using a probabilistic approach.

We will use the approach decribed in the seminal paper

Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6), 721-741.


2 Markov random fields

Let \(S = \{(i,j) \ , \ 1 \leq i,j\leq N \}\) denote the \(N\times N\) lattice. \(S\) is the set of pixel sites for the intensity process \(X\), where \(X= \{ X_{i,j} \ , (i,j) \in S \}\) denotes the gray levels of the original digitized image.

Let \({\cal G} = \{ {\cal G}_s , s\in S \}\) be a neighborhood system for \(S\), meaning any collection of subset of \(S\) such that

  1. \(s \in \! \! \! \! \! / \ \ {\cal G}_s\)
  2. \(r \in {\cal G}_s \Leftrightarrow s \in {\cal G}_r\)

Thus, \({\cal G}_s\) is the set of neighboors of \(s\) and \((S, {\cal G})\) is a graph in the usual way.

For any pixel \(s=(i,j)\), we will be interested in homogeneous neighborood systems of the form

\[{\cal G}_{s} = {\cal G}_{i,j} = \{ (k,\ell) \in S \ : \ 0 < (i-k)^2 + (j-\ell)^2 \leq \delta \}\] For example,

A subset \(C \subset S\) is a clique if every pair of distinct sites in \(C\) are neighbors. In a 4 neighbors system, the cliques are the sites \(\{(i,j)\}\) and all the pairs of neighbors \(\{(i,j) , (i+1,j) \}\) and \(\{(i,j) , (i,j+1)\}\).

We assume a common state space \(\Lambda = \{1, 2, \ldots, L \}\) so that \(X_s \in \Lambda\) for all \(s\). Let \(\Omega\) be the set of all possible configurations: \[\Omega = \{ \xomega= (x_s , s \in S) : \ x_s\in \Lambda \}\]

A priori knowledge about the image is captured by the probability distribution \(\prob{X=\xomega}\). In particular, \(X\) is a Markov Random Field (MRF) with respect to \({\cal G}\) if

  1. \(\prob{X=\xomega} >0\) for all \(\xomega \in \Omega\),
  2. \(\prob{X_s=x_s | X_r=x_r , r \neq s} = \prob{X_s=x_s | X_r=x_r , r \in {\cal G}_{s}}\) for every \(s\in S\).

The collection of conditional probabilities \((\prob{X_s=x_s | X_r=x_r , r \in {\cal G}_{s}})\) is called the local characteristics of the MRF and it turns out that the joint probability distribution \(\prob{X=\xomega}\) is uniquely determined by these conditional probabilities.


2.1 Gibbs distributions

A Gibbs distribution relative to \((S,{\cal G})\) is a probability measure \(\pi\) on \(\Omega\) with the following representation \[\pi(\xomega) = \frac{1}{Z}e^{-U(\xomega)}\] where \(Z\) is the normalizing constant called the partition function: \[Z = \sum_{\xomega \in \Omega} e^{-U(\xomega)}\]

\(U\), called energy function, is of the form \[U(\xomega) = \sum_{C \in {\cal C}}V_C(\xomega)\] where \({\cal C}\) is the set of cliques for \({\cal G}\). Each \(V_C\) is a function on \(\Omega\) with the property that \(V_C(\xomega)\) depends only on those coordinates \(x_s\) of \(\xomega\) for which \(s\in C\).

Such a family \(\{V_C, C \in {\cal C} \}\) is called a potential.

Theorem: Let \({\cal G}\) be a neighborhood system. Then \(X\) is an MRF with respect to \({\cal G}\) if and only if \(\pi(\xomega) = \prob{X=\xomega}\) is a Gibbs distribution with respect to \({\cal G}\).

This equivalence provides us with a simple, practical way of specifying MRF???s, namely by specifying potentials, which is easy, instead of local characteristics which may be tricky.

We will see that it may also be useful to define a family of Gibbs distributions \((\pi_T, T>0)\) on \(\Omega\) where \[\pi_T(\xomega) = \frac{1}{Z_T}e^{-U(\xomega)/T}\] where \(Z_T = \sum_{\xomega \in \Omega} e^{-U(\xomega)/T}\).

\(T\) stands for temperature: \(T\) controls the degree of ``peaking?????? in the distribution \(\pi_T\). Choosing a small \(T\) exagerates the modes of \(\pi\) while a large \(T\) converts \(\pi\) to the uniform distribution on \(\Omega\).


2.2 Some examples of Markov random fields

Assume that \(X_s \in \{1, 2, \ldots, L\}\) and let \(\xomega\) be a given configuration \((x_s , s\in S)\). We can then use the following energy function:

\[U(\xomega) = \sum_{s \in S}\sum_{\ell=1}^{L}\alpha_\ell\one_{x_s=\ell} + \beta \sum_{s,r \ {\rm neighbors}}\one_{x_s \neq x_r}\] where \(\alpha_L=0\) for identifiability reason.

For any \(\ell\), a large value of \(\alpha_\ell\) tends to penalize the level \(\ell\) since \(\prob{X_s=\ell}\) decreases when \(\alpha_\ell\) increases (the other coefficients beeing fixed).

A positive value of \(\beta\) penalizes configurations where neighbor sites take different values. A large value of \(\beta\) then tends to create homogeneous regions, i.e.??regions where \(X\) takes the same value. \(\beta=0\) means that the \(X_s\) are independent.

For any site \(s \in S\), let \((s) = S - s\) be the remaining sites. Then, \[\begin{aligned} \prob{X=\xomega} &= \prob{(X_s,X_{(s)})=(x_s,x_{(s)})} \\ &=\prob{X_s=x_s | X_{(s)}=x_{(s)}} \prob{X_{(s)}=x_{(s)}}\\ &= \prob{X_s = x_s | X_r=x_r, r \in {\cal G}_s}\prob{X_{(s)}=x_{(s)}} \\ &= C(x_{(s)}) \exp\left( -\sum_{\ell=1}^{L}\alpha_\ell\one_{x_s=\ell} - \beta \sum_{r \in {\cal G}_s} \one_{x_r \neq x_s} \right) \end{aligned}\] where \(C(x_{(s)})\) does not depend on \(x_s\).

Then, for \(\ell=1,2, \ldots , L\),

\[\prob{X_s = \ell | X_r=x_r, r \in {\cal G}_s} = \frac{e^{ -\alpha_\ell - \beta \sum_{r \in {\cal G}_s} \one_{x_r \neq \ell}}} {\sum_{m=1}^L e^{ -\alpha_m - \beta \sum_{r \in {\cal G}_s} \one_{x_r \neq m}}} \] If we furthemore assume that \(\prob{X_s=1} = \prob{X_s=2} = \cdots = \prob{X_s=L}\), then all the \(\alpha_\ell\) shoud be equal and the conditional distribution reduces to

\[\prob{X_s = \ell | X_r=x_r, r \in {\cal G}_s} = \frac{e^{ - \beta \sum_{r \in {\cal G}_s} \one_{x_r \neq \ell}}} {\sum_{m=1}^L e^{ - \beta \sum_{r \in {\cal G}_s} \one_{x_r \neq m}}} \]

In the particular case of a binary random field, \(X_s\) takes its values in \(\{1,2\}\) and

\[\begin{aligned} \prob{X=\xomega} &= \frac{1}{Z} \exp \left( - \alpha \sum_{s \in S}\one_{x_s=1} \ - \ \beta \!\!\!\sum_{s,r \ {\rm neighbors}} \!\!\!\one_{x_s \neq x_r} \right) \end{aligned}\]

Then, the following conditional distribution of \(X_s\) writes

\[\begin{aligned} \prob{X_s = 1 | X_r=x_r, r \in {\cal G}_s} &= 1 -\prob{X_s = 2 | X_r=x_r, r \in {\cal G}_s} \\ &= \frac{e^{ - \alpha - \beta \sum_{r \in {\cal G}_s} \one_{x_r=2}}} { e^{ - \alpha - \beta \sum_{r \in {\cal G}_s} \one_{x_r=2}} + e^{ - \beta \sum_{r \in {\cal G}_s} \one_{x_r=1}}} \end{aligned}\]

The relative proportions of 1 and 2 depend on \(\alpha\):

  • if \(\alpha > 0\), then \(\prob{X_s=2}>\prob{X_s=1}\),
  • if \(\alpha = 0\), then \(\prob{X_s=1}=\prob{X_s=2}\),
  • if \(\alpha < 0\), then \(\prob{X_s=1}>\prob{X_s=2}\).


2.3 The Gibbs sampler

The Gibbs sampler is a stochastic relaxation algorithm, which generates new configurations from a given Gibbs distribution

At each iteration the new configuration is chosen using the local characteristics (neighborhood) of the distribution \(\prob{X=\xomega}\).

More precisely, a pixel \(s_k\) is chosen at iteration \(k\) and a new \(X_{s_k}\) is sampled from the conditional distribution \(\prob{X_{s_k}= \ell | X_{(s_k)}=x_{(s_k)}}\).

Given any initial condition \(X(0)\), we thus obtain a sequence \(X(1)\), \(X(2)\), \(X(3)\), ??? of configurations, where \(X(k)\) and \(X(k-1)\) can only differ in coordinate \(s_k\).

The Gibbs Sampler produces an ergodic Markov chain \((X(k), k\geq 0)\) with \(\pi\) as stationary distribution:

Theorem Assume that for each \(s \in S\), the sequence \((s_k, k\geq 1)\) contains \(s\) infinitely often. Then for every starting configuration \(x(0) \in \Omega\) and every \(\xomega \in \Omega\),

\[\lim_{k \to \infty} \prob{X(k)=\xomega |X(0)=x(0)} = \pi(\xomega) \] Several strategies are possible for defining the sequence \((s_k)\). The simplest sequence is a raster scan, i.e.??repeatedly visiting all the sites in some natural order. We use instead a sequence of random scans by using random permutations of the columns and the rows of the image.

The Gibbs sampler for the binary random field described above with a nearest-neighbor system is implemented in the R code below. For reason of simplicity, the edges are fixed in this example.

alpha <- 0.03
beta <- 1
nmcmc <- 50
N <- 100
set.seed(12345)
X <- matrix(runif(N^2)<0.5,N)+1
for (k in (1:nmcmc)) {
  for (i in sample(2:(N-1))) {
    for (j in sample(2:(N-1))) {
      Vij <- c(X[i-1,j], X[i+1,j], X[i,j-1], X[i,j+1])
      u1 <-   beta*sum(Vij!=1) + alpha
      u2 <-   beta*sum(Vij!=2) 
      p1 <- 1/(1+exp(-u2+u1))
      r <- runif(1)
      if (r < p1)
        X[i,j] <- 1
      else 
        X[i,j] <- 2
    }
  }
}
image(X, axes=FALSE, frame.plot=FALSE, col = grey(c(0, 0.8)))

The following Shiny application allows one to modify the initial configuration and the parameters of the model.


3 The posterior distribution

In a probabilistic framework, we assume that the original image \(X= (X_s, s\in S)\) is randomly degraded.

A simple model for the observed image \(Y = (Y_s, s\in S)\) assumes that the value \(X_s\) of each pixel may be randomly changed with probability \(p\):

\[\prob{Y_s \neq X_s} = 1 - \prob{Y_s = X_s} = p\] We will furthermore assume that the degradations of the \(N^2\) pixels are independent. Then,

\[\begin{aligned} \prob{Y=y |X=x} &= \prod_{s \in S} \prob{Y_s=y_s |X_s=x_s}\\ &= \prod_{s \in S} p^{\one_{y_s\neq x_s}}(1-p)^{\one_{y_s = x_s}} \\ &= (1-p)^{N^2} \left( \frac{p}{1-p}\right)^{\sum_{s \in S}\one_{y_s\neq x_s}} \end{aligned}\]

Then, the posterior distribution of the original image \(X\) is

\[\begin{aligned} \prob{X=x |Y=y} &= \frac{\prob{Y=y |X=x}\prob{X=x}}{\prob{Y=y}} \\ &= C \exp \left\{ -\lambda \sum_{s \in S}\one_{y_s\neq x_s} - \alpha \sum_{s \in S}\one_{x_s=1} \ - \ \beta \!\!\!\sum_{s,r \ {\rm neighbors}} \!\!\!\one_{x_s \neq x_r} \right\} \end{aligned}\] where \(\lambda= \log((1-p)/p)\)

The posterior distribution \(\prob{X=x |Y=y}\) is therefore a Gibbs distribution with energy function \[U_{X|Y}(x) = \lambda \sum_{s \in S}\one_{y_s\neq x_s} + \alpha \sum_{s \in S}\one_{x_s=1} \ + \ \beta \!\!\!\sum_{s,r \ {\rm neighbors}} \!\!\!\one_{x_s \neq x_r} \] The local characteristics of this posterior distribution are the conditional distributions

\[\begin{aligned} \prob{X_s = 1 | Y_s=y_s \ , \ X_r=x_r, r \in {\cal G}_s} &= 1 -\prob{X_s = 2 | Y_s=y_s \ , \ X_r=x_r, r \in {\cal G}_s} \\ &= \frac{e^{ - \alpha - \beta \sum_{r \in {\cal G}_s} \one_{x_r=2} -\lambda \one_{y_s=2}}} { e^{ - \alpha - \beta \sum_{r \in {\cal G}_s} \one_{x_r=2} -\lambda \one_{y_s=2}} + e^{ - \beta \sum_{r \in {\cal G}_s} \one_{x_r=1} -\lambda \one_{y_s=1}}} \end{aligned}\]

The Gibbs sampler previously used for sampling the prior distribution \(\prob{X=x}\) can easily be extended for now sampling the posterior distribution \(\prob{X=x | Y=y}\).

alpha <- 0
beta <- 1.5
lambda <- log((1-p)/p)
nmcmc <- 50
set.seed(12345)
X <- Y
for (k in (1:nmcmc)) {
  for (i in sample(2:(N-1))) {
    for (j in sample(2:(N-1))) {
      Vij <- c(X[i-1,j], X[i+1,j], X[i,j-1], X[i,j+1])
      u1 <-   beta*sum(Vij!=1) + alpha + lambda*(Y[i,j]!=1)
      u2 <-   beta*sum(Vij!=2) + lambda*(Y[i,j]!=2)
      p1 <- 1/(1+exp(-u2+u1))
      r <- runif(1)
      if (r < p1)
        X[i,j] <- 1
      else 
        X[i,j] <- 2
    }
  }
}
image(X, axes=FALSE, frame.plot=FALSE, col = grey(c(0, 0.8)))

The following Shiny application allows one to modify the parameters of the model.


4 Some restoration algorithms

In a probabilistic framework, the problem of image restoration is an estimation problem: we use the observed image \(Y\) for estimating the unobserved original image \(X\).

The maximum a posteriori (MAP) estimate of \(X\) maximizes the posterior distribution \(\prob{X=x |Y=y}\). The MAP estimate \(\hat{X}^{\rm MAP}\) is also the configuration of minimum energy:

\[\begin{aligned} \hat{X}^{\rm MAP} &= \argmax{x \in \Omega} \prob{X=x |Y=y} \\ &= \argmin{x \in \Omega} U_{X|Y}(x) \end{aligned}\] MAP estimation clearly represents a formidable computational problem. The number of possible configuration is \(L^{N^2}\) which rules out any direct search, even for a binary image (\(L=2\)).


4.1 Iterative Conditional Modes (ICM)

The iterative conditional modes algorithm is a deterministic iterative-improvment method that generates a sequence of images that monotically increases the objective function, i.e the posterior distribution.

At iteration \(k\), a site \(s_k\) is chosen and the conditional distribution of \(X_{s_k}\) is maximized:

\[x_{s_k}(k) = \argmax{1 \leq \ell \leq L} \prob{X_{s_k} = \ell | Y_{s_k}=y_{s_k} \ , \ X_r=x_r(k-1), r \in {\cal G}_{s_k}} \]

The algorithm stops when no new changes are made during a complete scan of the \(N^2\) pixels.

X.icm <- Y
dM <- 1
set.seed(12345)
while (dM >0) {
  X0 <- X.icm
  for (i in sample(2:(N-1))) {
    for (j in sample(2:(N-1))) {
      Vij <- c(X.icm[i-1,j], X.icm[i+1,j], X.icm[i,j-1], X.icm[i,j+1])
      u1 <-   beta*sum(Vij!=1) + alpha + lambda*(Y[i,j]!=1)
      u2 <-   beta*sum(Vij!=2) + lambda*(Y[i,j]!=2)
      if (u1 < u2)
        X.icm[i,j] <- 1
      else
        X.icm[i,j] <- 2
    }
  }
  dM <- sum(abs(X0-X.icm))
}
image(X.icm, axes=FALSE, frame.plot=FALSE, col = grey(c(0, 0.8)))

print(sum(abs(X.ori-X.icm)/(N^2)))
## [1] 0.0366


4.2 Simulated annealing

In contrast to the ICM algorithm, stochatic relaxation permits changes that increase the energy function. These are made on a random basis, the effect of which is to avoid convergence to a local maxima of the posterior distribution.

The Gibbs sampler is combined with an annealing schedule: the energy function \(U_{X|Y}/T_k\) is used at iteration \(k\) instead of the original energy function \(U_{X|Y}\), where \((T_k, k\geq 0)\) is a sequence of decreasing temperatures.

If the initial temperature \(T_0\) is large enough, and if the temperature decreases slowly enough, then, it is possible to show that the sequence \((X(k))\) converges to the MAP estimate of \(X\).

X.sa <- Y
nsa <- 100
T <- 5
gamma <- 0.95
set.seed(12345)
for (k in (1:nsa)) {
  for (i in sample(2:(N-1))) {
    for (j in sample(2:(N-1))) {
      Vij <- c(X.sa[i-1,j], X.sa[i+1,j], X.sa[i,j-1], X.sa[i,j+1])
      u1 <- (beta*sum(Vij!=1) + lambda*(Y[i,j]!=1))/T
      u2 <- (beta*sum(Vij!=2) + lambda*(Y[i,j]!=2))/T
      p1 <- 1/(1+exp(-u2+u1))
      r <- runif(1)
      if (r < p1)
        X.sa[i,j] <- 1
      else 
        X.sa[i,j] <- 2
    }
  }
  T <- T*gamma
}
image(X.sa, axes=FALSE, frame.plot=FALSE, col = grey(c(0, 0.8)))

print(sum(abs(X.ori-X.sa)/(N^2)))
## [1] 0.0266