$$ \newcommand{\esp}[1]{\mathbb{E}\left(#1\right)} \newcommand{\var}[1]{\mbox{Var}\left(#1\right)} \newcommand{\deriv}[1]{\dot{#1}(t)} \newcommand{\prob}[1]{ \mathbb{P}\!(#1)} \newcommand{\eqdef}{\mathop{=}\limits^{\mathrm{def}}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bc}{\boldsymbol{c}} \newcommand{\bpsi}{\boldsymbol{\psi}} \def\pmacro{\texttt{p}} \def\like{{\cal L}} \def\llike{{\cal LL}} \def\logit{{\rm logit}} \def\probit{{\rm probit}} \def\one{{\rm 1\!I}} \def\iid{\mathop{\sim}_{\rm i.i.d.}} \def\simh0{\mathop{\sim}_{H_0}} \def\df{\texttt{df}} \def\res{e} \def\xomega{x} \newcommand{\argmin}[1]{{\rm arg}\min_{#1}} \newcommand{\argmax}[1]{{\rm arg}\max_{#1}} \newcommand{\Rset}{\mbox{$\mathbb{R}$}} \def\param{\theta} \def\setparam{\Theta} \def\xnew{x_{\rm new}} \def\fnew{f_{\rm new}} \def\ynew{y_{\rm new}} \def\nnew{n_{\rm new}} \def\enew{e_{\rm new}} \def\Xnew{X_{\rm new}} \def\hfnew{\widehat{\fnew}} \def\degree{m} \def\nbeta{d} \newcommand{\limite}[1]{\mathop{\longrightarrow}\limits_{#1}} \def\ka{k{\scriptstyle a}} \def\ska{k{\scriptscriptstyle a}} \def\kel{k{\scriptstyle e}} \def\skel{k{\scriptscriptstyle e}} \def\cl{C{\small l}} \def\Tlag{T\hspace{-0.1em}{\scriptstyle lag}} \def\sTlag{T\hspace{-0.07em}{\scriptscriptstyle lag}} \def\Tk{T\hspace{-0.1em}{\scriptstyle k0}} \def\sTk{T\hspace{-0.07em}{\scriptscriptstyle k0}} \def\thalf{t{\scriptstyle 1/2}} \newcommand{\Dphi}[1]{\partial_\pphi #1} \def\asigma{a} \def\pphi{\psi} \newcommand{\stheta}{{\theta^\star}} \newcommand{\htheta}{{\widehat{\theta}}} $$


1 Detection of changes in the CAC 40 index

The Figure below displays the CAC 40 index during a given period of time.

library(ggplot2) ; theme_set(theme_bw())
d <- read.csv(file="CAC40Data.txt", header=TRUE, sep="\t")
ggplot(data=d) +  geom_line(aes(time,y), color="#6666CC") +  ylab("CAC 40 index")

  1. Assuming a piecewise constant volatility, propose a mathematical model for this data.
  2. Extend the dynamic programming algorithm for this model.
  3. Propose a segmentation of the CAC 40 index.


2 Segmentation of array CGH data

The purpose of array-based Comparative Genomic Hybridization (array CGH) is to detect and map chromosomal aberrations, on a genomic scale, in a single experiment. Since chromosomal copy numbers can not be measured directly, two samples of genomic DNA (referred to as the reference and test DNAs) are differentially labelled with fluorescent dyes and competitively hybridized to known mapped sequences (referred to as BACs) that are immobilized on a slide. Subsequently, the ratio of the intensities of the two fluorochromes is computed and a CGH profile is constituted for each chromosome when the log2 of fluorescence ratios are ranked and plotted according to the physical position of their corresponding BACs on the genome.

Each profile can be viewed as a succession of ???segments??? that represent homogeneous regions in the genome whose BACs share the same relative copy number on average.

d <- read.csv(file="CGHdata.txt", header=TRUE, sep=" ")
ggplot(data=d) +  geom_point(aes(position,y), color="#6666CC") +  ylab("fluorescence")

  1. Assuming that both the central value and the variability of the data are both piecewise constant, propose a mathematical model for this data.
  2. Extend the dynamic programming algorithm for this model.
  3. Propose a segmentation of the CGH profile.


3 Testing the presence of a change point

Consider the well log data data between \(t=651\) and \(t=950\):

d <- read.csv(file="wellLogData.txt", header=TRUE, sep="\t")
ggplot(data=d[651:950,] ) +  geom_point(aes(time,y), color="#6666CC") +  ylab("nuclear reponse")

  1. Test if the mean changes at time \(t=750\). Test if the mean changes at time \(t=850\). What can we conclude from these two tests?
  2. Test if the mean changes somewhere between \(t=651\) and \(t=949\).


4 Linear transition between segments

The model used for the well log data assumes that the underlying signal suddenly jumps from a valor to another one.

If we look at the data more in detail, we can see that this assumption is an approximation which is not realistic: the transition around \(t=252\), for instance, is not abrupt but seems to occur during some time interval.

d <- read.csv(file="wellLogData.txt", header=TRUE, sep="\t")
ggplot(data=d[151:350,] ) +  geom_point(aes(time,y), color="#6666CC") +  ylab("nuclear reponse")

  1. Write a mathematical model which now assumes linear transitions between consecutive mean value.
  2. Propose and implement an algorithm for estimating the underlying signal in presence of a single change point. Apply this algorithm for estimating the undelying signal of our data between \(t=151\) and \(t=350\).
  3. In the case of multiple change points, how could we extend the dynamic programming algorithm for this new model?