$$ \newcommand{\esp}[1]{\mathbb{E}\left(#1\right)} \newcommand{\var}[1]{\mbox{Var}\left(#1\right)} \newcommand{\deriv}[1]{\dot{#1}(t)} \newcommand{\prob}[1]{ \mathbb{P}\!(#1)} \newcommand{\eqdef}{\mathop{=}\limits^{\mathrm{def}}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bc}{\boldsymbol{c}} \newcommand{\bpsi}{\boldsymbol{\psi}} \def\pmacro{\texttt{p}} \def\like{{\cal L}} \def\llike{{\cal LL}} \def\logit{{\rm logit}} \def\probit{{\rm probit}} \def\one{{\rm 1\!I}} \def\iid{\mathop{\sim}_{\rm i.i.d.}} \def\simh0{\mathop{\sim}_{H_0}} \def\df{\texttt{df}} \def\res{e} \def\xomega{x} \newcommand{\argmin}[1]{{\rm arg}\min_{#1}} \newcommand{\argmax}[1]{{\rm arg}\max_{#1}} \newcommand{\Rset}{\mbox{$\mathbb{R}$}} \def\param{\theta} \def\setparam{\Theta} \def\xnew{x_{\rm new}} \def\fnew{f_{\rm new}} \def\ynew{y_{\rm new}} \def\nnew{n_{\rm new}} \def\enew{e_{\rm new}} \def\Xnew{X_{\rm new}} \def\hfnew{\widehat{\fnew}} \def\degree{m} \def\nbeta{d} \newcommand{\limite}[1]{\mathop{\longrightarrow}\limits_{#1}} \def\ka{k{\scriptstyle a}} \def\ska{k{\scriptscriptstyle a}} \def\kel{k{\scriptstyle e}} \def\skel{k{\scriptscriptstyle e}} \def\cl{C{\small l}} \def\Tlag{T\hspace{-0.1em}{\scriptstyle lag}} \def\sTlag{T\hspace{-0.07em}{\scriptscriptstyle lag}} \def\Tk{T\hspace{-0.1em}{\scriptstyle k0}} \def\sTk{T\hspace{-0.07em}{\scriptscriptstyle k0}} \def\thalf{t{\scriptstyle 1/2}} \newcommand{\Dphi}[1]{\partial_\pphi #1} \def\asigma{a} \def\pphi{\psi} \newcommand{\stheta}{{\theta^\star}} \newcommand{\htheta}{{\widehat{\theta}}} $$

1 Gene expression

The dataset geHT.csv consists of gene expression measurements for ten genes under control and treatment conditions, with four replicates each.

  1. Test the hypothesis that the mean of the control expression values is 2000.

  2. Test that there is no difference overall between the treatments and controls for any of the genes (test that the whole experiment didn’t work or there are no differentially expressed genes)

  3. Test if the variances for the gene expression are the same under treatment or control conditions

2 Smoking, no smoking

  1. There are 88 smokers among a group of 300 people of a same population. Test that the proportion of smokers in this population is less than or equal to 0.25, greater than or equal to 0.25, equal to 0.25. Show that we can use an exact test, or a test relying on an approximation.

  2. There are 90 smokers in another group of 400 people, coming from another population. Can we conclude that the proportion of smokers are different in these two populations?

3 Alzheimer’s disease

Dementia is the result of various cerebral disorders, leading to an acquired loss of memory and impaired cognitive ability. The most common forms are Alzheimer’s disease and vascular dementia.

In a study, patients were treated either with Cerebrolysin or Donepezil. The datafile scoreAD.csv reports the difference of a score obtained by these patients before and after treatment (a negative score indicates an improvment).

  1. Test if Cerebrolysin and Donepezil have a beneficial effect on patients.

  2. A doctor claims that the score decrease is greater than 2 in average for patients who take Cerebrolysin. What do you think of this hypothesis? What should be the null hypothesis?

  3. Test if the two drugs can be considered a equivalent, considering that the two drugs are equivalent if the difference between the effects is i) less than 2 in average, ii) less than 4.

4 Epileptic activity

It is frequently assumed that the daily numbers of epilepy seizures are independent Poisson random variables.

  1. The daily numbers of epilepy seizures of a given patient are reported in the datafile epilepsy1.csv.Use the Poisson dispersion test and the Chi-Square goodness-of-fit test to test if this data follows a Poisson distribution.

  2. Compare the Type I error rate and the power of these two tests via Monte Carlo simulation.

5 Type 2 diabetes

An investigator is exploring whether the expression levels of genes significantly differ between a sample of healthy individuals and a sample of individuals with Type 2 diabetes. He performs a separate t-test comparing the two samples for 5,000 different genes, and uses \(\alpha = 0.05\) as his cutoff. His analysis identifes 411 genes as having different expression levels between the two samples.

  1. The investigator reasons that because he carried out his t-tests using a type I error rate of 5%, he should expect about 5% of the 411 genes that he discovered to be type I errors. Is this reasoning correct or incorrect? If it is incorrect, what’s wrong with it?

  2. What is the investigator’s false discovery rate?

6 Identification of genes

Breast cancer is the most common malignant disease in Western women. In these patients, it is not the primary tumour, but its metastases at distant sites that are the main cause of death.

Prognostic markers are needed to identify patients who are at the highest risk for developing metastases, which might enable oncologists to begin tailoring treatment strategies to individual patients. Gene-expression signatures of primary breast tumours might be one way to identify the patients who are most likely to develop metastatic cancer.

The datafile geneMFS.csv contains the expression level of 11 genes and the metastasis-free survival (the period until metastasis is detected) for 527 patients.

The objective of this study is to identify which genes may be good or poor prognosis for the development of matastasis.

  1. Graphically compare the distribution of the gene expressions in the groups of patients with early metastasis (MFS <1000) and late metastasis (MFS>1000).

  2. Compare the gene expression levels in these two groups using a parametric test.

  3. Compare these results with those obtained using a non parametric test.