$$ \newcommand{\esp}[1]{\mathbb{E}\left(#1\right)} \newcommand{\var}[1]{\mbox{Var}\left(#1\right)} \newcommand{\deriv}[1]{\dot{#1}(t)} \newcommand{\prob}[1]{ \mathbb{P}\!(#1)} \newcommand{\eqdef}{\mathop{=}\limits^{\mathrm{def}}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bc}{\boldsymbol{c}} \newcommand{\bpsi}{\boldsymbol{\psi}} \def\pmacro{\texttt{p}} \def\like{{\cal L}} \def\llike{{\cal LL}} \def\logit{{\rm logit}} \def\probit{{\rm probit}} \def\one{{\rm 1\!I}} \def\iid{\mathop{\sim}_{\rm i.i.d.}} \def\simh0{\mathop{\sim}_{H_0}} \def\df{\texttt{df}} \def\res{e} \def\xomega{x} \newcommand{\argmin}[1]{{\rm arg}\min_{#1}} \newcommand{\argmax}[1]{{\rm arg}\max_{#1}} \newcommand{\Rset}{\mbox{$\mathbb{R}$}} \def\param{\theta} \def\setparam{\Theta} \def\xnew{x_{\rm new}} \def\fnew{f_{\rm new}} \def\ynew{y_{\rm new}} \def\nnew{n_{\rm new}} \def\enew{e_{\rm new}} \def\Xnew{X_{\rm new}} \def\hfnew{\widehat{\fnew}} \def\degree{m} \def\nbeta{d} \newcommand{\limite}[1]{\mathop{\longrightarrow}\limits_{#1}} \def\ka{k{\scriptstyle a}} \def\ska{k{\scriptscriptstyle a}} \def\kel{k{\scriptstyle e}} \def\skel{k{\scriptscriptstyle e}} \def\cl{C{\small l}} \def\Tlag{T\hspace{-0.1em}{\scriptstyle lag}} \def\sTlag{T\hspace{-0.07em}{\scriptscriptstyle lag}} \def\Tk{T\hspace{-0.1em}{\scriptstyle k0}} \def\sTk{T\hspace{-0.07em}{\scriptscriptstyle k0}} \def\thalf{t{\scriptstyle 1/2}} \newcommand{\Dphi}[1]{\partial_\pphi #1} \def\asigma{a} \def\pphi{\psi} \newcommand{\stheta}{{\theta^\star}} \newcommand{\htheta}{{\widehat{\theta}}} $$


1 Introduction

When we perform a large number of statistical tests, some will have \(p\)-values less than 0.05 purely by chance, even if all the null hypotheses are really true.

More precisely, if we do a large number \(m\) of statistical tests, and for \(m_{0\cdot}\) of them the null hypothesis is actually true, we would expect about 5% of the \(m_{0\cdot}\) tests to be significant at the 0.05 level, just due to chance: these significant results are false discoveries (or false positives). On the other hand, if some alternative hypothesis are true, we can miss some of them: these non significant results are false negatives.

\[ \begin{array}{c|c|c|c} & \text{Null hypothesis true} & \text{Alternative hypothesis true} & \\ \hline \text{Test non significant} & m_{00} & m_{10} & m_{\cdot 0} \\ \hline \text{Test significant} & m_{01} & m_{11} & m_{\cdot 1} \\ \hline & m_{0 \cdot} & m_{1 \cdot} & m \end{array} \]

If important conclusions and decisions are based on these false positives, it is then important to control the family-wise error rate (FWER):

\[ FWER = \prob{m_{01}\geq 1} \]

When true positives are expected, it is possible to miss some of them. We then necessarily need to accept false positives if we want to limit the number of these false negatives. It is important in such situation to control the false discovery rate (FDR)

\[FDR = \esp{\frac{m_{01}}{m_{01} + m_{11}}} = \esp{\frac{m_{01}}{m_{\cdot1} }}\]

Several procedures exist for controlling either the FWER or the FDR.


2 Distribution of the p-values

2.1 Introduction

The health effects of a Roundup-tolerant genetically modified maize, cultivated with or without Roundup, and Roundup alone, were studied during a 2 years study in rats.

For each sex, one control group had access to plain water and standard diet from the closest isogenic non-transgenic maize control; six groups were fed with 11, 22 and 33% of GM NK603 maize either treated or not with Roundup. The final three groups were fed with the control diet and had access to water supplemented with different concentrations of Roundup.

A sample of 200 rats including 100 males and 100 females was randomized into 20 groups of 10 rats of the same sex. Within each group, rats received the same diet. For each sex, there are therefore nine experimental groups and one control group.

The file ratSurvival.csv reports the lifespan (in days) for each animal. Here, the experiment stopped after 720 days. Then, the reported survival time is 720 for those animals who were still alive at the end of the experiment.

See the opinion of the Haut Conseil des Biotechnologies for more information about this study.

Here is a summary of the data,

data <- read.csv("ratSurvival.csv")
summary(data)
##       time              regimen      gender   
##  Min.   :100.0   control    :20   female:100  
##  1st Qu.:590.0   NK603-11%  :20   male  :100  
##  Median :660.0   NK603-11%+R:20               
##  Mean   :631.6   NK603-22%  :20               
##  3rd Qu.:720.0   NK603-22%+R:20               
##  Max.   :720.0   NK603-33%  :20               
##                  (Other)    :80
levels(data$regimen)
##  [1] "control"     "NK603-11%"   "NK603-11%+R" "NK603-22%"   "NK603-22%+R"
##  [6] "NK603-33%"   "NK603-33%+R" "RoundUp A"   "RoundUp B"   "RoundUp C"


2.2 Single comparison between 2 groups

One objective of this study is the comparison of the survival between the control group and the experimental groups.

Consider for instance the control group of females

data.control <- subset(data, regimen=="control" & gender=="female")
data.control$time
##  [1] 540 645 720 720 720 720 720 720 720 720

Only 2 rats of this group died before the end of the experiment. On the other hand, 7 females of the group fed with 22% of maize NK693 died during the experiment.

data.test <- subset(data, regimen=="NK603-22%" & gender=="female")
data.test$time
##  [1] 290 475 480 510 550 555 650 720 720 720

A negative effect of the diet on the survival means that the rats of the experimental group tend to die before those of the control group. Then, we would like to test \(H_0\) : the NK603 11% diet has no effect on the survival of female rats''* against $H_1$: *the NK603 11% diet leads to decreased survival time for female rats??????.

In terms of survival functions, that means that, under \(H_1\), the probability to be alive at a given time \(t\) is lower for a rat of the experimental group than for a rat of the control group. We then would like to test \[H_0: \ ``\prob{T_{\rm test}>t} = \prob{T_{\rm control}>t}, \text{ for any } t>0"" \quad \text{versus} \quad H_1: \ ``\prob{T_{\rm test}>t} < \prob{T_{\rm control}>t}, \text{ for any } t>0" \]

Because of the (right) censoring process, we cannot just compare the mean survival times using a \(t\)-test. On the other hand, we can use the Wilcoxon-Mann-Whitney test which precisely aims to compare the ranks of the survival times in both groups.

wilcox.test(data.test$time, data.control$time, alternative="less")
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  data.test$time and data.control$time
## W = 22, p-value = 0.01144
## alternative hypothesis: true location shift is less than 0

Here, the \(p\)-value should lead us to reject the null hypothesis and conclude that 22% of the GM maize in the diet has a negative effect on the survival.

2.3 A single comparison??? among many others

Should we really accept this conclusion as it stands? No, because we don???t know the whole storyRemember that there are 9 experimental groups for each sex. Then, 18 comparisons with the control groups are performed.

library(ggplot2) ; theme_set(theme_bw())
pval <- stat <- gender <- regimen <- NULL
for (g in levels(data$gender)) {
  data.control <- subset(data, regimen=="control" & gender==g)
  for (r in levels(data$regimen)) {
    if (r != "control") {
      data.test <- subset(data, gender==g & regimen==r)
      wt <- wilcox.test(data.test$time, data.control$time, alternative="less")
      pval <- c(pval,wt$p.value)
      stat <- c(stat,wt$statistic)
      gender <- c(gender, g)
      regimen <- c(regimen, r)
    }
  }
}
R <- data.frame(gender=gender, regimen=regimen,  stat=stat, p.value=pval)
R <-  R[order(R$p.value),]
R
##    gender     regimen stat    p.value
## 3  female   NK603-22% 22.0 0.01143779
## 4  female NK603-22%+R 25.0 0.02131745
## 9  female   RoundUp C 26.5 0.02845455
## 1  female   NK603-11% 33.0 0.07166095
## 8  female   RoundUp B 34.0 0.08459162
## 7  female   RoundUp A 34.5 0.09156610
## 6  female NK603-33%+R 36.0 0.11556850
## 5  female   NK603-33% 39.0 0.17583926
## 2  female NK603-11%+R 40.0 0.18797769
## 14   male   NK603-33% 40.5 0.24733242
## 13   male NK603-22%+R 42.0 0.28493944
## 11   male NK603-11%+R 50.0 0.51508635
## 15   male NK603-33%+R 51.0 0.54532608
## 18   male   RoundUp C 54.5 0.64764235
## 17   male   RoundUp B 58.5 0.75283120
## 12   male   NK603-22% 64.0 0.86465902
## 10   male   NK603-11% 74.5 0.97160483
## 16   male   RoundUp A 75.5 0.97688776

Let us plot the ordered p-values:

ggplot(data=R) + geom_point(aes(x=1:18,color=regimen,y=p.value,shape=gender), size=4) + 
  scale_y_log10() + xlab("regimen") + ylab("p-value") + scale_x_continuous(breaks=NULL)

If we then decide to only report the largest observed differences, associated to the smallest \(p\)-values, how can we conclude that these differences are statistically significant?


2.4 The Bonferroni correction

Imagine that we perform \(m\) comparisons and that all the \(m\) null hypotheses are true, i.e. \(m=m_{0\cdot}\).. If we use the same significance level \(\alpha_m\) for the \(m\) tests, how should we choose \(\alpha_m\) in order to control the family-wise error rate (FWER)?

\[ \begin{aligned} {\rm FWER} &= \prob{m_{01}\geq 1} \\ &= 1 - \prob{m_{01}= 0} \\ &= 1 - (1-\alpha_m)^m \end{aligned} \]

Then, if we set FWER\(=\alpha\), the significance level for each individual test should be \[\begin{aligned} \alpha_m &= 1 - (1-\alpha)^{\frac{1}{m}} \quad \quad (\text{Sidak correction})\\ & \simeq \frac{\alpha}{m} \quad \quad (\text{Bonferroni correction}) \end{aligned}\]

Let \(p_k\) be the \(p\)-value of the \(k\)-th test. Using the Bonferroni correction, the \(k\)-th test is significant if \[ p_k \leq \alpha_m \quad \Longleftrightarrow \quad m \, p_k \leq \alpha \] We can then either compare the original \(p\)-value \(p_k\) to the corrected significance level \(\alpha/m\), or compare the adjusted \(p\)-value \(p_k^{\rm (bonferroni)} =\min(1, m \, p_k)\) to the critical value \(\alpha\).

m <- nrow(R)
R$pv.bonf <- pmin(1, R$p.value*m)
R
##    gender     regimen stat    p.value   pv.bonf
## 3  female   NK603-22% 22.0 0.01143779 0.2058803
## 4  female NK603-22%+R 25.0 0.02131745 0.3837141
## 9  female   RoundUp C 26.5 0.02845455 0.5121818
## 1  female   NK603-11% 33.0 0.07166095 1.0000000
## 8  female   RoundUp B 34.0 0.08459162 1.0000000
## 7  female   RoundUp A 34.5 0.09156610 1.0000000
## 6  female NK603-33%+R 36.0 0.11556850 1.0000000
## 5  female   NK603-33% 39.0 0.17583926 1.0000000
## 2  female NK603-11%+R 40.0 0.18797769 1.0000000
## 14   male   NK603-33% 40.5 0.24733242 1.0000000
## 13   male NK603-22%+R 42.0 0.28493944 1.0000000
## 11   male NK603-11%+R 50.0 0.51508635 1.0000000
## 15   male NK603-33%+R 51.0 0.54532608 1.0000000
## 18   male   RoundUp C 54.5 0.64764235 1.0000000
## 17   male   RoundUp B 58.5 0.75283120 1.0000000
## 12   male   NK603-22% 64.0 0.86465902 1.0000000
## 10   male   NK603-11% 74.5 0.97160483 1.0000000
## 16   male   RoundUp A 75.5 0.97688776 1.0000000

Using the Bonferroni correction, none of the 18 comparisons is significant.

Remark: the function p.adjust proposes several adjustements of the \(p\)-values for multiple comparisons, including the Bonferroni adjustment:

p.adjust(pval, method = "bonferroni")
##  [1] 1.0000000 1.0000000 0.2058803 0.3837141 1.0000000 1.0000000 1.0000000
##  [8] 1.0000000 0.5121818 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [15] 1.0000000 1.0000000 1.0000000 1.0000000

The Bonferroni correction is appropriate when a single false positive in a set of tests would be a problem. It is mainly useful when there are a fairly small number of multiple comparisons and very few of them might be significant. The main drawback of the Bonferroni correction is its lack of power: it may lead to a very high rate of false negatives.

Assume, for instance that all the female rats of all the experimental groups die before the end of the experiment. Each of the \(m=18\) original tests is significant, but the Bonferroni correction would lead to considering none of these tests as significant:

x <- subset(data, regimen=="control" & gender=="female")$time
y <- seq(610,700,length=10)
c(wilcox.test(x,y,"greater")$p.value, wilcox.test(x,y,"greater")$p.value*18)
## [1] 0.004444026 0.079992464


2.5 Permutation test

A permutation test (also called a randomization test) is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. If the labels are exchangeable under the null hypothesis, then the resulting tests yield exact significance levels. Prediction intervals can also be derived.

In our example, imagine that the null hypothesis is true. We can then randomly exchange the labels (i.e.??the regimen) and perform the 18 comparisons between the experimental groups and the control groups.

set.seed(100)
dperm.m <- subset(data,  gender=="male")
n.m <- dim(dperm.m)[1]
dperm.m$regimen <- dperm.m$regimen[sample(n.m)]
dperm.f <- subset(data,  gender=="female")
n.f <- dim(dperm.f)[1]
dperm.f$regimen <- dperm.f$regimen[sample(n.f)]
dperm <- rbind(dperm.m,dperm.f)

pval <- gender <- regimen <- NULL
for (g in levels(dperm$gender)) {
  dperm.control <- subset(dperm, regimen=="control" & gender==g)
  for (r in levels(dperm$regimen)) {
    if (r != "control") {
      dperm.test <- subset(dperm, gender==g & regimen==r)
      wt <- wilcox.test(dperm.test$time, dperm.control$time, alternative="less")
      pval <- c(pval,wt$p.value)
      gender <- c(gender, g)
      regimen <- c(regimen, r)
    }
  }
}
R.p <- data.frame(gender=gender, regimen=regimen, p.value=pval)
Ro.p <-  R.p[order(R.p$p.value),]

The test statistics and the \(p\)-values now really behave how they are supposed to behave under the null hypothesis

ggplot(data=Ro.p) + geom_point(aes(x=1:18,color=regimen,y=p.value,shape=gender), size=4) + 
  scale_y_log10() + xlab("regimen") + ylab("p-value") + scale_x_continuous(breaks=NULL)

If we now repeat the same experiment using many different permutations, we will be able to estimate the \(m\) distributions of the \(m\) test statistics as well as the \(m\) distributions of the \(m\) \(p\)-values under the null hypothesis.

L <- 1000
dperm.m <- subset(data,  gender=="male")
n.m <- dim(dperm.m)[1]
dperm.f <- subset(data,  gender=="female")
n.f <- dim(dperm.f)[1]
PV <- ST <- NULL

for (l in (1:L)) {
  dperm.m$regimen <- dperm.m$regimen[sample(n.m)]
  dperm.f$regimen <- dperm.f$regimen[sample(n.f)]
  dperm <- rbind(dperm.m,dperm.f)
  pval <- stat <- NULL
  for (g in levels(dperm$gender)) {
    dperm.control <- subset(dperm, regimen=="control" & gender==g)
    for (r in levels(dperm$regimen)) {
      if (r != "control") {
        dperm.test <- subset(dperm, gender==g & regimen==r)
        wt <- wilcox.test(dperm.test$time, dperm.control$time, alternative="less")
        pval <- c(pval,wt$p.value)
        stat <- c(stat,wt$statistic)
      }
    }
  }
  PV <- rbind(PV,sort(pval))
  ST <- rbind(ST,sort(stat))
}

We can estimate, for instance, prediction intervals of level 90% for the \(m=18\) ordered \(p\)-values

q <- apply(PV, MARGIN = 2, quantile, probs = c(0.05, 0.5, 0.95))
q <- as.data.frame(t(q))
names(q) <- c("low","median","up")
q$rank <- 1:dim(q)[1]

and plot them???

pl <- ggplot(data=q) +
  geom_errorbar(aes(x=rank, ymin=low, ymax=up), width=0.2,size=1.5,colour="grey50") +
  scale_y_log10() + xlab("regimen") + ylab("p-value") + scale_x_continuous(breaks=NULL) 
pl

??? with the original \(p\)-values

pl +  geom_point(data=R, aes(x=1:18,color=regimen,y=p.value,shape=gender), size=4) 

Here, all the \(p\)-values, including the smallest ones, belong to the 90% prediction intervals: all the observed \(p\)-values behave individually how they are expected to behave under the null hypothesis.

In particular, when 18 comparisons are performed, it???s not unlikely under the null hypothesis to obtain a smallest p-value less than or equal to the observed one (0.011).

The probability of such event can easily be estimated by Monte Carlo simulation. Let \(p_{(1),\ell}\) be the smallest p-value obtained from the \(\ell\)-th replicate of the Monte Carlo. Then,

\[ \prob{p_{(1)} \leq p_{(1)}^{\rm obs}} \simeq \frac{1}{L} \sum_{\ell=1}^{L} \one\left\{p_{(1),\ell} \leq p_{(1)}^{\rm obs}\right\} \]

mean(ST[,1]<R$stat[1])
## [1] 0.139


3 Controlling the False Discovery Rate

3.1 Detecting associations

(Example from Handbook of Biological Statistics)

Garc??a-Arenzana et al. (2014) tested associations of 25 dietary variables with mammographic density, an important risk factor for breast cancer, in Spanish women. They found the following results

data <- read.csv("dietary.csv")
data
##              dietary p.value
## 1     Total calories   0.001
## 2          Olive oil   0.008
## 3         Whole milk   0.039
## 4         White meat   0.041
## 5           Proteins   0.042
## 6               Nuts   0.061
## 7  Cereals and pasta   0.074
## 8         White fish   0.205
## 9             Butter   0.212
## 10        Vegetables   0.216
## 11      Skimmed milk   0.222
## 12          Red meat   0.251
## 13             Fruit   0.269
## 14              Eggs   0.275
## 15         Blue fish   0.340
## 16           Legumes   0.341
## 17     Carbohydrates   0.384
## 18          Potatoes   0.569
## 19             Bread   0.594
## 20              Fats   0.696
## 21            Sweets   0.762
## 22    Dairy products   0.940
## 23 Semi-skimmed milk   0.942
## 24        Total meat   0.975
## 25    Processed meat   0.986

We can see that five of the variables show a significant \(p\)-value (<0.05). However, because Garc??a-Arenzana et al. (2014) tested 25 dietary variables, we would expect one or two variables to show a significant result purely by chance, even if diet had no real effect on mammographic density.

Applying the Bonferroni correction, we divide \(\alpha=0.05\) by the number of tests (\(m=25\)) to get the Bonferroni critical value, so a test would have to have \(p<0.002\) to be significant. Under that criterion, only the test for total calories is significant.

An alternative approach is to control the false discovery rate, i.e the expected proportion of ``discoveries" (significant results) that are actually false positives. FDR control offers a way to increase power while maintaining some principled bound on error.

Imagine for instance that we compare expression levels for 20,000 genes between liver tumors and normal liver cells. We are going to do additional experiments on any genes that show a significant difference between the normal and tumor cells. Then, because we don???t want to miss genes of interest, we are willing to accept up to 25% of the genes with significant results being false positives. We???ll find out they???re false positives when we do the followup experiments. In this case, we would set the false discovery rate to 25%.

The Benjamini-Hochberg (BH) procedure controls the FDR??? and it is simple to use!

Indeed, for a given \(\alpha\) and a given sequence of ordered \(p\)-values \(P_{(1)}\), \(P_{(2)}\), , \(P_{(m)}\), it consists in computing the \(m\) adjusted \(p\)-values defined as

\[ P_{(i)}^{\rm BH} = \min\left( P_{(i)}\frac{m}{i} \ , \ P_{(i+1)}^{\rm BH} \right)\]

p.bh <- data$p.value
m <- length(p.bh)
for (i in ((m-1):1)) 
  p.bh[i] <- min(data$p.value[i]*m/i , p.bh[i+1])
data$p.bh <- p.bh
data
##              dietary p.value      p.bh
## 1     Total calories   0.001 0.0250000
## 2          Olive oil   0.008 0.1000000
## 3         Whole milk   0.039 0.2100000
## 4         White meat   0.041 0.2100000
## 5           Proteins   0.042 0.2100000
## 6               Nuts   0.061 0.2541667
## 7  Cereals and pasta   0.074 0.2642857
## 8         White fish   0.205 0.4910714
## 9             Butter   0.212 0.4910714
## 10        Vegetables   0.216 0.4910714
## 11      Skimmed milk   0.222 0.4910714
## 12          Red meat   0.251 0.4910714
## 13             Fruit   0.269 0.4910714
## 14              Eggs   0.275 0.4910714
## 15         Blue fish   0.340 0.5328125
## 16           Legumes   0.341 0.5328125
## 17     Carbohydrates   0.384 0.5647059
## 18          Potatoes   0.569 0.7815789
## 19             Bread   0.594 0.7815789
## 20              Fats   0.696 0.8700000
## 21            Sweets   0.762 0.9071429
## 22    Dairy products   0.940 0.9860000
## 23 Semi-skimmed milk   0.942 0.9860000
## 24        Total meat   0.975 0.9860000
## 25    Processed meat   0.986 0.9860000

Notice that we could equivalently use the function p.adjust:

p.adjust(data$p.value, method = "BH")
##  [1] 0.0250000 0.1000000 0.2100000 0.2100000 0.2100000 0.2541667 0.2642857
##  [8] 0.4910714 0.4910714 0.4910714 0.4910714 0.4910714 0.4910714 0.4910714
## [15] 0.5328125 0.5328125 0.5647059 0.7815789 0.7815789 0.8700000 0.9071429
## [22] 0.9860000 0.9860000 0.9860000 0.9860000

Then, the discoveries, i.e.??the significant tests, are those with an ajusted p-value less than \(\alpha\).

It can be shown that this procedure guarantees that for independent tests, and for any alternative hypothesis,

\[\begin{aligned} {\rm FDR} &= \esp{\frac{m_{01}}{m_{01} + m_{11}}} \\ &\leq \frac{m_{0\cdot}}{m} \alpha \\ &\leq \alpha \end{aligned}\] where \(m_{0\cdot}\) is the (unknown) total number of true null hypotheses, and where the first inequality is an equality with continuous \(p\)-value distributions.

In our example, the first five tests would be significant with \(\alpha=0.25\), which means that we expect no more than 25% of these 5 tests to be false discoveries.

Remark 1: The BH procedure is equivalent to consider as significant the non adjusted \(p\)-values smaller than a threshold \(P_{\rm BH}\) defined as

\[ P_{\rm BH} = \max_i \left\{ P_{(i)}: \ \ P_{(i)} \leq \alpha \frac{i}{m} \right\} \] In other words, the largest \(p\)-value that has \(P_{(i)}<(i/m)\alpha\) is significant, and all of the \(P\)-values smaller than it are also significant, even the ones that aren???t less than their Benjamini-Hochberg critical value \(\alpha \times i/m\)

alpha <- 0.25
m <- dim(data)[1]
data$critical.value <- (1:m)/m*alpha
data
##              dietary p.value      p.bh critical.value
## 1     Total calories   0.001 0.0250000           0.01
## 2          Olive oil   0.008 0.1000000           0.02
## 3         Whole milk   0.039 0.2100000           0.03
## 4         White meat   0.041 0.2100000           0.04
## 5           Proteins   0.042 0.2100000           0.05
## 6               Nuts   0.061 0.2541667           0.06
## 7  Cereals and pasta   0.074 0.2642857           0.07
## 8         White fish   0.205 0.4910714           0.08
## 9             Butter   0.212 0.4910714           0.09
## 10        Vegetables   0.216 0.4910714           0.10
## 11      Skimmed milk   0.222 0.4910714           0.11
## 12          Red meat   0.251 0.4910714           0.12
## 13             Fruit   0.269 0.4910714           0.13
## 14              Eggs   0.275 0.4910714           0.14
## 15         Blue fish   0.340 0.5328125           0.15
## 16           Legumes   0.341 0.5328125           0.16
## 17     Carbohydrates   0.384 0.5647059           0.17
## 18          Potatoes   0.569 0.7815789           0.18
## 19             Bread   0.594 0.7815789           0.19
## 20              Fats   0.696 0.8700000           0.20
## 21            Sweets   0.762 0.9071429           0.21
## 22    Dairy products   0.940 0.9860000           0.22
## 23 Semi-skimmed milk   0.942 0.9860000           0.23
## 24        Total meat   0.975 0.9860000           0.24
## 25    Processed meat   0.986 0.9860000           0.25
library(gridExtra)
pl1 <- ggplot(data) + geom_line(aes(x=1:m,y=p.value), colour="blue") +
geom_line(aes(x=1:m,y=critical.value), colour="red") +xlab("(i)")
grid.arrange(pl1,pl1 + xlim(c(1,8)) + ylim(c(0,0.21)) + geom_vline(xintercept=5.5))

The largest \(p\)-value with \(P_{(i)}<(i/m)\alpha\) is proteins, where the individual \(p\)-value (0.042) is less than the \((i/m)\alpha\) value of 0.050. Thus the first five tests would be significant.

Remark 2: The FDR is not bounded by \(\alpha\), but by \((m_{0\cdot}/m) \alpha\). We could increase the global power of the tests and get a FDR equal to the desired level \(\alpha\), either by defining the critical values as \((i/m_{0\cdot})\alpha\), or by multiplying the adjusted \(p\)-values by \(m_{0\cdot}/m\).

Unfortunately, \(m_{0\cdot}\) is unknown??? but it can be estimated, as the number of non significant tests for instance.

m0.est <- sum(data$p.bh>alpha)
data$crit.valc <- round(data$critical.value*m/m0.est,4)
data$p.bhc <- round(data$p.bh*m0.est/m,4)
head(data,10)
##              dietary p.value      p.bh critical.value crit.valc  p.bhc
## 1     Total calories   0.001 0.0250000           0.01    0.0125 0.0200
## 2          Olive oil   0.008 0.1000000           0.02    0.0250 0.0800
## 3         Whole milk   0.039 0.2100000           0.03    0.0375 0.1680
## 4         White meat   0.041 0.2100000           0.04    0.0500 0.1680
## 5           Proteins   0.042 0.2100000           0.05    0.0625 0.1680
## 6               Nuts   0.061 0.2541667           0.06    0.0750 0.2033
## 7  Cereals and pasta   0.074 0.2642857           0.07    0.0875 0.2114
## 8         White fish   0.205 0.4910714           0.08    0.1000 0.3929
## 9             Butter   0.212 0.4910714           0.09    0.1125 0.3929
## 10        Vegetables   0.216 0.4910714           0.10    0.1250 0.3929

We would consider the 7 first \(p\)-values as significant using this new correction.


3.2 A Monte Carlo simulation

Let us perform a Monte Carlo simulation to better understand the impact of these corrections.

Let us assume that we observe \((x_{ij}, 1\leq i \leq n_x, 1 \leq j \leq m)\) and \((y_{ij}, 1\leq i \leq n_y, 1 \leq j \leq m)\) where ,

\[\begin{aligned} x_{ij} & \iid {\cal N}(\mu_{x,j} \ , \ \sigma_x^2) \\ y_{ij} & \iid {\cal N}(\mu_{y,j} \ , \ \sigma_y^2) \end{aligned} \]

For \(j=1,2,\ldots, m\), we want to test \(H_{0,j}: \mu_{x,j}=\mu_{y,j}\) versus \(H_{1,j}: \mu_{x,j} \neq \mu_{y,j}\).

For the simulation, we will use \(m=140\), \(n_x=n_y=50\), \(\sigma_x^2=\sigma_y^2=1\) and \(\mu_{x,j}=0\) for \(1\leq j \leq m\).

Furthermore, the null hypothesis is true for the first \(m_{0,\cdot}=120\) variables, assuming that \(\mu_{y,j}=0\) for \(1\leq j \leq m_{0,\cdot}\). Then, for \(m_{0,\cdot}+1\leq j \leq m\), \(\mu_{y,j}\) varies from 0.3 to 0.6, which means that the alternative hypothesis is true for these \(m_{1,\cdot}=20\) variables.

nx <- 50
ny <- 50
m0 <- 120
m1 <- 20
mu.x <- 0
mu.y <- c(rep(0,m0),seq(0.3,0.6,length=m1))

For each of the \(L=1000\) simulated replicate of the same experiment, we will randomly sample observation \(x\) and \(y\) from the model and perform a \(t\)-test for each of the \(m\) variables. We therefore get \(m\) \(p\)-values for each of these \(L\) replicates.

L <- 1000
m <- m0+m1
set.seed <- 12345
pval <- matrix(ncol=L,nrow=m)
for (l in (1:L))
{
  x.sim <- matrix(rnorm(nx*m, mu.x), ncol=nx)
  y.sim <- matrix(rnorm(ny*m, mu.y), ncol=ny)
  dat <- cbind(x.sim, y.sim)
  pval[,l] <- apply(dat, 1, function(dat) {
    t.test(x = dat[1:nx], y = dat[(nx + 1):(nx + ny)])$p.value})
}

Setting the significance level \(\alpha\) to 0.2, we can compute for each replicate the numbers of true and false discoveries \(m_{11}\) and \(m_{01}\), as well as the numbers of true and false nondiscoveries \(m_{00}\) and \(m_{10}\).

We can then compute the proportion of wrongly rejected null hypotheses

alpha=0.2
m01 <- colSums(pval[1:m0,] < alpha)
mean(m01/m0)
## [1] 0.2009333

As expected, this proportion is close to \(\alpha\) which is precisely the probability to wrongly reject the null hypothesis.

The proportion of correctly rejected null hypotheses is an estimate of the power of the test

m11 <- colSums(pval[(m0+1):m,] < alpha)
mean(m11/m1)
## [1] 0.80705

The False Discovery Rate is estimated as the proportion of false discoveries

mean(m01/(m01+m11))
## [1] 0.5952523

This means that among the significant results, about 60% of them are false discoveries.

Let us now apply the Bonferroni correction, and compute the proportion of wrongly and correctly rejected null hypotheses and the proportion of false discoveries

pval.b <- apply(pval,2, function(pval) {p.adjust(pval,method="bonferroni")})
m01.b <- colSums(pval.b[1:m0,] < alpha)
m11.b <- colSums(pval.b[(m0+1):m,] < alpha)
c(mean(m01.b/m0), mean(m11.b/m1),mean(m01.b/(m01.b+m11.b),na.rm=TRUE))
## [1] 0.001283333 0.180950000 0.037476158

Very few of the true null hypotheses are rejected, which may be a good point, but the price to pay is a very low power (less that 20%).

The familywise error rate (FWER) is the probability \(\prob{m_{01}\geq 1}\) to reject at least one of the true null hypothesis. This probability remains quite small when the Bonferroni correction is used.

mean(m01.b>=1)
## [1] 0.148

The Benjamini-Hochberg correction increases the power and controls the FDR as expected since the proportion of false discoveries remains below the level \(\alpha\).

pval.bh <- apply(pval,2, function(pval) {p.adjust(pval,method="BH")})
m01.bh <- colSums(pval.bh[1:m0,] < alpha)
m11.bh <- colSums(pval.bh[(m0+1):m,] < alpha)
c(mean(m01.bh/m0), mean(m11.bh/m1),mean(m01.bh/(m01.bh+m11.bh),na.rm=TRUE))
## [1] 0.01786667 0.43115000 0.17250494

Lastly, we can slightly improve the BH procedure by multiplying the \(p\)-values by the ratio \(\hat{m}_{0\cdot}/m\)

pval.bhc <- pval.bh
for (l in (1:L)) {
  m0e <- sum(pval.bh[,l]>alpha)
  pval.bhc[,l] <- pval.bh[,l]*m0e/m
}

m01.bhc <- colSums(pval.bhc[1:m0,] < alpha)
m11.bhc <- colSums(pval.bhc[(m0+1):m,] < alpha)
c(mean(m01.bhc/m0), mean(m11.bhc/m1),mean(m01.bhc/(m01.bhc+m11.bhc),na.rm=TRUE))
## [1] 0.02068333 0.44910000 0.18481846

Benjamini, Y., and Y. Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B 57: 289-300.

Garc??a-Arenzana, N., E.M. Navarrete-Mu??oz, V. Lope, P. Moreo, S. Laso-Pablos, N. Ascunce, F. Casanova-G??mez, C. S??nchez-Contador, C. Santamari??a, N. Aragon??s, B.P. G??mez, J. Vioque, and M. Poll??n. 2014. Calorie intake, olive oil consumption and mammographic density among Spanish women. International journal of cancer 134: 1916-1925.