January 12th, 2020

When we perform a large number of statistical tests, some will have \(p\)-values less than 0.05 purely by chance, even if all the null hypotheses are really true.

More precisely, if we do a large number \(m\) of statistical tests, and for \(m_{0\cdot}\) of them the null hypothesis is actually true, we would expect about 5% of the \(m_{0\cdot}\) tests to be significant at the 0.05 level, just due to chance: these significant results are false discoveries (or false positives). On the other hand, if some alternative hypothesis are true, we can miss some of them: these non significant results are false negatives.

\[ \begin{array}{c|c|c|c} & \text{Null hypothesis true} & \text{Alternative hypothesis true} & \\ \hline \text{Test non significant} & m_{00} & m_{10} & m_{\cdot 0} \\ \hline \text{Test significant} & m_{01} & m_{11} & m_{\cdot 1} \\ \hline & m_{0 \cdot} & m_{1 \cdot} & m \end{array} \]

If important conclusions and decisions are based on these false positives, it is then important to control the *family-wise error rate* (FWER):

- the family-wise error rate is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests

\[ FWER = \prob{m_{01}\geq 1} \]

When true positives are expected, it is possible to miss some of them. We then necessarily need to accept false positives if we want to limit the number of these false negatives. It is important in such situation to control the false discovery rate (FDR)

- The false discovery rate (FDR) is the expected proportion of false discoveries among the discoveries

\[FDR = \esp{\frac{m_{01}}{m_{01} + m_{11}}} = \esp{\frac{m_{01}}{m_{\cdot1} }}\]

Several procedures exist for controlling either the FWER or the FDR.

The health effects of a Roundup-tolerant genetically modified maize, cultivated with or without Roundup, and Roundup alone, were studied during a 2 years study in rats.

For each sex, one control group had access to plain water and standard diet from the closest isogenic non-transgenic maize control; six groups were fed with 11, 22 and 33% of GM NK603 maize either treated or not with Roundup. The final three groups were fed with the control diet and had access to water supplemented with different concentrations of Roundup.

A sample of 200 rats including 100 males and 100 females was randomized into 20 groups of 10 rats of the same sex. Within each group, rats received the same diet. For each sex, there are therefore nine experimental groups and one control group.

The file

See the opinion of the Haut Conseil des Biotechnologies for more information about this study.

Here is a summary of the data,

```
## time regimen gender
## Min. :100.0 control :20 female:100
## 1st Qu.:590.0 NK603-11% :20 male :100
## Median :660.0 NK603-11%+R:20
## Mean :631.6 NK603-22% :20
## 3rd Qu.:720.0 NK603-22%+R:20
## Max. :720.0 NK603-33% :20
## (Other) :80
```

```
## [1] "control" "NK603-11%" "NK603-11%+R" "NK603-22%" "NK603-22%+R" "NK603-33%" "NK603-33%+R"
## [8] "RoundUp A" "RoundUp B" "RoundUp C"
```

One objective of this study is the comparison of the survival between the control group and the experimental groups.

Consider for instance the control group of females

`## [1] 540 645 720 720 720 720 720 720 720 720`

Only 2 rats of this group died before the end of the experiment. On the other hand, 7 females of the group fed with 22% of maize NK693 died during the experiment.

`## [1] 290 475 480 510 550 555 650 720 720 720`

A negative effect of the diet on the survival means that the rats of the experimental group tend to die before those of the control group. Then, we would like to test \[\begin{aligned} & H_0: \ ``\text{the NK603 22% diet has no effect on the survival of female rats }" \\ \text{versus } & H_1: \ ``\text{the NK603 22% diet leads to decreased survival time for female rats}" \end{aligned}\]

In terms of survival functions, that means that, under \(H_1\), the probability to be alive at a given time \(t\) is lower for a rat of the experimental group than for a rat of the control group. We then would like to test \[\begin{aligned} & H_0: \ ``\prob{T_{\rm test}>t} = \prob{T_{\rm control}>t}, \text{ for any } t>0" \\ \text{versus } & H_1: \ ``\prob{T_{\rm test}>t} < \prob{T_{\rm control}>t}, \text{ for any } t>0" \end{aligned}\]

Because of the (right) censoring process, we cannot just compare the mean survival times using a \(t\)-test. On the other hand, we can use the Wilcoxon-Mann-Whitney test which precisely aims to compare the ranks of the survival times in both groups.

```
##
## Wilcoxon rank sum test with continuity correction
##
## data: data.test$time and data.control$time
## W = 22, p-value = 0.01144
## alternative hypothesis: true location shift is less than 0
```

Here, the \(p\)-value should lead us to reject the null hypothesis and conclude that 22% of the GM maize in the diet has a negative effect on the survival.

Should we really accept this conclusion as it stands? No, because we don’t know the whole story… Remember that there are 9 experimental groups for each sex. Then, 18 comparisons with the control groups are performed.

```
library(ggplot2) ; theme_set(theme_bw())
pval <- stat <- gender <- regimen <- NULL
for (g in levels(data$gender)) {
data.control <- subset(data, regimen=="control" & gender==g)
for (r in levels(data$regimen)) {
if (r != "control") {
data.test <- subset(data, gender==g & regimen==r)
wt <- wilcox.test(data.test$time, data.control$time, alternative="less")
pval <- c(pval,wt$p.value)
stat <- c(stat,wt$statistic)
gender <- c(gender, g)
regimen <- c(regimen, r)
}
}
}
R <- data.frame(gender=gender, regimen=regimen, stat=stat, p.value=pval)
R <- R[order(R$p.value),]
R
```

```
## gender regimen stat p.value
## 3 female NK603-22% 22.0 0.01143779
## 4 female NK603-22%+R 25.0 0.02131745
## 9 female RoundUp C 26.5 0.02845455
## 1 female NK603-11% 33.0 0.07166095
## 8 female RoundUp B 34.0 0.08459162
## 7 female RoundUp A 34.5 0.09156610
## 6 female NK603-33%+R 36.0 0.11556850
## 5 female NK603-33% 39.0 0.17583926
## 2 female NK603-11%+R 40.0 0.18797769
## 14 male NK603-33% 40.5 0.24733242
## 13 male NK603-22%+R 42.0 0.28493944
## 11 male NK603-11%+R 50.0 0.51508635
## 15 male NK603-33%+R 51.0 0.54532608
## 18 male RoundUp C 54.5 0.64764235
## 17 male RoundUp B 58.5 0.75283120
## 12 male NK603-22% 64.0 0.86465902
## 10 male NK603-11% 74.5 0.97160483
## 16 male RoundUp A 75.5 0.97688776
```

Let us plot the ordered p-values:

```
ggplot(data=R) + geom_point(aes(x=1:18,color=regimen,y=p.value,shape=gender), size=4) +
scale_y_log10() + xlab("regimen") + ylab("p-value") + scale_x_continuous(breaks=NULL)
```

If we then decide to only report the largest observed differences, associated to the smallest \(p\)-values, how can we conclude that these differences are statistically significant?

A permutation test (also called a randomization test) is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. If the labels are exchangeable under the null hypothesis, then the resulting tests yield exact significance levels. Prediction intervals can also be derived.

In our example, imagine that the null hypothesis is true. We can then randomly exchange the labels (i.e. the regimen) and perform the 18 comparisons between the experimental groups and the control groups.

```
set.seed(100)
dperm.m <- subset(data, gender=="male")
n.m <- dim(dperm.m)[1]
dperm.m$regimen <- dperm.m$regimen[sample(n.m)]
dperm.f <- subset(data, gender=="female")
n.f <- dim(dperm.f)[1]
dperm.f$regimen <- dperm.f$regimen[sample(n.f)]
dperm <- rbind(dperm.m,dperm.f)
pval <- gender <- regimen <- NULL
for (g in levels(dperm$gender)) {
dperm.control <- subset(dperm, regimen=="control" & gender==g)
for (r in levels(dperm$regimen)) {
if (r != "control") {
dperm.test <- subset(dperm, gender==g & regimen==r)
wt <- wilcox.test(dperm.test$time, dperm.control$time, alternative="less")
pval <- c(pval,wt$p.value)
gender <- c(gender, g)
regimen <- c(regimen, r)
}
}
}
R.p <- data.frame(gender=gender, regimen=regimen, p.value=pval)
Ro.p <- R.p[order(R.p$p.value),]
```

The test statistics and the \(p\)-values now really behave how they are supposed to behave under the null hypothesis

```
ggplot(data=Ro.p) + geom_point(aes(x=1:18,color=regimen,y=p.value,shape=gender), size=4) +
scale_y_log10() + xlab("regimen") + ylab("p-value") + scale_x_continuous(breaks=NULL)
```

If we now repeat the same experiment using many different permutations, we will be able to estimate the \(m\) distributions of the \(m\) test statistics as well as the \(m\) distributions of the \(m\) \(p\)-values under the null hypothesis.

```
L <- 1000
dperm.m <- subset(data, gender=="male")
n.m <- dim(dperm.m)[1]
dperm.f <- subset(data, gender=="female")
n.f <- dim(dperm.f)[1]
PV <- ST <- NULL
for (l in (1:L)) {
dperm.m$regimen <- dperm.m$regimen[sample(n.m)]
dperm.f$regimen <- dperm.f$regimen[sample(n.f)]
dperm <- rbind(dperm.m,dperm.f)
pval <- stat <- NULL
for (g in levels(dperm$gender)) {
dperm.control <- subset(dperm, regimen=="control" & gender==g)
for (r in levels(dperm$regimen)) {
if (r != "control") {
dperm.test <- subset(dperm, gender==g & regimen==r)
wt <- wilcox.test(dperm.test$time, dperm.control$time, alternative="less")
pval <- c(pval,wt$p.value)
stat <- c(stat,wt$statistic)
}
}
}
PV <- rbind(PV,sort(pval))
ST <- rbind(ST,sort(stat))
}
```

We can estimate, for instance, prediction intervals of level 90% for the \(m=18\) ordered \(p\)-values

```
q <- apply(PV, MARGIN = 2, quantile, probs = c(0.05, 0.5, 0.95))
q <- as.data.frame(t(q))
names(q) <- c("low","median","up")
q$rank <- 1:dim(q)[1]
```

and plot them…

```
pl <- ggplot(data=q) +
geom_errorbar(aes(x=rank, ymin=low, ymax=up), width=0.2,size=1.5,colour="grey50") +
scale_y_log10() + xlab("regimen") + ylab("p-value") + scale_x_continuous(breaks=NULL)
pl
```

… with the original \(p\)-values

Here, all the \(p\)-values, including the smallest ones, belong to the 90% prediction intervals: all the observed \(p\)-values behave individually how they are expected to behave under the null hypothesis.

In particular, when 18 comparisons are performed, it’s not unlikely under the null hypothesis to obtain a smallest p-value less than or equal to the observed one (0.011).

The probability of such event can easily be estimated by Monte Carlo simulation. Let \(p_{(1),\ell}\) be the smallest p-value obtained from the \(\ell\)-th replicate of the Monte Carlo. Then,

\[ \prob{p_{(1)} \leq p_{(1)}^{\rm obs}} \simeq \frac{1}{L} \sum_{\ell=1}^{L} \one\left\{p_{(1),\ell} \leq p_{(1)}^{\rm obs}\right\} \]

`## [1] 0.152`

Imagine that we perform \(m\) comparisons and that all the \(m\) null hypotheses are true, i.e. \(m=m_{0\cdot}\).. If we use the same significance level \(\alpha_m\) for the \(m\) tests, how should we choose \(\alpha_m\) in order to control the family-wise error rate (FWER)?

\[ \begin{aligned} {\rm FWER} &= \prob{m_{01}\geq 1} \\ &= 1 - \prob{m_{01}= 0} \\ &= 1 - (1-\alpha_m)^m \end{aligned} \]

Then, if we set FWER\(=\alpha\), the significance level for each individual test should be \[\begin{aligned} \alpha_m &= 1 - (1-\alpha)^{\frac{1}{m}} \quad \quad (\text{Sidak correction})\\ & \simeq \frac{\alpha}{m} \quad \quad (\text{Bonferroni correction}) \end{aligned}\]

Let \(p_k\) be the \(p\)-value of the \(k\)-th test. Using the Bonferroni correction, the \(k\)-th test is significant if \[ p_k \leq \alpha_m \quad \Longleftrightarrow \quad m \, p_k \leq \alpha \] We can then either compare the original \(p\)-value \(p_k\) to the corrected significance level \(\alpha/m\), or compare the *adjusted* \(p\)-value \(p_k^{\rm (bonferroni)} =\min(1, m \, p_k)\) to the critical value \(\alpha\).

```
## gender regimen stat p.value pv.bonf
## 3 female NK603-22% 22.0 0.01143779 0.2058803
## 4 female NK603-22%+R 25.0 0.02131745 0.3837141
## 9 female RoundUp C 26.5 0.02845455 0.5121818
## 1 female NK603-11% 33.0 0.07166095 1.0000000
## 8 female RoundUp B 34.0 0.08459162 1.0000000
## 7 female RoundUp A 34.5 0.09156610 1.0000000
## 6 female NK603-33%+R 36.0 0.11556850 1.0000000
## 5 female NK603-33% 39.0 0.17583926 1.0000000
## 2 female NK603-11%+R 40.0 0.18797769 1.0000000
## 14 male NK603-33% 40.5 0.24733242 1.0000000
## 13 male NK603-22%+R 42.0 0.28493944 1.0000000
## 11 male NK603-11%+R 50.0 0.51508635 1.0000000
## 15 male NK603-33%+R 51.0 0.54532608 1.0000000
## 18 male RoundUp C 54.5 0.64764235 1.0000000
## 17 male RoundUp B 58.5 0.75283120 1.0000000
## 12 male NK603-22% 64.0 0.86465902 1.0000000
## 10 male NK603-11% 74.5 0.97160483 1.0000000
## 16 male RoundUp A 75.5 0.97688776 1.0000000
```

Using the Bonferroni correction, none of the 18 comparisons is significant.

**Remark:** the function `p.adjust`

proposes several adjustements of the \(p\)-values for multiple comparisons, including the Bonferroni adjustment:

```
## [1] 0.2058803 0.3837141 0.5121818 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [10] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
```

The Bonferroni correction is appropriate when a single false positive in a set of tests would be a problem. It is mainly useful when there are a fairly small number of multiple comparisons and very few of them might be significant. The main drawback of the Bonferroni correction is its lack of power: it may lead to a very high rate of false negatives.

Assume, for instance that all the female rats of all the experimental groups die before the end of the experiment. Each of the \(m=18\) original tests is significant, but the Bonferroni correction would lead to considering none of these tests as significant:

```
x <- subset(data, regimen=="control" & gender=="female")$time
y <- seq(610,700,length=10)
c(wilcox.test(x,y,"greater")$p.value, wilcox.test(x,y,"greater")$p.value*18)
```

`## [1] 0.004444026 0.079992464`

(Example from Handbook of Biological Statistics)

Garc?a-Arenzana et al. (2014) tested associations of 25 dietary variables with mammographic density, an important risk factor for breast cancer, in Spanish women. They found the following results

```
## dietary p.value
## 1 Total calories 0.001
## 2 Olive oil 0.008
## 3 Whole milk 0.039
## 4 White meat 0.041
## 5 Proteins 0.042
## 6 Nuts 0.061
## 7 Cereals and pasta 0.074
## 8 White fish 0.205
## 9 Butter 0.212
## 10 Vegetables 0.216
## 11 Skimmed milk 0.222
## 12 Red meat 0.251
## 13 Fruit 0.269
## 14 Eggs 0.275
## 15 Blue fish 0.340
## 16 Legumes 0.341
## 17 Carbohydrates 0.384
## 18 Potatoes 0.569
## 19 Bread 0.594
## 20 Fats 0.696
## 21 Sweets 0.762
## 22 Dairy products 0.940
## 23 Semi-skimmed milk 0.942
## 24 Total meat 0.975
## 25 Processed meat 0.986
```

We can see that five of the variables show a significant \(p\)-value (<0.05). However, because Garc?a-Arenzana et al. (2014) tested 25 dietary variables, we would expect one or two variables to show a significant result purely by chance, even if diet had no real effect on mammographic density.

Applying the Bonferroni correction, we divide \(\alpha=0.05\) by the number of tests (\(m=25\)) to get the Bonferroni critical value, so a test would have to have \(p<0.002\) to be significant. Under that criterion, only the test for total calories is significant.

An alternative approach is to control the false discovery rate, i.e the expected proportion of ``discoveries" (significant results) that are actually false positives. FDR control offers a way to increase power while maintaining some principled bound on error.

Imagine for instance that we compare expression levels for 20,000 genes between liver tumors and normal liver cells. We are going to do additional experiments on any genes that show a significant difference between the normal and tumor cells. Then, because we don’t want to miss genes of interest, we are willing to accept up to 25% of the genes with significant results being false positives. We’ll find out they’re false positives when we do the followup experiments. In this case, we would set the false discovery rate to 25%.

The Benjamini-Hochberg (BH) procedure controls the FDR… and it is simple to use!

Indeed, for a given \(\alpha\) and a given sequence of **ordered** \(p\)-values \(P_{(1)}\), \(P_{(2)}\), , \(P_{(m)}\), it consists in computing the \(m\) *adjusted* \(p\)-values defined as

\[ P_{(i)}^{\rm BH} = \min\left( P_{(i)}\frac{m}{i} \ , \ P_{(i+1)}^{\rm BH} \right)\]

```
p.bh <- data$p.value
m <- length(p.bh)
for (i in ((m-1):1))
p.bh[i] <- min(data$p.value[i]*m/i , p.bh[i+1])
data$p.bh <- p.bh
data
```

```
## dietary p.value p.bh
## 1 Total calories 0.001 0.0250000
## 2 Olive oil 0.008 0.1000000
## 3 Whole milk 0.039 0.2100000
## 4 White meat 0.041 0.2100000
## 5 Proteins 0.042 0.2100000
## 6 Nuts 0.061 0.2541667
## 7 Cereals and pasta 0.074 0.2642857
## 8 White fish 0.205 0.4910714
## 9 Butter 0.212 0.4910714
## 10 Vegetables 0.216 0.4910714
## 11 Skimmed milk 0.222 0.4910714
## 12 Red meat 0.251 0.4910714
## 13 Fruit 0.269 0.4910714
## 14 Eggs 0.275 0.4910714
## 15 Blue fish 0.340 0.5328125
## 16 Legumes 0.341 0.5328125
## 17 Carbohydrates 0.384 0.5647059
## 18 Potatoes 0.569 0.7815789
## 19 Bread 0.594 0.7815789
## 20 Fats 0.696 0.8700000
## 21 Sweets 0.762 0.9071429
## 22 Dairy products 0.940 0.9860000
## 23 Semi-skimmed milk 0.942 0.9860000
## 24 Total meat 0.975 0.9860000
## 25 Processed meat 0.986 0.9860000
```

Notice that we could equivalently use the function `p.adjust`

:

```
## [1] 0.0250000 0.1000000 0.2100000 0.2100000 0.2100000 0.2541667 0.2642857 0.4910714 0.4910714
## [10] 0.4910714 0.4910714 0.4910714 0.4910714 0.4910714 0.5328125 0.5328125 0.5647059 0.7815789
## [19] 0.7815789 0.8700000 0.9071429 0.9860000 0.9860000 0.9860000 0.9860000
```

Then, the discoveries, i.e. the significant tests, are those with an ajusted p-value less than \(\alpha\).

It can be shown that this procedure guarantees that for *independent tests*, and for *any alternative hypothesis*,

\[\begin{aligned} {\rm FDR} &= \esp{\frac{m_{01}}{m_{01} + m_{11}}} \\ &\leq \frac{m_{0\cdot}}{m} \alpha \\ &\leq \alpha \end{aligned}\] where \(m_{0\cdot}\) is the (unknown) total number of true null hypotheses, and where the first inequality is an equality with continuous \(p\)-value distributions.

In our example, the first five tests would be significant with \(\alpha=0.25\), which means that we expect no more than 25% of these 5 tests to be false discoveries.

**Remark 1:** The BH procedure is equivalent to consider as significant the non adjusted \(p\)-values smaller than a threshold \(P_{\rm BH}\) defined as

\[ P_{\rm BH} = \max_i \left\{ P_{(i)}: \ \ P_{(i)} \leq \alpha \frac{i}{m} \right\} \] In other words, the largest \(p\)-value that has \(P_{(i)}<(i/m)\alpha\) is significant, and all of the \(P\)-values smaller than it are also significant, even the ones that aren’t less than their Benjamini-Hochber