Biology Reference
In-Depth Information
the biological factors, and if necessary technical factors, that explain the relationships
between the experiments plus a term of random noise. The p -values (from t -statistics)
that quantify the amount of evidence for rejecting a null effect of each factor for a given
gene are then computed based on the ratio between fold-change and standard devia-
tion. A p -value (from F -statistic) can also be computed to assess whether any of the
factors are associated with a non-zero effect. The linear model framework makes it
possible to account for block designs such as time series where measurements at
different time-points are made on the same biological replicate.
Despite this pre-existing and well-established framework, the availability of micro-
array data led to a number of methodological developments to account for the so-called
“small n , large p ” context where n refers to the number of hybridizations and p to the
number of genes. Namely, there are two principal issues that we may want to tackle.
The first is that when p -values are computed by treating the genes independently, some
information on the variance between biological replicates may be lost if genes tend to
exhibit similar behaviour. The loss is greater if the number of replicates used to esti-
mate the variance is small. Several approaches have been proposed to borrow variance
information from the whole data set when assessing differential expression for each
individual genes. The most popular is the Empirical Bayes method proposed by
Smyth (2004) that leads to the so-called moderated p - values. It is based on the assump-
tion that the distribution of variance follows an inverse gamma distribution which may
not necessarily be verified (see, for instance, Bourgon et al. ,2010 ). Another approach
leading to differential expression statistics that does not rely on such assumption has
been proposed and can achieve better gene ranking ( Opgen-Rhein and Strimmer,
2007 ), but assessment of statistical significance becomes non trivial as the method does
not directly provide p -values. In practice, any of these methods tend to give more
weight to apparent fold-change and less to the apparent standard deviation in the dif-
ferential expression statistics for each gene, thereby providing an intermediate between
fold-change analysis and traditional ANOVA.
The second issue of the “small n , large p ” context is the control of the number of
false positives arising from multiple testing. If a nominative p -value appropriate for
single hypothesis testing such as 0.05 is considered, a large number of genes will be
called positive even when none of the genes is differential expressed (around 5% of
the genes). Approaches to control this number are referred in the literature as family-
wise error rate control methods. The best known and simplest being the Bonferroni
correction where the p -value cut-off scales inversely with the number of hypotheses
tested (genes). The methods that seek to control the number of false discoveries irre-
spective of the number of true discoveries are, however, typically much more con-
servative than is usually wanted in exploratory data analysis: in this context, doing
5 false predictions is usually acceptable if this accompanies 95 true predictions as the
false discovery rate (FDR) remains limited (5% in this example). A number of
methods have been developed to address the control of the FDR by examining the
distribution of p -values, and the expected FDR associated with a particular p -value
cut-off is often referred to as the q -value. The simplest, conservative, and still
very popular method was proposed by Benjamini and Hochberg (1995) . More
Search WWH ::




Custom Search