Biology Reference
In-Depth Information
that contribute to the successful mining of transcriptomic data. We refer to thoughtful
reviews for the reader interested in more principled network modelling, including
boolean, differential equation and Bayesian networks ( Hecker et al. , 2009 ;
Lee and Tzou, 2009 ).
The most direct approach to putting expression data in to the context of gene reg-
ulatory networks is to analyze gene sets built on prior knowledge of the biological
networks. The simplest implementation of this idea is to compare a list of genes
established from the analysis of an expression data set to pre-established gene sets
representative of regulons, functional categories or metabolic pathways such as the
functional classification of B. subtilis genes provided by SubtiWiki ( http://subtiwiki.
uni-goettingen.de/ , M¨der et al. , 2012 ). The overlap will be identified as statistically
significant on the basis of the Fisher exact test or, equivalently, the hypergeometric
model. More sophisticated approaches have been proposed that bypass the need to
select a cut-off value defining the list of differentially expressed genes ( Nam and
Kim, 2008; Ackermann and Strimmer, 2009 ). These approaches can be useful in sit-
uations where the number of genes that could be identified with a reasonable FDR
control is too small, as the data collected for a gene set is pooled to provide greater
statistical power. The gene set approaches can also be relevant when the number of
differentially expressed genes is large (depending on the null hypothesis selected, see
classification introduced in Tian et al. , 2005 ) when the focus of the analysis is the
comparison of the profiles of differential expression between gene sets rather than
the detection of differential expression.
Clustering is the approach by which one seeks to identify groups among elements
that, in the context of array experiments, can either be the genes or the experiments
(i.e. conditions or individual replicates). As mentioned earlier, the clustering of the
experiments is virtually always useful as a data quality check. However, the cluster-
ing of the genes is generally more relevant in a systems biology perspective ( Grant
et al. , 2007 ). The most popular algorithms for clustering rely on the explicit or
implicit definition of a distance between elements (most often Euclidean or based
on correlation coefficients) and can either optimize the clustering with respect to
some criterion for a given number of classes (for instance, k -means) or build a hier-
archical tree that simultaneously defines clusters for any number of classes (hierar-
chical clustering). The very popular heatmap representation is obtained by colour
coding the expression matrix after reordering rows and/or columns by hierarchical
clustering ( Figure 6.5 , Eisen et al. , 1998 ). Clustering approaches based on informa-
tion theory ( Slonim et al. , 2005 ), mixture models ( Ghosh and Chinnaiyan, 2002 ) and
graph theory ( Sharan et al. , 2003 ) are also used. Evaluation of the clustering methods
from a biology standpoint is difficult and typically relies on measuring the overlap
with predetermined gene sets as indicated above ( Handl et al. , 2005 ). Probably, the
main risk with clustering arises from the availability of a wide diversity of
approaches that makes it easy to select the particular procedure that provides a
“desired” result instead of a reliable representation of the data. It is indeed crucial
to critically assess and understand the results. It is, for example, wise to check the
robustness of the conclusions with respect to sampling or to the clustering approach.
Search WWH ::




Custom Search