Biology Reference
In-Depth Information
are perfectly sorted, the algorithm terminates; otherwise the same procedure is car-
ried out for each of the new nodes.
The C4.5 algorithm has the advantage of producing a single classification for
each data item, and because of its statistical basis, it is relatively robust to noisy data.
It tends to produce short trees, with high information gain near the root, a generally
desirable characteristic. Decision trees also carry out a form of feature selection,
since only the most informative variables are included in the tree. From a practical
point of view, the algorithm is easy to use, once the requisite data has been prepared,
and it produces results that are easy to understand. Other widely used decision tree
algorithms include the Chi-squared Automatic Interaction Detector ( Kass, 1980 ),
and Multivariate Adaptive Regression Splines ( Friedman, 1991 ).
Decision trees are useful for data where the input variables are either continuous
or categorical, and the outputs are categorical. They have the advantage of being able
to classify data into any one of multiple categories, but they do require a relatively
large amount of data. Decision trees can be turned into sets of rules, which can then
be incorporated into computer programmes, allowing the automated application of a
trained decision tree to new data as it is generated.
Decision trees have been widely used in microbiology research, in areas such as
microbial identification ( Rattray et al. , 1999; Ferdinand et al. , 2004; Dieckmann and
Malorny, 2011 ), determination of the phylogenetic group of Escherichia coli
( Clermont et al. , 2000 ), protein functional annotation ( AzĀ“ et al. , 2007 ), classifica-
tion of regulatory phenotype ( Bachmann et al. , 2009 ), environmental monitoring and
tracking the source of medically important microbes ( Lyautey et al. , 2007, 2010;
BallestĀ“ et al. , 2010 ) and understanding transcriptional control ( Singh et al. ,
2005; Nannapaneni et al. , 2012 ).
Software Availability
C4.5: http://www.rulequest.com/Personal/ .
C5.0: http://rulequest.com/download.html (source code in C; will need to be compiled).
Simple Decision Tree: http://sourceforge.net/projects/decisiontree/ .
5.4 Clustering
Decision trees take a mass of data and try to sort it into discrete, meaningful cate-
gories. A similar approach is cluster analysis, which also attempts to group data into
discrete categories, but without the aid of training data, and without necessarily spec-
ifying what these categories (clusters) mean.
The aim of clustering is to find subgroups within datasets that correspond to
meaningful clusters in vivo . Cluster analysis is widely used in all fields of biology.
Depending upon the dataset, clusters may have different interpretations. In a yeast,
two-hybrid dataset ( Fields and Song, 1989 ) a cluster may represent a protein com-
plex ( Krogan et al. , 2006 ), while in a more generalised interactome a cluster may
Search WWH ::




Custom Search