Data mining for microbiologists - Methods in Microbiology

Biology Reference

In-Depth Information

are perfectly sorted, the algorithm terminates; otherwise the same procedure is car-

ried out for each of the new nodes.

The C4.5 algorithm has the advantage of producing a single classification for

each data item, and because of its statistical basis, it is relatively robust to noisy data.

It tends to produce short trees, with high information gain near the root, a generally

desirable characteristic. Decision trees also carry out a form of feature selection,

since only the most informative variables are included in the tree. From a practical

point of view, the algorithm is easy to use, once the requisite data has been prepared,

and it produces results that are easy to understand. Other widely used decision tree

algorithms include the Chi-squared Automatic Interaction Detector ( Kass, 1980 ),

and Multivariate Adaptive Regression Splines ( Friedman, 1991 ).

Decision trees are useful for data where the input variables are either continuous

or categorical, and the outputs are categorical. They have the advantage of being able

to classify data into any one of multiple categories, but they do require a relatively

large amount of data. Decision trees can be turned into sets of rules, which can then

be incorporated into computer programmes, allowing the automated application of a

trained decision tree to new data as it is generated.

Decision trees have been widely used in microbiology research, in areas such as

microbial identification ( Rattray et al. , 1999; Ferdinand et al. , 2004; Dieckmann and

Malorny, 2011 ), determination of the phylogenetic group of Escherichia coli

( Clermont et al. , 2000 ), protein functional annotation ( Az´ et al. , 2007 ), classifica-

tion of regulatory phenotype ( Bachmann et al. , 2009 ), environmental monitoring and

tracking the source of medically important microbes ( Lyautey et al. , 2007, 2010;

Ballest´ et al. , 2010 ) and understanding transcriptional control ( Singh et al. ,

2005; Nannapaneni et al. , 2012 ).

Software Availability

C4.5: http://www.rulequest.com/Personal/ .

C5.0: http://rulequest.com/download.html (source code in C; will need to be compiled).

Simple Decision Tree: http://sourceforge.net/projects/decisiontree/ .

5.4 Clustering

Decision trees take a mass of data and try to sort it into discrete, meaningful cate-

gories. A similar approach is cluster analysis, which also attempts to group data into

discrete categories, but without the aid of training data, and without necessarily spec-

ifying what these categories (clusters) mean.

The aim of clustering is to find subgroups within datasets that correspond to

meaningful clusters in vivo . Cluster analysis is widely used in all fields of biology.

Depending upon the dataset, clusters may have different interpretations. In a yeast,

two-hybrid dataset ( Fields and Song, 1989 ) a cluster may represent a protein com-

plex ( Krogan et al. , 2006 ), while in a more generalised interactome a cluster may

Search WWH ::

Custom Search

Home