Data mining for microbiologists - Methods in Microbiology

Biology Reference

In-Depth Information

trees, which were then pruned to avoid overfitting. ILP uses the language of logic

programs to describe examples and theories, and is as powerful and flexible as

general-purpose programming languages such as Java. The ILP approach used

was aimed at identifying frequent patterns in the data and using these patterns to

identify clusters. The clusters were then converted into rules of the form:

IF

A

THEN

B

The functions of 65% of the ORFs in M. tuberculosis and 24% of those in E. coli were

predicted with 60-80% accuracy using this approach. The same group used mutant

phenotype growth data to predict the functional class of ORFs in S. cerevisiae , again

using a modified version of C4.5 to produce classification rules ( Clare and King,

2002 ).

A similar approach used two different decision tree algorithms to learn rules for

the annotation of proteins from Lactobacillus sakei with terms from a controlled

vocabulary developed originally for B. subtilis ( Moszer et al. , 2002 ). The rules were

learned from a training set of data for L. bulgaricus ( Az ´ et al. , 2007 ). It achieved a

precision of 80.5%, with a recall of 52.7%.

Interactomes have been widely used for the prediction of protein function. The

approaches taken usually involve constructing a network, clustering it, and then

using cluster membership to predict protein function. There are many algorithms,

ranging from simply taking the most-frequent annotation amongst the neighbours

of a protein ( Schwikowski et al. , 2000 ), to statistically based probabilistic methods

( Letovsky, 2003; Joshi et al. , 2005; Kao and Huang, 2010 ), and graph-theoretic

methods ( Nabieva et al. , 2005 ). Machine learning techniques applied to interactomes

include the calculation of Bayesian likelihood scores using homology information

from other genomes ( Date and Stoeckert, 2006 ), Markovian random field theory

( Letovsky, 2003; Deng et al. , 2004a,b ) and Bayesian approaches ( Jansen et al. ,

2003; Troyanskaya et al. , 2003; Nariai et al. , 2007 )

One of the most effective approaches, however, turns out to be one of the sim-

plest: propagation of functional labels to an un-annotated protein via the

neighbour with the highest weight, either level 1 neighbours, level 2 neighbours,

or both ( Chua et al. , 2006, 2007 )

An approach that combines interactome analysis with clustering, classification

tree construction and rule inference, was described by Brun and colleagues

( Brun et al. , 2004 ). These authors achieved an accuracy of between 58% and

64%, depending upon the subset of data considered.

Data mining has also been applied successfully to the prediction of protein func-

tion in metagenomic datasets. Such datasets are challenging because they generally

come from a wide variety of organisms with different characteristics, such as GC

content and codon usage bias. The majority of the organisms identified in most meta-

genomic studies are almost completely un-annotated, since more than 99% of pro-

karyotes in the environment cannot be cultured in the laboratory ( Schloss and

Handelsman, 2005 ). BLAST searches alone can therefore only provide limited infor-

mation about protein function.

Methods in Microbiology

Search WWH ::

Custom Search

Home