Biology Reference
In-Depth Information
trees, which were then pruned to avoid overfitting. ILP uses the language of logic
programs to describe examples and theories, and is as powerful and flexible as
general-purpose programming languages such as Java. The ILP approach used
was aimed at identifying frequent patterns in the data and using these patterns to
identify clusters. The clusters were then converted into rules of the form:
IF
A
THEN
B
The functions of 65% of the ORFs in M. tuberculosis and 24% of those in E. coli were
predicted with 60-80% accuracy using this approach. The same group used mutant
phenotype growth data to predict the functional class of ORFs in S. cerevisiae , again
using a modified version of C4.5 to produce classification rules ( Clare and King,
2002 ).
A similar approach used two different decision tree algorithms to learn rules for
the annotation of proteins from Lactobacillus sakei with terms from a controlled
vocabulary developed originally for B. subtilis ( Moszer et al. , 2002 ). The rules were
learned from a training set of data for L. bulgaricus ( Az ยด et al. , 2007 ). It achieved a
precision of 80.5%, with a recall of 52.7%.
Interactomes have been widely used for the prediction of protein function. The
approaches taken usually involve constructing a network, clustering it, and then
using cluster membership to predict protein function. There are many algorithms,
ranging from simply taking the most-frequent annotation amongst the neighbours
of a protein ( Schwikowski et al. , 2000 ), to statistically based probabilistic methods
( Letovsky, 2003; Joshi et al. , 2005; Kao and Huang, 2010 ), and graph-theoretic
methods ( Nabieva et al. , 2005 ). Machine learning techniques applied to interactomes
include the calculation of Bayesian likelihood scores using homology information
from other genomes ( Date and Stoeckert, 2006 ), Markovian random field theory
( Letovsky, 2003; Deng et al. , 2004a,b ) and Bayesian approaches ( Jansen et al. ,
2003; Troyanskaya et al. , 2003; Nariai et al. , 2007 )
One of the most effective approaches, however, turns out to be one of the sim-
plest: propagation of functional labels to an un-annotated protein via the
neighbour with the highest weight, either level 1 neighbours, level 2 neighbours,
or both ( Chua et al. , 2006, 2007 )
An approach that combines interactome analysis with clustering, classification
tree construction and rule inference, was described by Brun and colleagues
( Brun et al. , 2004 ). These authors achieved an accuracy of between 58% and
64%, depending upon the subset of data considered.
Data mining has also been applied successfully to the prediction of protein func-
tion in metagenomic datasets. Such datasets are challenging because they generally
come from a wide variety of organisms with different characteristics, such as GC
content and codon usage bias. The majority of the organisms identified in most meta-
genomic studies are almost completely un-annotated, since more than 99% of pro-
karyotes in the environment cannot be cultured in the laboratory ( Schloss and
Handelsman, 2005 ). BLAST searches alone can therefore only provide limited infor-
mation about protein function.
Search WWH ::




Custom Search