Biology Reference
In-Depth Information
trees, which were then pruned to avoid overfitting. ILP uses the language of logic
programs to describe examples and theories, and is as powerful and flexible as
general-purpose programming languages such as Java. The ILP approach used
was aimed at identifying frequent patterns in the data and using these patterns to
identify clusters. The clusters were then converted into rules of the form:
IF
A
THEN
B
The functions of 65% of the ORFs in
M. tuberculosis
and 24% of those in
E. coli
were
predicted with 60-80% accuracy using this approach. The same group used mutant
phenotype growth data to predict the functional class of ORFs in
S. cerevisiae
, again
using a modified version of C4.5 to produce classification rules (
Clare and King,
2002
).
A similar approach used two different decision tree algorithms to learn rules for
the annotation of proteins from
Lactobacillus sakei
with terms from a controlled
vocabulary developed originally for
B. subtilis
(
Moszer
et al.
, 2002
). The rules were
learned from a training set of data for
L. bulgaricus
(
Az
ยด
et al.
, 2007
). It achieved a
precision of 80.5%, with a recall of 52.7%.
Interactomes have been widely used for the prediction of protein function. The
approaches taken usually involve constructing a network, clustering it, and then
using cluster membership to predict protein function. There are many algorithms,
ranging from simply taking the most-frequent annotation amongst the neighbours
of a protein (
Schwikowski
et al.
, 2000
), to statistically based probabilistic methods
(
Letovsky, 2003; Joshi
et al.
, 2005; Kao and Huang, 2010
), and graph-theoretic
methods (
Nabieva
et al.
, 2005
). Machine learning techniques applied to interactomes
include the calculation of Bayesian likelihood scores using homology information
from other genomes (
Date and Stoeckert, 2006
), Markovian random field theory
(
Letovsky, 2003; Deng
et al.
, 2004a,b
) and Bayesian approaches (
Jansen
et al.
,
2003; Troyanskaya
et al.
, 2003; Nariai
et al.
, 2007
)
One of the most effective approaches, however, turns out to be one of the sim-
plest: propagation of functional labels to an un-annotated protein via the
neighbour with the highest weight, either level 1 neighbours, level 2 neighbours,
or both (
Chua
et al.
, 2006, 2007
)
An approach that combines interactome analysis with clustering, classification
tree construction and rule inference, was described by Brun and colleagues
(
Brun
et al.
, 2004
). These authors achieved an accuracy of between 58% and
64%, depending upon the subset of data considered.
Data mining has also been applied successfully to the prediction of protein func-
tion in metagenomic datasets. Such datasets are challenging because they generally
come from a wide variety of organisms with different characteristics, such as GC
content and codon usage bias. The majority of the organisms identified in most meta-
genomic studies are almost completely un-annotated, since more than 99% of pro-
karyotes in the environment cannot be cultured in the laboratory (
Schloss and
Handelsman, 2005
). BLAST searches alone can therefore only provide limited infor-
mation about protein function.