Biology Reference
In-Depth Information
the rapid annotation of this genome did not affect the clinical treatment of the out-
break, it was a convincing demonstration of the potential application of microbial
data mining via the wisdom of crowds.
Software Availability
Taverna: http://www.taverna.org.uk/ .
Microbase: http://www.microbasecloud.com/ .
7 CASE STUDY: DATA MINING FOR PROTEIN FUNCTION
PREDICTION
One of the major issues challenging bioinformatics in general, and microbial genet-
ics in particular, is the prediction of protein function from DNA sequence. The num-
ber of fully sequenced microbial genomes is growing exponentially, and the advent
of Next Generation Sequencing technologies means that the rate at which new
genomes are acquired will increase inexorably. However, the percentage of reliably
annotated proteins drops proportionately to the number of genomes sequenced, since
the process of experimentally producing such annotations takes far more time than
does the generation of sequence.
The workhorse of protein functional prediction is still BLAST. New sequences
are compared with sequences already on record, and if two sequences are similar
enough, annotations may be transferred from one to another. Although the basic
assumption—that similar sequences are likely to produce proteins with similar func-
tion—is broadly supportable, this assumption does not always hold. Further, over
time, this practice leads to a phenomenon colloquially known as database rot .
Sequence A might be very similar to sequence B, which was annotated on the basis
of its similarity to sequence C, which was annotated
and so on. Although
sequence A might be very similar to sequence B, it may be quite different
from sequence Z, the originally, experimentally verified protein. The annotation
of sequence A may therefore be far from accurate.
A large number of approaches have been taken to protein functional assignment; a
good review is provided by ( Sleator, 2012 ). In this review we are concerned only with
thosemethods that can broadly be described as datamining (as opposed to, for example,
deductions from measures of evolutionary relatedness). As with many bioinformatics
tasks, multiple algorithms are frequently combined in order to address this problem.
Several research groups have adopted the approach of calculating distances
between proteins, which are then used to build classification or decision trees, which
are then turned into classification rules.
An early example of this approach was learning of classification rules to infer
protein function in Mycobacterium tuberculosis and E. coli ( King et al. , 2000 ).
The process started with building a decision tree, using C4.5 and C5.0 (see above).
Inductive Logic Programming (ILP) was then used to derive rules from the decision
...
Search WWH ::




Custom Search