Data mining for microbiologists - Methods in Microbiology

Biology Reference

In-Depth Information

the rapid annotation of this genome did not affect the clinical treatment of the out-

break, it was a convincing demonstration of the potential application of microbial

data mining via the wisdom of crowds.

Software Availability

Taverna: http://www.taverna.org.uk/ .

Microbase: http://www.microbasecloud.com/ .

7 CASE STUDY: DATA MINING FOR PROTEIN FUNCTION

PREDICTION

One of the major issues challenging bioinformatics in general, and microbial genet-

ics in particular, is the prediction of protein function from DNA sequence. The num-

ber of fully sequenced microbial genomes is growing exponentially, and the advent

of Next Generation Sequencing technologies means that the rate at which new

genomes are acquired will increase inexorably. However, the percentage of reliably

annotated proteins drops proportionately to the number of genomes sequenced, since

the process of experimentally producing such annotations takes far more time than

does the generation of sequence.

The workhorse of protein functional prediction is still BLAST. New sequences

are compared with sequences already on record, and if two sequences are similar

enough, annotations may be transferred from one to another. Although the basic

assumption—that similar sequences are likely to produce proteins with similar func-

tion—is broadly supportable, this assumption does not always hold. Further, over

time, this practice leads to a phenomenon colloquially known as database rot .

Sequence A might be very similar to sequence B, which was annotated on the basis

of its similarity to sequence C, which was annotated

and so on. Although

sequence A might be very similar to sequence B, it may be quite different

from sequence Z, the originally, experimentally verified protein. The annotation

of sequence A may therefore be far from accurate.

A large number of approaches have been taken to protein functional assignment; a

good review is provided by ( Sleator, 2012 ). In this review we are concerned only with

thosemethods that can broadly be described as datamining (as opposed to, for example,

deductions from measures of evolutionary relatedness). As with many bioinformatics

tasks, multiple algorithms are frequently combined in order to address this problem.

Several research groups have adopted the approach of calculating distances

between proteins, which are then used to build classification or decision trees, which

are then turned into classification rules.

An early example of this approach was learning of classification rules to infer

protein function in Mycobacterium tuberculosis and E. coli ( King et al. , 2000 ).

The process started with building a decision tree, using C4.5 and C5.0 (see above).

Inductive Logic Programming (ILP) was then used to derive rules from the decision

...

Search WWH ::

Custom Search

Home