Divide and Conquer Strategies for Protein Structure Prediction - Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Biomedical Engineering Reference

In-Depth Information

Data classification is the problem of assigning objects to one of the mutually

exclusive classes according to statistical properties derived from a training set

of examples sharing the same nature of such objects. The problem can be eas-

ily formalised in the following way. Assume that the data we want to classify is

represented by a set of n-dimensional vectors x 2 X D R n and that each one

of such vectors can be assigned to exactly one of m possible classes c 2 C D

f 1;:::;m g . Given a set of pre-compiled examples E Df .x 1 ;c 1 /;:::;.x k ;c k / g ,

where .x i ;c i / 2 X C and j E j < j X j , the objective is to learn from E a mapping

f W X ! C that assigns every x 2 X to its correct class c 2 C . In the biolog-

ical context, each entry of the vector x 2 X usually represents a single feature

(observation) of the object we want to classify (i.e., x is not the object itself ), and

the number of classes is typically limited to two/three. Moreover, machine learning

methods generally do not provide a rigid classification of an object; they instead

return the probability that the object belongs to each one of the possible classes

(a classification can be obtained by choosing the class with higher probability).

In bioinformatics, the most widely used machine learning methods for data clas-

sification are neural networks (NN), support vector machines (SVM) and Hidden

Markov models (HMM). We do not discuss here the features and the limitations

of such methods (for an extensive introduction, see [ 5 ]), but we briefly outline the

problem of correctly evaluating the performance of predictors of protein structural

characteristic.

A reliable approach for assessing the performance of data classification is a nec-

essary pre-condition for every machine learning-based method. The cross-validation

is the standard technique used to statistically evaluate how accurate a predictive

model is. The cross-validation involves the partitioning of the example set into sev-

eral disjoint sets. In one round of cross-validation, one set is chosen as test set and

the others are used as training set. The method is trained on the training set and

the statistical evaluation of the performance is computed from the prediction re-

sults obtained on the test set. To reduce variability, multiple cross-validation rounds

are performed by interchanging training and test sets, and the results obtained are

averaged over the number of rounds.

A proper evaluation (or cross-validation) of prediction methods needs to meet

one fundamental requirement: the test set must not contain examples too much

similar to those contained in the training set. When testing prediction methods for

protein features (such as secondary structure or inter-residue contacts), this require-

ment transduces in having test and training sets compiled from proteins that share no

significant pairwise sequence identity (typically <25%). If homologous sequences

are included in both training and test set, the average prediction accuracy does not

provide a reliable estimation of the performance, and, in particular, it does not re-

flect the performance of the method for sequences not homologue to those in the

training set.

Search WWH ::

Custom Search

Home