Biomedical Engineering Reference
In-Depth Information
Data classification is the problem of assigning objects to one of the mutually
exclusive classes according to statistical properties derived from a training set
of examples sharing the same nature of such objects. The problem can be eas-
ily formalised in the following way. Assume that the data we want to classify is
represented by a set of n-dimensional vectors x 2 X D R n and that each one
of such vectors can be assigned to exactly one of m possible classes c 2 C D
f 1;:::;m g . Given a set of pre-compiled examples E Df .x 1 ;c 1 /;:::;.x k ;c k / g ,
where .x i ;c i / 2 X C and j E j < j X j , the objective is to learn from E a mapping
f W X ! C that assigns every x 2 X to its correct class c 2 C . In the biolog-
ical context, each entry of the vector x 2 X usually represents a single feature
(observation) of the object we want to classify (i.e., x is not the object itself ), and
the number of classes is typically limited to two/three. Moreover, machine learning
methods generally do not provide a rigid classification of an object; they instead
return the probability that the object belongs to each one of the possible classes
(a classification can be obtained by choosing the class with higher probability).
In bioinformatics, the most widely used machine learning methods for data clas-
sification are neural networks (NN), support vector machines (SVM) and Hidden
Markov models (HMM). We do not discuss here the features and the limitations
of such methods (for an extensive introduction, see [ 5 ]), but we briefly outline the
problem of correctly evaluating the performance of predictors of protein structural
characteristic.
A reliable approach for assessing the performance of data classification is a nec-
essary pre-condition for every machine learning-based method. The cross-validation
is the standard technique used to statistically evaluate how accurate a predictive
model is. The cross-validation involves the partitioning of the example set into sev-
eral disjoint sets. In one round of cross-validation, one set is chosen as test set and
the others are used as training set. The method is trained on the training set and
the statistical evaluation of the performance is computed from the prediction re-
sults obtained on the test set. To reduce variability, multiple cross-validation rounds
are performed by interchanging training and test sets, and the results obtained are
averaged over the number of rounds.
A proper evaluation (or cross-validation) of prediction methods needs to meet
one fundamental requirement: the test set must not contain examples too much
similar to those contained in the training set. When testing prediction methods for
protein features (such as secondary structure or inter-residue contacts), this require-
ment transduces in having test and training sets compiled from proteins that share no
significant pairwise sequence identity (typically <25%). If homologous sequences
are included in both training and test set, the average prediction accuracy does not
provide a reliable estimation of the performance, and, in particular, it does not re-
flect the performance of the method for sequences not homologue to those in the
training set.
Search WWH ::




Custom Search