Information Technology Reference
In-Depth Information
1.3.1 Generative and Discriminative Learning for Alignment-
Free Homology Detection and Fold Recognition
Alignment-free methods represent a protein (sequence) as a feature vector so that
homologs can be identi
ed by directly comparing the resultant feature vectors.
To compare feature vectors, alignment-free methods usually employ a machine
learning method to classify feature vectors into classes. Two types of machine
learning methods are employed: generative learning and discriminative learning
[
22
]. Generative learning methods [
23
,
24
] use a probabilistic model to de
ne the
occurring probability of a feature vector and automatically
find patterns in the data.
The main issue is that generative learning methods may not be sensitive enough to
discriminate distantly-related proteins. It also needs lots of non-redundant data to
estimate parameters for a generative model, which is not available for some protein
families.
Discriminative learning overcomes these issues partially by directly training a
machine learning model to differentiate homologous proteins (i.e., positive exam-
ples) from non-homologous proteins (i.e., negative examples). Existing discrimi-
native learning methods mainly differ in feature representation and extraction
schemes and the employed machine learning models. A few popular supervised
machine learning methods have been explored including k-nearest neighbor [
25
],
decision trees (random forests) [
26
,
27
], neural networks [
28
,
29
] and Support
Vector Machine (i.e., kernel-based methods). It has been reported that tested on a
few publicly available datasets Support Vector Machine (SVM) outperforms the
others [
30
]. Besides SVM, probabilistic graphical models, such as Conditional
Random Fields (CRF), are also used for fold recognition [
31
].
1.3.2 Kernel-Based Learning Methods for Alignment-Free
Homology Detection
A few kernel-based methods have been developed for alignment-free homology
detection and fold recognition. Their performance critically depends on protein
features employed to model a protein sequence (or pro
le) and the employed kernel
functions. Gaussian kernel functions are widely used and yield good performance.
Ding and Dubchak [
32
] developed a multi-class Support Vector Machines (SVM)
method for fold recognition that achieves an accuracy of 56 % on a dataset of 27
protein folds. The feature vector employed by this method mainly encodes the
amino acid composition of a protein sequence. SVM-Fisher [
17
] represents a
protein by a vector of Fisher scores extracted from pro
le Hidden Markov model
(HMM) and employs SVM to classify pro
le HMMs based upon their feature
vector. Shen and Chou [
25
] developed an ensemble classi
er PFP-Pred, which uses
a sequence feature called amphiphilic pseudo amino acid and also considers
sequence-order information. PFP-Pred improves homology detection and fold
Search WWH ::
Custom Search