Information Technology Reference
In-Depth Information
1.3.1 Generative and Discriminative Learning for Alignment-
Free Homology Detection and Fold Recognition
Alignment-free methods represent a protein (sequence) as a feature vector so that
homologs can be identi
ed by directly comparing the resultant feature vectors.
To compare feature vectors, alignment-free methods usually employ a machine
learning method to classify feature vectors into classes. Two types of machine
learning methods are employed: generative learning and discriminative learning
[ 22 ]. Generative learning methods [ 23 , 24 ] use a probabilistic model to de
ne the
occurring probability of a feature vector and automatically
find patterns in the data.
The main issue is that generative learning methods may not be sensitive enough to
discriminate distantly-related proteins. It also needs lots of non-redundant data to
estimate parameters for a generative model, which is not available for some protein
families.
Discriminative learning overcomes these issues partially by directly training a
machine learning model to differentiate homologous proteins (i.e., positive exam-
ples) from non-homologous proteins (i.e., negative examples). Existing discrimi-
native learning methods mainly differ in feature representation and extraction
schemes and the employed machine learning models. A few popular supervised
machine learning methods have been explored including k-nearest neighbor [ 25 ],
decision trees (random forests) [ 26 , 27 ], neural networks [ 28 , 29 ] and Support
Vector Machine (i.e., kernel-based methods). It has been reported that tested on a
few publicly available datasets Support Vector Machine (SVM) outperforms the
others [ 30 ]. Besides SVM, probabilistic graphical models, such as Conditional
Random Fields (CRF), are also used for fold recognition [ 31 ].
1.3.2 Kernel-Based Learning Methods for Alignment-Free
Homology Detection
A few kernel-based methods have been developed for alignment-free homology
detection and fold recognition. Their performance critically depends on protein
features employed to model a protein sequence (or pro
le) and the employed kernel
functions. Gaussian kernel functions are widely used and yield good performance.
Ding and Dubchak [ 32 ] developed a multi-class Support Vector Machines (SVM)
method for fold recognition that achieves an accuracy of 56 % on a dataset of 27
protein folds. The feature vector employed by this method mainly encodes the
amino acid composition of a protein sequence. SVM-Fisher [ 17 ] represents a
protein by a vector of Fisher scores extracted from pro
le Hidden Markov model
(HMM) and employs SVM to classify pro
le HMMs based upon their feature
vector. Shen and Chou [ 25 ] developed an ensemble classi
er PFP-Pred, which uses
a sequence feature called amphiphilic pseudo amino acid and also considers
sequence-order information. PFP-Pred improves homology detection and fold
Search WWH ::




Custom Search