Introduction - Protein Homology Detection Through Alignment of Markov Random Fields

Information Technology Reference

In-Depth Information

1.3.1 Generative and Discriminative Learning for Alignment-

Free Homology Detection and Fold Recognition

Alignment-free methods represent a protein (sequence) as a feature vector so that

homologs can be identi

ed by directly comparing the resultant feature vectors.

To compare feature vectors, alignment-free methods usually employ a machine

learning method to classify feature vectors into classes. Two types of machine

learning methods are employed: generative learning and discriminative learning

[ 22 ]. Generative learning methods [ 23 , 24 ] use a probabilistic model to de

ne the

occurring probability of a feature vector and automatically

find patterns in the data.

The main issue is that generative learning methods may not be sensitive enough to

discriminate distantly-related proteins. It also needs lots of non-redundant data to

estimate parameters for a generative model, which is not available for some protein

families.

Discriminative learning overcomes these issues partially by directly training a

machine learning model to differentiate homologous proteins (i.e., positive exam-

ples) from non-homologous proteins (i.e., negative examples). Existing discrimi-

native learning methods mainly differ in feature representation and extraction

schemes and the employed machine learning models. A few popular supervised

machine learning methods have been explored including k-nearest neighbor [ 25 ],

decision trees (random forests) [ 26 , 27 ], neural networks [ 28 , 29 ] and Support

Vector Machine (i.e., kernel-based methods). It has been reported that tested on a

few publicly available datasets Support Vector Machine (SVM) outperforms the

others [ 30 ]. Besides SVM, probabilistic graphical models, such as Conditional

Random Fields (CRF), are also used for fold recognition [ 31 ].

1.3.2 Kernel-Based Learning Methods for Alignment-Free

Homology Detection

A few kernel-based methods have been developed for alignment-free homology

detection and fold recognition. Their performance critically depends on protein

features employed to model a protein sequence (or pro

le) and the employed kernel

functions. Gaussian kernel functions are widely used and yield good performance.

Ding and Dubchak [ 32 ] developed a multi-class Support Vector Machines (SVM)

method for fold recognition that achieves an accuracy of 56 % on a dataset of 27

protein folds. The feature vector employed by this method mainly encodes the

amino acid composition of a protein sequence. SVM-Fisher [ 17 ] represents a

protein by a vector of Fisher scores extracted from pro

le Hidden Markov model

(HMM) and employs SVM to classify pro

le HMMs based upon their feature

vector. Shen and Chou [ 25 ] developed an ensemble classi

er PFP-Pred, which uses

a sequence feature called amphiphilic pseudo amino acid and also considers

sequence-order information. PFP-Pred improves homology detection and fold

Protein Homology Detection Through Alignment of Markov Random Fields

Search WWH ::

Custom Search

Home