Combining Biochemical Features and Evolutionary Information for Predicting DNA-Binding Residues in Protein Sequences - Advances in Computational Science and Engineering

Information Technology Reference

In-Depth Information

Machine learning has recently been applied to sequence-based prediction of DNA-

binding residues. The problem can be specified as follows: given the amino acid se-

quence of a protein that is supposed to interact with DNA, the task is to predict which

amino acid residues may be located at the interaction interface. Both the structure of

the protein and the sequence of the target DNA are assumed to be unknown. Although

some experimental observations have been made for DNA-binding residues in protein

structures, the molecular recognition mechanism is still poorly understood [3]. It is

desired that machine learning methods can be used to model the complex patterns

hidden in the available structural data and the resulting classifier can be applied to

reliable identification of DNA-binding residues in protein sequences. Therefore, it is a

challenging task to predict DNA-binding residues from amino acid properties and

local sequence patterns.

Several studies have been reported for sequence-based prediction of DNA-binding

residues. Ahmad et al. [4] analyzed the structural data of representative protein-DNA

complexes, and used the amino acid sequences in these structures to train artificial

neural networks (ANNs) for DNA-binding site prediction. Yan et al. [5] constructed

Naïve Bayes classifiers using the amino acid identities of DNA-binding sites and their

sequence neighbors (context information). However, the prediction accuracy was not

high, probably because amino acid sequences were directly used for classifier con-

struction in these studies.

Classifier performance has been shown to be enhanced by using evolutionary in-

formation for input encoding. Ahmad and Sarai [6] developed an ANN-based method

to utilize evolutionary information in terms of position-specific scoring matrices

(PSSMs). It was found that the average of sensitivity and specificity could be in-

creased by up to 8.7% using PSSMs when compared with ANN predictors using se-

quence information only [4]. More recently, PSSMs were also used to train support

vector machines (SVMs) and logistic regression models for accurate prediction of

DNA-binding residues [7,8]. For a given protein sequence, its PSSM can be derived

from the result of a PSI-BLAST search against a large sequence database. The scores

in the PSSM indicate how well each amino acid position of the query sequence is

conserved among its homologues. Since functional sites, including DNA-binding

residues, tend to be conserved among homologous proteins, PSSM scores can provide

relevant information for classifier construction. However, PSSM is rather designed

for PSI-BLAST searches, and it may not contain all the evolutionary information for

modeling DNA-binding sites.

In our previous studies [9,10], ANN and SVM classifiers were constructed using

relevant biochemical features, including the hydrophobicity index, side chain pK a

value, and molecular mass of an amino acid. These features were used to represent

biological knowledge, which might not be learned from the training data of DNA-

binding residues. It was found that classifier performance was significantly improved

by the use of biochemical features for input encoding, and the SVM classifier outper-

formed the ANN predictor. However, it is still unknown whether classifier perform-

ance can be further improved by combining the biochemical features with evolutionary

information.

The main objective of this study is to improve classifier performance by combining

different types of biological knowledge, including new descriptors of evolutionary in-

formation, PSSM and biochemical features. This approach gives rise to a large number

Advances in Computational Science and Engineering

Search WWH ::

Custom Search

Home