Information Technology Reference
In-Depth Information
Machine learning has recently been applied to sequence-based prediction of DNA-
binding residues. The problem can be specified as follows: given the amino acid se-
quence of a protein that is supposed to interact with DNA, the task is to predict which
amino acid residues may be located at the interaction interface. Both the structure of
the protein and the sequence of the target DNA are assumed to be unknown. Although
some experimental observations have been made for DNA-binding residues in protein
structures, the molecular recognition mechanism is still poorly understood [3]. It is
desired that machine learning methods can be used to model the complex patterns
hidden in the available structural data and the resulting classifier can be applied to
reliable identification of DNA-binding residues in protein sequences. Therefore, it is a
challenging task to predict DNA-binding residues from amino acid properties and
local sequence patterns.
Several studies have been reported for sequence-based prediction of DNA-binding
residues. Ahmad et al. [4] analyzed the structural data of representative protein-DNA
complexes, and used the amino acid sequences in these structures to train artificial
neural networks (ANNs) for DNA-binding site prediction. Yan et al. [5] constructed
Naïve Bayes classifiers using the amino acid identities of DNA-binding sites and their
sequence neighbors (context information). However, the prediction accuracy was not
high, probably because amino acid sequences were directly used for classifier con-
struction in these studies.
Classifier performance has been shown to be enhanced by using evolutionary in-
formation for input encoding. Ahmad and Sarai [6] developed an ANN-based method
to utilize evolutionary information in terms of position-specific scoring matrices
(PSSMs). It was found that the average of sensitivity and specificity could be in-
creased by up to 8.7% using PSSMs when compared with ANN predictors using se-
quence information only [4]. More recently, PSSMs were also used to train support
vector machines (SVMs) and logistic regression models for accurate prediction of
DNA-binding residues [7,8]. For a given protein sequence, its PSSM can be derived
from the result of a PSI-BLAST search against a large sequence database. The scores
in the PSSM indicate how well each amino acid position of the query sequence is
conserved among its homologues. Since functional sites, including DNA-binding
residues, tend to be conserved among homologous proteins, PSSM scores can provide
relevant information for classifier construction. However, PSSM is rather designed
for PSI-BLAST searches, and it may not contain all the evolutionary information for
modeling DNA-binding sites.
In our previous studies [9,10], ANN and SVM classifiers were constructed using
relevant biochemical features, including the hydrophobicity index, side chain pK a
value, and molecular mass of an amino acid. These features were used to represent
biological knowledge, which might not be learned from the training data of DNA-
binding residues. It was found that classifier performance was significantly improved
by the use of biochemical features for input encoding, and the SVM classifier outper-
formed the ANN predictor. However, it is still unknown whether classifier perform-
ance can be further improved by combining the biochemical features with evolutionary
The main objective of this study is to improve classifier performance by combining
different types of biological knowledge, including new descriptors of evolutionary in-
formation, PSSM and biochemical features. This approach gives rise to a large number
Search WWH ::

Custom Search