Information Technology Reference
In-Depth Information
Combining Biochemical Features and Evolutionary
Information for Predicting DNA-Binding Residues
in Protein Sequences
Liangjiang Wang
Department of Genetics and Biochemistry, Clemson University, Clemson, SC 29634, USA
liangjw@clemson.edu
Abstract. This paper describes a new machine learning approach for prediction
of DNA-binding residues from protein sequence data. Several biologically rele-
vant features, including biochemical properties of amino acid residues and evo-
lutionary information of protein sequences, were selected for input encoding.
The evolutionary information was represented as position-specific scoring ma-
trices (PSSMs) and several new descriptors developed in this study. The se-
quence-derived features were then used to train random forests (RFs), which
could handle a large number of input variables and avoid model overfitting. The
use of evolutionary information together with biochemical features was found
to significantly improve classifier performance. The RF classifier was further
evaluated using a separate test dataset. The results suggest that the RF-based
approach gives rise to more accurate prediction of DNA-binding residues than
previous studies.
Keywords: DNA-binding site prediction, feature extraction, evolutionary
information, random forests, machine learning.
1 Introduction
Protein-DNA interactions are essential for many biological processes. For instance,
transcription factors activate or repress downstream gene expression by binding to
specific DNA motifs in promoters [1]. Protein-DNA interactions also play important
roles in DNA replication, repair and modification. To understand the molecular
mechanism of protein-DNA interactions, it is important to identify the DNA-binding
residues in DNA-binding proteins. The identification can be straightforward if the
structure of a protein-DNA complex is already known. However, it is rather expen-
sive and time-consuming to solve the structure of a protein-DNA complex. Currently,
only a few hundreds of protein-DNA complexes have structural data available in the
Protein Data Bank [2]. With the rapid accumulation of sequence data from many
genomes, computational methods are needed for predicting DNA-binding residues
from protein sequence information. The prediction results may be used for gene func-
tional annotation, protein-DNA docking and experimental studies such as site-
directed mutagenesis.
 
Search WWH ::




Custom Search