Combining Biochemical Features and Evolutionary Information for Predicting DNA-Binding Residues in Protein Sequences - Advances in Computational Science and Engineering

Information Technology Reference

In-Depth Information

Combining Biochemical Features and Evolutionary

Information for Predicting DNA-Binding Residues

in Protein Sequences

Liangjiang Wang

Department of Genetics and Biochemistry, Clemson University, Clemson, SC 29634, USA

liangjw@clemson.edu

Abstract. This paper describes a new machine learning approach for prediction

of DNA-binding residues from protein sequence data. Several biologically rele-

vant features, including biochemical properties of amino acid residues and evo-

lutionary information of protein sequences, were selected for input encoding.

The evolutionary information was represented as position-specific scoring ma-

trices (PSSMs) and several new descriptors developed in this study. The se-

quence-derived features were then used to train random forests (RFs), which

could handle a large number of input variables and avoid model overfitting. The

use of evolutionary information together with biochemical features was found

to significantly improve classifier performance. The RF classifier was further

evaluated using a separate test dataset. The results suggest that the RF-based

approach gives rise to more accurate prediction of DNA-binding residues than

previous studies.

Keywords: DNA-binding site prediction, feature extraction, evolutionary

information, random forests, machine learning.

1 Introduction

Protein-DNA interactions are essential for many biological processes. For instance,

transcription factors activate or repress downstream gene expression by binding to

specific DNA motifs in promoters [1]. Protein-DNA interactions also play important

roles in DNA replication, repair and modification. To understand the molecular

mechanism of protein-DNA interactions, it is important to identify the DNA-binding

residues in DNA-binding proteins. The identification can be straightforward if the

structure of a protein-DNA complex is already known. However, it is rather expen-

sive and time-consuming to solve the structure of a protein-DNA complex. Currently,

only a few hundreds of protein-DNA complexes have structural data available in the

Protein Data Bank [2]. With the rapid accumulation of sequence data from many

genomes, computational methods are needed for predicting DNA-binding residues

from protein sequence information. The prediction results may be used for gene func-

tional annotation, protein-DNA docking and experimental studies such as site-

directed mutagenesis.

Search WWH ::

Custom Search

Home