Combining Biochemical Features and Evolutionary Information for Predicting DNA-Binding Residues in Protein Sequences - Advances in Computational Science and Engineering

Information Technology Reference

In-Depth Information

of input variables. For DNA-binding site prediction, a data instance normally includes

multiple neighboring residues for providing context information, and each residue is

encoded using many feature values ( e.g. , 20 PSSM scores for each residue). Consider-

ing the relatively small dataset currently available for modeling DNA-binding sites, too

many input variables may result in model overfitting. To avoid this potential pitfall,

classifiers have been constructed using the random forest (RF) learning algorithm,

which has the capability to handle a large number of input variables and avoid model

overfitting [11]. The results obtained in this study suggest that DNA-binding site predic-

tion can be significantly improved by combining relevant biochemical features with

several descriptors of evolutionary information for input encoding.

2 Methods

2.1 Data Preparation

Two amino acid sequence datasets, PDNA-62 and PDC25t, were derived from struc-

tural data of protein-DNA complexes available at the Protein Data Bank

(http://www.rcsb.org/pdb/). The PDNA-62 dataset was used for training classifiers in

this work as in the previous studies [6-10]. PDNA-62 was derived from 62 structures

of representative protein-DNA complexes, and the amino acid sequences in this data-

set shared less than 25% identity. The PDC25t dataset was derived from the protein-

DNA complexes that were not included in PDNA-62. The sequences in PDC25t had

less than 25% identity among them as well as with the sequences in PDNA-62. In this

study, PDC25t was used as a separate test dataset for classifier performance evalua-

tion and comparison.

DNA-binding residues in protein-DNA complexes were identified using atom dis-

tance or solvent accessible surface area (ASA). For the atom distance-based method,

an amino acid residue was designated as a binding site if the side chain or backbone

atoms of the residue fell within a cutoff distance of 3.5 Å from any atoms of the DNA

molecule in the complex, and all the other residues were regarded as non-binding

sites. This definition of DNA-binding residues was used in previous studies by us

[9,10] as well as others [4,6-8]. It is noteworthy that both PDNA-62 and PDC25t are

imbalanced datasets with ~15% residues labeled as DNA-binding and ~85% residues

being non-binding.

In some previous studies [5,12], DNA-binding residues were also labeled by using

the change of ASA during protein-DNA complex formation. A residue was assumed

to be a binding site if the residue's ASA lost at least one square angstrom (∆ASA ≥ 1

Å 2 ) after the protein bound to DNA. In this study, two ASA values were computed for

each amino acid residue in the protein-DNA complex or unbound protein (structural

data with the DNA coordinates removed) using GETAREA [13]. The residue's ∆ASA

was then calculated by subtracting its ASA in the protein-DNA complex from that in

the unbound protein. The above two definitions of DNA-binding residues were shown

to give rise to slightly different datasets [9], and thus could affect classifier evaluation

and comparison.

Search WWH ::

Custom Search

Home