Information Technology Reference
In-Depth Information
of input variables. For DNA-binding site prediction, a data instance normally includes
multiple neighboring residues for providing context information, and each residue is
encoded using many feature values ( e.g. , 20 PSSM scores for each residue). Consider-
ing the relatively small dataset currently available for modeling DNA-binding sites, too
many input variables may result in model overfitting. To avoid this potential pitfall,
classifiers have been constructed using the random forest (RF) learning algorithm,
which has the capability to handle a large number of input variables and avoid model
overfitting [11]. The results obtained in this study suggest that DNA-binding site predic-
tion can be significantly improved by combining relevant biochemical features with
several descriptors of evolutionary information for input encoding.
2 Methods
2.1 Data Preparation
Two amino acid sequence datasets, PDNA-62 and PDC25t, were derived from struc-
tural data of protein-DNA complexes available at the Protein Data Bank
( The PDNA-62 dataset was used for training classifiers in
this work as in the previous studies [6-10]. PDNA-62 was derived from 62 structures
of representative protein-DNA complexes, and the amino acid sequences in this data-
set shared less than 25% identity. The PDC25t dataset was derived from the protein-
DNA complexes that were not included in PDNA-62. The sequences in PDC25t had
less than 25% identity among them as well as with the sequences in PDNA-62. In this
study, PDC25t was used as a separate test dataset for classifier performance evalua-
tion and comparison.
DNA-binding residues in protein-DNA complexes were identified using atom dis-
tance or solvent accessible surface area (ASA). For the atom distance-based method,
an amino acid residue was designated as a binding site if the side chain or backbone
atoms of the residue fell within a cutoff distance of 3.5 Å from any atoms of the DNA
molecule in the complex, and all the other residues were regarded as non-binding
sites. This definition of DNA-binding residues was used in previous studies by us
[9,10] as well as others [4,6-8]. It is noteworthy that both PDNA-62 and PDC25t are
imbalanced datasets with ~15% residues labeled as DNA-binding and ~85% residues
being non-binding.
In some previous studies [5,12], DNA-binding residues were also labeled by using
the change of ASA during protein-DNA complex formation. A residue was assumed
to be a binding site if the residue's ASA lost at least one square angstrom (∆ASA ≥ 1
Å 2 ) after the protein bound to DNA. In this study, two ASA values were computed for
each amino acid residue in the protein-DNA complex or unbound protein (structural
data with the DNA coordinates removed) using GETAREA [13]. The residue's ∆ASA
was then calculated by subtracting its ASA in the protein-DNA complex from that in
the unbound protein. The above two definitions of DNA-binding residues were shown
to give rise to slightly different datasets [9], and thus could affect classifier evaluation
and comparison.
Search WWH ::

Custom Search