Information Technology Reference
In-Depth Information
2.2 Training Strategy
Classifiers were trained using residue-wise data instances derived from the sequence
dataset (PDNA-62). Each data instance consisted of eleven residues with the target
residue positioned in the middle of the subsequence. From a protein sequence with n
amino acid residues, a total of ( n - 10) data instances were extracted. A data instance
was labeled as 1 (positive) if the target residue was DNA-binding or 0 (negative) if
the target residue was non-binding. The context information provided by the five
neighboring residues on each side of the target residue was previously shown to be
optimal for sequence-based prediction of DNA-binding residues [9,10].
To generate the input vector, each residue was represented with three biochemical
features and several descriptors of evolutionary information (see below). The three
biochemical features, including the side chain pK a value (feature K ), hydrophobicity
index (feature H ) and molecular mass (feature M ) of an amino acid, were previously
demonstrated to be relevant for predicting DNA-binding residues [9,10].
2.3 Evolutionary Information
Considering the great complexity of protein-DNA interactions, the labeled datasets
derived from the available structures are rather small in size. On the other hand, there
are abundant unlabeled sequence data in public databases such as UniProt [14]. The
unlabeled data contain evolutionary information about the conservation of each se-
quence position, and DNA-binding residues tend to be conserved among homologous
proteins [15].
For a given protein sequence p , its homologues in a reference database can be re-
trieved and aligned to p using PSI-BLAST [16]. The sequence alignment is then used
to compute evolutionary conservation scores for each residue in p . In this study, the
protein sequence dataset UniProtKB ( was used as the
reference database, and PSI-BLAST was run for three iterations with the E-value
threshold set to 1e-5. The following descriptors of evolutionary information have
been investigated in this study for predicting DNA-binding residues:
(1) BLAST-based conservation score (feature B ): Let H p = { h 1 , h 2 , …, h n } be the set
of n hits ( n > 0) in the PSI-BLAST search for a query sequence p . Each hit is a
pair-wise sequence alignment, in which PSI-BLAST indicates whether two
aligned residues are identical or show similarity based on the BLOSUM62 scor-
ing matrix [16]. The B score for the residue a i at position i in p is computed as
where f ( a i , h j ) is set to 1 if a i is aligned to an identical or similar residue in h j , or 0
otherwise, and c is a pseudo-count (set to 10 in this work). The term ( c / n ) is used
to scale the feature value, and it becomes smaller when n gets larger. If p has no
Search WWH ::

Custom Search