Combining Biochemical Features and Evolutionary Information for Predicting DNA-Binding Residues in Protein Sequences - Advances in Computational Science and Engineering

Information Technology Reference

In-Depth Information

2.2 Training Strategy

Classifiers were trained using residue-wise data instances derived from the sequence

dataset (PDNA-62). Each data instance consisted of eleven residues with the target

residue positioned in the middle of the subsequence. From a protein sequence with n

amino acid residues, a total of ( n - 10) data instances were extracted. A data instance

was labeled as 1 (positive) if the target residue was DNA-binding or 0 (negative) if

the target residue was non-binding. The context information provided by the five

neighboring residues on each side of the target residue was previously shown to be

optimal for sequence-based prediction of DNA-binding residues [9,10].

To generate the input vector, each residue was represented with three biochemical

features and several descriptors of evolutionary information (see below). The three

biochemical features, including the side chain pK a value (feature K ), hydrophobicity

index (feature H ) and molecular mass (feature M ) of an amino acid, were previously

demonstrated to be relevant for predicting DNA-binding residues [9,10].

2.3 Evolutionary Information

Considering the great complexity of protein-DNA interactions, the labeled datasets

derived from the available structures are rather small in size. On the other hand, there

are abundant unlabeled sequence data in public databases such as UniProt [14]. The

unlabeled data contain evolutionary information about the conservation of each se-

quence position, and DNA-binding residues tend to be conserved among homologous

proteins [15].

For a given protein sequence p , its homologues in a reference database can be re-

trieved and aligned to p using PSI-BLAST [16]. The sequence alignment is then used

to compute evolutionary conservation scores for each residue in p . In this study, the

protein sequence dataset UniProtKB (http://www.pir.uniprot.org/) was used as the

reference database, and PSI-BLAST was run for three iterations with the E-value

threshold set to 1e-5. The following descriptors of evolutionary information have

been investigated in this study for predicting DNA-binding residues:

(1) BLAST-based conservation score (feature B ): Let H p = { h 1 , h 2 , …, h n } be the set

of n hits ( n > 0) in the PSI-BLAST search for a query sequence p . Each hit is a

pair-wise sequence alignment, in which PSI-BLAST indicates whether two

aligned residues are identical or show similarity based on the BLOSUM62 scor-

ing matrix [16]. The B score for the residue a i at position i in p is computed as

follows:

∑

∈

f

(

a

,

h

)

i

j

h

H

(1)

B

p

a

=

j

p

c

i

n

+

n

where f ( a i , h j ) is set to 1 if a i is aligned to an identical or similar residue in h j , or 0

otherwise, and c is a pseudo-count (set to 10 in this work). The term ( c / n ) is used

to scale the feature value, and it becomes smaller when n gets larger. If p has no

Advances in Computational Science and Engineering

Search WWH ::

Custom Search

Home