Combining Biochemical Features and Evolutionary Information for Predicting DNA-Binding Residues in Protein Sequences - Advances in Computational Science and Engineering

Information Technology Reference

In-Depth Information

hit in the database ( n = 0), the feature value is set to 0. The B score was used to

construct artificial neural network classifiers in our previous study [9].

(2) Mean and standard deviation of biochemical feature values: For each residue a i

in the sequence p , the mean (

a i

) and standard deviation ( σ ) of a biochemical

∈

{

}

feature X ,

, are calculated as follows:

∑

∈

(

)

(2)

∑

∈

(

)

−

)

(

)

(3)

−

(

i h

)

where

is the value of feature X for the amino acid residue in h j , which

is aligned to a i at position i in p . The mean of feature X , also referred to as H m , K m

or M m in this paper, captures the biochemical properties of an amino acid position

in the sequence alignment. It has been shown that basic and polar amino acids are

overrepresented while acidic and hydrophobic amino acids are underrepresented

in the population of DNA-binding sites [4,9]. The standard deviation of feature X ,

also called H d , K d or M d , reveals how well the biochemical properties of an amino

acid position are conserved in the aligned homologous sequences.

(3) Position-specific scoring matrix (PSSM): The PSSM scores are generated by

PSI-BLAST [14], and there are 20 values for each sequence position. The evolu-

tionary information captured by PSSMs was previously shown to improve the

performance of artificial neural network and support vector machine classifiers

for predicting DNA-binding residues [6,7]. Nevertheless, PSSM is rather de-

signed for BLAST searches, and it may not contain all the evolutionary informa-

tion for predicting DNA-binding residues, especially with regard to the relevant

biochemical properties discussed above.

2.4 Random Forests

One potential problem with the use of evolutionary information is that the number of

input variables becomes very large. In particular, the PSSM descriptor has 20 values

for each sequence position. For a data instance with eleven residues, PSSM alone

gives rise to 220 inputs. Considering the relatively small size of the training dataset,

too many inputs can result in model overfitting. This problem may be solved using

random forests [11].

Random forests (RFs) use a combination of independent decision trees to improve

classification accuracy. Specifically, each decision tree in a forest is constructed using

a bootstrap sample from the training data. During tree construction, m variables out of

all the n input variables ( m << n ) are randomly selected at each node, and the tree node

are split using the selected m variables. Because of the random feature selection, RFs

can handle a large number of input variables and avoid overfitting. For classifying a

Advances in Computational Science and Engineering

Search WWH ::

Custom Search

Home