Information Technology Reference
In-Depth Information
hit in the database ( n = 0), the feature value is set to 0. The B score was used to
construct artificial neural network classifiers in our previous study [9].
(2) Mean and standard deviation of biochemical feature values: For each residue a i
in the sequence p , the mean (
p
a i
X
) and standard deviation ( σ ) of a biochemical
X
{
H
,
K
,
M
}
feature X ,
, are calculated as follows:
χ
(
a
,
h
)
i
j
h
H
X
p
a
=
(2)
j
p
n
i
(
χ
(
a
,
h
)
X
p
a
)
2
i
j
i
h
H
σ
(
X
p
a
)
=
(3)
j
p
n
1
i
χ
(
a
i h
,
)
where
is the value of feature X for the amino acid residue in h j , which
is aligned to a i at position i in p . The mean of feature X , also referred to as H m , K m
or M m in this paper, captures the biochemical properties of an amino acid position
in the sequence alignment. It has been shown that basic and polar amino acids are
overrepresented while acidic and hydrophobic amino acids are underrepresented
in the population of DNA-binding sites [4,9]. The standard deviation of feature X ,
also called H d , K d or M d , reveals how well the biochemical properties of an amino
acid position are conserved in the aligned homologous sequences.
j
(3) Position-specific scoring matrix (PSSM): The PSSM scores are generated by
PSI-BLAST [14], and there are 20 values for each sequence position. The evolu-
tionary information captured by PSSMs was previously shown to improve the
performance of artificial neural network and support vector machine classifiers
for predicting DNA-binding residues [6,7]. Nevertheless, PSSM is rather de-
signed for BLAST searches, and it may not contain all the evolutionary informa-
tion for predicting DNA-binding residues, especially with regard to the relevant
biochemical properties discussed above.
2.4 Random Forests
One potential problem with the use of evolutionary information is that the number of
input variables becomes very large. In particular, the PSSM descriptor has 20 values
for each sequence position. For a data instance with eleven residues, PSSM alone
gives rise to 220 inputs. Considering the relatively small size of the training dataset,
too many inputs can result in model overfitting. This problem may be solved using
random forests [11].
Random forests (RFs) use a combination of independent decision trees to improve
classification accuracy. Specifically, each decision tree in a forest is constructed using
a bootstrap sample from the training data. During tree construction, m variables out of
all the n input variables ( m << n ) are randomly selected at each node, and the tree node
are split using the selected m variables. Because of the random feature selection, RFs
can handle a large number of input variables and avoid overfitting. For classifying a
Search WWH ::




Custom Search