Information Technology Reference
In-Depth Information
tradeoffs of sensitivity and specificity for the RF classifier. In this study, the prediction
strength was used for the initial selection of the best classifier, which was then evalu-
ated using the other performance measures.
The results suggest that, with the three biochemical features, the RF classifier is
slightly more accurate than the ANN and SVM predictors [9,10]. By using the same
dataset (PDNA-62), the ANN and SVM predictors achieved the prediction strength of
67.42% and 69.94%, respectively. The ROC AUC and MCC of the ANN and SVM
predictors are also slightly less than those of the RF classifier. However, RFs may
have major advantages in handling a large number of input variables and avoiding
model overfitting when various descriptors of evolutionary information are used for
classifier construction.
3.2 Effect of Evolutionary Information on Classifier Performance
RF classifiers were constructed using different types of evolutionary information, in-
cluding the BLAST-based conservation score, position-specific scoring matrices
(PSSMs), and the means and standard deviations of biochemical feature values. For the
first set of experiments, the PDNA-62 dataset with the atom distance-based definition of
DNA-binding residues was used to construct RF classifiers. The conservation score ( B )
was previously used to train ANN classifiers for DNA-binding site prediction [9]. As
shown in Table 2, the prediction strength (73.23%), MCC (0.34) and ROC AUC (0.81)
are slightly improved by adding the B score to the three biochemical features ( H , K , and
M ), suggesting that the conservation score does not capture most of the evolutionary
information for sequence-based prediction of DNA-binding residues.
Consistent with previous studies [6-8], the use of PSSM scores for input encoding
was found to significantly improve the classifier performance. As shown in Table2,
the RF classifier achieved 76.82% prediction strength with 79.26% sensitivity and
74.38% specificity. It also had higher MCC (0.40) and ROC AUC (0.85) than the RF
classifier constructed using the three biochemical features alone (MCC = 0.32 and
AUC = 0.78). The results were obtained using 1000 decision trees in the forest and
the training parameter m set to 15 ( ⎣ ⎦
253 ). Because each residue was encoded with
20 PSSM scores and 3 biochemical features, the input vector contained 253 values for
a data instance with eleven residues.
Table 2. Effect of evolutionary information on the performance of random forest classifiers
constructed using the PDNA-62 dataset with the atom distance-based definition of DNA-
binding residues
Evolutionary
information
A ccuracy
(%)
S ensitivity
(%)
S pecificity
(%)
Strength
(%)
ROC
AUC
MCC
None
70.23
73.46
69.68
71.57
0.32
0.78
B
72.74
73.92
72.54
73.23
0.34
0.81
PSSM
75.09
79.26
74.38
76.82
0.40
0.85
H m , H d , K m , K d
74.78
77.70
74.29
75.99
0.39
0.84
PSSM , H m , H d , K m , K d
78.20
78.06
78.22
78.14
0.43
0.86
 
Search WWH ::




Custom Search