Information Technology Reference
In-Depth Information
The Receiver Operating Characteristic (ROC) curve is probably the most robust ap-
proach for classifier evaluation and comparison [18]. The ROC curve is drawn by
plotting the true positive rate ( i.e. , sensitivity) against the false positive rate, which
equals to (1 - specificity). In this work, the ROC curve has been generated by varying
the output threshold of a classifier and plotting the true positive rate against false
positive rate for each threshold value. The area under the ROC curve (AUC) can be
used as a reliable measure of classifier performance [19]. Since the ROC plot is a unit
square, the maximum value of AUC is 1, which is achieved by a perfect classifier.
Weak classifiers and random guessing have AUC values close to 0.5.
3 Results
3.1 Prediction of DNA-Binding Residues Using Random Forests
Random forests (RFs) were first trained with three biochemical features that were
previously used to construct ANN and SVM predictors [9,10]. The biochemical fea-
tures, including the hydrophobicity index (feature H ), side chain pK a value ( K ) and
molecular mass ( M ) of an amino acid, were shown to provide relevant information for
predicting DNA-binding residues [9]. The input vector contained 33 feature values
because each data instance was a subsequence of eleven consecutive residues with the
target residue in the middle position. The context information provided by the ten
neighboring residues was found to be optimal for DNA-binding site prediction [9,10].
Table 1 shows the results obtained using the PDNA-62 dataset with the atom dis-
tance-based definition of DNA-binding residues (see Methods).
Table 1. Performance of different classifiers constructed using biochemical features
Accuracy
(%)
Sensitivity
(%)
Specificity
(%)
Strength
(%)
ROC
AUC
Classifier
MCC
RF
70.23
73.46
69.68
71.57
0.32
0.78
SVM
70.31
69.40
70.47
69.94
0.29
0.75
ANN
64.38
71.33
63.51
67.42
0.27
0.73
The RF classifier constructed using the biochemical features achieved 70.23% over-
all accuracy with 73.46% sensitivity and 69.68% specificity in fivefold cross-
validation experiments. Since the dataset was imbalanced with only 15% of the amino
acid residues as DNA-binding sites, the performance of the RF classifier was also
measured by the average of sensitivity and specificity (prediction strength = 71.57%),
Matthews correlation coefficient (MCC = 0.32 ), and the area under the receiver operat-
ing characteristic curve (ROC AUC = 0.78). Different training parameters were tested
for constructing the RF classifier, and the above performance measures were obtained
using 1000 decision trees in the forest with m = 5. This classifier had the highest level
of prediction strength. It should be noted that the MCC value was obtained using the
given sensitivity and specificity. Higher MCC values might be obtained by using other
 
Search WWH ::




Custom Search