Information Technology Reference
In-Depth Information
data instance, a RF classifier combines the votes made by the decision trees, and gives
the most popular class as the output of the ensemble. It has been shown that RFs out-
perform AdaBoost ensembles on noisy datasets, and can work well on data with many
weak inputs [11]. These characteristics of RFs are appealing since the DNA-binding
data appear to be noisy and may contain many weak sequence-derived features.
The software available at http://www.stat.berkeley.edu/~breiman/RandomForests/
was used to construct the RF classifiers in this study with the default parameter set-
tings for training. In particular, the number of variables selected to split each node ( m )
was set to the floor of square root of the total number of input variables (default set-
ting). Other values of m were also tested, but did not result in significant improvement
of classifier performance.
2.5 Classifier Evaluation
A fivefold cross-validation approach was used to provide the initial estimation of
classifier performance on the PDNA-62 dataset. The trained classifier was further
evaluated using the PDC25t dataset. The following performance measures were used
in this work:
TP
+
TN
(4)
Accuracy
=
TP
+
TN
+
FP
+
FN
TP
+
Sensitivit
y
=
(5)
TP
FN
TN
+
Specificit
y
=
(6)
TN
FP
Sensitivit
y
+
Specificit
y
Strength
=
(7)
2
where TP is the number of true positives (binding residues with positive predictions);
TN is the number of true negatives (non-binding residues with negative predictions); FP
is the number of false positives (non-binding residues but predicted as binding sites);
and FN is the number of false negatives (binding residues but predicted as non-binding
sites). Since the datasets used in this study are imbalanced, the overall accuracy alone
could be misleading. Thus, both sensitivity and specificity are also computed from pre-
diction results. Furthermore, the average of sensitivity and specificity, referred to as
strength in this paper, may provide a fair measure of classifier performance as shown in
previous studies [4,9].
Matthews Correlation Coefficient (MCC) is commonly used as a measure of the
quality of binary classifications [17]. It measures the correlation between predictions
and the actual class labels. However, for imbalanced datasets, different tradeoffs of
sensitivity and specificity may give rise to different MCC values for a classifier. MCC
is defined as:
TP
×
TN
FP
×
FN
(8)
MCC
=
(
TP
+
FP
)(
TP
+
FN
)(
TN
+
FP
)(
TN
+
FN
)
 
Search WWH ::




Custom Search