Information Technology Reference
In-Depth Information
Hellinger distance is a distance metric between probability distributions used
by Cieslak and Chawla [29] to create HDDTs. It was chosen as a splitting cri-
terion for the binary class imbalance problem because of its property of skew
insensitivity. Hellinger distance is defined as a splitting criterion as [29]:
2
p
|
X + j |
| X + |
|
X j |
| X |
d H (X + ,X ) =
(3.5)
j = 1
where X + is the set of all positive examples, X is the set of all negative
examples, and X + j ( X j ) is the set of positive (negative) examples with the j th
value of the relevant feature.
3.4 EVALUATION METRICS
One common method for determining the performance of a classifier is through
the use of a confusion matrix (Fig. 3.1). In a confusion matrix, TN is the number
of negative instances correctly classified (True Negatives), FP is the number of
negative instances incorrectly classified as positive (False Positive), FN is the
number of positive instances incorrectly classified as negative (False Negatives),
and TP is the number of positive instances correctly classified as positive (True
Positives).
From the confusion matrix, many standard evaluation metrics can be defined.
Traditionally, the most often used metric is accuracy (or its complement, the error
rate):
TP + TN
TP + FP + TN + FN
accuracy =
(3.6)
As mentioned previously, however, accuracy is inappropriate when data is
imbalanced. This is seen in our previous example, where the majority class may
Predicted
Predicted
negative
positive
Actual
TN
FP
negative
FN
TP
Actual
positive
Figure 3.1 Confusion matrix.
Search WWH ::




Custom Search