Information Technology Reference
In-Depth Information
0.30
0.25
Parity
0.20
Voting
0.15
0.10
0.05
0.00
16:1
8:1
4:1
2:1
1:1
Relative degree of rarity
Figure 2.2
Impact of within-class imbalance on rare cases.
directly from a predefined concept. Figure 2.2 shows the results generated from
the raw data from an early study on rare cases [7].
Figure 2.2 shows the error rate for the cases, or subconcepts, within the parity
and voting datasets, based on how rare the case is relative to the most general
case in the classification concept associated with the dataset. For example, a
relative degree of rarity of 16 : 1 means that the rare case is 16 times as rare as
the most common case, while a value of 1 : 1 corresponds to the most common
case. For the two datasets shown in Figure 2.2, we clearly see that the rare cases
(i.e., those with a higher relative degree of rarity) have a much higher error rate
than the common cases, where, for this particular set of experiments, the more
common cases are learned perfectly and have no errors. The concepts associated
with the two datasets can be learned perfectly (i.e., there is no noise) and the
errors were introduced by limiting the size of the training set.
Rare cases are difficult to analyze because one does not know the true concept
and hence cannot identify the rare cases. This inability to identify these rare cases
impacts the ability to develop strategies for dealing with them. But rare cases
will manifest themselves in the learned concept, which is an approximation of
the true concept. Many classifiers, such as decision tree and rule-based learners,
form disjunctive concepts, and for these learners, the rare cases will form small
disjuncts — the disjuncts in the learned classifier that cover few training examples
[8]. The relationship between the rare and the common cases in the true (but gen-
erally unknown) concept, and the disjuncts in the induced classifier, is depicted
in Figure 2.3.
Figure 2.3 shows a concept made up of two positively labeled cases, one is
a rare case and the other is a common case, and the small and large disjuncts
that the classifier forms to cover them. Any examples located within the solid
boundaries corresponding to these two cases should be labeled as positive and
Search WWH ::




Custom Search