CLASS IMBALANCE AND ACTIVE LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

of the virtual instances created by SMOTE becomes support vectors. Therefore,

SMOTE spends much time to create virtual instances that will not be used in the

model. On the other hand, VIRTUAL has already a short training time and uses

this time to create more informative virtual instances. In Table 6.4, the numbers

in parentheses give the ranks of the g -means prediction performance of the four

approaches. The values in bold correspond to a win and VIRTUAL wins in nearly

all datasets. The Wilcoxon-signed-rank test (two-tailed) between VIRTUAL and

its nearest competitor SMOTE reveals that the zero median hypothesis can be

rejected at the significance level 1% ( p =

10 − 4 ), implying that VIRTUAL

performs statistically better than SMOTE in these 14 datasets. These results

demonstrate the importance of creating synthetic samples from the informative

examples rather than all the examples.

4 . 82

×

6.5 DIFFICULTIES WITH EXTREME CLASS IMBALANCE

Practical applications rarely provide us with data that have equal numbers of train-

ing instances of all the classes. However, in many applications, the imbalance in

the distribution of naturally occurring instances is extreme. For example, when

labeling web pages to identify specific content of interest, uninteresting pages

may outnumber interesting ones by a million to one or worse (consider identi-

fying web pages containing hate speech, in order to keep advertisers off them,

cf. [33]).

The previous sections have detailed the techniques that have been developed to

cope with moderate class imbalance. However, as class imbalances tends toward

the extreme, AL strategies can fail completely—and this failure is not simply

due to the challenges of learning models with skewed class distributions, which

has received a good bit of study and has been addressed throughout this topic.

The lack of labeled data compounds the problem because techniques cannot con-

centrate on the minority instances, as the techniques are unaware which instances

to focus on.

Figure 6.10 compares the AUC of logistic regression text classifiers induced by

labeled instances selected with uncertainty sampling and with RS. The learning

1:1

10 : 1

100 : 1

1000 : 1

10000 : 1

1

0.9

0.8

Random

Uncertainty

0.7

0.6

0.5

0.4

0

5000

10,000

0

5000

10,000

Labeled examples

0

5000

0

5000

10,000

0

5000

10,000

Figure 6.10 Comparison of random sampling and uncertainty sampling on the same

dataset with induced skews ranging from 1 : 1 to 10 , 000 : 1.

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home