CLASS IMBALANCE AND ACTIVE LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

Majority-class distribution

Minority-class distribution

Virtual instances created by SMOTE

Virtual instances created by V IRTUAL

M arg in

M argi n

(a)

(b)

Figure 6.7 Comparison of oversampling the minority class with SMOTE and VIRTUAL.

(a) Oversampling with SMOTE and (b) oversampling with VIRTUAL.

number of virtual instances created for a real positive support vector in each

iteration, | SV ( + ) | is the number of positive support vectors, and C is the cost of

finding k nearest neighbors. The computation complexity of SMOTE is O( X R ·

v

) , where X R is the number of positive training instances.

depends on the

approach for finding k nearest neighbors. The naive implementation searches all

N training instances for the nearest neighbors and thus

· C

C

C =

kN . Using advanced

data structure such as kd-tree,

C =

k log N .Since | SV (

+

) | is typically much less

than X R , VIRTUAL incurs lower computation overhead than SMOTE. Also,

with fewer virtual instances created, the learner is less burdened with VIRTUAL.

We demonstrate with empirical results that the virtual instances created with

VIRTUAL are more informative and the prediction performance is also improved.

6.4.3 Experiments

We conduct a series of experiments on Reuters-21578 and four UCI datasets

to demonstrate the efficacy of VIRTUAL. The characteristics of the datasets are

detailed in [22]. We compare VIRTUAL with two systems, AL and SMOTE. AL

adopts the traditional AL strategy without preprocessing or creating any virtual

instances during learning. SMOTE, on the other hand, preprocesses the data by

creating virtual instances before training and uses RS in learning. Experiments

elicit the advantages of adaptive virtual sample creation in VIRTUAL.

Figures 6.8 and 6.9 provide details on the behavior of the three algorithms,

SMOTE, AL, and VIRTUAL. For the Reuters datasets (Fig. 6.9), note that in all

the 10 categories, VIRTUAL outperforms AL in g-means metric after saturation.

The difference in performance is most pronounced in the more imbalanced cate-

gories, for example, corn , interest ,and ship . In the less imbalanced datasets such

as acq and earn , the difference in g -means of both methods is less noticeable.

The g -means of SMOTE converges much slower than both AL and VIRTUAL.

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home