NONSTATIONARY STREAM DATA LEARNING WITH IMBALANCED CLASS DISTRIBUTION - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

that is, SMOTE / Learn ++ and the others using previous minority class examples

UB / SERA / MuSeRA / REA. The second category can be further divided into the

ones using all previous minority class examples and the others using partial

previous minority class examples. The key idea of using partial previous minority

class examples to compensate imbalanced class ratio is to find a measurement

to calculate the similarity degree between each previous minority class example

and the current minority class set, which splits the algorithms in this subcategory

into the ones using Mahalanobis distance and the others using k -nearest neighbor.

This finally ends the taxonomy of algorithms.

Algorithms are introduced by referring to their algorithmic pseudo-codes, fol-

lowed by the theoretical study and simulations on them. The results show that

REA seems to be able to provide the most competitive results over other algo-

rithms under comparison. Nevertheless, considering the efficiency in practice for

stream data-based applications, SERA may provide the best trade-off between

performance and algorithm complexity.

There are many interesting directions that can be followed up to pursue the

study of learning from nonstationary stream with imbalanced class distribution.

First of all, for REA / MuSeRA algorithms, an efficient and concrete mechanism

is desired to enable them to remove the hypotheses with obsoleted knowledge

on the fly to account for limited resources availability as well as concept drifts.

For instance, one can explore integrating the method using Learn ++ to prune

the hypothesis into REA / MuSeRA. Second, the issue of compensating the imbal-

anced class ratio can be worked against the other way around. In other words, one

can choose to remove less important majority class examples instead of explic-

itly increasing the minority class data. The effect would be the same, and the

benefit is obvious, which is the avoidance of accommodating synthetic/previous

data into the data chunk to potentially impair the integrity of target concept.

The random under-sampling method employed by UB could be considered as

a kind of preliminary effort to implement this. Finally, there seems to be no

record indicating the usage of the cost-sensitive learning framework to address the

problem. One could directly assign different misclassification costs to minority

class and majority class examples during training in the hope for better learn-

ing performance. Aside from this naive implementation, a smarter way is to

assign different misclassification costs to minority class examples, majority class

examples, and previous minority class examples . The misclassification costs for

previous minority class examples can even be set nonuniformly according to

how similar they are with the minority class set in the training data chunk under

consideration.

7.6 ACKNOWLEDGMENTS

This work was supported in part by the National Science Foundation (NSF) under

grant ECCS 1053717 and Army Research Office (ARO) under Grant W911NF-

12-1-0378.

Search WWH ::

Custom Search

Home