CLASS IMBALANCE AND ACTIVE LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

that is both balanced and extremely informative in terms of model training. Early

stopping criteria for halting the selection process of AL can further improve the

generalization of induced models. That is, a model trained on a small but infor-

mative subsample often offers performance far exceeding what can be achieved

by training on a large dataset drawn from the natural, skewed base rate. The

abilities of AL to provide small, balanced training from large, but imbalanced

problems are enhanced further by VIRTUAL, introduced in Section 6.4. Here,

artificially generated instances supplement the pool of examples available to the

active selection mechanism.

Additionally, it is noted throughout that abilities of an AL system tend to

degrade as the imbalance of the underlying distribution increases. While at more

moderate imbalances, the quality of the resulting training set may still be suffi-

cient to provide usable statistical models, but at more substantial class imbalances,

it may be difficult for a system based on AL to produce accurate models. Through-

out Sections 6,2-6.4, we illustrate a variety of AL techniques specially adapted

for imbalanced settings, techniques that may be considered by practitioners fac-

ing difficult problems. In Sections 6.5 and 6.6, we note that as a problem's class

imbalance tends toward the extreme, the selective abilities of an AL heuristic

may fail completely. We present several alternative approaches for data acqui-

sition in Section 6.8, mechanisms that may alleviate the difficulties AL faces in

problematic domains. Among these alternatives are guided learning and ACS in

Section 6.8.1, and using associations between specific feature values and certain

classes in Section 6.8.2.

Class imbalance presents a challenge to statistical models and machine learning

systems in general. Because the abilities of these models are so tightly coupled

with the data used for training, it is crucial to consider the selection process that

generates this data. This chapter discusses specifically this problem. It is clear

that when building models for challenging imbalanced domains, AL is an aspect

of the approach that should not be ignored.

REFERENCES

1. B. Settles, “Active learning literature survey,” Computer Sciences Technical Report

1648, University of Wisconsin-Madison, 2009.

2. V. V. Federov, Theory of Optimal Experiments . New York: Academic Press, 1972.

3. D. Cohn, L. Atlas, and R. Ladner, “Improving generalization with active learning,”

Machine Learning , vol. 15, pp. 201-221, 1994.

4. D. D. Lewis and W. A. Gale, “A sequential algorithm for training text classifiers,”

in SIGIR '94: Proceedings of the 17th Annual International ACM SIGIR Conference

on Research and Development in Information Retrieval (New York, NY, USA), pp.

3-12, Springer-Verlag Inc., 1994.

5. Y. Freund, “Sifting informative examples from a random source,” in Working Notes

of the Workshop on Relevance, AAAI Fall Symposium Series , pp. 85-89, 1994.

Search WWH ::

Custom Search

Home