Information Technology Reference
In-Depth Information
that is both balanced and extremely informative in terms of model training. Early
stopping criteria for halting the selection process of AL can further improve the
generalization of induced models. That is, a model trained on a small but infor-
mative subsample often offers performance far exceeding what can be achieved
by training on a large dataset drawn from the natural, skewed base rate. The
abilities of AL to provide small, balanced training from large, but imbalanced
problems are enhanced further by VIRTUAL, introduced in Section 6.4. Here,
artificially generated instances supplement the pool of examples available to the
active selection mechanism.
Additionally, it is noted throughout that abilities of an AL system tend to
degrade as the imbalance of the underlying distribution increases. While at more
moderate imbalances, the quality of the resulting training set may still be suffi-
cient to provide usable statistical models, but at more substantial class imbalances,
it may be difficult for a system based on AL to produce accurate models. Through-
out Sections 6,2-6.4, we illustrate a variety of AL techniques specially adapted
for imbalanced settings, techniques that may be considered by practitioners fac-
ing difficult problems. In Sections 6.5 and 6.6, we note that as a problem's class
imbalance tends toward the extreme, the selective abilities of an AL heuristic
may fail completely. We present several alternative approaches for data acqui-
sition in Section 6.8, mechanisms that may alleviate the difficulties AL faces in
problematic domains. Among these alternatives are guided learning and ACS in
Section 6.8.1, and using associations between specific feature values and certain
classes in Section 6.8.2.
Class imbalance presents a challenge to statistical models and machine learning
systems in general. Because the abilities of these models are so tightly coupled
with the data used for training, it is crucial to consider the selection process that
generates this data. This chapter discusses specifically this problem. It is clear
that when building models for challenging imbalanced domains, AL is an aspect
of the approach that should not be ignored.
REFERENCES
1. B. Settles, “Active learning literature survey,” Computer Sciences Technical Report
1648, University of Wisconsin-Madison, 2009.
2. V. V. Federov, Theory of Optimal Experiments . New York: Academic Press, 1972.
3. D. Cohn, L. Atlas, and R. Ladner, “Improving generalization with active learning,”
Machine Learning , vol. 15, pp. 201-221, 1994.
4. D. D. Lewis and W. A. Gale, “A sequential algorithm for training text classifiers,”
in SIGIR '94: Proceedings of the 17th Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval (New York, NY, USA), pp.
3-12, Springer-Verlag Inc., 1994.
5. Y. Freund, “Sifting informative examples from a random source,” in Working Notes
of the Workshop on Relevance, AAAI Fall Symposium Series , pp. 85-89, 1994.
Search WWH ::




Custom Search