Information Technology Reference
In-Depth Information
CLASS IMBALANCE AND ACTIVE
LEARNING
JOSH ATTENBERG
Etsy, Brooklyn, NY, USA and NYU Stern School of Business, New York, NY USA
¸ EYDA ERTEKIN
MIT Sloan School of Management, Massachusetts Institute of Technology, Cambridge,
MA, USA
Abstract: The performance of a predictive model is tightly coupled with the
data used during training. While using more examples in the training will often
result in a better informed, more accurate model; limits on computer memory and
real-world costs associated with gathering labeled examples often constrain the
amount of data that can be used for training. In settings where the number of
training examples is limited, it often becomes meaningful to carefully see just which
examples are selected. In active learning (AL), the model itself plays a hands-on role
in the selection of examples for labeling from a large pool of unlabeled examples.
These examples are used for model training. Numerous studies have demonstrated,
both empirically and theoretically, the benefits of AL: Given a fixed budget, a
training system that interactively involves the current model in selecting the training
examples can often result in a far greater accuracy than a system that simply
selects random training examples. Imbalanced settings provide special opportunities
and challenges for AL. For example, while AL can be used to build models that
counteract the harmful effects of learning under class imbalance, extreme class
imbalance can cause an AL strategy to “fail,” preventing the selection scheme from
choosing any useful examples for labeling. This chapter focuses on the interaction
between AL and class imbalance, discussing (i) AL techniques designed specifically
for dealing with imbalanced settings, (ii) strategies that leverage AL to overcome
the deleterious effects of class imbalance, (iii) how extreme class imbalance can
prevent AL systems from selecting useful examples, and alternatives to AL in these
cases.
Search WWH ::




Custom Search