Information Technology Reference
In-Depth Information
skewed datasets and their impact on the selections performed by an AL strategy.
Here, we discuss the impact significant class imbalance has on AL and illustrate
alternatives to traditional AL that may be considered when dealing with the most
difficult, highly skewed problems.
6.2 ACTIVE LEARNING FOR IMBALANCED PROBLEMS
The intent of this section is to provide the reader with some background on the
AL problem in the context of building cost-effective classification models. We
then discuss challenges encountered by AL heuristics in settings with significant
class imbalance. We then discuss the strategies specialized in overcoming the
difficulties imposed by this setting.
6.2.1 Background on Active Learning
AL is a specialized set of machine learning techniques developed for reducing the
annotation costs associated with gathering the training data required for building
predictive statistical models. In many applications, unlabeled data comes rela-
tively cheaply when compared to the costs associated with the acquisition of a
ground-truth value of the target variable of that data. For instance, the textual
content of a particular web page may be crawled readily, or the actions of a
user in a social network may be collected trivially by mining the web logs in
that network. However, knowing with some degree of certainty the topical cat-
egorization of a particular web page, or identifying any malicious activity of a
user in a social network is likely to require costly editorial review. These costs
restrict the number of examples that may be labeled, typically to a small fraction
of the overall population. Because of these practical constraints typically placed
on the overall number of ground-truth labels available and the tight dependence
of the performance of a predictive model on the examples in its training set, the
benefits of careful selection of the examples are apparent. This importance is
further evidenced by the vast research literature on the topic.
While an in-depth literature review is beyond the scope of this chapter, for
context we provide a brief overview of some of the more broadly cited approaches
in AL. For a more thorough treatment on the history and details of AL, we direct
the reader to the excellent survey by Settles [1]. AL tends to focus on two sce-
narios—(i) stream-based selection, where unlabeled examples are presented one
at a time to a predictive model, which feeds predicted target values to a consum-
ing process and subsequently applies an AL heuristic to decide whether some
budget should be expended gathering this example's class label for subsequent
re-training. (ii) pool-based AL, on the other hand, is typically an offline, iterative
process. Here, a large set of unlabeled examples are presented to an AL system.
During each epoch of this process, the AL system chooses one or more unla-
beled examples for labeling and subsequent model training. This proceeds until
the budget is exhausted or some stopping criterion is met. At this time, if the
Search WWH ::




Custom Search