CLASS IMBALANCE AND ACTIVE LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

skewed datasets and their impact on the selections performed by an AL strategy.

Here, we discuss the impact significant class imbalance has on AL and illustrate

alternatives to traditional AL that may be considered when dealing with the most

difficult, highly skewed problems.

6.2 ACTIVE LEARNING FOR IMBALANCED PROBLEMS

The intent of this section is to provide the reader with some background on the

AL problem in the context of building cost-effective classification models. We

then discuss challenges encountered by AL heuristics in settings with significant

class imbalance. We then discuss the strategies specialized in overcoming the

difficulties imposed by this setting.

6.2.1 Background on Active Learning

AL is a specialized set of machine learning techniques developed for reducing the

annotation costs associated with gathering the training data required for building

predictive statistical models. In many applications, unlabeled data comes rela-

tively cheaply when compared to the costs associated with the acquisition of a

ground-truth value of the target variable of that data. For instance, the textual

content of a particular web page may be crawled readily, or the actions of a

user in a social network may be collected trivially by mining the web logs in

that network. However, knowing with some degree of certainty the topical cat-

egorization of a particular web page, or identifying any malicious activity of a

user in a social network is likely to require costly editorial review. These costs

restrict the number of examples that may be labeled, typically to a small fraction

of the overall population. Because of these practical constraints typically placed

on the overall number of ground-truth labels available and the tight dependence

of the performance of a predictive model on the examples in its training set, the

benefits of careful selection of the examples are apparent. This importance is

further evidenced by the vast research literature on the topic.

While an in-depth literature review is beyond the scope of this chapter, for

context we provide a brief overview of some of the more broadly cited approaches

in AL. For a more thorough treatment on the history and details of AL, we direct

the reader to the excellent survey by Settles [1]. AL tends to focus on two sce-

narios—(i) stream-based selection, where unlabeled examples are presented one

at a time to a predictive model, which feeds predicted target values to a consum-

ing process and subsequently applies an AL heuristic to decide whether some

budget should be expended gathering this example's class label for subsequent

re-training. (ii) pool-based AL, on the other hand, is typically an offline, iterative

process. Here, a large set of unlabeled examples are presented to an AL system.

During each epoch of this process, the AL system chooses one or more unla-

beled examples for labeling and subsequent model training. This proceeds until

the budget is exhausted or some stopping criterion is met. At this time, if the

Search WWH ::

Custom Search

Home