CLASS IMBALANCE AND ACTIVE LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

6.1

INTRODUCTION

The rich history of predictive modeling has been culminated in a diverse set of

techniques capable of making accurate predictions on many real-world problems.

Many of these techniques demand training , whereby a set of instances with

ground-truth labels (values of a dependent variable) are observed by a model-

building process that attempts to capture, at least in part, the relationship between

the features of the instances and their labels. The resultant model can be applied

to instances for which the label is not known, to estimate or predict the labels.

These predictions depend not only on the functional structure of the model itself,

but also on the particular data with which the model was trained. The accuracy of

the predicted labels depends highly on the model's ability to capture an unbiased

and sufficient understanding of the characteristics of different classes; in problems

where the prevalence of classes is imbalanced, it is necessary to prevent the

resultant model from being skewed toward the majority class and to ensure that

the model is capable of reflecting the true nature of the minority class.

Another consequence of class imbalance is observed in domains where the

ground-truth labels in the dataset are not available beforehand and need to be

gathered on-demand at some cost. The costs associated with collecting labels

may be due to human labor or is the result of costly incentives, interventions, or

experiments. In these settings, simply labeling all available instances may not be

practicable because of the budgetary constraints or simply a strong desire to be

cost efficient. As in predictive modeling with imbalanced classes, the goal here

is to ensure that the budget is not predominantly expended on getting the labels

of the majority class instances, and to make sure that the set of instances to be

labeled have comparable number of minority class instances as well.

In the context of learning from imbalanced datasets, the role of active learn-

ing (AL) can be viewed from two different perspectives. The first perspective

considers the case where the labels for all the examples in a reasonably large,

imbalanced dataset are readily available. The role of AL in this case is to reduce,

and potentially eliminate, any adverse effects that the class imbalance can have

on the model's generalization performance. The other perspective addresses the

setting where we have prior knowledge that the dataset is imbalanced, and we

would like to employ AL to select informative examples both from the majority

and minority classes for labeling, subject to the constraints of a given budget. The

first perspective focuses on AL's ability to address class imbalance, whereas the

second perspective is concerned with the impact of class imbalance on the sam-

pling performance of the active learner. The intent of this chapter is to present a

comprehensive analysis of this interplay of AL and class imbalance. In particular,

we first present techniques for dealing with the class imbalance problem using

AL and discuss how AL can alleviate the issues that stem from class imbalance.

We show that AL, even without any adjustments to target class imbalance, is an

effective strategy to have a balanced view of the dataset in most cases. It is also

possible to further improve the effectiveness of AL by tuning its sampling strat-

egy in a class-specific way. Additionally, we will focus on dealing with highly

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home