Information Technology Reference
In-Depth Information
6.1
INTRODUCTION
The rich history of predictive modeling has been culminated in a diverse set of
techniques capable of making accurate predictions on many real-world problems.
Many of these techniques demand training , whereby a set of instances with
ground-truth labels (values of a dependent variable) are observed by a model-
building process that attempts to capture, at least in part, the relationship between
the features of the instances and their labels. The resultant model can be applied
to instances for which the label is not known, to estimate or predict the labels.
These predictions depend not only on the functional structure of the model itself,
but also on the particular data with which the model was trained. The accuracy of
the predicted labels depends highly on the model's ability to capture an unbiased
and sufficient understanding of the characteristics of different classes; in problems
where the prevalence of classes is imbalanced, it is necessary to prevent the
resultant model from being skewed toward the majority class and to ensure that
the model is capable of reflecting the true nature of the minority class.
Another consequence of class imbalance is observed in domains where the
ground-truth labels in the dataset are not available beforehand and need to be
gathered on-demand at some cost. The costs associated with collecting labels
may be due to human labor or is the result of costly incentives, interventions, or
experiments. In these settings, simply labeling all available instances may not be
practicable because of the budgetary constraints or simply a strong desire to be
cost efficient. As in predictive modeling with imbalanced classes, the goal here
is to ensure that the budget is not predominantly expended on getting the labels
of the majority class instances, and to make sure that the set of instances to be
labeled have comparable number of minority class instances as well.
In the context of learning from imbalanced datasets, the role of active learn-
ing (AL) can be viewed from two different perspectives. The first perspective
considers the case where the labels for all the examples in a reasonably large,
imbalanced dataset are readily available. The role of AL in this case is to reduce,
and potentially eliminate, any adverse effects that the class imbalance can have
on the model's generalization performance. The other perspective addresses the
setting where we have prior knowledge that the dataset is imbalanced, and we
would like to employ AL to select informative examples both from the majority
and minority classes for labeling, subject to the constraints of a given budget. The
first perspective focuses on AL's ability to address class imbalance, whereas the
second perspective is concerned with the impact of class imbalance on the sam-
pling performance of the active learner. The intent of this chapter is to present a
comprehensive analysis of this interplay of AL and class imbalance. In particular,
we first present techniques for dealing with the class imbalance problem using
AL and discuss how AL can alleviate the issues that stem from class imbalance.
We show that AL, even without any adjustments to target class imbalance, is an
effective strategy to have a balanced view of the dataset in most cases. It is also
possible to further improve the effectiveness of AL by tuning its sampling strat-
egy in a class-specific way. Additionally, we will focus on dealing with highly
Search WWH ::




Custom Search