Information Technology Reference
In-Depth Information
2.1
INTRODUCTION
Many of the machine-learning and data-mining problems that we study, whether
they are in business, science, medicine, or engineering, involve some form of
data imbalance. The imbalance is often an integral part of the problem and in
virtually every case the less frequently occurring entity is the one that we are
most interested in. For example, those working on fraud detection will focus on
identifying the fraudulent transactions rather than on the more common legiti-
mate transactions [1], a telecommunications engineer will be far more interested
in identifying the equipment about to fail than the equipment that will remain
operational [2], and an industrial engineer will be more likely to focus on weld
flaws than on welds that are completed satisfactorily [3].
In all these situations, it is far more important to accurately predict or identify
the rarer case than the more common case, and this is reflected in the costs
associated with errors in the predictions and classifications. For example, if we
predict that telecommunication equipment is going to fail and it does not, we
may incur some modest inconvenience and cost if the equipment is swapped out
unnecessarily, but if we predict that equipment is not going to fail and it does,
then we incur a much more significant cost when service is disrupted. In the case
of medical diagnosis, the costs are even clearer: while a false-positive diagnosis
may lead to a more expensive follow-up test and patient anxiety, a false-negative
diagnosis could result in death if a treatable condition is not identified.
The authors of this chapter cover the foundations of imbalanced learning.
It begins by providing important background information and terminology and
then describes the fundamental issues associated with learning from imbalanced
data. This description provides the foundation for understanding the imbalanced
learning problem. This chapter then categorizes the methods for handling class
imbalance and maps each to the fundamental issue that each method addresses.
This mapping is quite important as many research papers on imbalanced learning
fail to provide a comprehensive description of how or why these methods work,
and what underlying issue(s) they address. This chapter provides a good overview
of the imbalanced learning problem and describes some of the key work in the
area, but it is not intended to provide either a detailed description of the methods
used for dealing with imbalanced data or a comprehensive literature survey.
Details on many of the methods are provided in subsequent chapters in this
topic.
2.2 BACKGROUND
A full appreciation of the issues associated with imbalanced data requires some
important background knowledge. In this section, we look at what it means for
a dataset to be imbalanced, what impact class imbalance has on learning, the
role of between-class and within-class imbalances, and how imbalance applies to
unsupervised learning tasks.
Search WWH ::




Custom Search