FOUNDATIONS OF IMBALANCED LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

2.1

INTRODUCTION

Many of the machine-learning and data-mining problems that we study, whether

they are in business, science, medicine, or engineering, involve some form of

data imbalance. The imbalance is often an integral part of the problem and in

virtually every case the less frequently occurring entity is the one that we are

most interested in. For example, those working on fraud detection will focus on

identifying the fraudulent transactions rather than on the more common legiti-

mate transactions [1], a telecommunications engineer will be far more interested

in identifying the equipment about to fail than the equipment that will remain

operational [2], and an industrial engineer will be more likely to focus on weld

flaws than on welds that are completed satisfactorily [3].

In all these situations, it is far more important to accurately predict or identify

the rarer case than the more common case, and this is reflected in the costs

associated with errors in the predictions and classifications. For example, if we

predict that telecommunication equipment is going to fail and it does not, we

may incur some modest inconvenience and cost if the equipment is swapped out

unnecessarily, but if we predict that equipment is not going to fail and it does,

then we incur a much more significant cost when service is disrupted. In the case

of medical diagnosis, the costs are even clearer: while a false-positive diagnosis

may lead to a more expensive follow-up test and patient anxiety, a false-negative

diagnosis could result in death if a treatable condition is not identified.

The authors of this chapter cover the foundations of imbalanced learning.

It begins by providing important background information and terminology and

then describes the fundamental issues associated with learning from imbalanced

data. This description provides the foundation for understanding the imbalanced

learning problem. This chapter then categorizes the methods for handling class

imbalance and maps each to the fundamental issue that each method addresses.

This mapping is quite important as many research papers on imbalanced learning

fail to provide a comprehensive description of how or why these methods work,

and what underlying issue(s) they address. This chapter provides a good overview

of the imbalanced learning problem and describes some of the key work in the

area, but it is not intended to provide either a detailed description of the methods

used for dealing with imbalanced data or a comprehensive literature survey.

Details on many of the methods are provided in subsequent chapters in this

topic.

2.2 BACKGROUND

A full appreciation of the issues associated with imbalanced data requires some

important background knowledge. In this section, we look at what it means for

a dataset to be imbalanced, what impact class imbalance has on learning, the

role of between-class and within-class imbalances, and how imbalance applies to

unsupervised learning tasks.

Search WWH ::

Custom Search

Home