INTRODUCTION - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

of two or all of them. The task of imbalanced learning could also be applied to

regression, classification, or clustering tasks. In this Chapter, we provide a brief

introduction to the problem formulation, research methods, and challenges and

opportunities in this field. This chapter is based on a recent comprehensive sur-

vey and critical review of imbalanced learning as presented in [1], and interested

readers could refer to that survey paper for more details regarding imbalanced

learning.

Imbalanced learning not only presents significant new challenges to the data

research community but also raises many critical questions in real-world data-

intensive applications, ranging from civilian applications such as financial and

biomedical data analysis to security- and defense-related applications such as

surveillance and military data analysis [1]. This increased interest in imbalanced

learning is reflected in the recent significantly increased number of publications

in this field as well as in the organization of dedicated workshops, conferences,

symposiums, and special issues, [2, 3, 4].

To start with a simple example of imbalanced learning, let us consider a pop-

ular case study in biomedical data analysis [1]. Consider the “Mammography

Data Set,” a collection of images acquired from a series of mammography exam-

inations performed on a set of distinct patients [5-7]. For such a dataset, the

natural classes that arise are “Positive” or “Negative” for an image representa-

tive of a “cancerous” or “healthy” patient, respectively. From experience, one

would expect the number of noncancerous patients to exceed greatly the number

of cancerous patients; indeed, this dataset contains 10,923 “Negative” (majority

class) and 260 “Positive” (minority class) samples. Preferably, we require a clas-

sifier that provides a balanced degree of predictive accuracy for both the minority

and majority classes on the dataset. However, in many standard learning algo-

rithms, we find that classifiers tend to provide a severely imbalanced degree of

accuracy, with the majority class having close to 100% accuracy and the minority

class having accuracies of 0 ~ 10%; see for instance, [5, 7]. Suppose a classi-

fier achieves 5% accuracy on the minority class of the mammography dataset.

Analytically, this would suggest that 247 minority samples are misclassified as

majority samples (i.e., 247 cancerous patients are diagnosed as noncancerous).

In the medical industry, the ramifications of such a consequence can be over-

whelmingly costly, more so than classifying a noncancerous patient as cancerous

[8]. Furthermore, this also suggests that the conventional evaluation practice of

using singular assessment criteria, such as the overall accuracy or error rate,

does not provide adequate information in the case of imbalanced learning. In

an extreme case, if a given dataset includes 1% of minority class examples and

99% of majority class examples, a naive approach of classifying every example

to be a majority class example would provide an accuracy of 99%. Taken at face

value, 99% accuracy across the entire dataset appears superb; however, by the

same token, this description fails to reflect the fact that none of the minority

examples are identified, when in many situations, those minority examples are of

much more interest. This clearly demonstrates the need to revisit the assessment

metrics for imbalanced learning, which is discussed in Chapter 8.

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home