Information Technology Reference
In-Depth Information
of two or all of them. The task of imbalanced learning could also be applied to
regression, classification, or clustering tasks. In this Chapter, we provide a brief
introduction to the problem formulation, research methods, and challenges and
opportunities in this field. This chapter is based on a recent comprehensive sur-
vey and critical review of imbalanced learning as presented in [1], and interested
readers could refer to that survey paper for more details regarding imbalanced
learning.
Imbalanced learning not only presents significant new challenges to the data
research community but also raises many critical questions in real-world data-
intensive applications, ranging from civilian applications such as financial and
biomedical data analysis to security- and defense-related applications such as
surveillance and military data analysis [1]. This increased interest in imbalanced
learning is reflected in the recent significantly increased number of publications
in this field as well as in the organization of dedicated workshops, conferences,
symposiums, and special issues, [2, 3, 4].
To start with a simple example of imbalanced learning, let us consider a pop-
ular case study in biomedical data analysis [1]. Consider the “Mammography
Data Set,” a collection of images acquired from a series of mammography exam-
inations performed on a set of distinct patients [5-7]. For such a dataset, the
natural classes that arise are “Positive” or “Negative” for an image representa-
tive of a “cancerous” or “healthy” patient, respectively. From experience, one
would expect the number of noncancerous patients to exceed greatly the number
of cancerous patients; indeed, this dataset contains 10,923 “Negative” (majority
class) and 260 “Positive” (minority class) samples. Preferably, we require a clas-
sifier that provides a balanced degree of predictive accuracy for both the minority
and majority classes on the dataset. However, in many standard learning algo-
rithms, we find that classifiers tend to provide a severely imbalanced degree of
accuracy, with the majority class having close to 100% accuracy and the minority
class having accuracies of 0 ~ 10%; see for instance, [5, 7]. Suppose a classi-
fier achieves 5% accuracy on the minority class of the mammography dataset.
Analytically, this would suggest that 247 minority samples are misclassified as
majority samples (i.e., 247 cancerous patients are diagnosed as noncancerous).
In the medical industry, the ramifications of such a consequence can be over-
whelmingly costly, more so than classifying a noncancerous patient as cancerous
[8]. Furthermore, this also suggests that the conventional evaluation practice of
using singular assessment criteria, such as the overall accuracy or error rate,
does not provide adequate information in the case of imbalanced learning. In
an extreme case, if a given dataset includes 1% of minority class examples and
99% of majority class examples, a naive approach of classifying every example
to be a majority class example would provide an accuracy of 99%. Taken at face
value, 99% accuracy across the entire dataset appears superb; however, by the
same token, this description fails to reflect the fact that none of the minority
examples are identified, when in many situations, those minority examples are of
much more interest. This clearly demonstrates the need to revisit the assessment
metrics for imbalanced learning, which is discussed in Chapter 8.
Search WWH ::




Custom Search