Information Technology Reference
In-Depth Information
IMBALANCED DATASETS: FROM
SAMPLING TO CLASSIFIERS
T. RYAN HOENS AND NITESH V. CHAWLA
Department of Computer Science and Engineering, The University of Notre Dame, Notre
Dame, IN, USA
Abstract: Classification is one of the most fundamental tasks in the machine
learning and data-mining communities. One of the most common challenges faced
when trying to perform classification is the class imbalance problem. A dataset is
considered imbalanced if the class of interest (positive or minority class) is relatively
rare as compared to the other classes (negative or majority classes). As a result, the
classifier can be heavily biased toward the majority class. A number of sampling
approaches, ranging from under-sampling to over-sampling, have been developed
to solve the problem of class imbalance. One challenge with sampling strategies
is deciding how much to sample, which is obviously conditioned on the sampling
strategy that is deployed. While a wrapper approach may be used to discover the
sampling strategy and amounts, it can quickly become computationally prohibitive.
To that end, recent research has also focused on developing novel classification
algorithms that are class imbalance (skew) insensitive. In this chapter, we provide
an overview of the sampling strategies as well as classification algorithms developed
for countering class imbalance. In addition, we consider the issues of correctly
evaluating the performance of a classifier on imbalanced datasets and present a
discussion on various metrics.
3.1
INTRODUCTION
A common problem faced in data mining is dealing class imbalance . A dataset is
said to be imbalanced if one class (called the majority ,or negative class) vastly
Search WWH ::




Custom Search