IMBALANCED DATASETS: FROM SAMPLING TO CLASSIFIERS - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

IMBALANCED DATASETS: FROM

SAMPLING TO CLASSIFIERS

T. RYAN HOENS AND NITESH V. CHAWLA

Department of Computer Science and Engineering, The University of Notre Dame, Notre

Dame, IN, USA

Abstract: Classification is one of the most fundamental tasks in the machine

learning and data-mining communities. One of the most common challenges faced

when trying to perform classification is the class imbalance problem. A dataset is

considered imbalanced if the class of interest (positive or minority class) is relatively

rare as compared to the other classes (negative or majority classes). As a result, the

classifier can be heavily biased toward the majority class. A number of sampling

approaches, ranging from under-sampling to over-sampling, have been developed

to solve the problem of class imbalance. One challenge with sampling strategies

is deciding how much to sample, which is obviously conditioned on the sampling

strategy that is deployed. While a wrapper approach may be used to discover the

sampling strategy and amounts, it can quickly become computationally prohibitive.

To that end, recent research has also focused on developing novel classification

algorithms that are class imbalance (skew) insensitive. In this chapter, we provide

an overview of the sampling strategies as well as classification algorithms developed

for countering class imbalance. In addition, we consider the issues of correctly

evaluating the performance of a classifier on imbalanced datasets and present a

discussion on various metrics.

3.1

INTRODUCTION

A common problem faced in data mining is dealing class imbalance . A dataset is

said to be imbalanced if one class (called the majority ,or negative class) vastly

Search WWH ::

Custom Search

Home