IMBALANCED DATASETS: FROM SAMPLING TO CLASSIFIERS - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

might fit the training data too closely and, as a result, not generalize well to

unseen instances.

In order to overcome this issue, Chawla et al. developed a method of cre-

ating synthetic instances instead of merely copying existing instances in the

dataset. This techniques is known as the synthetic minority over-sampling tech-

nique (SMOTE) [6]. As mentioned, in SMOTE, the training set is altered by

adding synthetically generated minority class instances, causing the class distri-

bution to become more balanced. We say that the instances created are synthetic ,

as they are, in general, new minority instances that have been extrapolated and

created out of existing minority class instances.

To create the new synthetic minority class instances, SMOTE first selects

a minority class instance a at random and finds its k nearest minority class

neighbors. The synthetic instance is then created by choosing one of the k nearest

neighbors b at random and connecting a and b to form a line segment in the

feature space. The synthetic instances are generated as a convex combination of

the two chosen instances a and b .

Given SMOTE's effectiveness as an over-sampling method, it has been

extended multiple times [7-9]. In Borderline-SMOTE, for instance, only

borderline instances are considered to be SMOTEd, where borderline instances

are defined as instances that are misclassified by a nearest neighbor classifier.

Safe-Level-SMOTE, on the other hand, defines a “safe-level” for each instance,

and the instances that are deemed “safe” are considered to be SMOTEd.

In addition to SMOTE, Jo and Japkowicz [10] defined an over-sampling

method based on clustering. That is, instead of randomly choosing the instances

to oversample, they instead first cluster all of the minority class instances using

k -means clustering. They then oversample each of the clusters to have the same

number of instances, and the overall dataset to be balanced. The purpose of this

method is to identify the disparate regions in the feature space where minority

class instances are found and to ensure that each region is equally represented

with minority class instances.

In addition to cluster-based over-sampling, Japkowicz et al. [11] also devel-

oped a method called focused resampling . In focused resampling, only minority

class instances that occur on the boundary between minority and majority class

instances are over-sampled. In this way, redundant instances are reduced, and

better performance can be achieved.

3.2.3 Hybrid Techniques

In addition to merely over-sampling or under-sampling the dataset, techniques

have been developed, which perform a combination of both. By combining over-

sampling and under-sampling, the dataset can be balanced by neither losing too

much information (i.e., under-sampling too many majority class instances), nor

suffering from overfitting (i.e., over-sampling too heavily).

Two examples of hybrid techniques that have been developed include

SMOTE+Tomek and SMOTE+ENN [12], wherein SMOTE is used to

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home