Information Technology Reference
In-Depth Information
might fit the training data too closely and, as a result, not generalize well to
unseen instances.
In order to overcome this issue, Chawla et al. developed a method of cre-
ating synthetic instances instead of merely copying existing instances in the
dataset. This techniques is known as the synthetic minority over-sampling tech-
nique (SMOTE) [6]. As mentioned, in SMOTE, the training set is altered by
adding synthetically generated minority class instances, causing the class distri-
bution to become more balanced. We say that the instances created are synthetic ,
as they are, in general, new minority instances that have been extrapolated and
created out of existing minority class instances.
To create the new synthetic minority class instances, SMOTE first selects
a minority class instance a at random and finds its k nearest minority class
neighbors. The synthetic instance is then created by choosing one of the k nearest
neighbors b at random and connecting a and b to form a line segment in the
feature space. The synthetic instances are generated as a convex combination of
the two chosen instances a and b .
Given SMOTE's effectiveness as an over-sampling method, it has been
extended multiple times [7-9]. In Borderline-SMOTE, for instance, only
borderline instances are considered to be SMOTEd, where borderline instances
are defined as instances that are misclassified by a nearest neighbor classifier.
Safe-Level-SMOTE, on the other hand, defines a “safe-level” for each instance,
and the instances that are deemed “safe” are considered to be SMOTEd.
In addition to SMOTE, Jo and Japkowicz [10] defined an over-sampling
method based on clustering. That is, instead of randomly choosing the instances
to oversample, they instead first cluster all of the minority class instances using
k -means clustering. They then oversample each of the clusters to have the same
number of instances, and the overall dataset to be balanced. The purpose of this
method is to identify the disparate regions in the feature space where minority
class instances are found and to ensure that each region is equally represented
with minority class instances.
In addition to cluster-based over-sampling, Japkowicz et al. [11] also devel-
oped a method called focused resampling . In focused resampling, only minority
class instances that occur on the boundary between minority and majority class
instances are over-sampled. In this way, redundant instances are reduced, and
better performance can be achieved.
3.2.3 Hybrid Techniques
In addition to merely over-sampling or under-sampling the dataset, techniques
have been developed, which perform a combination of both. By combining over-
sampling and under-sampling, the dataset can be balanced by neither losing too
much information (i.e., under-sampling too many majority class instances), nor
suffering from overfitting (i.e., over-sampling too heavily).
Two examples of hybrid techniques that have been developed include
SMOTE+Tomek and SMOTE+ENN [12], wherein SMOTE is used to
Search WWH ::




Custom Search