IMBALANCED DATASETS: FROM SAMPLING TO CLASSIFIERS - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

class. Similarly, noisy instances are the majority class instances, which are the

product of randomness in the dataset, rather than being a true representation of

the underlying concept. Removing borderline instances is valuable as small per-

turbations to a borderline instance's features may cause the decision boundary to

shift incorrectly.

One of the earliest approaches is due to Kubat and Matwin [2]. In their

approach, they combine Tomek Links [3] and a modified version of the con-

densed nearest neighbor (CNN) rule [4] to create a directed under-sampling

method. The choice to combine Tomek Links and CNN is natural, as Tomek

Links can be said to remove borderline and noisy instances, while CNN removes

redundant instances. To see how this works in practice, let us consider how

Tomek Links and CNN are defined.

Given two instances a and b ,let δ(a, b) define the distance (e.g., Euclidean)

between a and b . A pair of instances a and b , which belong to different classes,

is said to be a Tomek Link if there is no instance c such that δ(a, c) < δ(a, b) or

δ(b, c) < δ(a, c) . In words, instances a and b define a Tomek Link if: (i) instance

a 's nearest neighbor is b , (ii) instance b 's nearest neighbor is a , and (iii) instances

a and b belong to different classes. From this definition, we see that instances

that are in Tomek Links are either boundary instances or noisy instances. This is

due to the fact that only boundary instances and noisy instances will have nearest

neighbors, which are from the opposite class.

By considering the entire dataset, one can remove Tomek Links by searching

through each instance and removing those which fit the criteria. When using

CNN, on the other hand, one builds up the dataset from a small, seed group of

instances. To do this, let D be the original training set of instances, and C be

a set of instances containing all of the minority class examples and a randomly

selected majority instance. CNN then classifies all instances in D by finding its

nearest neighbor in C and adding it to C if it is misclassified. Thus, we see that

redundant instances are effectively eliminated in the resulting dataset C as any

instance correctly classified by its nearest neighbors is not added to the dataset.

As an alternative to Tomek Links and CNN, another method for under-

sampling is called the neighborhood cleaning rule (NCR) due to Laurikkala [5].

NCR uses Wilson's edited nearest neighbor rule (ENN) to select majority class

instances to remove from the dataset. In NCR, for each instance a in the dataset,

its three nearest neighbors are computed. If a is a majority class instance and is

misclassified by its three nearest neighbors, then a is removed from the dataset.

Alternatively, if a is a minority class instance and is misclassified by its three

nearest neighbors, then the majority class instances among a 's neighbors are

removed.

3.2.2 Over-Sampling Techniques

While random under-sampling suffered from the loss of potentially useful

information, random over-sampling suffers from the problem of overfitting.

Specifically, by randomly replicating instances in the dataset, the learned model

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home