Information Technology Reference
In-Depth Information
class. Similarly, noisy instances are the majority class instances, which are the
product of randomness in the dataset, rather than being a true representation of
the underlying concept. Removing borderline instances is valuable as small per-
turbations to a borderline instance's features may cause the decision boundary to
shift incorrectly.
One of the earliest approaches is due to Kubat and Matwin [2]. In their
approach, they combine Tomek Links [3] and a modified version of the con-
densed nearest neighbor (CNN) rule [4] to create a directed under-sampling
method. The choice to combine Tomek Links and CNN is natural, as Tomek
Links can be said to remove borderline and noisy instances, while CNN removes
redundant instances. To see how this works in practice, let us consider how
Tomek Links and CNN are defined.
Given two instances a and b ,let δ(a, b) define the distance (e.g., Euclidean)
between a and b . A pair of instances a and b , which belong to different classes,
is said to be a Tomek Link if there is no instance c such that δ(a, c) < δ(a, b) or
δ(b, c) < δ(a, c) . In words, instances a and b define a Tomek Link if: (i) instance
a 's nearest neighbor is b , (ii) instance b 's nearest neighbor is a , and (iii) instances
a and b belong to different classes. From this definition, we see that instances
that are in Tomek Links are either boundary instances or noisy instances. This is
due to the fact that only boundary instances and noisy instances will have nearest
neighbors, which are from the opposite class.
By considering the entire dataset, one can remove Tomek Links by searching
through each instance and removing those which fit the criteria. When using
CNN, on the other hand, one builds up the dataset from a small, seed group of
instances. To do this, let D be the original training set of instances, and C be
a set of instances containing all of the minority class examples and a randomly
selected majority instance. CNN then classifies all instances in D by finding its
nearest neighbor in C and adding it to C if it is misclassified. Thus, we see that
redundant instances are effectively eliminated in the resulting dataset C as any
instance correctly classified by its nearest neighbors is not added to the dataset.
As an alternative to Tomek Links and CNN, another method for under-
sampling is called the neighborhood cleaning rule (NCR) due to Laurikkala [5].
NCR uses Wilson's edited nearest neighbor rule (ENN) to select majority class
instances to remove from the dataset. In NCR, for each instance a in the dataset,
its three nearest neighbors are computed. If a is a majority class instance and is
misclassified by its three nearest neighbors, then a is removed from the dataset.
Alternatively, if a is a minority class instance and is misclassified by its three
nearest neighbors, then the majority class instances among a 's neighbors are
removed.
3.2.2 Over-Sampling Techniques
While random under-sampling suffered from the loss of potentially useful
information, random over-sampling suffers from the problem of overfitting.
Specifically, by randomly replicating instances in the dataset, the learned model
Search WWH ::




Custom Search