Databases Reference
In-Depth Information
A weakness of clustering-based outlier detection is its effectiveness, which depends
highly on the clustering method used. Such methods may not be optimized for outlier
detection. Clustering methods are often costly for large data sets, which can serve as a
bottleneck.
12.6 Classification-Based Approaches
Outlier detection can be treated as a classification problem if a training data set with class
labels is available. The general idea of classification-based outlier detection methods is
to train a classification model that can distinguish normal data from outliers.
Consider a training set that contains samples labeled as “normal” and others labeled
as “outlier.” A classifier can then be constructed based on the training set. Any classi-
fication method can be used (Chapters 8 and 9). This kind of brute-force approach,
however, does not work well for outlier detection because the training set is typically
heavily biased. That is, the number of normal samples likely far exceeds the number of
outlier samples. This imbalance, where the number of outlier samples may be insuffi-
cient, can prevent us from building an accurate classifier. Consider intrusion detection
in a system, for example. Because most system accesses are normal, it is easy to obtain
a good representation of the normal events. However, it is infeasible to enumerate all
potential intrusions, as new and unexpected attempts occur from time to time. Hence,
we are left with an insufficient representation of the outlier (or intrusion) samples.
To overcome this challenge, classification-based outlier detection methods often use a
one-classmodel . That is, a classifier is built to describe only the normal class. Any samples
that do not belong to the normal class are regarded as outliers.
Example 12.19 Outlier detection using a one-class model. Consider the training set shown in
Figure 12.13, where white points are samples labeled as “normal” and black points
are samples labeled as “outlier.” To build a model for outlier detection, we can learn
the decision boundary of the normal class using classification methods such as SVM
(Chapter 9), as illustrated. Given a new object, if the object is within the decision bound-
ary of the normal class, it is treated as a normal case. If the object is outside the decision
boundary, it is declared an outlier.
An advantage of using only the model of the normal class to detect outliers is that
the model can detect new outliers that may not appear close to any outlier objects in the
training set. This occurs as long as such new outliers fall outside the decision boundary
of the normal class.
The idea of using the decision boundary of the normal class can be extended to
handle situations where the normal objects may belong to multiple classes such as in
fuzzy clustering (Chapter 11). For example, AllElectronics accepts returned items. Cus-
tomers can return items for a number of reasons (corresponding to class categories)
such as “product design defects” and “product damaged during shipment.” Each such
 
Search WWH ::




Custom Search