Outlier Detection - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

A weakness of clustering-based outlier detection is its effectiveness, which depends

highly on the clustering method used. Such methods may not be optimized for outlier

detection. Clustering methods are often costly for large data sets, which can serve as a

bottleneck.

12.6 Classification-Based Approaches

Outlier detection can be treated as a classification problem if a training data set with class

labels is available. The general idea of classification-based outlier detection methods is

to train a classification model that can distinguish normal data from outliers.

Consider a training set that contains samples labeled as “normal” and others labeled

as “outlier.” A classifier can then be constructed based on the training set. Any classi-

fication method can be used (Chapters 8 and 9). This kind of brute-force approach,

however, does not work well for outlier detection because the training set is typically

heavily biased. That is, the number of normal samples likely far exceeds the number of

outlier samples. This imbalance, where the number of outlier samples may be insuffi-

cient, can prevent us from building an accurate classifier. Consider intrusion detection

in a system, for example. Because most system accesses are normal, it is easy to obtain

a good representation of the normal events. However, it is infeasible to enumerate all

potential intrusions, as new and unexpected attempts occur from time to time. Hence,

we are left with an insufficient representation of the outlier (or intrusion) samples.

To overcome this challenge, classification-based outlier detection methods often use a

one-classmodel . That is, a classifier is built to describe only the normal class. Any samples

that do not belong to the normal class are regarded as outliers.

Example 12.19 Outlier detection using a one-class model. Consider the training set shown in

Figure 12.13, where white points are samples labeled as “normal” and black points

are samples labeled as “outlier.” To build a model for outlier detection, we can learn

the decision boundary of the normal class using classification methods such as SVM

(Chapter 9), as illustrated. Given a new object, if the object is within the decision bound-

ary of the normal class, it is treated as a normal case. If the object is outside the decision

boundary, it is declared an outlier.

An advantage of using only the model of the normal class to detect outliers is that

the model can detect new outliers that may not appear close to any outlier objects in the

training set. This occurs as long as such new outliers fall outside the decision boundary

of the normal class.

The idea of using the decision boundary of the normal class can be extended to

handle situations where the normal objects may belong to multiple classes such as in

fuzzy clustering (Chapter 11). For example, AllElectronics accepts returned items. Cus-

tomers can return items for a number of reasons (corresponding to class categories)

such as “product design defects” and “product damaged during shipment.” Each such

Search WWH ::

Custom Search

Home