Outlier Detection - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

methods may have a high false positive rate—they may mislabel many normal objects

as outliers (intrusions or viruses in these applications), and let many actual outliers go

undetected. Due to the high similarity between intrusions and viruses (i.e., they have to

attack key resources in the target systems), modeling outliers using supervised methods

may be far more effective.

Many clustering methods can be adapted to act as unsupervised outlier detection

methods. The central idea is to find clusters first, and then the data objects not belong-

ing to any cluster are detected as outliers. However, such methods suffer from two issues.

First, a data object not belonging to any cluster may be noise instead of an outlier. Sec-

ond, it is often costly to find clusters first and then find outliers. It is usually assumed

that there are far fewer outliers than normal objects. Having to process a large popu-

lation of nontarget data entries (i.e., the normal objects) before one can touch the real

meat (i.e., the outliers) can be unappealing. The latest unsupervised outlier detection

methods develop various smart ideas to tackle outliers directly without explicitly and

completely finding clusters. You will learn more about these techniques in Sections 12.4

and 12.5 on proximity-based and clustering-based methods, respectively.

Semi-Supervised Methods

In many applications, although obtaining some labeled examples is feasible, the number

of such labeled examples is often small. We may encounter cases where only a small set

of the normal and/or outlier objects are labeled, but most of the data are unlabeled.

Semi-supervised outlier detection methods were developed to tackle such scenarios.

Semi-supervised outlier detection methods can be regarded as applications of semi-

supervised learning methods (Section 9.7.2). For example, when some labeled normal

objects are available, we can use them, together with unlabeled objects that are close by,

to train a model for normal objects. The model of normal objects then can be used to

detect outliers—those objects not fitting the model of normal objects are classified as

outliers.

If only some labeled outliers are available, semi-supervised outlier detection is trick-

ier. A small number of labeled outliers are unlikely to represent all the possible outliers.

Therefore, building a model for outliers based on only a few labeled outliers is unlikely

to be effective. To improve the quality of outlier detection, we can get help from models

for normal objects learned from unsupervised methods.

For additional information on semi-supervised methods, interested readers are

referred to the bibliographic notes at the end of this chapter (Section 12.11).

12.2.2 Statistical Methods, Proximity-Based Methods,

and Clustering-Based Methods

As discussed in Section 12.1, outlier detection methods make assumptions about outliers

versus the rest of the data. According to the assumptions made, we can categorize outlier

detection methods into three types: statistical methods, proximity-based methods, and

clustering-based methods.

Search WWH ::

Custom Search

Home