Databases Reference
In-Depth Information
methods may have a high false positive rate—they may mislabel many normal objects
as outliers (intrusions or viruses in these applications), and let many actual outliers go
undetected. Due to the high similarity between intrusions and viruses (i.e., they have to
attack key resources in the target systems), modeling outliers using supervised methods
may be far more effective.
Many clustering methods can be adapted to act as unsupervised outlier detection
methods. The central idea is to find clusters first, and then the data objects not belong-
ing to any cluster are detected as outliers. However, such methods suffer from two issues.
First, a data object not belonging to any cluster may be noise instead of an outlier. Sec-
ond, it is often costly to find clusters first and then find outliers. It is usually assumed
that there are far fewer outliers than normal objects. Having to process a large popu-
lation of nontarget data entries (i.e., the normal objects) before one can touch the real
meat (i.e., the outliers) can be unappealing. The latest unsupervised outlier detection
methods develop various smart ideas to tackle outliers directly without explicitly and
completely finding clusters. You will learn more about these techniques in Sections 12.4
and 12.5 on proximity-based and clustering-based methods, respectively.
Semi-Supervised Methods
In many applications, although obtaining some labeled examples is feasible, the number
of such labeled examples is often small. We may encounter cases where only a small set
of the normal and/or outlier objects are labeled, but most of the data are unlabeled.
Semi-supervised outlier detection methods were developed to tackle such scenarios.
Semi-supervised outlier detection methods can be regarded as applications of semi-
supervised learning methods (Section 9.7.2). For example, when some labeled normal
objects are available, we can use them, together with unlabeled objects that are close by,
to train a model for normal objects. The model of normal objects then can be used to
detect outliers—those objects not fitting the model of normal objects are classified as
outliers.
If only some labeled outliers are available, semi-supervised outlier detection is trick-
ier. A small number of labeled outliers are unlikely to represent all the possible outliers.
Therefore, building a model for outliers based on only a few labeled outliers is unlikely
to be effective. To improve the quality of outlier detection, we can get help from models
for normal objects learned from unsupervised methods.
For additional information on semi-supervised methods, interested readers are
referred to the bibliographic notes at the end of this chapter (Section 12.11).
12.2.2 Statistical Methods, Proximity-Based Methods,
and Clustering-Based Methods
As discussed in Section 12.1, outlier detection methods make assumptions about outliers
versus the rest of the data. According to the assumptions made, we can categorize outlier
detection methods into three types: statistical methods, proximity-based methods, and
clustering-based methods.
 
Search WWH ::




Custom Search