Databases Reference
In-Depth Information
dependency on the application type makes it impossible to develop a universally
applicable outlier detection method. Instead, individual outlier detection methods
that are dedicated to specific applications must be developed.
Handling noise in outlier detection. As mentioned earlier, outliers are different from
noise. It is also well known that the quality of real data sets tends to be poor. Noise
often unavoidably exists in data collected in many applications. Noise may be present
as deviations in attribute values or even as missing values. Low data quality and
the presence of noise bring a huge challenge to outlier detection. They can distort
the data, blurring the distinction between normal objects and outliers. Moreover,
noise and missing data may “hide” outliers and reduce the effectiveness of out-
lier detection—an outlier may appear “disguised” as a noise point, and an outlier
detection method may mistakenly identify a noise point as an outlier.
Understandability. In some application scenarios, a user may want to not only
detect outliers, but also understand why the detected objects are outliers. To meet
the understandability requirement, an outlier detection method has to provide some
justification of the detection. For example, a statistical method can be used to jus-
tify the degree to which an object may be an outlier based on the likelihood that the
object was generated by the same mechanism that generated the majority of the data.
The smaller the likelihood, the more unlikely the object was generated by the same
mechanism, and the more likely the object is an outlier.
The rest of this chapter discusses approaches to outlier detection.
12.2 Outlier Detection Methods
There are many outlier detection methods in the literature and in practice. Here, we
present two orthogonal ways to categorize outlier detection methods. First, we catego-
rize outlier detection methods according to whether the sample of data for analysis is
given with domain expert-provided labels that can be used to build an outlier detection
model. Second, we divide methods into groups according to their assumptions regarding
normal objects versus outliers.
12.2.1 Supervised, Semi-Supervised, and Unsupervised Methods
If expert-labeled examples of normal and/or outlier objects can be obtained, they can be
used to build outlier detection models. The methods used can be divided into supervised
methods, semi-supervised methods, and unsupervised methods.
Supervised Methods
Supervised methods model data normality and abnormality. Domain experts examine
and label a sample of the underlying data. Outlier detection can then be modeled as
 
Search WWH ::




Custom Search