Outlier Detection - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

dependency on the application type makes it impossible to develop a universally

applicable outlier detection method. Instead, individual outlier detection methods

that are dedicated to specific applications must be developed.

Handling noise in outlier detection. As mentioned earlier, outliers are different from

noise. It is also well known that the quality of real data sets tends to be poor. Noise

often unavoidably exists in data collected in many applications. Noise may be present

as deviations in attribute values or even as missing values. Low data quality and

the presence of noise bring a huge challenge to outlier detection. They can distort

the data, blurring the distinction between normal objects and outliers. Moreover,

noise and missing data may “hide” outliers and reduce the effectiveness of out-

lier detection—an outlier may appear “disguised” as a noise point, and an outlier

detection method may mistakenly identify a noise point as an outlier.

Understandability. In some application scenarios, a user may want to not only

detect outliers, but also understand why the detected objects are outliers. To meet

the understandability requirement, an outlier detection method has to provide some

justification of the detection. For example, a statistical method can be used to jus-

tify the degree to which an object may be an outlier based on the likelihood that the

object was generated by the same mechanism that generated the majority of the data.

The smaller the likelihood, the more unlikely the object was generated by the same

mechanism, and the more likely the object is an outlier.

The rest of this chapter discusses approaches to outlier detection.

12.2 Outlier Detection Methods

There are many outlier detection methods in the literature and in practice. Here, we

present two orthogonal ways to categorize outlier detection methods. First, we catego-

rize outlier detection methods according to whether the sample of data for analysis is

given with domain expert-provided labels that can be used to build an outlier detection

model. Second, we divide methods into groups according to their assumptions regarding

normal objects versus outliers.

12.2.1 Supervised, Semi-Supervised, and Unsupervised Methods

If expert-labeled examples of normal and/or outlier objects can be obtained, they can be

used to build outlier detection models. The methods used can be divided into supervised

methods, semi-supervised methods, and unsupervised methods.

Supervised Methods

Supervised methods model data normality and abnormality. Domain experts examine

and label a sample of the underlying data. Outlier detection can then be modeled as

Search WWH ::

Custom Search

Home