Outlier Detection - Data Mining: Concepts and Techniques - page 564

Databases Reference

In-Depth Information

set) to mine all outliers by scanning the data set three times. First, a sample, S , is created

of the given data set, D , using sampling by replacement. Each object in S is considered

the centroid of a partition. The objects in D are assigned to the partitions based on

distance. The preceding steps are completed in one scan of D . Candidate outliers are

identified in a second scan of D . After a third scan, all DB

.

r ,

/

-outliers have been found.

12.4.3 Density-Based Outlier Detection

Distance-based outliers, such as DB

-outliers, are just one type of outlier. Specifi-

cally, distance-based outlier detection takes a global view of the data set. Such outliers

can be regarded as “global outliers” for two reasons:

.

r ,

/

A DB

.

r ,

/

-outlier, for example, is far (as quantified by parameter r ) from at least

.

1/100% of the objects in the data set. In other words, an outlier as such is

remote from the majority of the data.

To detect distance-based outliers, we need two global parameters, r and

, which are

applied to every outlier object.

Many real-world data sets demonstrate a more complex structure, where objects

may be considered outliers with respect to their local neighborhoods, rather than with

respect to the global data distribution. Let's look at an example.

Example 12.14 Local proximity-based outliers. Consider the data points in Figure 12.8. There are two

clusters: C 1 is dense, and C 2 is sparse. Object o 3 can be detected as a distance-based

outlier because it is far from the majority of the data set.

Now, let's consider objects o 1 and o 2 . Are they outliers? On the one hand, the distance

from o 1 and o 2 to the objects in the dense cluster, C 1 , is smaller than the average dis-

tance between an object in cluster C 2 and its nearest neighbor. Thus, o 1 and o 2 are not

distance-based outliers. In fact, if we were to categorize o 1 and o 2 as DB

.

r ,

/

-outliers,

we would have to classify all the objects in clusters C 2 as DB

-outliers.

On the other hand, o 1 and o 2 can be identified as outliers when they are considered

locally with respect to cluster C 1 because o 1 and o 2 deviate significantly from the objects

in C 1 . Moreover, o 1 and o 2 are also far from the objects in C 2 .

.

r ,

/

C 1

o 1

o 4

o 2

C 2

o 3

Figure 12.8 Global outliers and local outliers.

Next Page

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home