Databases Reference
In-Depth Information
set) to mine all outliers by scanning the data set three times. First, a sample, S , is created
of the given data set, D , using sampling by replacement. Each object in S is considered
the centroid of a partition. The objects in D are assigned to the partitions based on
distance. The preceding steps are completed in one scan of D . Candidate outliers are
identified in a second scan of D . After a third scan, all DB
.
r ,
/
-outliers have been found.
12.4.3 Density-Based Outlier Detection
Distance-based outliers, such as DB
-outliers, are just one type of outlier. Specifi-
cally, distance-based outlier detection takes a global view of the data set. Such outliers
can be regarded as “global outliers” for two reasons:
.
r ,
/
A DB
.
r ,
/
-outlier, for example, is far (as quantified by parameter r ) from at least
.
1/100% of the objects in the data set. In other words, an outlier as such is
remote from the majority of the data.
To detect distance-based outliers, we need two global parameters, r and
, which are
applied to every outlier object.
Many real-world data sets demonstrate a more complex structure, where objects
may be considered outliers with respect to their local neighborhoods, rather than with
respect to the global data distribution. Let's look at an example.
Example 12.14 Local proximity-based outliers. Consider the data points in Figure 12.8. There are two
clusters: C 1 is dense, and C 2 is sparse. Object o 3 can be detected as a distance-based
outlier because it is far from the majority of the data set.
Now, let's consider objects o 1 and o 2 . Are they outliers? On the one hand, the distance
from o 1 and o 2 to the objects in the dense cluster, C 1 , is smaller than the average dis-
tance between an object in cluster C 2 and its nearest neighbor. Thus, o 1 and o 2 are not
distance-based outliers. In fact, if we were to categorize o 1 and o 2 as DB
.
r ,
/
-outliers,
we would have to classify all the objects in clusters C 2 as DB
-outliers.
On the other hand, o 1 and o 2 can be identified as outliers when they are considered
locally with respect to cluster C 1 because o 1 and o 2 deviate significantly from the objects
in C 1 . Moreover, o 1 and o 2 are also far from the objects in C 2 .
.
r ,
/
C 1
o 1
o 4
o 2
C 2
o 3
Figure 12.8 Global outliers and local outliers.
 
Search WWH ::




Custom Search