Databases Reference
In-Depth Information
12.4.1
Distance-Based Outlier Detection and a Nested
Loop Method
A representative method of proximity-based outlier detection uses the concept of
distance-based outliers
. For a set,
D
, of data objects to be analyzed, a user can spec-
ify a distance threshold,
r
, to define a reasonable neighborhood of an object. For each
object,
o
, we can examine the number of other objects in the
r
-neighborhood of
o
. If
most of the objects in
D
are far from
o
, that is, not in the
r
-neighborhood of
o
, then
o
can be regarded as an outlier.
Formally, let
r
.
r
0
/
be a
distancethreshold
and
.
0
< 1
/
be a fraction
threshold. An object,
o
, is a
DB
.
r
,
/
-outlier
if
kf
o
0
j
dist
o
,
o
0
.
/
r
gk
,
(12.10)
k
D
k
where
dist
is a distance measure.
Equivalently, we can determine whether an object,
o
, is a
DB
.,/
-outlier by checking
the distance between
o
and its
k
-nearest neighbor,
o
k
, where
k
Ddk
D
ke. Object
o
is an
outlier if
dist
.
r
,
/
r
, because in such a case, there are fewer than
k
objects except for
o
that are in the
r
-neighborhood of
o
.
“HowcanwecomputeDB
.
o
,
o
k
/>
-outliers?”
A straightforward approach is to use nested
loops to check the
r
-neighborhood for every object, as shown in Figure 12.6. For any
object,
o
i
.
.
r
,
/
, we calculate the distance between
o
i
and the other object, and
count the number of other objects in the
r
-neighborhood of
o
i
. Once we find
1
i
n
/
n
other
Algorithm:
Distance-based outlier detection.
Input:
a set of objects
D
Df
o
1
,
:::
,
o
n
g, threshold
r
.
r
>
0
/
and
.
0
< 1
/
;
Output:
DB
.
r
,
/
outliers in
D
.
Method:
for
i
D 1 to
n
do
count
0
for
j
D 1 to
n
do
if
i
6D
j
and
dist
o
i
,
o
j
/
r
then
count
count
C1
if
count
n
then
exit f
o
i
cannot be a
DB
.
r
,/ outlierg
endif
endif
endfor
print
o
i
f
o
i
is a
DB
.
r
,/ outlier according to (Eq. 12.10)g
endfor
;
.
Figure 12.6
Nested loop algorithm for
DB
.
r
,
/
-outlier detection.