Databases Reference
In-Depth Information
proximity measures and instead adopt new heuristics to detect outliers, which do not
deteriorate in high-dimensional data.
Let's examine angle-basedoutlierdetection (ABOD) as an example.
Example 12.25 Angle-based outliers. Figure 12.15 contains a set of points forming a cluster, with the
exception of c , which is an outlier. For each point o , we examine the angle\ xoy for every
pair of points x , y such that x 6D o , y 6D o . The figure shows angle\ dae as an example.
Note that for a point in the center of a cluster (e.g., a ), the angles formed as such
differ widely. For a point that is at the border of a cluster (e.g., b ), the angle variation is
smaller. For a point that is an outlier (e.g., c ), the angle variable is substantially smaller.
This observation suggests that we can use the variance of angles for a point to determine
whether a point is an outlier.
We can combine angles and distance to model outliers. Mathematically, for each
point o , we use the distance-weighted angle variance as the outlier score. That is, given a
set of points, D , for a point, o 2 D , we define the angle-based outlier factor (ABOF) as
/D VAR x , y 2 D , x 6D o , y 6D o h ! ox , ! oy i
dist
ABOF
.
o
2 ,
(12.23)
2 dist
.
o , x
/
.
o , y
/
where h,i is the scalar product operator, and dist
is a norm distance.
Clearly, the farther away a point is from clusters and the smaller the variance of the
angles of a point, the smaller the ABOF. The ABOD computes the ABOF for each point,
and outputs a list of the points in the data set in ABOF-ascending order.
Computing the exact ABOF for every point in a database is costly, requiring a time
complexity of O
.
,
/
n 3
, where n is the number of points in the database. Obviously, this
exact algorithm does not scale up for large data sets. Approximation methods have been
developed to speed up the computation. The angle-based outlier detection idea has been
generalized to handle arbitrary data types. For additional details, see the bibliographic
notes (Section 12.11).
Developing native models for high-dimensional outliers can lead to effective meth-
ods. However, finding good heuristics for detecting high-dimensional outliers is dif-
ficult. Efficiency and scalability on large and high-dimensional data sets are major
challenges.
.
/
c
d
a
b
e
Figure 12.15 Angle-based outliers.
 
Search WWH ::




Custom Search