Outlier Detection - Data Mining: Concepts and Techniques - page 580

Databases Reference

In-Depth Information

proximity measures and instead adopt new heuristics to detect outliers, which do not

deteriorate in high-dimensional data.

Let's examine angle-basedoutlierdetection (ABOD) as an example.

Example 12.25 Angle-based outliers. Figure 12.15 contains a set of points forming a cluster, with the

exception of c , which is an outlier. For each point o , we examine the angle\ xoy for every

pair of points x , y such that x 6D o , y 6D o . The figure shows angle\ dae as an example.

Note that for a point in the center of a cluster (e.g., a ), the angles formed as such

differ widely. For a point that is at the border of a cluster (e.g., b ), the angle variation is

smaller. For a point that is an outlier (e.g., c ), the angle variable is substantially smaller.

This observation suggests that we can use the variance of angles for a point to determine

whether a point is an outlier.

We can combine angles and distance to model outliers. Mathematically, for each

point o , we use the distance-weighted angle variance as the outlier score. That is, given a

set of points, D , for a point, o 2 D , we define the angle-based outlier factor (ABOF) as

/D VAR x , y 2 D , x 6D o , y 6D o h ! ox , ! oy i

dist

ABOF

.

o

2 ,

(12.23)

2 dist

.

o , x

/

.

o , y

/

where h,i is the scalar product operator, and dist

is a norm distance.

Clearly, the farther away a point is from clusters and the smaller the variance of the

angles of a point, the smaller the ABOF. The ABOD computes the ABOF for each point,

and outputs a list of the points in the data set in ABOF-ascending order.

Computing the exact ABOF for every point in a database is costly, requiring a time

complexity of O

.

,

/

n 3

, where n is the number of points in the database. Obviously, this

exact algorithm does not scale up for large data sets. Approximation methods have been

developed to speed up the computation. The angle-based outlier detection idea has been

generalized to handle arbitrary data types. For additional details, see the bibliographic

notes (Section 12.11).

Developing native models for high-dimensional outliers can lead to effective meth-

ods. However, finding good heuristics for detecting high-dimensional outliers is dif-

ficult. Efficiency and scalability on large and high-dimensional data sets are major

challenges.

.

/

c

d

a

b

e

Figure 12.15 Angle-based outliers.

Next Page

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home