Outlier Detection - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

“Howcanwedetectoutliersinsubspaces?” We use a grid-basedsubspaceoutlier

detectionmethod to illustrate. The major ideas are as follows. We consider projections of

the data onto various subspaces. If, in a subspace, we find an area that has a density that

is much lower than average, then the area may contain outliers. To find such projections,

we first discretize the data into a grid in an equal-depth way. That is, each dimension is

partitioned into

equal-depth ranges, where each range contains a fraction, f , of the

objects f D 1

. Equal-depth partitioning is used because data along different dimen-

sions may have different localities. An equal-width partitioning of the space may not be

able to reflect such differences in locality.

Next, we search for regions defined by ranges in subspaces that are signifi-

cantly sparse. To quantify what we mean by “significantly sparse,” let's consider a

k -dimensional cube formed by k ranges on k dimensions. Suppose the data set con-

tains n objects. If the objects are independently distributed, the expected number of

objects falling into a k -dimensional region is 1

k

n D f k n . The standard deviation of

the number of points in a k -dimensional region is p f k

1 f k

.

/

n . Suppose a specific

k -dimensional cube C has n

.

C

/

objects. We can define the sparsity coefficient of C as

/ f k n

n

.

C

S

.

C

/D

p f k

.

(12.22)

1 f k

.

/

n

If S

/

(i.e., the more negative), the sparser C is and the more likely the objects in C are outliers

in the subspace.

By assuming S

.

C

/<

0, then C contains fewer objects than expected. The smaller the value of S

.

C

follows a normal distribution, we can use normal distribution

tables to determine the probabilistic significance level for an object that deviates dra-

matically from the average for an a priori assumption of the data following a uniform

distribution. In general, the assumption of uniform distribution does not hold. How-

ever, the sparsity coefficient still provides an intuitive measure of the “outlier-ness” of a

region.

To find cubes of significantly small sparsity coefficient values, a brute-force approach

is to search every cube in every possible subspace. The cost of this, however, is

immediately exponential. An evolutionarysearch can be conducted, which improves effi-

ciency at the expense of accuracy. For details, please refer to the bibliographic notes

(Section 12.11). The objects contained by cubes of very small sparsity coefficient values

are output as outliers.

In summary, searching for outliers in subspaces is advantageous in that the outliers

found tend to be better understood, owing to the context provided by the subspaces.

Challenges include making the search efficient and scalable.

.

C

/

12.8.3 Modeling High-Dimensional Outliers

An alternative approach for outlier detection methods in high-dimensional data tries to

develop new models for high-dimensional outliers directly. Such models typically avoid

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home