Outlier Detection - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

All objects are ranked in weight-descending order. The top- l objects in weight are output

as outliers, where l is another user-specified parameter.

Computing the k -nearest neighbors for every object is costly and does not scale up

when the dimensionality is high and the database is large. To address the scalability issue,

HilOut employs space-filling curves to achieve an approximation algorithm, which is

scalable in both running time and space with respect to database size and dimensionality.

While some methods like HilOut detect outliers in the full space despite the high

dimensionality, other methods reduce the high-dimensional outlier detection prob-

lem to a lower-dimensional one by dimensionality reduction (Chapter 3). The idea

is to reduce the high-dimensional space to a lower-dimensional space where normal

instances can still be distinguished from outliers. If such a lower-dimensional space can

be found, then conventional outlier detection methods can be applied.

To reduce dimensionality, general feature selection and extraction methods may be

used or extended for outlier detection. For example, principal components analysis

(PCA) can be used to extract a lower-dimensional space. Heuristically, the principal

components with low variance are preferred because, on such dimensions, normal

objects are likely close to each other and outliers often deviate from the majority.

By extending conventional outlier detection methods, we can reuse much of the expe-

rience gained from research in the field. These new methods, however, are limited. First,

they cannot detect outliers with respect to subspaces and thus have limited interpretabil-

ity. Second, dimensionality reduction is feasible only if there exists a lower-dimensional

space where normal objects and outliers are well separated. This assumption may not

hold true.

12.8.2 Finding Outliers in Subspaces

Another approach for outlier detection in high-dimensional data is to search for outliers

in various subspaces. A unique advantage is that, if an object is found to be an outlier

in a subspace of much lower dimensionality, the subspace provides critical information

for interpreting why and towhatextent the object is an outlier. This insight is highly

valuable in applications with high-dimensional data due to the overwhelming number

of dimensions.

Example 12.24 Outliers in subspaces. As a customer-relationship manager at AllElectronics , you are

interested in finding outlier customers. AllElectronics maintains an extensive customer

information database, which contains many attributes and the transaction history of

customers. The database is high dimensional.

Suppose you find that a customer, Alice, is an outlier in a lower-dimensional sub-

space that contains the dimensions averagetransactionamount and purchasefrequency ,

such that her average transaction amount is substantially larger than the majority of

the customers, and her purchase frequency is dramatically lower. The subspace itself

speaks for why and to what extent Alice is an outlier. Using this information, you strate-

gically decide to approach Alice by suggesting options that could improve her purchase

frequency at AllElectronics .

Search WWH ::

Custom Search

Home