Databases Reference
In-Depth Information
All objects are ranked in weight-descending order. The top- l objects in weight are output
as outliers, where l is another user-specified parameter.
Computing the k -nearest neighbors for every object is costly and does not scale up
when the dimensionality is high and the database is large. To address the scalability issue,
HilOut employs space-filling curves to achieve an approximation algorithm, which is
scalable in both running time and space with respect to database size and dimensionality.
While some methods like HilOut detect outliers in the full space despite the high
dimensionality, other methods reduce the high-dimensional outlier detection prob-
lem to a lower-dimensional one by dimensionality reduction (Chapter 3). The idea
is to reduce the high-dimensional space to a lower-dimensional space where normal
instances can still be distinguished from outliers. If such a lower-dimensional space can
be found, then conventional outlier detection methods can be applied.
To reduce dimensionality, general feature selection and extraction methods may be
used or extended for outlier detection. For example, principal components analysis
(PCA) can be used to extract a lower-dimensional space. Heuristically, the principal
components with low variance are preferred because, on such dimensions, normal
objects are likely close to each other and outliers often deviate from the majority.
By extending conventional outlier detection methods, we can reuse much of the expe-
rience gained from research in the field. These new methods, however, are limited. First,
they cannot detect outliers with respect to subspaces and thus have limited interpretabil-
ity. Second, dimensionality reduction is feasible only if there exists a lower-dimensional
space where normal objects and outliers are well separated. This assumption may not
hold true.
12.8.2 Finding Outliers in Subspaces
Another approach for outlier detection in high-dimensional data is to search for outliers
in various subspaces. A unique advantage is that, if an object is found to be an outlier
in a subspace of much lower dimensionality, the subspace provides critical information
for interpreting why and towhatextent the object is an outlier. This insight is highly
valuable in applications with high-dimensional data due to the overwhelming number
of dimensions.
Example 12.24 Outliers in subspaces. As a customer-relationship manager at AllElectronics , you are
interested in finding outlier customers. AllElectronics maintains an extensive customer
information database, which contains many attributes and the transaction history of
customers. The database is high dimensional.
Suppose you find that a customer, Alice, is an outlier in a lower-dimensional sub-
space that contains the dimensions averagetransactionamount and purchasefrequency ,
such that her average transaction amount is substantially larger than the majority of
the customers, and her purchase frequency is dramatically lower. The subspace itself
speaks for why and to what extent Alice is an outlier. Using this information, you strate-
gically decide to approach Alice by suggesting options that could improve her purchase
frequency at AllElectronics .
 
Search WWH ::




Custom Search