Databases Reference
In-Depth Information
“Howcanwedetectoutliersinsubspaces?” We use a grid-basedsubspaceoutlier
detectionmethod to illustrate. The major ideas are as follows. We consider projections of
the data onto various subspaces. If, in a subspace, we find an area that has a density that
is much lower than average, then the area may contain outliers. To find such projections,
we first discretize the data into a grid in an equal-depth way. That is, each dimension is
partitioned into
equal-depth ranges, where each range contains a fraction, f , of the
objects f D 1
. Equal-depth partitioning is used because data along different dimen-
sions may have different localities. An equal-width partitioning of the space may not be
able to reflect such differences in locality.
Next, we search for regions defined by ranges in subspaces that are signifi-
cantly sparse. To quantify what we mean by “significantly sparse,” let's consider a
k -dimensional cube formed by k ranges on k dimensions. Suppose the data set con-
tains n objects. If the objects are independently distributed, the expected number of
objects falling into a k -dimensional region is 1
k
n D f k n . The standard deviation of
the number of points in a k -dimensional region is p f k
1 f k
.
/
n . Suppose a specific
k -dimensional cube C has n
.
C
/
objects. We can define the sparsity coefficient of C as
/ f k n
n
.
C
S
.
C
/D
p f k
.
(12.22)
1 f k
.
/
n
If S
/
(i.e., the more negative), the sparser C is and the more likely the objects in C are outliers
in the subspace.
By assuming S
.
C
/<
0, then C contains fewer objects than expected. The smaller the value of S
.
C
follows a normal distribution, we can use normal distribution
tables to determine the probabilistic significance level for an object that deviates dra-
matically from the average for an a priori assumption of the data following a uniform
distribution. In general, the assumption of uniform distribution does not hold. How-
ever, the sparsity coefficient still provides an intuitive measure of the “outlier-ness” of a
region.
To find cubes of significantly small sparsity coefficient values, a brute-force approach
is to search every cube in every possible subspace. The cost of this, however, is
immediately exponential. An evolutionarysearch can be conducted, which improves effi-
ciency at the expense of accuracy. For details, please refer to the bibliographic notes
(Section 12.11). The objects contained by cubes of very small sparsity coefficient values
are output as outliers.
In summary, searching for outliers in subspaces is advantageous in that the outliers
found tend to be better understood, owing to the context provided by the subspaces.
Challenges include making the search efficient and scalable.
.
C
/
12.8.3 Modeling High-Dimensional Outliers
An alternative approach for outlier detection methods in high-dimensional data tries to
develop new models for high-dimensional outliers directly. Such models typically avoid
 
Search WWH ::




Custom Search