Databases Reference
In-Depth Information
As with contextual outlier detection, collective outlier detection methods can also be
divided into two categories. The first category consists of methods that reduce the prob-
lem to conventional outlier detection. Its strategy is to identify structureunits , treat each
structure unit (e.g., a subsequence, a time-series segment, a local area, or a subgraph)
as a data object, and extract features. The problem of collective outlier detection is thus
transformed into outlier detection on the set of “structured objects” constructed as such
using the extracted features. A structure unit, which represents a group of objects in the
original data set, is a collective outlier if the structure unit deviates significantly from the
expected trend in the space of the extracted features.
Example 12.23 Collective outlier detection on graph data. Let's see how we can detect collective out-
liers in AllElectronics' online social network of customers. Suppose we treat the social
network as an unlabeled graph. We then treat each possible subgraph of the network as
a structure unit. For each subgraph, S , let j S j be the number of vertices in S , and freq
.
S
/
be the frequency of S in the network. That is, freq
is the number of different subgraphs
in the network that are isomorphic to S . We can use these two features to detect outlier
subgraphs. An outliersubgraph is a collective outlier that contains multiple vertices.
In general, a small subgraph (e.g., a single vertex or a pair of vertices connected by
an edge) is expected to be frequent, and a large subgraph is expected to be infrequent.
Using the preceding simple method, we can detect small subgraphs that are of very low
frequency or large subgraphs that are surprisingly frequent. These are outlier structures
in the social network.
.
S
/
Predefining the structure units for collective outlier detection can be difficult or
impossible. Consequently, the second category of methods models the expected behav-
ior of structure units directly. For example, to detect collective outliers in temporal
sequences, one method is to learn a Markov model from the sequences. A subsequence
can then be declared as a collective outlier if it significantly deviates from the model.
In summary, collective outlier detection is subtle due to the challenge of explor-
ing the structures in data. The exploration typically uses heuristics, and thus may be
application-dependent. The computational cost is often high due to the sophisticated
mining process. While highly useful in practice, collective outlier detection remains a
challenging direction that calls for further research and development.
12.8 Outlier Detection in High-Dimensional Data
In some applications, we may need to detect outliers in high-dimensional data. The
dimensionality curse poses huge challenges for effective outlier detection. As the dimen-
sionality increases, the distance between objects may be heavily dominated by noise.
That is, the distance and similarity between two points in a high-dimensional space
may not reflect the real relationship between the points. Consequently, conventional
outlier detection methods, which mainly use proximity or density to identify outliers,
deteriorate as dimensionality increases.
 
Search WWH ::




Custom Search