Outlier Detection - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

As with contextual outlier detection, collective outlier detection methods can also be

divided into two categories. The first category consists of methods that reduce the prob-

lem to conventional outlier detection. Its strategy is to identify structureunits , treat each

structure unit (e.g., a subsequence, a time-series segment, a local area, or a subgraph)

as a data object, and extract features. The problem of collective outlier detection is thus

transformed into outlier detection on the set of “structured objects” constructed as such

using the extracted features. A structure unit, which represents a group of objects in the

original data set, is a collective outlier if the structure unit deviates significantly from the

expected trend in the space of the extracted features.

Example 12.23 Collective outlier detection on graph data. Let's see how we can detect collective out-

liers in AllElectronics' online social network of customers. Suppose we treat the social

network as an unlabeled graph. We then treat each possible subgraph of the network as

a structure unit. For each subgraph, S , let j S j be the number of vertices in S , and freq

.

S

/

be the frequency of S in the network. That is, freq

is the number of different subgraphs

in the network that are isomorphic to S . We can use these two features to detect outlier

subgraphs. An outliersubgraph is a collective outlier that contains multiple vertices.

In general, a small subgraph (e.g., a single vertex or a pair of vertices connected by

an edge) is expected to be frequent, and a large subgraph is expected to be infrequent.

Using the preceding simple method, we can detect small subgraphs that are of very low

frequency or large subgraphs that are surprisingly frequent. These are outlier structures

in the social network.

.

S

/

Predefining the structure units for collective outlier detection can be difficult or

impossible. Consequently, the second category of methods models the expected behav-

ior of structure units directly. For example, to detect collective outliers in temporal

sequences, one method is to learn a Markov model from the sequences. A subsequence

can then be declared as a collective outlier if it significantly deviates from the model.

In summary, collective outlier detection is subtle due to the challenge of explor-

ing the structures in data. The exploration typically uses heuristics, and thus may be

application-dependent. The computational cost is often high due to the sophisticated

mining process. While highly useful in practice, collective outlier detection remains a

challenging direction that calls for further research and development.

12.8 Outlier Detection in High-Dimensional Data

In some applications, we may need to detect outliers in high-dimensional data. The

dimensionality curse poses huge challenges for effective outlier detection. As the dimen-

sionality increases, the distance between objects may be heavily dominated by noise.

That is, the distance and similarity between two points in a high-dimensional space

may not reflect the real relationship between the points. Consequently, conventional

outlier detection methods, which mainly use proximity or density to identify outliers,

deteriorate as dimensionality increases.

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home