Frequent Pattern Mining Algorithms for Data Clustering - Frequent Pattern Mining

Database Reference

In-Depth Information

These subspaces are hence expected to prove interesting for various density-based

clustering algorithms. While monotonicity of this quality criterion is not proven in

general, we do know that the core point property is anti-monotonic:

S ⇒

o is not a core point in

(16.6)

∀ T ⊇ S

: o is not a core point in

T

.

A similar approach, SURFING (SUbspaces Relevant For clusterING) [ 15 ], also

following a density-based notion of 'interestingness', assesses the variance of k -

nearest neighbor distances. -dimensional subspaces can be rated as 'interesting',

'neutral', or 'irrelevant' in comparison to (

1)-dimensional subspaces, but this

rating is not monotonous. Accordingly, the subspace search of SURFING follows the

Apriori idea of early pruning of candidates only heuristically but does not formally

implement the strictly anti-monotonic candidate elimination.

CMI (Cumulative Mutual Information) [ 66 ] is a measure to assess the correlation

among the attributes of some subspace and is used to identify subspaces that are

interesting w.r.t. a high contrast, that is, they are likely to contain different clusters.

The authors assume monotonicity of this contrast criterion to facilitate a candidate

elimination-based search starting with two dimensional subspaces. As a priority

search, generating candidates from the top m subspaces only, their algorithm is

more efficient than the Apriori search at the expense of completeness of the results.

Finally, subspaces contained in another subspace reported as a result are dropped

from the resulting set of subspaces, if the higher-dimensional subspace also has

higher contrast.

−

Local Subspace Search Subspace search has also been incorporated locally into

clustering algorithms. DiSH [ 1 ], similar to its predecessor HiCS

citeclu:AchBoeKriKroetal06, follows a pattern of cluster search that is different from

the Apriori-based subspace clustering idea discussed so far. Appropriate subspaces

for distance computations are learned locally for each point, then the locally adapted

(subspace-) distances and the dimensionality of the assigned subspace are used as a

combined distance measure in a global clustering schema similar to OPTICS [ 8 ]to

find hierarchies of subspaces.

For learning the most appropriate subspace for each data point both HiCS and

DiSH assign a 'subspace preference vector' to each object, based on the variance

of the neighborhood in each attribute. As such, the clustering procedure does not

make use of an efficient frequent pattern search algorithm. However, while HiCS

uses the full-dimensional neighborhood and studies the variances of the neighbor-

hoods in attribute-wise projections, DiSH starts with attribute-wise neighborhoods

and combines those neighborhoods in a bottom-up procedure. Here, an Apriori-like

search strategy is one of the suggested alternatives, employing the monotonicity of

neighborhoods in projections of the data. If

S

is a subspace of

T

, then the cardinality

of the ε -neighborhood of some object o in

T

is bound to be at most the cardinality

of the ε -neighborhood of the same object o in

S

:

S ⊆ T ⇒ N ε

( o ) ≤ N ε

( o )

(16.7)

Frequent Pattern Mining

Search WWH ::

Custom Search

Home