Novel Trends in Clustering - Evolving Application Domains of Data Warehousing and Mining

Database Reference

In-Depth Information

are located in the neighbourhood, and an incor-

rect subspace dimension may be determined. In

contrast, if ε is chosen too small, no core objects

may be found. A hierarchical extension of 4C

called HiCO which is based on the hierarchical

density-based clustering notion of OPTICS (cf.

the background section) and PCA has also been

proposed (Achtert et al. 2006). A further exten-

sion, DiSH (Achtert et al. 2007) considers more

general, network-based relationships between

correlation clusters.

In (Tung et al. 2006) the authors propose

CURLER, a method for finding and visualiz-

ing clusters even with a nonlinear correlation.

CURLER uses the concept of micro-clusters that

are generated using a variant of EM clustering. The

micro-clusters are merged to discover correlation

clusters. The merging of micro-clusters is based

on the concept of co-sharing, i.e. the co-sharing

level of two micro-clusters corresponds to the total

amount of points which are associated to both

clusters simultaneously. From the set of all micro-

clusters, an arbitrary starting object is selected,

and, like in OPTICS or Single Link, in each step

that cluster is selected as the next micro-cluster

in the micro-cluster order which has the highest

co-sharing level to the starting object or any of

the previously placed micro-clusters. CURLER

also defines a visualization technique similar to

the reachability-plot of OPTICS. CURLER im-

proves over ORCLUS and 4C as the correlations

underlying the clusters are not necessarily linear.

Thus, also clusters like the example in Figure 1(c)

can be detected. Furthermore, as a fuzzy approach,

CURLER assumes each data object to belong

to all clusters simultaneously, but with different

probabilities for each cluster assigned. By merg-

ing several clusters according to their co-sharing

level, the algorithm on the one hand becomes less

sensitive to the predefined number K of clusters,

thus also overcoming a severe limitation of any

K-means related approach. On the other hand, the

user cannot directly derive a model describing the

correlations, since the original K models are no

longer present in the resulting clustering. There-

fore, one can say that the algorithm can handle

arbitrary non-linear correlations but is not able to

identify the type of correlation.

The most important problem of all approaches

previously presented in this section is the miss-

ing robustness with respect to noise objects and

correlation clusters which are close to each other.

All the applied techniques, iterative methods,

density-based methods and micro-clusters are to

some degree sensitive to the situation when the

correlation cannot be determined by looking at the

full-dimensional local neighbourhood of an object.

The algorithm CASH (Clustering inArbitrary Sub-

spaces based on the Hough transform) presented

in (Achtert et al. 2008) tackles this problem by

a sophisticated parameter space transformation:

Instead of clustering the points directly, the idea is

first to replace each point by the (infinite) set of all

possible, λ-dimensional planes in which the point

is contained. For low (two or three) dimensional

applications from image processing domains, this

idea is already well-known, and denoted by the

term Hough transform. In these low dimensional

domains, the infinite set of all lines or planes can

easily be discretely represented by a finite selec-

tion of some possible sample planes (for example

sampled by discrete angles). Matching planes can

easily be determined by counting of accumula-

tors, where each accumulator corresponds to one

possible plane. In higher dimensional spaces, a

sufficient discretization of all possible planes is

not possible due to the curse of dimensionality.

But the set of all possible planes passing through a

point can also be represented by a function. If the

planes are represented by Euclidean coordinates,

these functions are linear functions. If the planes

are represented by spherical coordinates (which

is advantageous because orthographic planes can-

not be represented in Euclidean coordinates) then

the functions are trigonometric (combinations of

sine and cosine functions). Points sharing a com-

mon plane are determined by those trigonometric

functions which have a common intersection at an

Evolving Application Domains of Data Warehousing and Mining

Search WWH ::

Custom Search

Home