Novel Trends in Clustering - Evolving Application Domains of Data Warehousing and Mining

Database Reference

In-Depth Information

similarity, normalization with the inverted degree

matrix is required. The algorithm proposed in (Shi

& Malik 2000) approximates the optimal nCut by

decomposing a normalized Laplacian defined as D

-1/2 L. It can be proven that minMaxCut can be ap-

proximated by the same generalized Eigenproblem.

Ng et al. (2002) propose symmetric normaliza-

tion of the Laplacian, i.e. D -1/2 L D -1/2 . This paper

also provides derivations of spectral clustering

from the perspectives of perturbation theory and

random walks on graphs. Figure 7 provides an

example of non-Gaussian vector data clustered

with this algorithm. It becomes evident that the

transformation of the clustering problem to the

graph-cut perspective allows detecting arbitrarily

shaped clusters.

Dhillon (2001) and Zha et al. (2001) propose al-

gorithms for simultaneously clustering documents

and words of a word-document co-occurrence

matrix based on the spectral clustering idea. Docu-

ments and words are arranged in a bipartite graph

and objective functions similar to nCut for bipartite

graph partitioning are introduced. A more general

framework for spectral clustering of multi-type

relational data is introduced in (Long et al. 2006).

This method allows simultaneously clustering

objects of multiple types which are related to each

other, for example web pages, queries and web

users. Technically this is achieved by collective

factorization of related matrices.

Unlike many algorithms for partitioning clus-

tering, spectral clustering requires no assumptions

on the data distribution and the algorithms can be

easily implemented using standard linear algebra

packages. However, the result strongly depends

on the construction of the similarity matrix and a

suitable choice of the number of clusters K. Bach

and Jordan (2003) propose a technique for metric

learning to construct the similarity matrix together

with a novel algorithm approximating nCut by

weighted K-means in Eigenvector space. Zelnik-

Manor and Perona (2004) propose guidelines for

parameter settings. Another limitation of spectral

methods is that they require decomposing an n x n

matrix for a data set of n points and are therefore

not suitable for very large data sets. Fowlkes et

al. (2004) propose a sampling-based method to

approximate spectral clustering of large data sets.

As an alternative, Dhillon et al. (2004) propose a

weighted kernel-K-means algorithm to minimize

Ncut without matrix decomposition.

PArAMeter-free cluSterIng

Most approaches to clustering introduced so far

suffer from a common problem: To obtain a good

result, the user needs to select suitable values for

parameters such as the number of clusters K in K-

means and related approaches, density thresholds

Figure 7. Spectral Clustering following Ng et al. (2002): (a) Data set with two non-linear correlation

clusters; (b) Color coded visualization of the normalized Laplacian; (c) Second largest Eigenvector

provides cluster indicators which can be trivially separated by bisecting K-means

Evolving Application Domains of Data Warehousing and Mining

Search WWH ::

Custom Search

Home