Advanced Cluster Analysis - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

A cut is sparsest if its sparsity is not greater than the sparsity of any other cut. There may

be more than one sparsest cut.

In Example 11.21 and Figure 11.14, C 2 is a sparsest cut. Using sparsity as the objective

function, a sparsest cut tries to minimize the number of edges crossing the partitions and

balance the partitions in size.

Consider a clustering on a graph G D.

that partitions the graph into k clusters.

The modularity of a clustering assesses the quality of the clustering and is defined as

V , E

/

l i

j E j

2 !

d i

2j E j

k X

Q D

,

(11.39)

i D1

where l i is the number of edges between vertices in the i th cluster, and d i is the sum of

the degrees of the vertices in the i th cluster. The modularity of a clustering of a graph is

the difference between the fraction of all edges that fall into individual clusters and the

fraction that would do so if the graph vertices were randomly connected. The optimal

clustering of graphs maximizes the modularity.

Theoretically, many graph clustering problems can be regarded as finding good cuts,

such as the sparsest cuts, on the graph. In practice, however, a number of challenges

exist:

High computational cost: Many graph cut problems are computationally expen-

sive. The sparsest cut problem, for example, is NP-hard. Therefore, finding the

optimal solutions on large graphs is often impossible. A good trade-off between

efficiency/scalability and quality has to be achieved.

Sophisticated graphs: Graphs can be more sophisticated than the ones described

here, involving weights and/or cycles.

High dimensionality: A graph can have many vertices. In a similarity matrix, a vertex

is represented as a vector (a row in the matrix) with a dimensionality that is the

number of vertices in the graph. Therefore, graph clustering methods must handle

high dimensionality.

Sparsity: A large graph is often sparse, meaning each vertex on average connects to

only a small number of other vertices. A similarity matrix from a large sparse graph

can also be sparse.

There are two kinds of methods for clustering graph data, which address these

challenges. One uses clustering methods for high-dimensional data, while the other is

designed specifically for clustering graphs.

The first group of methods is based on generic clustering methods for high-

dimensional data. They extract a similarity matrix from a graph using a similarity

measure such as those discussed in Section 11.3.2. A generic clustering method can

then be applied on the similarity matrix to discover clusters. Clustering methods for

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home