Advanced Cluster Analysis - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

high-dimensional data are typically employed. For example, in many scenarios, once

a similarity matrix is obtained, spectral clustering methods (Section 11.2.4) can be

applied. Spectral clustering can approximate optimal graph cut solutions. For additional

information, please refer to the bibliographic notes (Section 11.7).

The second group of methods is specific to graphs. They search the graph to find

well-connected components as clusters. Let's look at a method called SCAN (Structural

Clustering Algorithm for Networks) as an example.

Given an undirected graph, G D.

V , E

, for a vertex, u 2 V , the neighborhood of

u is

/2 E g[f u g. Using the idea of structural-context similarity, SCAN

measures the similarity between two vertices, u , v 2 V , by the normalized common

neighborhood size, that is,

/Df v j.

u , v

/D j0.

/\0.

(11.40)

u , v

p j0.

/jj0.

The larger the value computed, the more similar the two vertices. SCAN uses a similarity

threshold

to define the cluster membership. For a vertex, u 2 V , the

-neighborhood

of u is defined as N

-neighborhood of u contains all

neighbors of u with a structural-context similarity to u that is at least

" .

/Df v 20.

/j.

u , v

/"g. The

In SCAN, a core vertex is a vertex inside of a cluster. That is, u 2 V is a core ver-

tex

is a popularity threshold. SCAN grows clusters from core

vertices. If a vertex v is in the

if j N " .

, where

-neighborhood of a core u , then v is assigned to the

same cluster as u . This process of growing clusters continues until no cluster can be

further grown. The process is similar to the density-based clustering method, DBSCAN

(Chapter 10).

Formally, a vertex v can be directly reached from a core u if v 2 N

" .

. Transitively, a

vertex v can be reached from a core u if there exist vertices w 1 ,

:::

, w n such that w 1 can

be reached from u , w i can be reached from w i 1 for 1

i n , and v can be reached

from w n . Moreover, two vertices, u , v 2 V , which may or may not be cores, are said to

be connected if there exists a core w such that both u and v can be reached from w . All

vertices in a cluster are connected. A cluster is a maximum set of vertices such that every

pair in the set is connected.

Some vertices may not belong to any cluster. Such a vertex u is a hub if the neighbor-

hood

of u contains vertices from more than one cluster. If a vertex does not belong

to any cluster, and is not a hub, it is an outlier .

The SCAN algorithm is shown in Figure 11.15. The search framework closely resem-

bles the cluster-finding process in DBSCAN. SCAN finds a cut of the graph, where

each cluster is a set of vertices that are connected based on the transitive similarity in

a structural context.

An advantage of SCAN is that its time complexity is linear with respect to the number

of edges. In very large and sparse graphs, the number of edges is in the same scale of the

number of vertices. Therefore, SCAN is expected to have good scalability on clustering

large graphs.

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home