Databases Reference
In-Depth Information
high-dimensional data are typically employed. For example, in many scenarios, once
a similarity matrix is obtained, spectral clustering methods (Section 11.2.4) can be
applied. Spectral clustering can approximate optimal graph cut solutions. For additional
information, please refer to the bibliographic notes (Section 11.7).
The second group of methods is specific to graphs. They search the graph to find
well-connected components as clusters. Let's look at a method called SCAN (Structural
Clustering Algorithm for Networks) as an example.
Given an undirected graph, G D.
V , E
/
, for a vertex, u 2 V , the neighborhood of
u is
/2 E g[f u g. Using the idea of structural-context similarity, SCAN
measures the similarity between two vertices, u , v 2 V , by the normalized common
neighborhood size, that is,
0.
u
/Df v j.
u , v
/D j0.
u
/\0.
v
/j
.
(11.40)
.
u , v
p j0.
u
/jj0.
v
/j
The larger the value computed, the more similar the two vertices. SCAN uses a similarity
threshold
"
to define the cluster membership. For a vertex, u 2 V , the
"
-neighborhood
of u is defined as N
-neighborhood of u contains all
neighbors of u with a structural-context similarity to u that is at least
" .
u
/Df v 20.
u
/j.
u , v
/"g. The
"
.
In SCAN, a core vertex is a vertex inside of a cluster. That is, u 2 V is a core ver-
"
tex
is a popularity threshold. SCAN grows clusters from core
vertices. If a vertex v is in the
if j N " .
u
/j
, where
-neighborhood of a core u , then v is assigned to the
same cluster as u . This process of growing clusters continues until no cluster can be
further grown. The process is similar to the density-based clustering method, DBSCAN
(Chapter 10).
Formally, a vertex v can be directly reached from a core u if v 2 N
"
" .
u
/
. Transitively, a
vertex v can be reached from a core u if there exist vertices w 1 ,
:::
, w n such that w 1 can
be reached from u , w i can be reached from w i 1 for 1
i n , and v can be reached
from w n . Moreover, two vertices, u , v 2 V , which may or may not be cores, are said to
be connected if there exists a core w such that both u and v can be reached from w . All
vertices in a cluster are connected. A cluster is a maximum set of vertices such that every
pair in the set is connected.
Some vertices may not belong to any cluster. Such a vertex u is a hub if the neighbor-
hood
<
of u contains vertices from more than one cluster. If a vertex does not belong
to any cluster, and is not a hub, it is an outlier .
The SCAN algorithm is shown in Figure 11.15. The search framework closely resem-
bles the cluster-finding process in DBSCAN. SCAN finds a cut of the graph, where
each cluster is a set of vertices that are connected based on the transitive similarity in
a structural context.
An advantage of SCAN is that its time complexity is linear with respect to the number
of edges. In very large and sparse graphs, the number of edges is in the same scale of the
number of vertices. Therefore, SCAN is expected to have good scalability on clustering
large graphs.
0.
u
/
 
Search WWH ::




Custom Search