Database Reference
In-Depth Information
We formulate the desired balancing properties by assigning each object
(customer, document, Web session) a weight and then softly constrain the
sum of weights in each cluster. For sample-balanced clustering, we assign each
sample x j thesameweightw j =1/n. To obtain value-balancing properties,
asample x j 's weight is set to w j = v i=1
x i,j . Note that the sum of weights
for all samples is 1.
3.3.2 Vertex-Weighted Graph Partitioning
We map the problem of clustering to partitioning a vertex-weighted graph
G
into k unconnected components of approximately equal size (as defined
by the balancing constraint) by removing a minimal number of edges. The
objects to be clustered are viewed as a set of vertices
V
=
{
x 1 ,..., x n }
.Two
vertices x a and x b are connected with an undirected edge (a, b)
∈E
of positive
weight given by the similarity s( x a , x b ). This defines the graph
G
=(
V
,
E
).
An edge separator ∆
E
is a set of edges whose removal splits the graph
G
into
k pairwise unconnected components (subgraphs)
{G 1 ,...,
G k }
. All subgraphs
G =(
E ) have pairwise disjoint sets of vertices and edges. The edge
separator for a particular partitioning includes all the edges that are not part
of any subgraph or ∆
V ,
∪E k )). The clustering task is
thus to find an edge separator with a minimum sum of edge weights, which
partitions the graph into k disjoint pieces. The following equation formalizes
this minimum-cut objective:
E
=(
E\
(
E 1 ∪E 2
...
min
∆E
s( x a , x b ).
(3.3)
(a,b)∈∆E
Without loss of generality, we can assume that the vertex weights w j
are
normalized to sum to one: j=1
w j = 1. While striving for the minimum-cut
objective, the balancing constraint
k
max
∈{1,...,k}
w j
t
(3.4)
λ j =
has to be fulfilled. The left-hand side of the inequality is called the imbalance
(the ratio of the biggest cluster in terms of cumulative normalized edge weight
to the desired equal cluster size, n/k) and has a lower bound of one. The
balancing threshold t enforces perfectly balanced clusters for t = 1. In practice
t is often chosen to be slightly greater than 1 (e.g., we use t =1.05 for all our
experiments, which allows at most 5% of imbalance).
Thus, in graph partitioning one has to essentially solve a constrained
optimization problem. Finding such an optimal partitioning is an NP-hard
problem [3.15]. However, there are fast, heuristic algorithms for this widely
studied problem. We experimented with the Kernighan-Lin (KL) algorithm,
recursive spectral bisection, and multilevel k-way partitioning (Metis).
Search WWH ::




Custom Search