Database Reference
In-Depth Information
We formulate the desired balancing properties by assigning each object
(customer, document, Web session) a weight and then softly constrain the
sum of weights in each cluster. For sample-balanced clustering, we assign each
sample
x
j
thesameweightw
j
=1/n. To obtain value-balancing properties,
asample
x
j
's weight is set to w
j
=
v
i=1
x
i,j
. Note that the sum of weights
for all samples is 1.
3.3.2 Vertex-Weighted Graph Partitioning
We map the problem of clustering to partitioning a vertex-weighted graph
G
into k unconnected components of approximately equal size (as defined
by the balancing constraint) by removing a minimal number of edges. The
objects to be clustered are viewed as a set of vertices
V
=
{
x
1
,...,
x
n
}
.Two
vertices
x
a
and
x
b
are connected with an undirected edge (a, b)
∈E
of positive
weight given by the similarity s(
x
a
,
x
b
). This defines the graph
G
=(
V
,
E
).
An edge separator ∆
E
is a set of edges whose removal splits the graph
G
into
k pairwise unconnected components (subgraphs)
{G
1
,...,
G
k
}
. All subgraphs
G
=(
E
) have pairwise disjoint sets of vertices and edges. The edge
separator for a particular partitioning includes all the edges that are not part
of any subgraph or ∆
V
,
∪E
k
)). The clustering task is
thus to find an edge separator with a minimum sum of edge weights, which
partitions the graph into k disjoint pieces. The following equation formalizes
this minimum-cut objective:
E
=(
E\
(
E
1
∪E
2
∪
...
min
∆E
s(
x
a
,
x
b
).
(3.3)
(a,b)∈∆E
Without loss of generality, we can assume that the vertex weights w
j
are
normalized to sum to one:
j=1
w
j
= 1. While striving for the minimum-cut
objective, the balancing constraint
k
max
∈{1,...,k}
w
j
≤
t
(3.4)
λ
j
=
has to be fulfilled. The left-hand side of the inequality is called the imbalance
(the ratio of the biggest cluster in terms of cumulative normalized edge weight
to the desired equal cluster size, n/k) and has a lower bound of one. The
balancing threshold t enforces perfectly balanced clusters for t = 1. In practice
t is often chosen to be slightly greater than 1 (e.g., we use t =1.05 for all our
experiments, which allows at most 5% of imbalance).
Thus, in graph partitioning one has to essentially solve a constrained
optimization problem. Finding such an optimal partitioning is an NP-hard
problem [3.15]. However, there are fast, heuristic algorithms for this widely
studied problem. We experimented with the Kernighan-Lin (KL) algorithm,
recursive spectral bisection, and multilevel k-way partitioning (Metis).
Search WWH ::
Custom Search