Clustering and Visualization of Retail Market Baskets - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

We formulate the desired balancing properties by assigning each object

(customer, document, Web session) a weight and then softly constrain the

sum of weights in each cluster. For sample-balanced clustering, we assign each

sample x j thesameweightw j =1/n. To obtain value-balancing properties,

asample x j 's weight is set to w j = v i=1

x i,j . Note that the sum of weights

for all samples is 1.

3.3.2 Vertex-Weighted Graph Partitioning

We map the problem of clustering to partitioning a vertex-weighted graph

into k unconnected components of approximately equal size (as defined

by the balancing constraint) by removing a minimal number of edges. The

objects to be clustered are viewed as a set of vertices

{

x 1 ,..., x n }

.Two

vertices x a and x b are connected with an undirected edge (a, b)

∈E

of positive

weight given by the similarity s( x a , x b ). This defines the graph

An edge separator ∆

is a set of edges whose removal splits the graph

into

k pairwise unconnected components (subgraphs)

{G 1 ,...,

G k }

. All subgraphs

G =(

E ) have pairwise disjoint sets of vertices and edges. The edge

separator for a particular partitioning includes all the edges that are not part

of any subgraph or ∆

V ,

∪E k )). The clustering task is

thus to find an edge separator with a minimum sum of edge weights, which

partitions the graph into k disjoint pieces. The following equation formalizes

this minimum-cut objective:

(

E 1 ∪E 2 ∪

...

min

∆E

s( x a , x b ).

(3.3)

(a,b)∈∆E

Without loss of generality, we can assume that the vertex weights w j

are

normalized to sum to one: j=1

w j = 1. While striving for the minimum-cut

objective, the balancing constraint

max

∈{1,...,k}

w j ≤

(3.4)

λ j =

has to be fulfilled. The left-hand side of the inequality is called the imbalance

(the ratio of the biggest cluster in terms of cumulative normalized edge weight

to the desired equal cluster size, n/k) and has a lower bound of one. The

balancing threshold t enforces perfectly balanced clusters for t = 1. In practice

t is often chosen to be slightly greater than 1 (e.g., we use t =1.05 for all our

experiments, which allows at most 5% of imbalance).

Thus, in graph partitioning one has to essentially solve a constrained

optimization problem. Finding such an optimal partitioning is an NP-hard

problem [3.15]. However, there are fast, heuristic algorithms for this widely

studied problem. We experimented with the Kernighan-Lin (KL) algorithm,

recursive spectral bisection, and multilevel k-way partitioning (Metis).

Advanced Techniques in Knowledge Discovery and Data Mining

Search WWH ::

Custom Search

Home