Clustering and Visualization of Retail Market Baskets - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

3.6.3 FASTOPOSSUM

Because

aims to achieve balanced clusters, random sampling is

effective for obtaining adequate examples of each cluster. If the clusters are

perfectly balanced, the distribution of the number of samples from a specific

cluster in a subsample of size n taken from the entire population is bino-

mial with mean n/k and variance n(k

Opossum

1)/k 2 . For a finite population that is

balanced to begin with, the variance will be even less, because we are doing

sampling without replacement. Thus if one cluster gets more than the ex-

pected number of points at some time in the sampling process, the fraction

of the unsampled points that belong to this cluster becomes less than the

expected number.

If we require at least r repr esentativ es from a cluster, then the number of

samplesisgivenby n /k

−

z α n (k

1)+r,wherez α =1.96 or 2.81 for 97.5%

and 99.5% confidence levels respectively. This is O(rk). For example, if we

have 10 clusters and need to ensure at least 20 representatives from a given

cluster with probability 0.995, about 400 samples are adequate. Note that

this number is independent of n if n is adequately large (at least 400 in this

case), so even for more than one million customers, only 400 representatives

are required. Additional analysis is provided in [3.25].

This suggests a simple and effective way to scale

≥

−

to a very large

number of objects n, using the following four-step process called

Opossum

FastOpos-

sum

:

1. Pick a boot-sample of size n so that the corresponding r value is adequate

to define each cluster.

2. Apply

to the boot-sample to get k initial clusters.

3. Find the centroid for each of the k clusters.

4. Assign each of the remaining n

Opossum

−

n points to the cluster with the nearest

centroid.

Using n = √ n reduces the complexity of

to O(kn). Note that

the algorithm may not result in balanced clusters. We can enforce balancing

by allocating the remaining points to the k clusters in groups, each time solv-

ing a stable-marriage problem [3.29], but this will increase the computation

time.

Figure 3.5 illustrates the behavior of

FastOpossum

for the drugstore cus-

tomer data set from Section 3.5.1. The remaining edge-weight fraction indi-

cates how much of the cumulative edge weight remains after the edge separa-

FastOpossum

tor has been removed: [ k =1 λ a = λ b =,b>a s( x a , x b )]/[ a=1 b=a+1

s( x a , x b )].

The better the partitioning, the smaller the edge-separator and thus the

larger the remaining edge weight fraction. Surprisingly the speedup does not

result in a significantly decreased quality in terms of remaining edge weight

(Fig. 3.5(a)). However, the balancing property is progressively relaxed as the

boot sample becomes smaller in comparison to the full data set (Fig. 3.5(b)).

Using n = 100 initial points reduces the original computation time to less

Advanced Techniques in Knowledge Discovery and Data Mining

Search WWH ::

Custom Search

Home