Database Reference
In-Depth Information
3.6.3 FASTOPOSSUM
Because
aims to achieve balanced clusters, random sampling is
effective for obtaining adequate examples of each cluster. If the clusters are
perfectly balanced, the distribution of the number of samples from a specific
cluster in a subsample of size n taken from the entire population is bino-
mial with mean n/k and variance n(k
Opossum
1)/k 2 . For a finite population that is
balanced to begin with, the variance will be even less, because we are doing
sampling without replacement. Thus if one cluster gets more than the ex-
pected number of points at some time in the sampling process, the fraction
of the unsampled points that belong to this cluster becomes less than the
expected number.
If we require at least r repr esentativ es from a cluster, then the number of
samplesisgivenby n /k
z α n (k
1)+r,wherez α =1.96 or 2.81 for 97.5%
and 99.5% confidence levels respectively. This is O(rk). For example, if we
have 10 clusters and need to ensure at least 20 representatives from a given
cluster with probability 0.995, about 400 samples are adequate. Note that
this number is independent of n if n is adequately large (at least 400 in this
case), so even for more than one million customers, only 400 representatives
are required. Additional analysis is provided in [3.25].
This suggests a simple and effective way to scale
to a very large
number of objects n, using the following four-step process called
Opossum
FastOpos-
sum
:
1. Pick a boot-sample of size n so that the corresponding r value is adequate
to define each cluster.
2. Apply
to the boot-sample to get k initial clusters.
3. Find the centroid for each of the k clusters.
4. Assign each of the remaining n
Opossum
n points to the cluster with the nearest
centroid.
Using n = n reduces the complexity of
to O(kn). Note that
the algorithm may not result in balanced clusters. We can enforce balancing
by allocating the remaining points to the k clusters in groups, each time solv-
ing a stable-marriage problem [3.29], but this will increase the computation
time.
Figure 3.5 illustrates the behavior of
FastOpossum
for the drugstore cus-
tomer data set from Section 3.5.1. The remaining edge-weight fraction indi-
cates how much of the cumulative edge weight remains after the edge separa-
FastOpossum
tor has been removed: [ k =1 λ a = λ b =,b>a s( x a , x b )]/[ a=1 b=a+1
s( x a , x b )].
The better the partitioning, the smaller the edge-separator and thus the
larger the remaining edge weight fraction. Surprisingly the speedup does not
result in a significantly decreased quality in terms of remaining edge weight
(Fig. 3.5(a)). However, the balancing property is progressively relaxed as the
boot sample becomes smaller in comparison to the full data set (Fig. 3.5(b)).
Using n = 100 initial points reduces the original computation time to less
 
Search WWH ::




Custom Search