Database Reference
In-Depth Information
methods) has a complexity at least O(n 2 ), as does
. There are four
main ways of reducing this computation. We mention them briefly and then
explore the first option in a bit more detail.
1. Sampling: Sample the data, cluster the sample points, and then use a
quick heuristic to allocate the nonsampled points to the initial clusters.
This approach will yield a faster algorithm at the possible cost of some
loss in quality and is employed, for example, in the Buckshot algorithm
for the Scatter/Gather approach to iterative clustering for interactive
browsing [3.23]. If the sample is O( n), and the “nearest cluster center”
is used to allocate the remaining points, one obtains an O(kn) algorithm.
Also related are randomized approaches that can partition a set of points
into two clusters of comparable size in sublinear time, producing a (1+)
solution with high probability [3.24]. We show later that because
Opossum
Opos-
is based on balanced clusters, sampling is a good choice because one
can ensure with high probability that each cluster is represented in the
sample without needing a large sample size [3.25].
2. Sequential Building: Construct a “core” clustering using a small number
of elements, and then sequentially scan the data to allocate the remaining
inputs, creating new clusters (and optionally adjusting existing centers)
as needed. Such an approach is seen in CLARANS and BIRCH [3.26].
This style compromises balancing to some extent, and the threshold de-
termining when a new cluster is formed has to be experimented with to
bring the number of clusters obtained to the desired range. A version of
this approach for graph partitioning using a corrupted clique model was
proposed in [3.27] and applied to clustering gene expressions. This can be
readily used for
sum
as well. Sequential building is especially pop-
ular for out-of-core methods, the idea being to scan the database once to
form a summarized model (for instance, the size, sum, and sum-squared
values of each cluster [3.28]) in main memory. Subsequent refinement
based on summarized information is then restricted to main memory
operations without resorting to further disk scans.
3. Compare with representatives rather than with all points, as in k-means.
The results, however, become sensitive to the initial selection of repre-
sentatives.
4. Apply prior domain knowledge to “presegment” the data, e.g., using in-
dices or other “partitionings” of the input space. As mentioned earlier,
this becomes increasingly problematic as the dimensionality of the input
space increases to the hundreds or beyond, where suitable segments may
be di cult to estimate, predetermine, or populate.
Opossum
All these approaches are somewhat orthogonal to the main clustering routine
in that they can be applied in conjunction with most core clustering routines
(including
Opossum
) to save computation at the cost of some loss in quality.
Search WWH ::




Custom Search