Clustering and Visualization of Retail Market Baskets - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

methods) has a complexity at least O(n 2 ), as does

. There are four

main ways of reducing this computation. We mention them briefly and then

explore the first option in a bit more detail.

1. Sampling: Sample the data, cluster the sample points, and then use a

quick heuristic to allocate the nonsampled points to the initial clusters.

This approach will yield a faster algorithm at the possible cost of some

loss in quality and is employed, for example, in the Buckshot algorithm

for the Scatter/Gather approach to iterative clustering for interactive

browsing [3.23]. If the sample is O( √ n), and the “nearest cluster center”

is used to allocate the remaining points, one obtains an O(kn) algorithm.

Also related are randomized approaches that can partition a set of points

into two clusters of comparable size in sublinear time, producing a (1+)

solution with high probability [3.24]. We show later that because

Opossum

Opos-

is based on balanced clusters, sampling is a good choice because one

can ensure with high probability that each cluster is represented in the

sample without needing a large sample size [3.25].

2. Sequential Building: Construct a “core” clustering using a small number

of elements, and then sequentially scan the data to allocate the remaining

inputs, creating new clusters (and optionally adjusting existing centers)

as needed. Such an approach is seen in CLARANS and BIRCH [3.26].

This style compromises balancing to some extent, and the threshold de-

termining when a new cluster is formed has to be experimented with to

bring the number of clusters obtained to the desired range. A version of

this approach for graph partitioning using a corrupted clique model was

proposed in [3.27] and applied to clustering gene expressions. This can be

readily used for

sum

as well. Sequential building is especially pop-

ular for out-of-core methods, the idea being to scan the database once to

form a summarized model (for instance, the size, sum, and sum-squared

values of each cluster [3.28]) in main memory. Subsequent refinement

based on summarized information is then restricted to main memory

operations without resorting to further disk scans.

3. Compare with representatives rather than with all points, as in k-means.

The results, however, become sensitive to the initial selection of repre-

sentatives.

4. Apply prior domain knowledge to “presegment” the data, e.g., using in-

dices or other “partitionings” of the input space. As mentioned earlier,

this becomes increasingly problematic as the dimensionality of the input

space increases to the hundreds or beyond, where suitable segments may

be di cult to estimate, predetermine, or populate.

Opossum

All these approaches are somewhat orthogonal to the main clustering routine

in that they can be applied in conjunction with most core clustering routines

(including

Opossum

) to save computation at the cost of some loss in quality.

Advanced Techniques in Knowledge Discovery and Data Mining

Search WWH ::

Custom Search

Home