Database Reference
In-Depth Information
In summary, the challenges mainly arise from two aspects: 3 (i) large num-
bers of data samples, n; and (ii) each sample having a large number of at-
tributes or features (dimensions, d), which cannot be readily simplified with-
out much information loss.
The first aspect can be dealt with by subsampling the data, exploiting
summary statistics, aggregating or “rolling up” to consider data at a coarser
resolution, or using approximating heuristics that reduce computation time
at the cost of some loss in quality. See chap. 8 [3.8] for several examples of
such approaches. This chapter provides a way of addressing the second aspect
by describing an alternative way of clustering and visualization when, even
after feature reduction, one is left with hundreds of dimensions per object
(and further reduction will significantly degrade the results), and moreover,
simplifying data modeling assumptions are also not valid. Because clustering
basically involves grouping objects based on their interrelationships or simi-
larities, one can alternatively work in similarity space instead of the original
feature space. The key insight in this work is that if one can find a similarity
measure (derived from the object features) that is appropriate for the problem
domain, then a single number can capture the essential “closeness” of a given
pair of objects, and any further analysis can be based only on these numbers.
This can be of great benefit when the data are very high-dimensional and
simplifications such as further reducing dimensionality through projections or
assuming conditional independence of features are not appropriate. Indeed,
several researchers have recently proposed similarity-based clustering tech-
niques for data mining applications [3.7], [3.9], [3.10], and this is emerging as
an active area of research.
The similarity space also lends itself to a simple technique to visualize the
clustering results. A major contribution of this chapter is to demonstrate that
this technique has increased power when the clustering method used contains
ordering information (e.g., top-down). Popular clustering methods in feature
space are either nonhierarchical (as in k-means) or bottom-up (agglomer-
ative clustering). However, if one transforms the clustering problem into a
related problem of partitioning a similarity graph, several powerful partition-
ing methods with ordering properties can be applied. Moreover, the overall
framework is quite generally applicable if one can determine the appropriate
similarity measure for a given situation.
We begin by considering domain-specific transformations into similarity
space in Section 3.2. Section 3.3 describes a specific clustering technique for
transaction data (
), based on a multilevel graph-partitioning algo-
rithm [3.11]. In Section 3.4, we describe a simple but effective visualization
technique applicable to similarity spaces (
Opossum
). Clustering and visual-
ization results are presented in Section 3.5. In Section 3.6, we consider system
Clusion
3 A thirdissueofhow to deal with seasonality and other temporal variations in
the data is also critical in some applications. This aspect is not within the scope
of this chapter, but see [3.4] for a solution for retail data.
Search WWH ::




Custom Search