Databases Reference
In-Depth Information
Capability of clustering high-dimensionality data : A data set can contain numerous
dimensions or attributes. When clustering documents, for example, each keyword
can be regarded as a dimension, and there are often thousands of keywords. Most
clustering algorithms are good at handling low-dimensional data such as data sets
involving only two or three dimensions. Finding clusters of data objects in a high-
dimensional space is challenging, especially considering that such data can be very
sparse and highly skewed.
Constraint-based clustering : Real-world applications may need to perform clus-
tering under various kinds of constraints. Suppose that your job is to choose the
locations for a given number of new automatic teller machines (ATMs) in a city. To
decide upon this, you may cluster households while considering constraints such as
the city's rivers and highway networks and the types and number of customers per
cluster. A challenging task is to find data groups with good clustering behavior that
satisfy specified constraints.
Interpretability and usability : Users want clustering results to be interpretable,
comprehensible, and usable. That is, clustering may need to be tied in with spe-
cific semantic interpretations and applications. It is important to study how an
application goal may influence the selection of clustering features and clustering
methods.
The
following
are
orthogonal
aspects
with
which
clustering
methods
can
be
compared:
The partitioning criteria : In some methods, all the objects are partitioned so that
no hierarchy exists among the clusters. That is, all the clusters are at the same level
conceptually. Such a method is useful, for example, for partitioning customers into
groups so that each group has its own manager. Alternatively, other methods parti-
tion data objects hierarchically, where clusters can be formed at different semantic
levels. For example, in text mining, we may want to organize a corpus of documents
into multiple general topics, such as “politics” and “sports,” each of which may have
subtopics, For instance, “football,” “basketball,” “baseball,” and “hockey” can exist as
subtopics of “sports.” The latter four subtopics are at a lower level in the hierarchy
than “sports.”
Separation of clusters : Some methods partition data objects into mutually exclusive
clusters. When clustering customers into groups so that each group is taken care of by
one manager, each customer may belong to only one group. In some other situations,
the clusters may not be exclusive, that is, a data object may belong to more than one
cluster. For example, when clustering documents into topics, a document may be
related to multiple topics. Thus, the topics as clusters may not be exclusive.
Similarity measure : Some methods determine the similarity between two objects
by the distance between them. Such a distance can be defined on Euclidean space,
 
Search WWH ::




Custom Search