Cluster Analysis: Basic Concepts and Methods - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

Capability of clustering high-dimensionality data : A data set can contain numerous

dimensions or attributes. When clustering documents, for example, each keyword

can be regarded as a dimension, and there are often thousands of keywords. Most

clustering algorithms are good at handling low-dimensional data such as data sets

involving only two or three dimensions. Finding clusters of data objects in a high-

dimensional space is challenging, especially considering that such data can be very

sparse and highly skewed.

Constraint-based clustering : Real-world applications may need to perform clus-

tering under various kinds of constraints. Suppose that your job is to choose the

locations for a given number of new automatic teller machines (ATMs) in a city. To

decide upon this, you may cluster households while considering constraints such as

the city's rivers and highway networks and the types and number of customers per

cluster. A challenging task is to find data groups with good clustering behavior that

satisfy specified constraints.

Interpretability and usability : Users want clustering results to be interpretable,

comprehensible, and usable. That is, clustering may need to be tied in with spe-

cific semantic interpretations and applications. It is important to study how an

application goal may influence the selection of clustering features and clustering

methods.

The

following

are

orthogonal

aspects

with

which

clustering

methods

can

be

compared:

The partitioning criteria : In some methods, all the objects are partitioned so that

no hierarchy exists among the clusters. That is, all the clusters are at the same level

conceptually. Such a method is useful, for example, for partitioning customers into

groups so that each group has its own manager. Alternatively, other methods parti-

tion data objects hierarchically, where clusters can be formed at different semantic

levels. For example, in text mining, we may want to organize a corpus of documents

into multiple general topics, such as “politics” and “sports,” each of which may have

subtopics, For instance, “football,” “basketball,” “baseball,” and “hockey” can exist as

subtopics of “sports.” The latter four subtopics are at a lower level in the hierarchy

than “sports.”

Separation of clusters : Some methods partition data objects into mutually exclusive

clusters. When clustering customers into groups so that each group is taken care of by

one manager, each customer may belong to only one group. In some other situations,

the clusters may not be exclusive, that is, a data object may belong to more than one

cluster. For example, when clustering documents into topics, a document may be

related to multiple topics. Thus, the topics as clusters may not be exclusive.

Similarity measure : Some methods determine the similarity between two objects

by the distance between them. Such a distance can be defined on Euclidean space,

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home