Cluster Analysis: Basic Concepts and Methods - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

The following are typical requirements of clustering in data mining.

Scalability : Many clustering algorithms work well on small data sets containing fewer

than several hundred data objects; however, a large database may contain millions or

even billions of objects, particularly in Web search scenarios. Clustering on only a

sample of a given large data set may lead to biased results. Therefore, highly scalable

clustering algorithms are needed.

Ability to deal with different types of attributes : Many algorithms are designed to

cluster numeric (interval-based) data. However, applications may require clustering

other data types, such as binary, nominal (categorical), and ordinal data, or mixtures

of these data types. Recently, more and more applications need clustering techniques

for complex data types such as graphs, sequences, images, and documents.

Discovery of clusters with arbitrary shape : Many clustering algorithms determine

clusters based on Euclidean or Manhattan distance measures (Chapter 2). Algorithms

based on such distance measures tend to find spherical clusters with similar size and

density. However, a cluster could be of any shape. Consider sensors, for example,

which are often deployed for environment surveillance. Cluster analysis on sensor

readings can detect interesting phenomena. We may want to use clustering to find

the frontier of a running forest fire, which is often not spherical. It is important to

develop algorithms that can detect clusters of arbitrary shape.

Requirements for domain knowledge to determine input parameters : Many clus-

tering algorithms require users to provide domain knowledge in the form of input

parameters such as the desired number of clusters. Consequently, the clustering

results may be sensitive to such parameters. Parameters are often hard to determine,

especially for high-dimensionality data sets and where users have yet to grasp a deep

understanding of their data. Requiring the specification of domain knowledge not

only burdens users, but also makes the quality of clustering difficult to control.

Ability to deal with noisy data : Most real-world data sets contain outliers and/or

missing, unknown, or erroneous data. Sensor readings, for example, are often

noisy—some readings may be inaccurate due to the sensing mechanisms, and some

readings may be erroneous due to interferences from surrounding transient objects.

Clustering algorithms can be sensitive to such noise and may produce poor-quality

clusters. Therefore, we need clustering methods that are robust to noise.

Incremental clustering and insensitivity to input order : In many applications,

incremental updates (representing newer data) may arrive at any time. Some clus-

tering algorithms cannot incorporate incremental updates into existing clustering

structures and, instead, have to recompute a new clustering from scratch. Cluster-

ing algorithms may also be sensitive to the input data order. That is, given a set

of data objects, clustering algorithms may return dramatically different clusterings

depending on the order in which the objects are presented. Incremental clustering

algorithms and algorithms that are insensitive to the input order are needed.

Search WWH ::

Custom Search

Home