Databases Reference
In-Depth Information
The following are typical requirements of clustering in data mining.
Scalability : Many clustering algorithms work well on small data sets containing fewer
than several hundred data objects; however, a large database may contain millions or
even billions of objects, particularly in Web search scenarios. Clustering on only a
sample of a given large data set may lead to biased results. Therefore, highly scalable
clustering algorithms are needed.
Ability to deal with different types of attributes : Many algorithms are designed to
cluster numeric (interval-based) data. However, applications may require clustering
other data types, such as binary, nominal (categorical), and ordinal data, or mixtures
of these data types. Recently, more and more applications need clustering techniques
for complex data types such as graphs, sequences, images, and documents.
Discovery of clusters with arbitrary shape : Many clustering algorithms determine
clusters based on Euclidean or Manhattan distance measures (Chapter 2). Algorithms
based on such distance measures tend to find spherical clusters with similar size and
density. However, a cluster could be of any shape. Consider sensors, for example,
which are often deployed for environment surveillance. Cluster analysis on sensor
readings can detect interesting phenomena. We may want to use clustering to find
the frontier of a running forest fire, which is often not spherical. It is important to
develop algorithms that can detect clusters of arbitrary shape.
Requirements for domain knowledge to determine input parameters : Many clus-
tering algorithms require users to provide domain knowledge in the form of input
parameters such as the desired number of clusters. Consequently, the clustering
results may be sensitive to such parameters. Parameters are often hard to determine,
especially for high-dimensionality data sets and where users have yet to grasp a deep
understanding of their data. Requiring the specification of domain knowledge not
only burdens users, but also makes the quality of clustering difficult to control.
Ability to deal with noisy data : Most real-world data sets contain outliers and/or
missing, unknown, or erroneous data. Sensor readings, for example, are often
noisy—some readings may be inaccurate due to the sensing mechanisms, and some
readings may be erroneous due to interferences from surrounding transient objects.
Clustering algorithms can be sensitive to such noise and may produce poor-quality
clusters. Therefore, we need clustering methods that are robust to noise.
Incremental clustering and insensitivity to input order : In many applications,
incremental updates (representing newer data) may arrive at any time. Some clus-
tering algorithms cannot incorporate incremental updates into existing clustering
structures and, instead, have to recompute a new clustering from scratch. Cluster-
ing algorithms may also be sensitive to the input data order. That is, given a set
of data objects, clustering algorithms may return dramatically different clusterings
depending on the order in which the objects are presented. Incremental clustering
algorithms and algorithms that are insensitive to the input order are needed.
 
Search WWH ::




Custom Search