Database Reference
In-Depth Information
7
Clustering
Clustering is the process of examining a collection of “points,” and grouping the points into
“clusters” according to some distance measure. The goal is that points in the same cluster
have a small distance from one another, while points in different clusters are at a large dis-
tance from one another. A suggestion of what clusters might look like was seen in Fig. 1.1 .
However, there the intent was that there were three clusters around three different road inter-
sections, but two of the clusters blended into one another because they were not sufficiently
separated.
Our goal in this chapter is to offer methods for discovering clusters in data. We are par-
ticularly interested in situations where the data is very large, and/or where the space either
is high-dimensional, or the space is not Euclidean at all. We shall therefore discuss several
algorithms that assume the data does not fit in main memory. However, we begin with the
basics: the two general approaches to clustering and the methods for dealing with clusters in
a non-Euclidean space.
7.1 Introduction to Clustering Techniques
We begin by reviewing the notions of distance measures and spaces. The two major ap-
proaches to clustering - hierarchical and point-assignment - are defined. We then turn to
a discussion of the “curse of dimensionality,” which makes clustering in high-dimensional
spaces difficult, but also, as we shall see, enables some simplifications if used correctly in a
clustering algorithm.
7.1.1
Points, Spaces, and Distances
A dataset suitable for clustering is a collection of points , which are objects belonging to
some space . In its most general sense, a space is just a universal set of points, from which
the points in the dataset are drawn. However, we should be mindful of the common case of
a Euclidean space (see Section 3.5.2 ), which has a number of important properties useful for
Search WWH ::




Custom Search