Clustering - Mining of Massive Datasets

Database Reference

In-Depth Information

7

Clustering

Clustering is the process of examining a collection of “points,” and grouping the points into

“clusters” according to some distance measure. The goal is that points in the same cluster

have a small distance from one another, while points in different clusters are at a large dis-

tance from one another. A suggestion of what clusters might look like was seen in Fig. 1.1 .

However, there the intent was that there were three clusters around three different road inter-

sections, but two of the clusters blended into one another because they were not sufficiently

separated.

Our goal in this chapter is to offer methods for discovering clusters in data. We are par-

ticularly interested in situations where the data is very large, and/or where the space either

is high-dimensional, or the space is not Euclidean at all. We shall therefore discuss several

algorithms that assume the data does not fit in main memory. However, we begin with the

basics: the two general approaches to clustering and the methods for dealing with clusters in

a non-Euclidean space.

7.1 Introduction to Clustering Techniques

We begin by reviewing the notions of distance measures and spaces. The two major ap-

proaches to clustering - hierarchical and point-assignment - are defined. We then turn to

a discussion of the “curse of dimensionality,” which makes clustering in high-dimensional

spaces difficult, but also, as we shall see, enables some simplifications if used correctly in a

clustering algorithm.

7.1.1

Points, Spaces, and Distances

A dataset suitable for clustering is a collection of points , which are objects belonging to

some space . In its most general sense, a space is just a universal set of points, from which

the points in the dataset are drawn. However, we should be mindful of the common case of

a Euclidean space (see Section 3.5.2 ), which has a number of important properties useful for

Search WWH ::

Custom Search

Home