Clustering - Mining of Massive Datasets

Databases Reference

In-Depth Information

Chapter 7

Clustering

Clustering is the process of examining a collection of “points,” and grouping

the points into “clusters” according to some distance measure. The goal is that

points in the same cluster have a small distance from one another, while points

in different clusters are at a large distance from one another. A suggestion of

what clusters might look like was seen in Fig. 1.1. However, there the intent

was that there were three clusters around three different road intersections, but

two of the clusters blended into one another because they were not su ciently

separated.

Our goal in this chapter is to offer methods for discovering clusters in data.

We are particularly interested in situations where the data is very large, and/or

where the space either is high-dimensional, or the space is not Euclidean at

all. We shall therefore discuss several algorithms that assume the data does

not fit in main memory. However, we begin with the basics: the two general

approaches to clustering and the methods for dealing with clusters in a non-

Euclidean space.

7.1

Introduction to Clustering Techniques

We begin by reviewing the notions of distance measures and spaces. The two

major approaches to clustering - hierarchical and agglomerative - are defined.

We then turn to a discussion of the “curse of dimensionality,” which makes

clustering in high-dimensional spaces di cult, but also, as we shall see, enables

some simplifications if used correctly in a clustering algorithm.

7.1.1 Points, Spaces, and Distances

A dataset suitable for clustering is a collection of points, which are objects

belonging to some space. In its most general sense, a space is just a universal

set of points, from which the points in the dataset are drawn. However, we

should be mindful of the common case of a Euclidean space (see Section 3.5.2),

221

Search WWH ::

Custom Search

Home