Constrained Partitional Clustering of Text Data: An Overview - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

filtering steps. Some other specific pre-processing steps are also occasion-

ally performed based on the types of the documents, e.g., headers and email

signatures are removed for newsgroup articles, HTML tags are removed for

webpages, etc.

7.3.2 Distance Measures

High dimensional spaces like text have good directional properties, which

has made directional distance measures like cosine distance (1 - cosine simi-

larity) between the vector representations of text data a popular measure of

distance in the information retrieval community (1). Other distance measures,

e.g., probabilistic document overlap (26), have also been used successfully for

text clustering. Some practitioners use SquaredEuclidean distance for text

clustering, after all data instance vectors have been normalized to have unit

length according to the L 2 norm. This normalization makes the SquaredEu-

clidean distance between two instances proportional to the cosine distance

between them, as illustrated by the following relation:

2 =

2 +

SquaredEuclideanDist ( x 1 ,x 2 )=

x 1 −

x 2

x 1

x 2

−

x 1

x 2

x 1 x 2 )=2

CosineDist ( x 1 ,x 2 ) ,

since x i =1 ∀i . This prior normalization of the instances is crucial so

that subsequent clustering algorithms can group text documents based on

their content words and get good quality, since otherwise clustering text using

SquaredEuclidean distance can result in poor quality(25).

Spherical KMeans (SP-KMeans) is a version of KMeans (outlined in the

next section) that uses cosine distance as its underlying distance metric. In

the SP-KMeans algorithm, standard Euclidean KMeans is applied to data

vectors

=2(1

−

i =1 that have been normalized to have unit L 2 norm, so that the

data instances lie on a unit sphere (21). Note that there is one additional

step in SP-KMeans — in the cluster re-estimation step the centroid vectors

{

x i }

j =1 are also constrained to lie on the unit sphere . This is the main differ-

ence between SP-KMeans and Euclidean KMeans on L 2 normalized document

vectors. The SP-KMeans clustering problem can be equivalently formulated

as that of maximizing the objective function:

μ j }

x i μ j ,

J sp-kmeans =

(7.1)

j =1

x i ∈

X j

where the centroid μ j of the j th cluster is the mean of all the instances in that

cluster, normalized to have unit L 2 norm. The SP-KMeans algorithm gives a

local maximum of this objective function.

In all the algorithms in this chapter that use SquaredEuclidean distance,

the data have been pre-normalized to have unit L 2 norm. In practice, KMeans

and SP-KMeans clustering involving the text vectors are performed eciently

by using sparse representations of document vectors.

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home