Database Reference
In-Depth Information
filtering steps. Some other specific pre-processing steps are also occasion-
ally performed based on the types of the documents, e.g., headers and email
signatures are removed for newsgroup articles, HTML tags are removed for
webpages, etc.
7.3.2 Distance Measures
High dimensional spaces like text have good directional properties, which
has made directional distance measures like cosine distance (1 - cosine simi-
larity) between the vector representations of text data a popular measure of
distance in the information retrieval community (1). Other distance measures,
e.g., probabilistic document overlap (26), have also been used successfully for
text clustering. Some practitioners use SquaredEuclidean distance for text
clustering, after all data instance vectors have been normalized to have unit
length according to the L 2 norm. This normalization makes the SquaredEu-
clidean distance between two instances proportional to the cosine distance
between them, as illustrated by the following relation:
2 =
2 +
2
SquaredEuclideanDist ( x 1 ,x 2 )=
x 1
x 2
x 1
x 2
2
x 1
x 2
x 1 x 2 )=2
CosineDist ( x 1 ,x 2 ) ,
since x i =1 ∀i . This prior normalization of the instances is crucial so
that subsequent clustering algorithms can group text documents based on
their content words and get good quality, since otherwise clustering text using
SquaredEuclidean distance can result in poor quality(25).
Spherical KMeans (SP-KMeans) is a version of KMeans (outlined in the
next section) that uses cosine distance as its underlying distance metric. In
the SP-KMeans algorithm, standard Euclidean KMeans is applied to data
vectors
=2(1
×
i =1 that have been normalized to have unit L 2 norm, so that the
data instances lie on a unit sphere (21). Note that there is one additional
step in SP-KMeans — in the cluster re-estimation step the centroid vectors
{
{
x i }
j =1 are also constrained to lie on the unit sphere . This is the main differ-
ence between SP-KMeans and Euclidean KMeans on L 2 normalized document
vectors. The SP-KMeans clustering problem can be equivalently formulated
as that of maximizing the objective function:
μ j }
k
x i μ j ,
J sp-kmeans =
(7.1)
j =1
x i
X j
where the centroid μ j of the j th cluster is the mean of all the instances in that
cluster, normalized to have unit L 2 norm. The SP-KMeans algorithm gives a
local maximum of this objective function.
In all the algorithms in this chapter that use SquaredEuclidean distance,
the data have been pre-normalized to have unit L 2 norm. In practice, KMeans
and SP-KMeans clustering involving the text vectors are performed eciently
by using sparse representations of document vectors.
Search WWH ::




Custom Search