Database Reference
In-Depth Information
X
X
S
λ
Ψ
Φ
x
1
x
n
Fig. 3.1.
The relationship-based clustering framework.
issues and briefly discuss several strategies to scale
for large data
sets. Section 3.7 summarizes related work in clustering, graph partitioning
and visualization.
Opossum
3.2 Domain-Specific Features and Similarity Space
Notation
.Letn be the number of objects/samples/points (e.g., customers,
documents, Web sessions) in the data and d the number of features (e.g.,
products, words, Web pages) for each sample
x
j
with j
.Letk
be the desired number of clusters. The input data can be represented by a
d
∈{
1,...,n
}
n data matrix
X
with the jth column vector representing the sample
x
j
.
x
j
denotes the transpose of
x
j
. Hard clustering assigns a label λ
j
∈
{1,...,k} to each d-dimensional sample
x
j
such that similar samples get the
same label. In general the labels are treated as nominals with no inherent
order, though in some cases, such as 1-dimensional SOMs, any top-down
recursive bisection approach our proposed method, the labeling contains extra
ordering information. Let
×
C
denote the set of all objects in the th cluster
(
.
Figure 3.1 gives an overview of our relationship-based clustering process
from a set of raw object descriptions
∈{
1,...,k
}
), with
x
j
∈C
⇔
λ
j
= and n
=
|C
|
X
(residing in input space
I
) via the
vector space description
X
(in feature space
F
) and relationship description
S
(in similarity space
S
) to the cluster labels λ (in output space
O
): (
X∈
Υ
→
Ψ
→
Φ
→
n
)
n
d×n
)
n×n
=[0, 1]
n×n
n×n
)
n
=
I
(
X
∈F
⊂
R
(
S
∈S
⊂
R
(λ
∈O
n
). For example, in Web page clustering,
{
1,...,k
}
X
is a collection of n
Web pages x
j
. Extracting features using Υ yields
X
,the
term frequencies of stemmed words, normalized such that for all documents
x
:
with j
∈{
1,...,n
}
2
= 1. Similarities are computed, using, e.g., cosine-based similarity
Ψ, yielding the n
x
n similarity matrix
S
. Finally, the cluster label vector λ is
computed using a clustering function Φ, such as graph partitioning. In short,
the basic process can be denoted as
×
Υ
→
X
Ψ
S
Φ
λ.
Similarity Measures
. In this chapter, we work in similarity space rather
than the original vector space in which the feature vectors reside. This ap-
proach is particularly useful when there are many fewer objects than the
X
→
→
Search WWH ::
Custom Search