Database Reference
In-Depth Information
X
X
S
λ
Ψ
Φ
x 1
x
n
Fig. 3.1. The relationship-based clustering framework.
issues and briefly discuss several strategies to scale
for large data
sets. Section 3.7 summarizes related work in clustering, graph partitioning
and visualization.
Opossum
3.2 Domain-Specific Features and Similarity Space
Notation .Letn be the number of objects/samples/points (e.g., customers,
documents, Web sessions) in the data and d the number of features (e.g.,
products, words, Web pages) for each sample x j with j
.Letk
be the desired number of clusters. The input data can be represented by a
d
∈{
1,...,n
}
n data matrix X with the jth column vector representing the sample
x j . x j denotes the transpose of x j . Hard clustering assigns a label λ j
{1,...,k} to each d-dimensional sample x j such that similar samples get the
same label. In general the labels are treated as nominals with no inherent
order, though in some cases, such as 1-dimensional SOMs, any top-down
recursive bisection approach our proposed method, the labeling contains extra
ordering information. Let
×
C
denote the set of all objects in the th cluster
(
.
Figure 3.1 gives an overview of our relationship-based clustering process
from a set of raw object descriptions
∈{
1,...,k
}
), with x j ∈C
λ j = and n =
|C |
X
(residing in input space
I
) via the
vector space description X (in feature space
F
) and relationship description
S (in similarity space
S
) to the cluster labels λ (in output space
O
): (
X∈
Υ
Ψ
Φ
n )
n
d×n )
n×n =[0, 1] n×n
n×n )
n =
I
( X
∈F
R
( S
∈S
R
∈O
n ). For example, in Web page clustering,
{
1,...,k
}
X
is a collection of n
Web pages x j
. Extracting features using Υ yields X ,the
term frequencies of stemmed words, normalized such that for all documents
x :
with j
∈{
1,...,n
}
2 = 1. Similarities are computed, using, e.g., cosine-based similarity
Ψ, yielding the n
x
n similarity matrix S . Finally, the cluster label vector λ is
computed using a clustering function Φ, such as graph partitioning. In short,
the basic process can be denoted as
×
Υ
X Ψ
S Φ
λ.
Similarity Measures . In this chapter, we work in similarity space rather
than the original vector space in which the feature vectors reside. This ap-
proach is particularly useful when there are many fewer objects than the
X
Search WWH ::




Custom Search