Database Reference
In-Depth Information
8.4 Clustering Tree
Clustering groups data instances into subsets in such a manner that similar
instances are grouped together, while different instances belong to different
groups. The instances are thereby organized into an ecient representation
that characterizes the population being sampled. Formally, the clustering
structure is represented as a set of subsets C = C 1 ,...,C k of S , such that:
S = i =1
= j . Consequently, any instance in S
belongs to exactly one and only one subset.
Clustering of objects is as ancient as the human need for describing
the salient characteristics of men and objects and identifying them with a
type. Therefore, it embraces various scientific disciplines: from mathematics
and statistics to biology and genetics, each of which uses different terms
to describe the topologies formed using this analysis. From biological
“taxonomies”, to medical “syndromes” and genetic “genotypes” to manu-
facturing “group technology” — the problem is identical: forming categories
of entities and assigning individuals to the proper groups within it. A clus-
tering tree is a decision tree where the leaves do not contain classes but each
node of a tree corresponds to a concept or a cluster, and the tree as a whole
thus represents a kind of taxonomy or a hierarchy. To induce clustering
trees, Blockeel et al . (1998) introduce the top-down induction of clustering
trees which adapt the basic top-down induction of decision trees method
towards clustering. To this end, a distance measure is used to compute the
distance between two examples. Specifically, in order to compute the dis-
tance between two sets of instances, they employ a function that induces a
prototype of a set instances. Thus, the distance between two set of instances
is calculated as the distance between their prototypes. Given a distance
measure for clusters and the view that each node of a tree corresponds to
a cluster, the decision tree algorithm selects in each node the test that will
maximize the distance between the resulting clusters in its subnodes. Each
node in the tree corresponds to a cluster of examples, and the hierarchical
structure of the tree shows the way that clusters are split into subclusters.
The test in a node can be seen as a discriminant description of the two
clusters in which the current cluster of examples is divided. One cluster is
described by saying that the test succeeds, the other by saying that it fails.
C i and C i
C j =
for i
8.4.1
Distance Measures
Since clustering is the grouping of similar instances/objects, some sort of
measure that can determine whether two objects are similar or dissimilar
Search WWH ::




Custom Search