Beyond Classification Tasks - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

8.4 Clustering Tree

Clustering groups data instances into subsets in such a manner that similar

instances are grouped together, while different instances belong to different

groups. The instances are thereby organized into an ecient representation

that characterizes the population being sampled. Formally, the clustering

structure is represented as a set of subsets C = C 1 ,...,C k of S , such that:

S = i =1

= j . Consequently, any instance in S

belongs to exactly one and only one subset.

Clustering of objects is as ancient as the human need for describing

the salient characteristics of men and objects and identifying them with a

type. Therefore, it embraces various scientific disciplines: from mathematics

and statistics to biology and genetics, each of which uses different terms

to describe the topologies formed using this analysis. From biological

“taxonomies”, to medical “syndromes” and genetic “genotypes” to manu-

facturing “group technology” — the problem is identical: forming categories

of entities and assigning individuals to the proper groups within it. A clus-

tering tree is a decision tree where the leaves do not contain classes but each

node of a tree corresponds to a concept or a cluster, and the tree as a whole

thus represents a kind of taxonomy or a hierarchy. To induce clustering

trees, Blockeel et al . (1998) introduce the top-down induction of clustering

trees which adapt the basic top-down induction of decision trees method

towards clustering. To this end, a distance measure is used to compute the

distance between two examples. Specifically, in order to compute the dis-

tance between two sets of instances, they employ a function that induces a

prototype of a set instances. Thus, the distance between two set of instances

is calculated as the distance between their prototypes. Given a distance

measure for clusters and the view that each node of a tree corresponds to

a cluster, the decision tree algorithm selects in each node the test that will

maximize the distance between the resulting clusters in its subnodes. Each

node in the tree corresponds to a cluster of examples, and the hierarchical

structure of the tree shows the way that clusters are split into subclusters.

The test in a node can be seen as a discriminant description of the two

clusters in which the current cluster of examples is divided. One cluster is

described by saying that the test succeeds, the other by saying that it fails.

C i and C i ∩

C j =

∅

for i

8.4.1

Distance Measures

Since clustering is the grouping of similar instances/objects, some sort of

measure that can determine whether two objects are similar or dissimilar

Data Mining with Decision Trees: Theory and Applications

Search WWH ::

Custom Search

Home