Classification Tree Generation Constrained with Variable Weights - Foundations on Natural and Artificial Computation

Information Technology Reference

In-Depth Information

The number of such classification trees grows very fast with the number of

objects in the classification, making it crucial to have an ecient method for

eliminating uninteresting options.

In this article we propose a novel methodology for constraining the number of

classification trees through allowing different features to be weighted with some

slack with respect to an average. Our methodology, which can be flexibly applied

to multiple uses both for linguistics and for biology, drastically diminishes the

number of alternatives to be considered, down to an acceptable number, by

adopting a constraint programming approach. It may handle arbitrary sets of

features, and uses an implicit distance measure while allowing for an arbitrary

flexibility in the relative weights of the features. The purpose of this approach

is to work dually, by finding either: a) the trees that are possible given some

bounds on the weights, or b) the bounds on the weights given constraints on

possible trees. This is achieved either by specifying allowed trees or simply the

constraints that stipulate the tree topology and discovering whether there are

sets of weights that make the distances compatible with such topologies.

Section 2 briefly describes the state of the art. Section 3 describes the mo-

tivation for this paper, the main features of Constraint Programming and, in

subsection 3, our phylogenetic tree model; section 4 shows the results obtained

for a biological and a linguistic example, and section 5 discusses our results.

2 Background, State-of-the-Art

Phylogenetic trees aim to represent the evolutionary relationships between enti-

ties, either qualitatively, by grouping together those that are more closely related,

or quantitatively, with each branch length indicating the distance to a common

ancestor, and the path through the branches of the tree the distance between

any two entities.

There are essentially two different ways of calculating phylogenetic trees. Dis-

tance methods rely on distance values between all pairs of entities to organize

in the tree [3]. The tree is then calculated such that the sum of the branches

joining any pair equals their corresponding distance. Commonly used algorithms

to build phylogenetic trees based on distance measures between the nodes (al-

most always the leaves of the tree) are the Unweighted Pair Group Method with

Arithmetic Mean (UPGMA) and the neighbor-joining method [8].

Distance methods tend to be the first choice, as sequence comparisons allow

for distances to be pre-computed. However, for some applications, it is necessary

to use sets of features from which a distance measure cannot be immediately

inferred.

3 Our Proposed Methodology

We use a declarative, CLP, algorithm for building the trees by specifying the

constraints that stem from the evolutionary relationships that the tree must

represent. Each inner node bisects the set of entities that descend from the

Search WWH ::

Custom Search

Home