Classification Tree Generation Constrained with Variable Weights - Foundations on Natural and Artificial Computation

Information Technology Reference

In-Depth Information

Classification Tree Generation Constrained with

Variable Weights

Pedro Barahona 1 , Gemma Bel-Enguix 2 , Veronica Dahl 2 , 3 ,

M. Dolores Jimenez-Lopez 2 , and Ludwig Krippahl 1

1 Departamento de Informatica, Universidade Nova de Lisboa

{ pb,ludi } @di.fct.unl.pt

2 Research Group on Mathematical Linguistics, Universitat Rovira i Virgili

{ gemma.bel,mariadolores.jimenez } @urv.cat

3 Department of Computer Science, Simon Fraser University

veronica@cs.sfu.ca

Abstract. Trees are a useful framework for classifying entities whose at-

tributes are, at least partially, related through a common ancestry, such

as species of organisms, family members or languages. In some com-

mon applications, such as phylogenetic trees based on DNA sequences,

relatedness can be inferred from the statistical analysis of unweighted at-

tributes. In this paper we present a Constraint Programming approach

that can enforce consistency between bounds on the relative weight of

each trait and tree topologies, so that the user can best determine which

sets of traits to use and how the entities are likely to be related.

1

Introduction

Classification trees are useful methods of research in many areas, including biol-

ogy and linguistics, even though the characteristics of the problems may differ.

In biology, for example, distances between genes are clearly identifiable, and they

very strongly correlate with phylogeny, since most mutations are evolutionary

neutral [5]. This allows phylogenetic trees to be built using hierarchical cluster-

ing methods with fixed distances [3]. However in linguistics, distance measures

based on arbitrary sets of features may not correlate with phylogeny because

many differences are not random.

Even in biology, in some cases it is not possible to measure the genetic infor-

mation, such as when studying fossils, and one must rely on anatomical features

which may not correlate with phylogeny as well as the mostly random inherited

mutations in ADN. For these cases, the assumption that all observable features

have the same importance in determining phylogeny is likely not to hold. Dis-

tance between features can therefore vary according to the relevance that is

assigned to each feature, leading to different classification trees being generated

by hierarchical clustering.

This paper has been supported by the project HP2008-0029.

Search WWH ::

Custom Search

Home