Information Technology Reference
In-Depth Information
Classification Tree Generation Constrained with
Variable Weights
Pedro Barahona 1 , Gemma Bel-Enguix 2 , Veronica Dahl 2 , 3 ,
M. Dolores Jimenez-Lopez 2 , and Ludwig Krippahl 1
1 Departamento de Informatica, Universidade Nova de Lisboa
{ pb,ludi } @di.fct.unl.pt
2 Research Group on Mathematical Linguistics, Universitat Rovira i Virgili
{ gemma.bel,mariadolores.jimenez } @urv.cat
3 Department of Computer Science, Simon Fraser University
veronica@cs.sfu.ca
Abstract. Trees are a useful framework for classifying entities whose at-
tributes are, at least partially, related through a common ancestry, such
as species of organisms, family members or languages. In some com-
mon applications, such as phylogenetic trees based on DNA sequences,
relatedness can be inferred from the statistical analysis of unweighted at-
tributes. In this paper we present a Constraint Programming approach
that can enforce consistency between bounds on the relative weight of
each trait and tree topologies, so that the user can best determine which
sets of traits to use and how the entities are likely to be related.
1
Introduction
Classification trees are useful methods of research in many areas, including biol-
ogy and linguistics, even though the characteristics of the problems may differ.
In biology, for example, distances between genes are clearly identifiable, and they
very strongly correlate with phylogeny, since most mutations are evolutionary
neutral [5]. This allows phylogenetic trees to be built using hierarchical cluster-
ing methods with fixed distances [3]. However in linguistics, distance measures
based on arbitrary sets of features may not correlate with phylogeny because
many differences are not random.
Even in biology, in some cases it is not possible to measure the genetic infor-
mation, such as when studying fossils, and one must rely on anatomical features
which may not correlate with phylogeny as well as the mostly random inherited
mutations in ADN. For these cases, the assumption that all observable features
have the same importance in determining phylogeny is likely not to hold. Dis-
tance between features can therefore vary according to the relevance that is
assigned to each feature, leading to different classification trees being generated
by hierarchical clustering.
This paper has been supported by the project HP2008-0029.
 
Search WWH ::




Custom Search