Information Technology Reference
In-Depth Information
ancestor represented in that node. From that node there are two branches, each
one the origin of a sub-tree leading to a set of leaf entities. Let A and B be those
sets. Evolutionary relations require that all pairs of entities in A have a closer
kinship than any pair formed by one entity from A and one entity from B .This
constraint in the evolutionary relations of these groups is what the branching
represents. More formally:
a i ,a j
A,
b k
B, dist ( a i ,a j ) <dist ( a i ,b k )
(1)
This results in a top-down computation of the phylogenetic tree, requiring, at
each inner node, the calculation of a bisection of the set of entities in that sub-
tree. Since there are 2 n bisections of any set of n elements, this is potentially
an exponential problem unless there are some ways of reducing the number of
possibilities. This approach has some similarity to top down induction of decision
trees [7] or clustering trees [1], but our focus is not so much on the generation
of the tree itself, or the eciency of that step, but rather on the flexibility
of the distance measures. Thus, for this purpose, we can consider the actual
implementation details of the tree computation may be adapted according to
the size and complexity of the problems to solve.
The distance between two entities is a linear combination of the modulus of the
difference of the feature values. However, the weights are not instantiated, but
have an initially specified domain that can be constrained by the tree topology.
Considering S to be the set of all entities to organize in the phylogenetic tree,
the distance between any to entities is defined as
a
∀s i ,s j
S, dist ( s i ,s j )=
w k ×|
f ( s i ,k )
f ( s j ,k )
|
(2)
k =1
where f ( s i ,k ) is the attribute of index k for the entity of index i (respectively
for f ( s j ,k )) and w k is the weight of attribute k . The attribute weights are also
constrained to be greater or equal to zero and to add up to a constant value,
which we can assume here to be the unit value (although the actual value is
arbitrary). Thus:
w k
W, w k
0
(3)
a
w k =1
(4)
k =1
Moreover, if all features are normalised so that the maximum difference between
any two values of the same feature is 1, then the maximum distance between two
leaves is 1 (assuming the sum of the weights to be 1, the value that we have set).
For our proof-of-concept exercises, we consider the additional constraint that
the weight of any attribute must fall within a preset interval, that the user
can define. If the lower bound is zero, then that attribute can be ignored. The
particular case where the weights are restricted to the 1 /N , with N being the
number of weights used, results in the computation of an unweighted tree, if one
such tree is possible.
 
Search WWH ::




Custom Search