Biology Reference
In-Depth Information
Under the assumption that a phylogenetic tree describes evolution via a continuous-
timeMarkov process, the main problem of phylogenetic tree reconstruction is to find a
tree when only the character sequences, such as DNA sequences or protein sequences,
etc., are given. In the lingo of trees and Markov processes, one assumes that the only
sequence data that observed is at the tips, i.e., in the leaves , while other information
on the phylogenetic tree, the particulars of the substitution events, and edge lengths
are missing. In a distance-based approach, distances between sequences (i.e., what
should be sums of edge lengths on the path in the phylogenetic tree between the nodes
for the sequences) are derived directly from the sequences by using the evolutionary
model to compute the most likely distance between each pair of sequences. This is
formalized in mathematical and statistical terms by so-called maximum likelihood
estimates. 1 Computing these estimates will not be our focus; using them will be.
Pairwise distances, along with a distance-based method for reconstructing phy-
logenetic trees from the set of all pairwise distances, can be used to reconstruct a
particular phylogenetic tree that relates the sequences. Usually the set of all pairwise
distances computed from an alignment, collectively often called a distance matrix (or
an example of a dissimilarity map ), does not give a tree metric , which is a distance
matrix realizing a phylogenetic tree. Thus a distance-based method tries to find a tree
metric closest to the given set of pairwise distances computed from the alignment
under some criteria.
To date, of all the tree reconstruction methods, distance-based methods for
phylogeny reconstruction have been seen to be the best hope for accurately build-
ing phylogenies on very large sets of taxa such as the data sets for tree of life for
the insects Hymenoptera [ 2 , 9 ]. More precisely, distance-based methods have been
shown to be statistically consistent in all settings (such as the long branch attraction
problem; see [ 1 ] for details), in contrast with other methods, such as parsimony meth-
ods, e.g., [ 10 - 12 ]. Distance-based methods also have a huge speed advantage over
parsimony and likelihood methods, and hence enable the reconstruction of trees on
greater numbers of taxa. However, a distance-based method is not a perfect method
for reconstructing a phylogenetic tree from a given sequence data set. This is because,
in computing pairwise distances, one ignores both the interior nodes and the overall
tree topology and so there is a concern that one loses some information from the
input data sets. Therefore, it is important to understand how a distance-based method
works and how robust it is with noisy data sets. (However, it is noteworthy that from
an information-theoretic point of view, a recent article [ 13 ] argues that at least some
distance-based methods can be proved to preserve more information than may be
obtainable from maximum likelihood methods for tree reconstruction, contrary to
commonly expressed concerns in the mathematical and biological literature.)
A distance-based method is related to geometry and combinatorics. In fact, one
can describe the space of phylogenetic trees, i.e., a set of all tree metrics over the set
of all distance matrices, as points in a high-dimensional space which form a union of
so-called polyhedral cones. In Section 10.3 , we will take an elementary approach to
1 Separate from the ML approach to tree reconstruction.
 
Search WWH ::




Custom Search