Biomedical Engineering Reference
In-Depth Information
grows exponentially with the number of sequences n , for example, for n
=
50
10 76 alternative topologies; a number almost as
large as the number of atoms in the Universe (
×
organisms, there already exist 2.84
10 80 ). Thus, given some, biologically
meaningful, optimality criterion for evaluating all alternative configurations (topolo-
gies) to search for the best tree, one can quickly assume that the problem might be
NP-hard. In fact, this has been already demonstrated for the general version of the
perfect phylogeny problem [2] and maximum parsimony (MP) [3]. The maximum
likelihood (ML) criterion [4] is also believed to be NP-hard, although this could not
be demonstrated so far because of the significantly superior mathematical complexity
of the model which is not discrete.
Another important aspect for the design of heuristic tree searches consists in the
very high degree of accuracy (difference to the score of the optimal or best-known solu-
tion), which is required to obtain reasonable biological results. Although an accuracy
of 90% is considered to be a “good” value for heuristics designed to solve other
NP-hard optimization problems, for example, the traveling salesman problem, recent
results suggest that phylogenetic analyses require an accuracy
99.99%, particu-
larly for large trees [5]. This observation yields the whole field more difficult and
challenging.
When comparing the various optimality criteria which have been devised for
phylogenetic trees, one can observe a trade-off between speed and quality. This
means that a phylogenetic analysis conducted with an elaborate model such as ML
requires significantly more time but yields trees with superior accuracy than, for
example, neighbor joining (NJ) [6] or MP [7, 8]. However, because of the higher
accuracy, it is desirable to infer large and complex trees with ML.
Within this context, it is important to emphasize that the design of ML programs
is primarily an algorithmic discipline , because of the gigantesque number of alter-
native tree topologies and the high computational cost of the likelihood function
(Section 14.2). Thus, progress in the field has been attained mainly by algorithmic
improvements rather than by brute force allocation of all available computational
resources. As an example, consider the performance of parallel fastDNAml [9] (state-
of-the-art parallel ML program in 2001) and RAxML-III [10] (randomized accelerated
ML, one of the fastest sequential ML programs in 2004) on a 1000-organisms align-
ment: For this large alignment, parallel fastDNAml consumed approximately 9000
accumulated CPU hours on a Linux PC cluster in contrast to less than 20 hours
required by RAxML-III on a single processor. In addition, the likelihood of the tree
computed by RAxML-III was significantly better than the likelihood score obtained
by parallel fastDNAml.
Therefore, a reasonable approach to design ML phylogeny programs for high-
performance computing consists in adapting an iterative development cycle as
outlined in Figure 14.2. Consequently, this chapter covers one typical iteration of
this development process by example of algorithmic and technical improvements
implemented in RAxML-III.
The remainder of this chapter is organized as follows. Section 14.2 provides a brief
introduction to the ML method and describes the abstract computational problems,
Search WWH ::




Custom Search