Parallel and Distributed Computation of Large Phylogenetic Trees - Parallel Computing for Bioinformatics and Computational Biology

Biomedical Engineering Reference

In-Depth Information

grows exponentially with the number of sequences n , for example, for n

=

50

10 76 alternative topologies; a number almost as

large as the number of atoms in the Universe (

×

organisms, there already exist 2.84

10 80 ). Thus, given some, biologically

meaningful, optimality criterion for evaluating all alternative configurations (topolo-

gies) to search for the best tree, one can quickly assume that the problem might be

NP-hard. In fact, this has been already demonstrated for the general version of the

perfect phylogeny problem [2] and maximum parsimony (MP) [3]. The maximum

likelihood (ML) criterion [4] is also believed to be NP-hard, although this could not

be demonstrated so far because of the significantly superior mathematical complexity

of the model which is not discrete.

Another important aspect for the design of heuristic tree searches consists in the

very high degree of accuracy (difference to the score of the optimal or best-known solu-

tion), which is required to obtain reasonable biological results. Although an accuracy

of 90% is considered to be a “good” value for heuristics designed to solve other

NP-hard optimization problems, for example, the traveling salesman problem, recent

results suggest that phylogenetic analyses require an accuracy

≈

99.99%, particu-

larly for large trees [5]. This observation yields the whole field more difficult and

challenging.

When comparing the various optimality criteria which have been devised for

phylogenetic trees, one can observe a trade-off between speed and quality. This

means that a phylogenetic analysis conducted with an elaborate model such as ML

requires significantly more time but yields trees with superior accuracy than, for

example, neighbor joining (NJ) [6] or MP [7, 8]. However, because of the higher

accuracy, it is desirable to infer large and complex trees with ML.

Within this context, it is important to emphasize that the design of ML programs

is primarily an algorithmic discipline , because of the gigantesque number of alter-

native tree topologies and the high computational cost of the likelihood function

(Section 14.2). Thus, progress in the field has been attained mainly by algorithmic

improvements rather than by brute force allocation of all available computational

resources. As an example, consider the performance of parallel fastDNAml [9] (state-

of-the-art parallel ML program in 2001) and RAxML-III [10] (randomized accelerated

ML, one of the fastest sequential ML programs in 2004) on a 1000-organisms align-

ment: For this large alignment, parallel fastDNAml consumed approximately 9000

accumulated CPU hours on a Linux PC cluster in contrast to less than 20 hours

required by RAxML-III on a single processor. In addition, the likelihood of the tree

computed by RAxML-III was significantly better than the likelihood score obtained

by parallel fastDNAml.

Therefore, a reasonable approach to design ML phylogeny programs for high-

performance computing consists in adapting an iterative development cycle as

outlined in Figure 14.2. Consequently, this chapter covers one typical iteration of

this development process by example of algorithmic and technical improvements

implemented in RAxML-III.

The remainder of this chapter is organized as follows. Section 14.2 provides a brief

introduction to the ML method and describes the abstract computational problems,

≥

Parallel Computing for Bioinformatics and Computational Biology

Search WWH ::

Custom Search

Home