High-Performance Phylogeny Reconstruction Under Maximum Parsimony - Parallel Computing for Bioinformatics and Computational Biology

Biomedical Engineering Reference

In-Depth Information

sequences. MP attempts to minimize this score following the philosophy of Occam's

razor — the simplest explanation of the data is preferred. Under MP, a total cost is

assigned to each tree, and the optimal tree (i.e., the most parsimonious tree) is defined

as the one with the smallest total cost. In this approach, a unit cost is given for each

nucleotide substitution. A central step in the procedure is allocating sequences to the

internal nodes in the tree. For any set of sequence allocations, the total cost of the tree

is the sum of the costs of the various edges, where the cost of joining two internal

nodes, or an internal node and a leaf, is the number of substitutions needed to move

from the sequence at one to the sequence at the other (i.e., the Hamming distance).

The most popular software packages that implement MP heuristics are PAUP* [3],

Phylip [4], and TNT [6].

Likelihood Methods Likelihood methods require the specification of an evolu-

tionary model. For a given phylogenetic arrangement, the question is: what is the

likelihood that evolution under the specified parameters will produce the observed

nucleotide sequences? For a given evolutionary hypothesis, the likelihood of the

observed change is computed for each position, and then the product of the likelihoods

is expressed as a distance or branch length between the sequences. The parameters are

then varied, and the combination with the highest likelihood is accepted. This proce-

dure is then repeated for another arrangement, the two topologies compared, and the

one with the highest likelihood selected. The selective process is continued until an

arrangement is found with the combined ML of both an evolutionary hypothesis and

a topology. Finding the best tree under ML is the most computationally demanding

of the methods discussed here.

Like MP, ML is an optimization problem. ML seeks the tree and associated

model parameter values that maximize the probability of producing the given set of

sequences. ML thus depends explicitly on an assumed model of evolution. For exam-

ple, the ML problem under the Jukes-Cantor model needs to estimate one parameter

(the substitution probability) for each edge of the tree, while under theGeneralMarkov

model 12 parameters must be estimated on each edge. Scoring a fixed tree cannot

be done in polynomial time for ML [15], whereas it is easily accomplished in lin-

ear time for MP [16]. Various software packages provide heuristics for ML, include

PAUP* [3], Phylip [4], fastDNAml [17], and PHYML [18].

Bayesian methods deserve a special mention among likelihood-based approaches;

they compute the posterior probability that the observed data would have been

produced by various trees (in contrast to a true ML method, which computes the

probability that a fixed tree would produce various kinds of data at its leaves). Their

implementation with Markov chain Monte-Carlo (MCMC) algorithms often run sig-

nificantly faster than pure ML methods. Moreover, the moves through state space can

be designed to enhance convergence rates and speedup execution. Mr-Bayes [5] is the

most popular software package for reconstructing trees based on Bayesian analysis.

16.1.3 Large-Scale Phylogenies

The remainder of this paper considers high-performance approaches to inferring trees

under MP. Our reasons for focusing on MP are twofold. First, MP remains the major

Parallel Computing for Bioinformatics and Computational Biology

Search WWH ::

Custom Search

Home