Large-Scale Multiple Sequence Alignment and Tree Estimation Using SATe - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

“phylogenetic placement” [ 31 - 33 ]. Multiple sequence alignment

estimation for highly fragmentary data can also be addressed

through these phylogenetic placement methods, but has not been

sufficiently studied in this context.

3 Algorithm

SAT ´ uses iteration to produce improved alignments and trees,

so that each iteration uses the results from the previous iteration

to start its analysis, and then reestimates a multiple sequence align-

ment and tree for the dataset. Empirically, our studies have shown

that the iterations quickly converge to good alignments and trees,

with the biggest improvement occurring in the first iteration.

Therefore, even a single iteration can result in an improved tree

and alignment, and several iterations provide increased accuracy.

The studies in [ 3 , 14 ] showed that the most accurate alignment

methods (such as MAFFT, in its most accurate setting) could not

be run on datasets with thousands of sequences, and that all meth-

ods have reduced accuracy for large enough rates of evolution and

numbers of sequences. SAT´ overcomes this barrier using divide-

and-conquer. It divides an input sequence dataset into subsets that

are small enough that highly accurate alignment methods can be

run on them, thus producing “subset alignments”. SAT´ then

merges the subset alignments together to produce an alignment

of the full dataset, on which a tree can then be estimated using

maximum likelihood methods. By repeating this process several

times, the alignments and trees become increasingly accurate.

By design, SAT ´ has several algorithmic parameters that deter-

mine how it runs. These have default values, but can also be reset by

the user. Understanding the algorithmic parameters is helpful to

obtaining improved accuracy for dataset analyses. Here we describe

the algorithmic structure, and point out the algorithmic parameters

that the user can set.

The input to SAT´ is a set of unaligned sequences. However,

the user can also provide an initial alignment and/or tree, which

can then be used by SAT´ to begin the iterative process. If none are

provided, then SAT ´ will estimate its own alignment and tree for

the input sequences.

The main analysis then proceeds by repeating the following

steps in an iterative fashion. The tree T from the previous iteration

is used to guide the divide-and-conquer strategy for the current

iteration. The tree itself is estimated using maximum likelihood

(either using RAxML or FastTree) on the alignment on the

sequences, and so has branch “lengths” (indicating substitution

parameters for the Markov model of evolution). A branch e is

selected in the tree T (either the “centroid” branch, which divides

the tree into two subtrees with roughly equal numbers of taxa, or

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home