Biology Reference
In-Depth Information
“phylogenetic placement” [ 31 - 33 ]. Multiple sequence alignment
estimation for highly fragmentary data can also be addressed
through these phylogenetic placement methods, but has not been
sufficiently studied in this context.
3 Algorithm
SAT ´ uses iteration to produce improved alignments and trees,
so that each iteration uses the results from the previous iteration
to start its analysis, and then reestimates a multiple sequence align-
ment and tree for the dataset. Empirically, our studies have shown
that the iterations quickly converge to good alignments and trees,
with the biggest improvement occurring in the first iteration.
Therefore, even a single iteration can result in an improved tree
and alignment, and several iterations provide increased accuracy.
The studies in [ 3 , 14 ] showed that the most accurate alignment
methods (such as MAFFT, in its most accurate setting) could not
be run on datasets with thousands of sequences, and that all meth-
ods have reduced accuracy for large enough rates of evolution and
numbers of sequences. SAT´ overcomes this barrier using divide-
and-conquer. It divides an input sequence dataset into subsets that
are small enough that highly accurate alignment methods can be
run on them, thus producing “subset alignments”. SAT´ then
merges the subset alignments together to produce an alignment
of the full dataset, on which a tree can then be estimated using
maximum likelihood methods. By repeating this process several
times, the alignments and trees become increasingly accurate.
By design, SAT ´ has several algorithmic parameters that deter-
mine how it runs. These have default values, but can also be reset by
the user. Understanding the algorithmic parameters is helpful to
obtaining improved accuracy for dataset analyses. Here we describe
the algorithmic structure, and point out the algorithmic parameters
that the user can set.
The input to SAT´ is a set of unaligned sequences. However,
the user can also provide an initial alignment and/or tree, which
can then be used by SAT´ to begin the iterative process. If none are
provided, then SAT ´ will estimate its own alignment and tree for
the input sequences.
The main analysis then proceeds by repeating the following
steps in an iterative fashion. The tree T from the previous iteration
is used to guide the divide-and-conquer strategy for the current
iteration. The tree itself is estimated using maximum likelihood
(either using RAxML or FastTree) on the alignment on the
sequences, and so has branch “lengths” (indicating substitution
parameters for the Markov model of evolution). A branch e is
selected in the tree T (either the “centroid” branch, which divides
the tree into two subtrees with roughly equal numbers of taxa, or
Search WWH ::




Custom Search