Biology Reference
In-Depth Information
Chapter 15
Large-Scale Multiple Sequence Alignment
and Tree Estimation Using SAT ´
Kevin Liu and Tandy Warnow
Abstract
SAT ´ is a method for estimating multiple sequence alignments and trees that has been shown to produce
highly accurate results for datasets with large numbers of sequences. Running SAT ´ using its default settings
is very simple, but improved accuracy can be obtained by modifying its algorithmic parameters. We provide
a detailed introduction to the algorithmic approach used by SAT ´ , and instructions for running a SAT ´
analysis using the GUI under default settings. We also provide a discussion of how to modify these settings
to obtain improved results, and how to use SAT´ in a phylogenetic analysis pipeline.
Key words Multiple sequence alignment, Maximum likelihood, Phylogenetics, SAT´, Species tree
estimation, Gene tree estimation, Phylogenomics
1
Introduction
A typical phylogenetic study estimates a multiple sequence align-
ment (MSA) from biomolecular sequence data, and then infers a
phylogeny using the MSA [ 1 ]. While much has been established
about the relative performance of phylogeny estimation methods
and the importance of picking a highly accurate estimation method,
only in recent years has there been substantial study of the impact of
the alignment method on the final phylogenetic estimation. It is
now understood that the accuracy of the inferred phylogeny
depends on the accuracy of the multiple sequence alignments esti-
mated in the preceding phase [ 2 - 9 ], and that inaccurate multiple
sequence alignments tend to produce inaccurate trees. While data-
sets with low enough rates of evolution can be aligned well using
existing fast alignment methods (such as ClustalW [ 10 ], Muscle
[ 11 , 12 ], and MAFFT [ 13 ]), alignments of datasets that evolve
more quickly are substantially harder to estimate, and standard
methods typically produce poor alignments on these datasets
[ 3 , 4 , 14 ]. Furthermore, many of the highly accurate alignment
methods cannot be run on datasets with many sequences, due to
Search WWH ::




Custom Search