Biology Reference
In-Depth Information
much more accurate alignments and trees than other methods.
SAT´ has been used to analyze protein as well as nucleotide datasets
for many different types of organisms (birds, plants, bacteria, etc.).
Many of these analyses have been on small datasets, with less
than 100 sequences; however, SAT ´ has also been used to analyze
almost 28,000 rRNA sequences, spanning the domains of Archaea,
Bacteria, and Eukaryota [ 24 ].
Although the first publication [ 23 ]ofSAT´ has been cited over
100 times, the current implementation in the public distribution
(available from the University of Kansas Web site at http://phylo.
bio.ku.edu/software/sate/sate.html ) is based on the second pub-
lication [ 24 ]. The focus of this chapter, therefore, is on the new
implementation of SAT´. We limit our discussion to the GUI
usage, but readers interested in command-line usage can obtain
additional information from the tutorial available online from
the Kansas Web site ( see Note 1 ), or from the SAT ´ user group
( see Note 2 ). SAT´ is under active development, with extensions to
handling fragmentary data (as created by next generation sequenc-
ing technologies), improved analysis of protein sequences, etc.,
and users may wish to contact the UT-Austin SAT ´ group for
information about these plans, or to suggest new developments
( see Note 3 ). Finally, phylogenetic estimation is a large and complex
research discipline, and we direct the interested reader to [ 29 ] for a
more in-depth discussion.
2
SAT ´ Design Goals and Limitations
SAT ´ was designed to enable fast and accurate estimation of align-
ments and trees for nucleotide datasets with hundreds to thousands
of sequences [ 23 , 24 ]. Its design, which is based on divide-and-
conquer, improves accuracy on those datasets for which the best
alignment methods cannot run due to computational requirements
(either memory or time). Therefore, SAT ´ is not designed to
improve accuracy on those datasets that are small enough to be
handled well by standard methods. In addition, although SAT ´ is
designed for large datasets, the largest dataset ever analyzed by
SAT ´ is the 16S.B.ALL dataset with 27,643 rRNA sequences with
6,857 sites [ 23 ], and we do not know how well it will scale for very
large datasets with many tens of thousands of sequences.
Some datasets fall clearly outside of the design goals of SAT´.
SAT ´ is also not designed for alignment estimation of datasets that
are extremely long (hundreds of thousands of nucleotides) or that
evolve with rearrangements rather than just indels (insertions and
deletions) and substitutions; thus, whole genome alignment [ 30 ]is
not part of SAT´'s capabilities. SAT´ has also not been designed for
datasets with substantial missing data or fragmentary data from
short read sequencing projects. Phylogeny estimation for highly
fragmentary data can be obtained through methods based on
Search WWH ::




Custom Search