Biology Reference
In-Depth Information
RAxML on the final alignment returned by SAT´. For very large
datasets, the final RAxML analysis could take a long time, and so an
alternative is to run SAT ´ using FastTree and without any final
RAxML run, save the resultant alignment and tree, and then run
RAxML on the final alignment. We recommend using MAFFT for
aligning subsets, using a maximum alignment subset size of 200,
and the centroid edge decomposition. We recommend using Opal
to merge subset alignments instead of Muscle, unless the dataset is
so large (in number of sequences and/or sequence length) that the
memory requirements for using Opal exceed what you have avail-
able on your machine. Opal should never be used as the subset
alignment technique on extremely large datasets (its memory
requirements will slow down the analysis dramatically). Prank is
too slow to use on even moderately large datasets, and therefore
Prank should not be used as the subset aligner. The use of ClustalW
for the subset aligner will not cause running time issues, but there is
little evidence that ClustalW is likely to produce more accurate
alignments than MAFFT; therefore, it is not recommended as a
subset aligner.
For very large datasets, providing an initial alignment
(and possibly initial tree) to SAT´ can speed up and potentially
improve the analysis. If you run SAT´ without providing it an initial
alignment and/or tree, this initial alignment will be estimated using
MAFFT, which is run in its less accurate setting (in extreme cases,
this will be MAFFT-PartTree [ 53 ]) on very large datasets. How-
ever, faster and potentially more accurate estimations of initial
alignments might be achievable using other methods, such as
Clustal-Omega [ 54 ] for amino-acid sequences or MAFFT-profile
[ 55 ] for nucleotide sequences. Once the initial alignment is
provided, SAT´ will use FastTree to estimate the initial tree on the
alignment. Because SAT ´ is quite robust to its initial tree [ 23 , 24 ],
this means that the initial alignment need not be particularly accu-
rate. The analysis of very large datasets presents both memory and
running time challenges; see Notes 5 - 7 for advice on how to handle
problems that may arise.
Small datasets . Using SAT ´ to estimate trees and alignments on very
small datasets (with less than 200 sequences) may not result in
improved accuracy, since these datasets can be analyzed well using
methods such as MAFFT; however, datasets of this size have been
analyzed using SAT ´ (see, for example, [ 50 , 51 , 56 - 58 ]).The main
recommendation we make for the analysis of small datasets is to
use 50 % as the maximum subproblem size, rather than a smaller
percentage, and to otherwise use the standard defaults. In addition,
for small enough datasets, phylogeny estimation methods that are
generally too computationally intensive to use on even moderately
large datasets (such as MrBayes [ 59 ]) can be used to estimate a tree
on the resultant SAT ´ alignment.
Search WWH ::




Custom Search