Large-Scale Multiple Sequence Alignment and Tree Estimation Using SATe - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

RAxML on the final alignment returned by SAT´. For very large

datasets, the final RAxML analysis could take a long time, and so an

alternative is to run SAT ´ using FastTree and without any final

RAxML run, save the resultant alignment and tree, and then run

RAxML on the final alignment. We recommend using MAFFT for

aligning subsets, using a maximum alignment subset size of 200,

and the centroid edge decomposition. We recommend using Opal

to merge subset alignments instead of Muscle, unless the dataset is

so large (in number of sequences and/or sequence length) that the

memory requirements for using Opal exceed what you have avail-

able on your machine. Opal should never be used as the subset

alignment technique on extremely large datasets (its memory

requirements will slow down the analysis dramatically). Prank is

too slow to use on even moderately large datasets, and therefore

Prank should not be used as the subset aligner. The use of ClustalW

for the subset aligner will not cause running time issues, but there is

little evidence that ClustalW is likely to produce more accurate

alignments than MAFFT; therefore, it is not recommended as a

subset aligner.

For very large datasets, providing an initial alignment

(and possibly initial tree) to SAT´ can speed up and potentially

improve the analysis. If you run SAT´ without providing it an initial

alignment and/or tree, this initial alignment will be estimated using

MAFFT, which is run in its less accurate setting (in extreme cases,

this will be MAFFT-PartTree [ 53 ]) on very large datasets. How-

ever, faster and potentially more accurate estimations of initial

alignments might be achievable using other methods, such as

Clustal-Omega [ 54 ] for amino-acid sequences or MAFFT-profile

[ 55 ] for nucleotide sequences. Once the initial alignment is

provided, SAT´ will use FastTree to estimate the initial tree on the

alignment. Because SAT ´ is quite robust to its initial tree [ 23 , 24 ],

this means that the initial alignment need not be particularly accu-

rate. The analysis of very large datasets presents both memory and

running time challenges; see Notes 5 - 7 for advice on how to handle

problems that may arise.

Small datasets . Using SAT ´ to estimate trees and alignments on very

small datasets (with less than 200 sequences) may not result in

improved accuracy, since these datasets can be analyzed well using

methods such as MAFFT; however, datasets of this size have been

analyzed using SAT ´ (see, for example, [ 50 , 51 , 56 - 58 ]).The main

recommendation we make for the analysis of small datasets is to

use 50 % as the maximum subproblem size, rather than a smaller

percentage, and to otherwise use the standard defaults. In addition,

for small enough datasets, phylogeny estimation methods that are

generally too computationally intensive to use on even moderately

large datasets (such as MrBayes [ 59 ]) can be used to estimate a tree

on the resultant SAT ´ alignment.

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home