Large-Scale Multiple Sequence Alignment and Tree Estimation Using SATe - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

“Tree Estimator” method . Only RAxML and FastTree are enabled

for estimating trees from alignments, and FastTree is the default.

Both are heuristics for maximum likelihood, which is a computa-

tionally hard problem. FastTree is much faster than RAxML, and

generally produces trees of very similar accuracy [ 34 ]. Further-

more, in our unpublished studies, the use of FastTree instead of

RAxML within SAT ´ produces alignments of comparable accuracy

and only a small decrease in accuracy for the trees. Because of its

great speed advantage, however, we recommend the use of Fast-

Tree. If FastTree is used, a final RAxML run can be applied to the

output alignment in order to obtain a RAxML tree (and thus

potentially improved accuracy).

Substitution model . This refers to the statistical model [ 29 ] used by

the maximum likelihood method (RAxML or FastTree) to estimate

trees from alignments. The choice of statistical model depends on

whether your data are nucleotide or amino-acid sequences, and also

on whether you are using RAxML or FastTree as the tree estimator,

since these enable somewhat different models. For nucleotide data,

the default using RAxML is GTRCAT, while the default using

FastTree is GTR + G20. GTR stands for the General Time Revers-

ible (GTR) model, which is the most general substitution model

available within SAT´. G20 and CAT refer to how the model

handles the Gamma rates-across-sites model; G20 is the GAMMA

distribution approximated by 20 rate categories, while CAT [ 35 ]is

a heuristic approximation to the GAMMA rate-variation model.

Alternative settings for RAxML include GTRGAMMA (GTR +

GAMMA) and GTRGAMMAI (GTR + Gamma + Invariable).

Alternative settings for FastTree include JC (the Jukes-Cantor

model) [ 36 ] instead of GTR, but this simplified model is not

recommended except under very unusual circumstances where

the data seem to fit the Jukes-Cantor model best (unlikely for

most data). Note that the GAMMA setting is usually used in

phylogenetic analyses, but the CAT setting improves speed at a

potential loss of phylogenetic accuracy. For amino-acid datasets,

the choice of substitution model is more complicated; see the

section below on Amino-Acid Datasets for more information.

Maximum subproblem size . This is the maximum allowed size of the

subsets of sequences, and so determines how many times the

decomposition strategy is applied. The default depends on the

dataset size (and will be set by SAT ´ after you input your data).

However, the main issue in setting the maximum subproblem size

is the method used to align subsets. When MAFFT is the aligner

method, then keeping the maximum subproblem size to at most

200 allows the most accurate version of MAFFT (L-INS-i) to be

used to align the subsets, and this results in the best accuracy. If you

wish to use Prank instead of MAFFT to align subsets, the maximum

subproblem size should be reduced substantially, because Prank is

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home