Large-Scale Multiple Sequence Alignment and Tree Estimation Using SATe - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

6 Advanced Topics

Amino-acid datasets . The analysis of amino-acid datasets presents

some additional challenges and opportunities. Compared to nucle-

otide sequences, the selection of the substitution model is more

complicated, since the models are not “nested”. The best model for

your data needs to be selected using a statistical test [ 37 , 38 ],

however, JTT [ 39 ] and WAG [ 40 ] models are often used for

amino acid datasets and are reasonable defaults. The models avail-

able for use for amino-acid analyses are displayed within SAT ´ after

you check the box indicating that your data are proteins, and

depend upon the ML method you have selected (RAxML or

FastTree). RAxML enables many more models than FastTree, and

so may be preferable. The other amino-acid models available in

SAT´ when used with RAxML are DAYHOFF [ 41 ], DCMUT

[ 42 ], MTREV [ 43 ], RTREV [ 44 ], CPREV [ 45 ], VT [ 46 ],

BLOSUM62 [ 47 ], MTMAM [ 48 ], and LG [ 49 ], each in combina-

tion with a rates-across-sites model. To set base frequencies for

these amino-acid models to empirical base frequencies, add an

“F” suffix to the name of the model; see the RAxML documenta-

tion for details (available from http://sco.h-its.org/exelixis/old-

Page/RAxML-Manual.7.0.4.pdf ) . SAT´ has been used to analyze

protein datasets [ 50 , 51 ], but we have not studied SAT ´ as a protein

aligner nearly as thoroughly as we have studied it as a nucleotide

sequence aligner; therefore, the default settings for the algorithmic

parameters may not be optimized well. Finally, amino-acid align-

ment estimation in particular can be enhanced with structural

(secondary or tertiary) information about the proteins, information

that the aligner methods (MAFFT, ClustalW, Prank, and Opal)

used by SAT ´ do not use. Therefore, there is the potential for

improved accuracy to be obtained through the use of a different

set of protein alignment methods, including methods such as

SATCHMO-JS [ 52 ] that employ Hidden Markov Models to take

advantage of particular properties of protein alignments.

Large datasets . We now present guidelines for the analysis of data-

sets with 1,000 or more sequences. However, because SAT ´ has not

been tested on datasets with more than 28,000 sequences, our

recommendation on very large datasets should be taken as our

best guess, at this time, for how to handle such datasets. We

strongly recommend the use of FastTree rather than RAxML for

ML tree estimation in each iteration: FastTree is much faster than

RAxML, and our preliminary studies (unpublished) suggest that

using FastTree instead of RAxML produces the same quality align-

ments in a fraction of the time. However, switching to FastTree can

reduce the tree accuracy slightly, and so the user may wish to use

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home