Biology Reference
In-Depth Information
6 Advanced Topics
Amino-acid datasets . The analysis of amino-acid datasets presents
some additional challenges and opportunities. Compared to nucle-
otide sequences, the selection of the substitution model is more
complicated, since the models are not “nested”. The best model for
your data needs to be selected using a statistical test [ 37 , 38 ],
however, JTT [ 39 ] and WAG [ 40 ] models are often used for
amino acid datasets and are reasonable defaults. The models avail-
able for use for amino-acid analyses are displayed within SAT ´ after
you check the box indicating that your data are proteins, and
depend upon the ML method you have selected (RAxML or
FastTree). RAxML enables many more models than FastTree, and
so may be preferable. The other amino-acid models available in
SAT´ when used with RAxML are DAYHOFF [ 41 ], DCMUT
[ 42 ], MTREV [ 43 ], RTREV [ 44 ], CPREV [ 45 ], VT [ 46 ],
BLOSUM62 [ 47 ], MTMAM [ 48 ], and LG [ 49 ], each in combina-
tion with a rates-across-sites model. To set base frequencies for
these amino-acid models to empirical base frequencies, add an
“F” suffix to the name of the model; see the RAxML documenta-
tion for details (available from http://sco.h-its.org/exelixis/old-
Page/RAxML-Manual.7.0.4.pdf ) . SAT´ has been used to analyze
protein datasets [ 50 , 51 ], but we have not studied SAT ´ as a protein
aligner nearly as thoroughly as we have studied it as a nucleotide
sequence aligner; therefore, the default settings for the algorithmic
parameters may not be optimized well. Finally, amino-acid align-
ment estimation in particular can be enhanced with structural
(secondary or tertiary) information about the proteins, information
that the aligner methods (MAFFT, ClustalW, Prank, and Opal)
used by SAT ´ do not use. Therefore, there is the potential for
improved accuracy to be obtained through the use of a different
set of protein alignment methods, including methods such as
SATCHMO-JS [ 52 ] that employ Hidden Markov Models to take
advantage of particular properties of protein alignments.
Large datasets . We now present guidelines for the analysis of data-
sets with 1,000 or more sequences. However, because SAT ´ has not
been tested on datasets with more than 28,000 sequences, our
recommendation on very large datasets should be taken as our
best guess, at this time, for how to handle such datasets. We
strongly recommend the use of FastTree rather than RAxML for
ML tree estimation in each iteration: FastTree is much faster than
RAxML, and our preliminary studies (unpublished) suggest that
using FastTree instead of RAxML produces the same quality align-
ments in a fraction of the time. However, switching to FastTree can
reduce the tree accuracy slightly, and so the user may wish to use
Search WWH ::




Custom Search