Biology Reference
In-Depth Information
to (log( N )) 2 randomly chosen seed sequences only. The fast
pairwise distance calculation routines, based on a ktuple alignment
algorithm, have been retained from the previous Clustal programs.
The pairwise distances are then clustered, using a bisecting k-means
algorithm [ 6 ]. Groups of sequences are bisected until a certain
threshold for the cluster size is reached. In the current version
this threshold is hard-wired to 100. Guide-tree construction within
the clusters and amongst the clusters makes use of the tree building
routines in Muscle [ 7 ]. This dendrogram is referred to as a guide-
tree to emphasize that it is only used to guide the progressive
alignment—it is not a reliable guide to the phylogeny of the
sequences. Guide-tree construction will be skipped if only two
sequences are to be aligned or if an externally constructed guide-
tree is inputted.
In the profile-profile alignment phase sequences are aligned in
larger and larger groups, according to the branching order in the
guide-tree. At each stage of this final step, two alignments are
aligned. Initially these are single sequences, but they grow with
the addition of new sequences as one traverses the guide-tree.
The alignment of residues and the positioning of gaps during
each profile-profile alignment are fixed and cannot be undone at a
later profile-profile alignment higher up in the tree. The main
algorithmic change over ClustalW2 is a new profile-profile engine,
based on the HHalign software [ 5 ]. HHalign is entirely based on
Hidden-Markov Models (HMMs). Sequences and intermediary
profiles are converted into HMMs, which are aligned in turn. It is
also possible to input a HMM in addition to the unaligned
sequences, and to use this external HMM during the profile-profile
alignment stage. This is referred to External Profile Alignment
(EPA). There are two HMM alignment algorithms: the accurate
and memory-hungry Maximum Accuracy (MAC) algorithm and
the faster, less accurate and more memory efficient Viterbi algo-
rithm. The MAC algorithm is the default, and Viterbi is activated
automatically only if the system resources are exhausted.
Sequence input to Clustal Omega is handled by the Squid
routines ( http://selab.janelia.org/software.html ) , and permissible
input formats are a2m (fasta/vienna), clustal, msf, phylip, selex and
stockholm. Output can be in the same formats.
The maximum number of sequences and lengths that can be
aligned will depend on the machine being used. The number of
sequences primarily affects the distance matrix calculation. Storing
an mBed matrix for N
¼
10,000 sequences takes up approximately
14 MB of memory. A full distance matrix would take up almost
400 MB. Both alternatives are clearly feasible on a modern desktop
computer. For N
100,000 the mBed matrix will take up
220 MB, while the full distance matrix will require about 40 GB
which may require a higher end machine. The length of the indi-
vidual
¼
input
sequences
also contributes
to the memory
Search WWH ::




Custom Search