Clustal Omega, Accurate Alignment of Very Large Numbers of Sequences - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

to (log( N )) 2 randomly chosen seed sequences only. The fast

pairwise distance calculation routines, based on a ktuple alignment

algorithm, have been retained from the previous Clustal programs.

The pairwise distances are then clustered, using a bisecting k-means

algorithm [ 6 ]. Groups of sequences are bisected until a certain

threshold for the cluster size is reached. In the current version

this threshold is hard-wired to 100. Guide-tree construction within

the clusters and amongst the clusters makes use of the tree building

routines in Muscle [ 7 ]. This dendrogram is referred to as a guide-

tree to emphasize that it is only used to guide the progressive

alignment—it is not a reliable guide to the phylogeny of the

sequences. Guide-tree construction will be skipped if only two

sequences are to be aligned or if an externally constructed guide-

tree is inputted.

In the profile-profile alignment phase sequences are aligned in

larger and larger groups, according to the branching order in the

guide-tree. At each stage of this final step, two alignments are

aligned. Initially these are single sequences, but they grow with

the addition of new sequences as one traverses the guide-tree.

The alignment of residues and the positioning of gaps during

each profile-profile alignment are fixed and cannot be undone at a

later profile-profile alignment higher up in the tree. The main

algorithmic change over ClustalW2 is a new profile-profile engine,

based on the HHalign software [ 5 ]. HHalign is entirely based on

Hidden-Markov Models (HMMs). Sequences and intermediary

profiles are converted into HMMs, which are aligned in turn. It is

also possible to input a HMM in addition to the unaligned

sequences, and to use this external HMM during the profile-profile

alignment stage. This is referred to External Profile Alignment

(EPA). There are two HMM alignment algorithms: the accurate

and memory-hungry Maximum Accuracy (MAC) algorithm and

the faster, less accurate and more memory efficient Viterbi algo-

rithm. The MAC algorithm is the default, and Viterbi is activated

automatically only if the system resources are exhausted.

Sequence input to Clustal Omega is handled by the Squid

routines ( http://selab.janelia.org/software.html ) , and permissible

input formats are a2m (fasta/vienna), clustal, msf, phylip, selex and

stockholm. Output can be in the same formats.

The maximum number of sequences and lengths that can be

aligned will depend on the machine being used. The number of

sequences primarily affects the distance matrix calculation. Storing

an mBed matrix for N

¼

10,000 sequences takes up approximately

14 MB of memory. A full distance matrix would take up almost

400 MB. Both alternatives are clearly feasible on a modern desktop

computer. For N

100,000 the mBed matrix will take up

220 MB, while the full distance matrix will require about 40 GB

which may require a higher end machine. The length of the indi-

vidual

¼

input

sequences

also contributes

to the memory

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home