GramAlign: Fast alignment driven by grammar-based phylogeny - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

use edit distances, breakpoints, rearrangements, recombination,

comparative mapping, and gene order have been introduced

[ 21 - 30 ]. While these methods do not require multiple alignments

they still tend to be computationally expensive.

The relative complexity measure utilized in GRAMALIGN is an

alignment free distance measure which is also computationally

efficient. The use of the relative complexity measure for computing

phylogenetic distance was demonstrated via successful construction

of phylogenies for eutharians using complete mitochondrial gen-

omes [ 31 ]. This measure was further validated [ 32 ] by using it for

phylogenetic analysis of medically relevant fungal species using the

cytochrome b gene, the 18S rDNA gene and the ITS region. The

topology constructed using this measure was robust to random

removal of significant portions (up to 40 %) of the genes. We

have made use of the computational efficiency of this distance

metric in both GRAMALIGN and to develop a clustering algorithm

that was validated by clustering large 16S data sets [ 33 , 34 ]. GRA-

MALIGN itself has been validated and incorporated into the primer

design pipeline UniPrime2 [ 35 ]. Finally, this measure was used to

cluster protein families into functional subtypes [ 36 ]. In all these

applications the distance measure is used to obtain the phylogenetic

topology and thus what has been important has been the ordering

of the distances between sequences.

The relative complexity measure of distance is based on an approxi-

mation to Kolmogorov complexity [ 37 ] by Lempel-Ziv complexity

[ 38 ]. The Lempel-Ziv (LZ) complexity of a sequence is the num-

ber of distinct phrases in the left-to-right parsing of a sequence.

A distinct phrase is one which has not occurred in the history of a

sequence. Consider the sequence Q

2.2 Relative

Complexity Measure

gagacagt . Initially the his-

tory of the sequence is the empty string. As g is not in the empty

string g becomes the first distinct phrase. The history of the

sequence now consists of the single letter g .As a is not in

the history, a becomes the second unique phrase. The history of

the sequence is now ga . The next two letters are g and a . As the

sequence ga exists in the history we keep building the phrase until

we get to gac which is not in the history of the sequence. Thus, gac

is the third unique phrase. The history is now the sequence gagac .

The next unique phrase in the sequence is agt , ag being available in

the history. Thus the sequence gagacagt can be represented by four

unique phrases and therefore has an LZ complexity of four. We can

represent the parsing by

¼

agt

and the LZ complexity by c ( Q ). We can see that sequences that have

more diversity will result in higher LZ complexity and sequences

that are more uniform will result in smaller LZ complexity. If we

concatenated this string with another similar to it, the increase in

Q

¼

g

a

gac

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home