GramAlign: Fast alignment driven by grammar-based phylogeny - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

by scanning the only observations available (i.e., s m and s n ).

The grammar approximation improves as the length of the

observed sequences increases. And so, the distance calculations

are a function of sequence lengths, becoming more accurate as

the sequences increase in length. In practice, this calculation

works well for DNA/RNA sequences, even of shorter lengths,

because the approximated grammar of a DNA/RNA sequence

can only contain rules involving words composed of combinations

of elements from the alphabet {“A,”“C,”“G,”“T/U”}. This small

alphabet allows for a rapid generation of a reasonable grammar

since there are a relatively small number of permutations of letters.

From a grammar perspective, amino acid sequences are gener-

ally much more difficult to process correctly using Eq. 1 . The

reason being the alphabet contains 23 letters, where each element

is not equally different from all other elements. Due to the relatively

large alphabet size, much longer sequences are necessary to gener-

ate a reasonable grammar approximation. Thus, the accuracy of

distances calculated for sets of short amino acid sequences is dimin-

ished. Consider the substitution scores of “L” and “M” as taken

from the GONNET250 and BLOSUM62 substitution matrices in

Fig. 2 . Notice in (a) and (c), that “L” receives a relatively high

positive value when aligned with any of {“I,”“L,”“M,”“V”}. In (b)

and (d), “M” receives a relatively high positive value when aligned

with any of the same set. Additionally, both “L” and “M” generally

receive high negative values when compared to letters other than

{“I,”“L,”“M,”“V”}. When taking this type of scoring into account,

the elements “L” and “M” could be considered the same letter in a

grammatical sense.

Thus, GRAMALIGN offers the option to use a “Merged Amino

Acid Alphabet” when calculating the distance matrix. The merged

alphabet contains 11 elements corresponding to the 23 amino acid

letters grouped into the sets {“A,”“S,”“T,”“X”}, {“B,”“D,”“N”},

{“C”}, {“E,”“K,”“Q,”“R,”“Z”}, {“F”}, {“G”}, {“H”}, {“I,”“L,”

“M,”“V”}, {“P”}, {“W”}, and {“Y”}. These groupings were deter-

mined by considering all 23 rows of the BLOSUM45, BLO-

SUM62, BLOSUM80, and GONNET250 substitution matrices,

and only grouping elements that had a strong similarity across the

entire row in all four matrices. The merged alphabet has the benefit

of containing fewer elements allowing for more accurate distance

estimates based upon shorter observed sequences. In practice,

average alignment scores increase when aligning the same data

sets using the merged alphabet within the distance calculation, as

compared to using the actual alphabet.

Once the distances have been calculated, a minimal spanning tree

based on these distances is used to determine the order in which

sequences should be pairwise aligned. At the core of most progres-

sive MSA algorithms is some method for performing pairwise

2.5 Progressive

Alignment

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home