Biology Reference
In-Depth Information
by scanning the only observations available (i.e., s m and s n ).
The grammar approximation improves as the length of the
observed sequences increases. And so, the distance calculations
are a function of sequence lengths, becoming more accurate as
the sequences increase in length. In practice, this calculation
works well for DNA/RNA sequences, even of shorter lengths,
because the approximated grammar of a DNA/RNA sequence
can only contain rules involving words composed of combinations
of elements from the alphabet {“A,”“C,”“G,”“T/U”}. This small
alphabet allows for a rapid generation of a reasonable grammar
since there are a relatively small number of permutations of letters.
From a grammar perspective, amino acid sequences are gener-
ally much more difficult to process correctly using Eq. 1 . The
reason being the alphabet contains 23 letters, where each element
is not equally different from all other elements. Due to the relatively
large alphabet size, much longer sequences are necessary to gener-
ate a reasonable grammar approximation. Thus, the accuracy of
distances calculated for sets of short amino acid sequences is dimin-
ished. Consider the substitution scores of “L” and “M” as taken
from the GONNET250 and BLOSUM62 substitution matrices in
Fig. 2 . Notice in (a) and (c), that “L” receives a relatively high
positive value when aligned with any of {“I,”“L,”“M,”“V”}. In (b)
and (d), “M” receives a relatively high positive value when aligned
with any of the same set. Additionally, both “L” and “M” generally
receive high negative values when compared to letters other than
{“I,”“L,”“M,”“V”}. When taking this type of scoring into account,
the elements “L” and “M” could be considered the same letter in a
grammatical sense.
Thus, GRAMALIGN offers the option to use a “Merged Amino
Acid Alphabet” when calculating the distance matrix. The merged
alphabet contains 11 elements corresponding to the 23 amino acid
letters grouped into the sets {“A,”“S,”“T,”“X”}, {“B,”“D,”“N”},
{“C”}, {“E,”“K,”“Q,”“R,”“Z”}, {“F”}, {“G”}, {“H”}, {“I,”“L,”
“M,”“V”}, {“P”}, {“W”}, and {“Y”}. These groupings were deter-
mined by considering all 23 rows of the BLOSUM45, BLO-
SUM62, BLOSUM80, and GONNET250 substitution matrices,
and only grouping elements that had a strong similarity across the
entire row in all four matrices. The merged alphabet has the benefit
of containing fewer elements allowing for more accurate distance
estimates based upon shorter observed sequences. In practice,
average alignment scores increase when aligning the same data
sets using the merged alphabet within the distance calculation, as
compared to using the actual alphabet.
Once the distances have been calculated, a minimal spanning tree
based on these distances is used to determine the order in which
sequences should be pairwise aligned. At the core of most progres-
sive MSA algorithms is some method for performing pairwise
2.5 Progressive
Alignment
Search WWH ::




Custom Search