GramAlign: Fast alignment driven by grammar-based phylogeny - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

complexity will be quite small. For example, consider the sequence

R

¼

gagacat and examine the parsing of

the concatenated

sequence QR :

gagacat

The concatenated sequence QR while being almost twice as long as

the sequence Q has an LZ complexity of five, which is only one

more than the LZ complexity of Q . If, on the other hand, we

concatenate a sequence with a very different composition to Q ,

we will get a greater increase in LZ complexity. Consider the

sequence tcgctta and the parsing of the concatenation

QT

QR

¼

g

a

gac

agt

¼

g

a

gac

agt

tc

gc

tta

with an LZ complexity of seven. In this case the LZ complexity has

grown in proportion to the length of the sequence.

We have discussed the analysis of the sequence as simply a

parsing. We can also view this same analysis as a way of discovering

grammar rules which can be used to generate a sequence. Thus the

entry gac in our parsing can be viewed as a rule for generating the

letter following ga . From this perspective, the elements in our

parsing can be viewed as a dictionary of grammar rules which can

be used to generate sequences belonging to the same grammatical

family as the sequence being analyzed. Thus, any increase in com-

plexity when analyzing a concatenation of two sequences is a result

of the need to include additional grammar rules for generating the

latter sequence. The more different two sequences are the more

the need for additional grammar rules. As this is a more general

framework we will use this in our subsequent discussion. We intro-

duce some notation in the following to formalize this approach.

The distance calculation begins by generating a dictionary of gram-

mar rules for each sequence. Let the notation G m define the gram-

mar dictionary for the m th sequence. Further, while the dictionary is

under construction, the sequence will be scanned one residue at a

time from left-to-right. We add a superscript on the dictionary

notation, G m k , to indicate the current dictionary up to the k th

residue with sequence m . Let f be the current substring “fragment”

of the sequence that has not yet been added to the dictionary.

A fragment is composed via concatenated symbols taken one at a

time from the sequence being scanned. After each iteration, the

sequence is checked for a matching fragment. If it is not found,

the fragment is unique and added to the dictionary before it is reset

to the empty string. As in the case of the dictionary, we use the

superscript f k to indicate the fragment location within the sequence.

Finally, we use the term “visible sequence” to mean the left-most

windowed portion of a sequence that has already been processed.

So, at the k th iteration, the visible sequence would include only the

first k residues of a scanned sequence.

2.3 Grammar-Based

Distance Estimation

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home