Biology Reference
In-Depth Information
complexity will be quite small. For example, consider the sequence
R
¼
gagacat and examine the parsing of
the concatenated
sequence QR :
gagacat
The concatenated sequence QR while being almost twice as long as
the sequence Q has an LZ complexity of five, which is only one
more than the LZ complexity of Q . If, on the other hand, we
concatenate a sequence with a very different composition to Q ,
we will get a greater increase in LZ complexity. Consider the
sequence tcgctta and the parsing of the concatenation
QT
QR
¼
g
a
gac
agt
¼
g
a
gac
agt
tc
gc
tta
with an LZ complexity of seven. In this case the LZ complexity has
grown in proportion to the length of the sequence.
We have discussed the analysis of the sequence as simply a
parsing. We can also view this same analysis as a way of discovering
grammar rules which can be used to generate a sequence. Thus the
entry gac in our parsing can be viewed as a rule for generating the
letter following ga . From this perspective, the elements in our
parsing can be viewed as a dictionary of grammar rules which can
be used to generate sequences belonging to the same grammatical
family as the sequence being analyzed. Thus, any increase in com-
plexity when analyzing a concatenation of two sequences is a result
of the need to include additional grammar rules for generating the
latter sequence. The more different two sequences are the more
the need for additional grammar rules. As this is a more general
framework we will use this in our subsequent discussion. We intro-
duce some notation in the following to formalize this approach.
The distance calculation begins by generating a dictionary of gram-
mar rules for each sequence. Let the notation G m define the gram-
mar dictionary for the m th sequence. Further, while the dictionary is
under construction, the sequence will be scanned one residue at a
time from left-to-right. We add a superscript on the dictionary
notation, G m k , to indicate the current dictionary up to the k th
residue with sequence m . Let f be the current substring “fragment”
of the sequence that has not yet been added to the dictionary.
A fragment is composed via concatenated symbols taken one at a
time from the sequence being scanned. After each iteration, the
sequence is checked for a matching fragment. If it is not found,
the fragment is unique and added to the dictionary before it is reset
to the empty string. As in the case of the dictionary, we use the
superscript f k to indicate the fragment location within the sequence.
Finally, we use the term “visible sequence” to mean the left-most
windowed portion of a sequence that has already been processed.
So, at the k th iteration, the visible sequence would include only the
first k residues of a scanned sequence.
2.3 Grammar-Based
Distance Estimation
Search WWH ::




Custom Search