Biology Reference
In-Depth Information
Initially, each dictionary G m 1
¼;
is empty, a fragment
f 1
s m (1) is set to the first residue of the corresponding sequence,
and only the first element s m (1) is visible to the algorithm. At
the k th iteration of the procedure, the k th residue is appended
to the k
¼
1 fragment and the visible sequence is checked. If
f k
1) then f k
2
s m (1,
, k
is considered a new rule, and so
...
added to the dictionary G m ¼
G k 1
m
f k
[f
g
, and the fragment is
reset, f k
. However, if f k
1), then the current
dictionary contains enough rules to produce the current fragment,
i.e., G m ¼
¼;
s m (1,
, k
...
G k m . In either case, the iteration completes by append-
ing the k th residue to the visible sequence. This procedure con-
tinues until the visible sequence is equal to the entire sequence, at
which time the size of the dictionary | G m | is recorded for the
next step. As was described in [ 31 ], calculating the distance
between sequences requires only the number of entries in the
dictionary.
For the next step in generating the distance matrix, each
sequence is compared with all other sequences. In particular, con-
sider the process of comparing sequences m and n . Initially, the
dictionary G m , n 1
¼
G m is set to that of sequence m , a fragment
f 1
s n (1) is set to the first residue of the n th sequence, and the
visible sequence is all of s m . The algorithm operates as described
previously, resulting in a new dictionary size | G m , n |. When com-
plete, more grammatically-similar sequences will have a new dictio-
nary size with fewer entries as compared to sequences that are less
grammatically-similar. Therefore, the size of the new dictionary
| G m , n | will be close to the size of the original dictionary | G m | for
grammatically similar sequences.
The distance between the sequences is then estimated using
combinations of the old and new dictionary sizes. Five different
distance measures were suggested in [ 31 ]. GRAMALIGN uses the
distance measure
¼
G m;n
G m þ
G n;m
G n
d m;n ¼
G m;n þ G n;m
2
;
(1)
, N } are indices of two sequences being
compared. This particular metric accounts for differences in
sequence lengths and normalizes accordingly. Thus, the final dis-
tance matrix D is composed of grammar-based distance entries
given by Eq. 1 . Smaller entries in D indicate a stronger similarity,
at least in terms of the LZ-based grammar estimate.
where m , n
{1,
...
The distance between sequences m and n as determined by Eq. 1 is
based on how many additional rules need to be added to each
grammar in order to generate both s m and s n . Because the real
grammars
2.4 Sequence
Alphabets
are unknown, G m and G n
are
approximated
Search WWH ::




Custom Search