GramAlign: Fast alignment driven by grammar-based phylogeny - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

Initially, each dictionary G m 1

¼;

is empty, a fragment

f 1

s m (1) is set to the first residue of the corresponding sequence,

and only the first element s m (1) is visible to the algorithm. At

the k th iteration of the procedure, the k th residue is appended

to the k

1 fragment and the visible sequence is checked. If

f k

1) then f k

s m (1,

, k

is considered a new rule, and so

...

added to the dictionary G m ¼

G k 1

f k

, and the fragment is

reset, f k

. However, if f k

1), then the current

dictionary contains enough rules to produce the current fragment,

i.e., G m ¼

¼;

s m (1,

, k

∈

...

G k m . In either case, the iteration completes by append-

ing the k th residue to the visible sequence. This procedure con-

tinues until the visible sequence is equal to the entire sequence, at

which time the size of the dictionary | G m | is recorded for the

next step. As was described in [ 31 ], calculating the distance

between sequences requires only the number of entries in the

dictionary.

For the next step in generating the distance matrix, each

sequence is compared with all other sequences. In particular, con-

sider the process of comparing sequences m and n . Initially, the

dictionary G m , n 1

G m is set to that of sequence m , a fragment

f 1

s n (1) is set to the first residue of the n th sequence, and the

visible sequence is all of s m . The algorithm operates as described

previously, resulting in a new dictionary size | G m , n |. When com-

plete, more grammatically-similar sequences will have a new dictio-

nary size with fewer entries as compared to sequences that are less

grammatically-similar. Therefore, the size of the new dictionary

| G m , n | will be close to the size of the original dictionary | G m | for

grammatically similar sequences.

The distance between the sequences is then estimated using

combinations of the old and new dictionary sizes. Five different

distance measures were suggested in [ 31 ]. GRAMALIGN uses the

distance measure

G m;n

G m þ

G n;m

G n

d m;n ¼

G m;n þ G n;m

;

(1)

, N } are indices of two sequences being

compared. This particular metric accounts for differences in

sequence lengths and normalizes accordingly. Thus, the final dis-

tance matrix D is composed of grammar-based distance entries

given by Eq. 1 . Smaller entries in D indicate a stronger similarity,

at least in terms of the LZ-based grammar estimate.

where m , n

{1,

∈

...

The distance between sequences m and n as determined by Eq. 1 is

based on how many additional rules need to be added to each

grammar in order to generate both s m and s n . Because the real

grammars

2.4 Sequence

Alphabets

are unknown, G m and G n

are

approximated

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home