Dynamic Programming - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

for both sequences in the pair. The matching subsequences are

determined and located in the alignment grid. Finding matching

pairs can be implemented efficiently using indexing techniques like

hash-tables, so that it can be completed in linear time. Subse-

quences of selected sizes are likely to occur in several alignments

other than the optimal ones. Therefore a filtering phase to find the

significant hotspots is followed. It is assumed that if selected sub-

string matchings are placed on the same diagonal, and they are close

to each other this cluster is likely to be on an optimal path. Since the

cells c ( i , j ) located on the same diagonal line have the same i

j

value, they can easily be grouped together and the clusters within a

close neighborhood are selected as the seeds. FASTA picks ten top

scoring clusters. The best scoring cluster is selected as the center of

an alignment band, and banded dynamic programming is employed

to connect the seeds.

Using short subsequence matches as in the case of FASTA

program is sensitive, but this approach usually generates high num-

ber of regions which are not a part of an optimal path. BLAST

programs, on the other hand, use longer fragments (default 3

residues for proteins, 11 nucleotides for DNA/RNA). Statistical

significance of each hit is determined by a statistical model using the

random occurrence of a seed candidate, where the random occur-

rence is estimated from the background frequencies of the residues.

The statistical modelling provides confidence scores for each align-

ment found, that is useful since the alignments are approximate

solutions. The seeds are extended in each direction to find local

alignments. A simple bounded dynamic program is recruited for

the gapped extension. The algorithm searches for the optimal paths

not dropping below a predetermined threshold score at some

iteration.

Seeding operation can be improved to an adaptive procedure

by using variable length subsequences. The hash-table data struc-

tures indexing the fixed-length subsequences as in the case of

FASTA and BLAST class of programs are replaced by suffix-based

structures such as suffix array and FM-index in variable length

seeding [ 24 ]. The seed length is extended iteratively until a certain

number of sequences below a threshold score are left. The dynamic

programming procedure followed after the seeding phase is similar

to BLAST programs.

4 Dynamic Programming and Multiple Sequence Alignment

Pairwise sequence alignment techniques can be generalized to

multiple sequence alignment in a straightforward way due to the

multiple alignment scoring scheme called the sum-of-pairs scoring.

In sum-of-pairs, every pair of residues aligned is evaluated. Accord-

ing to this, the scores every residues pair (or gaps) in an alignment

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home