Biology Reference
In-Depth Information
for both sequences in the pair. The matching subsequences are
determined and located in the alignment grid. Finding matching
pairs can be implemented efficiently using indexing techniques like
hash-tables, so that it can be completed in linear time. Subse-
quences of selected sizes are likely to occur in several alignments
other than the optimal ones. Therefore a filtering phase to find the
significant hotspots is followed. It is assumed that if selected sub-
string matchings are placed on the same diagonal, and they are close
to each other this cluster is likely to be on an optimal path. Since the
cells c ( i , j ) located on the same diagonal line have the same i
j
value, they can easily be grouped together and the clusters within a
close neighborhood are selected as the seeds. FASTA picks ten top
scoring clusters. The best scoring cluster is selected as the center of
an alignment band, and banded dynamic programming is employed
to connect the seeds.
Using short subsequence matches as in the case of FASTA
program is sensitive, but this approach usually generates high num-
ber of regions which are not a part of an optimal path. BLAST
programs, on the other hand, use longer fragments (default 3
residues for proteins, 11 nucleotides for DNA/RNA). Statistical
significance of each hit is determined by a statistical model using the
random occurrence of a seed candidate, where the random occur-
rence is estimated from the background frequencies of the residues.
The statistical modelling provides confidence scores for each align-
ment found, that is useful since the alignments are approximate
solutions. The seeds are extended in each direction to find local
alignments. A simple bounded dynamic program is recruited for
the gapped extension. The algorithm searches for the optimal paths
not dropping below a predetermined threshold score at some
iteration.
Seeding operation can be improved to an adaptive procedure
by using variable length subsequences. The hash-table data struc-
tures indexing the fixed-length subsequences as in the case of
FASTA and BLAST class of programs are replaced by suffix-based
structures such as suffix array and FM-index in variable length
seeding [ 24 ]. The seed length is extended iteratively until a certain
number of sequences below a threshold score are left. The dynamic
programming procedure followed after the seeding phase is similar
to BLAST programs.
4 Dynamic Programming and Multiple Sequence Alignment
Pairwise sequence alignment techniques can be generalized to
multiple sequence alignment in a straightforward way due to the
multiple alignment scoring scheme called the sum-of-pairs scoring.
In sum-of-pairs, every pair of residues aligned is evaluated. Accord-
ing to this, the scores every residues pair (or gaps) in an alignment
Search WWH ::




Custom Search