Biomedical Engineering Reference
In-Depth Information
before attempting the very time consuming optimized search. The trade-off between
speed and sensitivity is controlled by the k -tuple parameter, which specifies the size
of the word. Increasing the k -tuple decreases the number of background hits. The
FASTA program does not investigate every word hit encountered, but instead looks
initially for segments containing several nearby hits. Using a heuristic method, these
segments are assigned scores and the score of the best segment found appears in
the output. For those alignments finally reported, a full Smith-Waterman alignment
search is performed.
BLAST [4] is another heuristic-based algorithm for sequence homology search.
As in FASTA, it finds database sequences that have k consecutive matches to the
query sequence. The value of k is 3 for protein sequence and 11 for DNA sequence.
Several variations [4, 8] of the original BLAST algorithm were developed to accom-
modate different types of sequence alignments. For example, MEGABLAST uses the
XDrop alignment algorithm [8]. It is particularly tuned for the alignment of two DNA
sequences that are highly similar. This algorithm is computationally efficient because
it considers long runs of identical adjacent nucleotides. If the two sequences differ
by 3%, the expected length of the run is 30 nucleotides. The algorithm is also com-
putationally efficient because it completely avoids the use of dynamic programming
even in a limited context. It uses a greedy algorithm instead. A greedy algorithm is
one type of a heuristic that is developed with the assumption that a global optimal
can be obtained by making a sequence of local optimal decisions, whereas dynamic
programming is a global optimization algorithm. XDrop was used to align entire
genomes and it was found [8] to be 10 times faster than BLAST for long and highly
similar sequences.
PSI-BLAST [4] executes several iterations of the BLAST algorithm. However,
the scoring matrix, which is used to score the pairwise alignment, is not fixed in
PSI-BLAST. The scoring matrix includes the weights corresponding to a match for
all types of nucleotides or amino acids. After every iteration, the top ranking target
sequences in the pairwise alignment are used to update the scoring matrix. In addition,
PSI-BLAST uses a position-specific scoring matrix where two matching residues are
assigned a score based not only on the importance of the match but also on the position
of the residue in the sequence. PSI-BLAST is more sensitive than BLAST in detecting
weak sequence similarities.
Regardless of the algorithm used, in the case of pairwise alignment, an input
sequence is aligned against a list of target sequences from a database resulting in
multiple pairwise alignments. In each pairwise alignment, one of the two sequences
is the input sequence and the other one is sequence from the database. This process
can be parallelized in two ways: (1) multiple pairwise alignments can be executed in
parallel (coarse-grain parallelism) and (2) a parallel version of the alignment algorithm
can be used to speedup each individual pairwise alignment (fine-grain parallelism).
Given a set of input sequences, ClustalW [9] implements multiple alignments
using a tree-based method. Pairwise alignments are first constructed for each pair
of sequences from the input set. These alignments are used to construct a similarity
matrix. Each entry in the matrix represents the similarity distance between any two
Search WWH ::




Custom Search