Parallel Implementations of Local Sequence Alignment: Hardware and Software - Parallel Computing for Bioinformatics and Computational Biology

Biomedical Engineering Reference

In-Depth Information

before attempting the very time consuming optimized search. The trade-off between

speed and sensitivity is controlled by the k -tuple parameter, which specifies the size

of the word. Increasing the k -tuple decreases the number of background hits. The

FASTA program does not investigate every word hit encountered, but instead looks

initially for segments containing several nearby hits. Using a heuristic method, these

segments are assigned scores and the score of the best segment found appears in

the output. For those alignments finally reported, a full Smith-Waterman alignment

search is performed.

BLAST [4] is another heuristic-based algorithm for sequence homology search.

As in FASTA, it finds database sequences that have k consecutive matches to the

query sequence. The value of k is 3 for protein sequence and 11 for DNA sequence.

Several variations [4, 8] of the original BLAST algorithm were developed to accom-

modate different types of sequence alignments. For example, MEGABLAST uses the

XDrop alignment algorithm [8]. It is particularly tuned for the alignment of two DNA

sequences that are highly similar. This algorithm is computationally efficient because

it considers long runs of identical adjacent nucleotides. If the two sequences differ

by 3%, the expected length of the run is 30 nucleotides. The algorithm is also com-

putationally efficient because it completely avoids the use of dynamic programming

even in a limited context. It uses a greedy algorithm instead. A greedy algorithm is

one type of a heuristic that is developed with the assumption that a global optimal

can be obtained by making a sequence of local optimal decisions, whereas dynamic

programming is a global optimization algorithm. XDrop was used to align entire

genomes and it was found [8] to be 10 times faster than BLAST for long and highly

similar sequences.

PSI-BLAST [4] executes several iterations of the BLAST algorithm. However,

the scoring matrix, which is used to score the pairwise alignment, is not fixed in

PSI-BLAST. The scoring matrix includes the weights corresponding to a match for

all types of nucleotides or amino acids. After every iteration, the top ranking target

sequences in the pairwise alignment are used to update the scoring matrix. In addition,

PSI-BLAST uses a position-specific scoring matrix where two matching residues are

assigned a score based not only on the importance of the match but also on the position

of the residue in the sequence. PSI-BLAST is more sensitive than BLAST in detecting

weak sequence similarities.

Regardless of the algorithm used, in the case of pairwise alignment, an input

sequence is aligned against a list of target sequences from a database resulting in

multiple pairwise alignments. In each pairwise alignment, one of the two sequences

is the input sequence and the other one is sequence from the database. This process

can be parallelized in two ways: (1) multiple pairwise alignments can be executed in

parallel (coarse-grain parallelism) and (2) a parallel version of the alignment algorithm

can be used to speedup each individual pairwise alignment (fine-grain parallelism).

Given a set of input sequences, ClustalW [9] implements multiple alignments

using a tree-based method. Pairwise alignments are first constructed for each pair

of sequences from the input set. These alignments are used to construct a similarity

matrix. Each entry in the matrix represents the similarity distance between any two

Parallel Computing for Bioinformatics and Computational Biology

Search WWH ::

Custom Search

Home