Parallel Implementations of Local Sequence Alignment: Hardware and Software - Parallel Computing for Bioinformatics and Computational Biology

Biomedical Engineering Reference

In-Depth Information

is a set of sequences of different lengths. In essence, in a typical coarse-grained

parallel implementation, one of the processors acts as a “master”, dispatching blocks

of sequences to the “slaves” which, in turn, perform the algorithm calculations. When

the slaves report results for one block, the master sends a new block. This strategy is

possible because results from the comparison between query and database sequences

are independent of the previous results deriving from the comparison of the query

with other sequences.

However, the time required in the processing of any given sequence depends not

only on the length of the sequence but also on its composition. Therefore, the use of a

dynamic load balancing strategy is necessary. The simplest way is to modify the way in

which the master processor distributes the load on demand from the slaves. Obviously,

sending one-sequence messages introduces additional expensive time overhead due

to the high number of messages interchanged. Thus, rather than distributing messages

sequence-by-sequence, better results are achieved by dispatching blocks of sequences.

10.4 FASTA

FASTA [7] finds homologous sequences using a four-step process. First, sequences

that have sub-sequences of at least k adjacent residues, which match sub-sequences

in the query sequence, are identified. The recommended value of k for a protein

sequence alignment is 2 and for a DNA sequence alignment is between 4 and 6. The

second step combines groups of these matching sub-sequences into longer matching

regions called initial regions. Each initial region consists of one or more matching

sub-sequence separated by mismatching regions of residues. Gaps are not allowed

within the mismatching regions. That is, the number of residues between two con-

secutive matching sub-sequences within the same initial region has to be the same

in the query sequence and the target sequence. These initial regions are scored and

the best 10 initial regions are selected. During the third step, dynamic program-

ming is used to combine nearby initial regions and new scores are assigned to the

combined regions. This is an example of how dynamic programming is used in a

limited context (i.e., only for nearby initial regions) within heuristics. In this third

step, mismatching regions between the initial regions may contain gaps. The scores

generated in the third step are used to rank the database sequences. During the fourth

step, the Smith-Waterman [6] algorithm is applied to the top ranking sequences from

the previous step. Specifically, this algorithm is used to align the query sequence

and the database sequences within the selected initial regions and their neighboring

residues. The fourth step in FASTA is another example of the use of dynamic pro-

gramming in a limited context within a heuristic based alignment algorithm. The

Smith-Waterman [6] algorithm is only used for top ranking sequences and only

within selected regions of these sequences. Limiting the use of dynamic program-

ming increases the computational efficiency of the alignment algorithm. However, it

also means that the generated solution is only a sub-optimal solution rather than an

optimal one.

Search WWH ::

Custom Search

Home