Computational Molecular Biology - Parallel Computing for Bioinformatics and Computational Biology

Biomedical Engineering Reference

In-Depth Information

the matches and mismatches. The 10 regions that have the highest score are selected

and the first step ends.

At the second step, new scores are associated with the best 10 regions using a

substitution matrix. The best score obtained this way is called the initial score , which

is the first similarity measure between s and t .

The initial score is used in the third step to rank the sequences in the database. Fasta

also attempts to merge the best 10 regions in such a way that gaps are restrictively

treated, that is, gaps are considered only in the junction of two regions [22]. The

best region combination is calculated using a dynamic programming algorithm and

the score for this combination is called initn . The score initn is calculated for every

sequence in the database and only sequences which have this score above a given

threshold are considered in the fourth step.

In the fourth step, the k -band variant of the SW algorithm (Section 6.3.3) is executed

for each pair of sequences to generate the alignments.

6.4.2 BLAST

BLAST was proposed by Altschul et al. [1] in 1990. It is based on an heuristic

algorithm which was designed to run fast, while still maintaining high sensibility.

The first BLAST version searched for local alignments without considering gaps.

Its motivation was to improve the performance of the Fasta algorithm by obtaining less

k -tuples with higher quality. This was achieved by integrating the use of PAM matrices

in the first step of the algorithm. In 1996 and 1997, improved gapped versions of the

original BLAST, NCBI-BLAST2, and WU-BLAST2, were proposed by Altschul

et al. [8] and States and Gish [29], respectively.

The BLAST algorithm is divided into three well-defined phases: seeding, exten-

sion, and evaluation.

6.4.2.1 Seeding As in Fasta, BLAST also compares in the first phase a query

sequence against all sequences in a database. BLAST uses the concept of words,

which is very similar to the k -tuples concept (Section 6.4.1). A word is defined to

be a finite set of letters with length w that appear in a given sequence. For instance,

the sequence TCACGA contains four words with length three: TCA, CAC, ACG,

and CGA. The BLAST algorithm assumes that significant alignments have words in

common.

In this first phase, the location of all shared w -letter words between sequences

s and t is determined by doing exact pattern matching. These locations are known

as identical words . Only regions with identical words can be used as seeds for the

alignment.

For the cases where significant alignments do not contain words in common, the

concept of neighborhood is used. The neighbor of a word includes the word itself

and every other word whose score is at least equal to T when compared through a

substitution matrix. For instance, if we consider T

11 and a substitution matrix

PAM200, RGD, and KGD are neighbors, then the score between them is 14 [17].

=

Parallel Computing for Bioinformatics and Computational Biology

Search WWH ::

Custom Search

Home