Biomedical Engineering Reference
In-Depth Information
the matches and mismatches. The 10 regions that have the highest score are selected
and the first step ends.
At the second step, new scores are associated with the best 10 regions using a
substitution matrix. The best score obtained this way is called the initial score , which
is the first similarity measure between s and t .
The initial score is used in the third step to rank the sequences in the database. Fasta
also attempts to merge the best 10 regions in such a way that gaps are restrictively
treated, that is, gaps are considered only in the junction of two regions [22]. The
best region combination is calculated using a dynamic programming algorithm and
the score for this combination is called initn . The score initn is calculated for every
sequence in the database and only sequences which have this score above a given
threshold are considered in the fourth step.
In the fourth step, the k -band variant of the SW algorithm (Section 6.3.3) is executed
for each pair of sequences to generate the alignments.
6.4.2 BLAST
BLAST was proposed by Altschul et al. [1] in 1990. It is based on an heuristic
algorithm which was designed to run fast, while still maintaining high sensibility.
The first BLAST version searched for local alignments without considering gaps.
Its motivation was to improve the performance of the Fasta algorithm by obtaining less
k -tuples with higher quality. This was achieved by integrating the use of PAM matrices
in the first step of the algorithm. In 1996 and 1997, improved gapped versions of the
original BLAST, NCBI-BLAST2, and WU-BLAST2, were proposed by Altschul
et al. [8] and States and Gish [29], respectively.
The BLAST algorithm is divided into three well-defined phases: seeding, exten-
sion, and evaluation.
6.4.2.1 Seeding As in Fasta, BLAST also compares in the first phase a query
sequence against all sequences in a database. BLAST uses the concept of words,
which is very similar to the k -tuples concept (Section 6.4.1). A word is defined to
be a finite set of letters with length w that appear in a given sequence. For instance,
the sequence TCACGA contains four words with length three: TCA, CAC, ACG,
and CGA. The BLAST algorithm assumes that significant alignments have words in
common.
In this first phase, the location of all shared w -letter words between sequences
s and t is determined by doing exact pattern matching. These locations are known
as identical words . Only regions with identical words can be used as seeds for the
alignment.
For the cases where significant alignments do not contain words in common, the
concept of neighborhood is used. The neighbor of a word includes the word itself
and every other word whose score is at least equal to T when compared through a
substitution matrix. For instance, if we consider T
11 and a substitution matrix
PAM200, RGD, and KGD are neighbors, then the score between them is 14 [17].
=
Search WWH ::




Custom Search