Biomedical Engineering Reference
In-Depth Information
6.3 GLOBAL AND LOCAL BIOLOGICAL SEQUENCE ALIGNMENT
DNA sequences and protein sequences are treated as long sequences depicted, respec-
tively, by
.
To compare two biological sequences, we need to determine whether a given pattern
(sequence 1) appears in a given text (sequence 2). Exact searching is not usually done
in this case, because it is unlikely that biological patterns match the text exactly.
Therefore, the problem of comparing two biological sequences is in fact a problem
of approximate pattern matching, where the goal is to find a given pattern in a given
text, allowing a limited number of errors in the matches [20]. In this context, a key
concept is the error model, which defines how different two strings are. Depending
on the type of errors being considered, the time needed to obtain solutions can range
from linear to exponential.
For any string, s
{
A, T, G, C
}
and
{
A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y
}
, we denote its length as
|
s
|
. We also denote s i , the i th
character of s , for an integer i
s j .
Using these previously defined notations, the problem of approximate pattern
matching can be defined as follows [20]:
∈{
1
···|
s
|}
. We also denote s i , ... , j =
s i s i + 1 ···
Let be a finite alphabet of size
|
|=
σ .
be a text of length n
Let T
=|
T
|
.
be a pattern of length m
Let P
=|
P
|
.
Let k
be the maximum error allowed.
Let d : ×
be a distance function .
·
Given T , P , k , and d(
) return the set of all the text positions j such that there
exists i with d(P , T i , ... , j )
k .
The distance d(x , y) between strings x and y is the minimal cost of a sequence
of operations of the form δ(z , w)
=
t , where z and w are different substrings, which
transform x into y [20].
In biological sequence comparison applications, the set of possible operations δ
is [27]:
Insertion ( δ( , a) ): insertion of the letter a .
Deletion ( δ(a , ) ): deletion of the letter a .
Substitution ( δ(a , b) ): substitution of the letter a by letter b .
Two basic notions widely used in biological sequence comparison algorithms are
similarity and alignment. The similarity is a measure how similar two sequences are.
In general, the more the sequences are similar, the less the errors are present. The
alignment is obtained by placing one sequence above the other, making clear the
correspondence between similar characters or substrings from the sequences [27]. In
an alignment, spaces are inserted in arbitrary locations along the sequences so that
they end up with the same size.
Search WWH ::




Custom Search