Computational Molecular Biology - Parallel Computing for Bioinformatics and Computational Biology

Biomedical Engineering Reference

In-Depth Information

6.3 GLOBAL AND LOCAL BIOLOGICAL SEQUENCE ALIGNMENT

DNA sequences and protein sequences are treated as long sequences depicted, respec-

tively, by

To compare two biological sequences, we need to determine whether a given pattern

(sequence 1) appears in a given text (sequence 2). Exact searching is not usually done

in this case, because it is unlikely that biological patterns match the text exactly.

Therefore, the problem of comparing two biological sequences is in fact a problem

of approximate pattern matching, where the goal is to find a given pattern in a given

text, allowing a limited number of errors in the matches [20]. In this context, a key

concept is the error model, which defines how different two strings are. Depending

on the type of errors being considered, the time needed to obtain solutions can range

from linear to exponential.

For any string, s

{

A, T, G, C

}

and

{

A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y

}

∗ , we denote its length as

∈

. We also denote s i , the i th

character of s , for an integer i

s j .

Using these previously defined notations, the problem of approximate pattern

matching can be defined as follows [20]:

∈{

···|

. We also denote s i , ... , j =

s i s i + 1 ···

Let be a finite alphabet of size

σ .

•

∗ be a text of length n

Let T

∈

•

∗ be a pattern of length m

Let P

∈

•

Let k

∈

be the maximum error allowed.

•

Let d : ∗ ×

∗ →

be a distance function .

•

Given T , P , k , and d(

) return the set of all the text positions j such that there

exists i with d(P , T i , ... , j )

•

≤

k .

The distance d(x , y) between strings x and y is the minimal cost of a sequence

of operations of the form δ(z , w)

t , where z and w are different substrings, which

transform x into y [20].

In biological sequence comparison applications, the set of possible operations δ

is [27]:

Insertion ( δ( , a) ): insertion of the letter a .

•

Deletion ( δ(a , ) ): deletion of the letter a .

•

Substitution ( δ(a , b) ): substitution of the letter a by letter b .

•

Two basic notions widely used in biological sequence comparison algorithms are

similarity and alignment. The similarity is a measure how similar two sequences are.

In general, the more the sequences are similar, the less the errors are present. The

alignment is obtained by placing one sequence above the other, making clear the

correspondence between similar characters or substrings from the sequences [27]. In

an alignment, spaces are inserted in arbitrary locations along the sequences so that

they end up with the same size.

Parallel Computing for Bioinformatics and Computational Biology

Search WWH ::

Custom Search

Home