Aligning Multiple Protein Sequences by Hybrid Clonal Selection Algorithm with Insert-Remove-Gaps and BlockShuffling Operators - Artificial Immune Systems

Information Technology Reference

In-Depth Information

When n =2a pairwise sequence alignment is found, with n

3 multiple sequence

alignment . Solving the sequence alignment problem requires a scoring function

to evaluate alignments. A simple scoring function is a distance function (another

scoring function is the similarity approach). Having a distance function d ( S i , S j )

for any aligned sequences

≥

S i and

S j , the pairwise alignment problem can be

stated as follows:

Definition 2 [Pairwise alignment problem]. Let S =

{

S 1 ,S 2 }

be a set of 2

sequences over the alphabet Σ. Compute the alignment S =

S 1 , S 2 }

{

of S over

the alphabet Σ that minimises the distance d ( S 1 , S 2 ) .

Hence, the multiple sequence alignment problem can be stated as follows:

Definition 3 [Sum-of-pairs multiple alignment problem]

Let S =

{

S 1 ,S 2 ,...S n }

be a set of n sequences over the alphabet Σ. Compute

S =

S 1 , S 2 ,..., S n }

Σ that minimises the

the alignment

{

of S over the alphabet

S i , and

S j :

sum of the distance over all pairs

= n− 1

d ( S i , S j )

min

i =1

j = i +1

The scoring functions previously defined are too simple to be used when aligning

real biological sequences. A scoring function needs to be based on the similarity

of the characters occurring in the sequences, e.g. amino-acids. For instance, for

two amino-acids, aa i and aa j , we need a measure of the probability that they

have a common ancestor, or that one aa is the result of one or several mutations

of the other. This measure can be formulated as follows:

Definition 4 [Scoring matrix]. Let M be a

scoring matrix, where is

the cardinality of the alphabet Σ, which for any two characters a and b of the

alphabet Σ has the following properties:

1. M ( a, b )= M ( b, a ) ,

∀

a, b

∈

Σ,

2. M ( a,

−

)= GEP, where GEP is a fixed gap penalty,

3. M (

−

)=0 .

In general a gap of lenght h has a penalty score of h × GEP, where GEP < 0is

the fixed gap (extension) penalty. This is called the linear gap penalty function .

From a biological point of view a more appropriate penalty score is the a ne gap

penalty function ,(AGPS):givenanalignedsequence S i , the first gap receives a

gap opening penalty , GOP < GEP < 0 , which is stronger than penalty for gap

extending spaces. Hence, a gap of lenght h has a cost of GOP +( h

1) GEP. The

most common scoring matrices are the PAM and BLOSUM series. These scoring

matrices have been developed based on observed mutations in the nature. In

order to minimise redundant information, based on the relatedness of the given

sequences, each sequence usually receives a weight proportional to the amount

of independent information it contains. This kind of information can be derived

from a phylogenetic tree for the sequences.

−

Artificial Immune Systems

Search WWH ::

Custom Search

Home