Information Technology Reference
In-Depth Information
When n =2a pairwise sequence alignment is found, with n
3 multiple sequence
alignment . Solving the sequence alignment problem requires a scoring function
to evaluate alignments. A simple scoring function is a distance function (another
scoring function is the similarity approach). Having a distance function d ( S i , S j )
for any aligned sequences
S i and
S j , the pairwise alignment problem can be
stated as follows:
Definition 2 [Pairwise alignment problem]. Let S =
{
S 1 ,S 2 }
be a set of 2
sequences over the alphabet Σ. Compute the alignment S =
S 1 , S 2 }
{
of S over
the alphabet Σ that minimises the distance d ( S 1 , S 2 ) .
Hence, the multiple sequence alignment problem can be stated as follows:
Definition 3 [Sum-of-pairs multiple alignment problem]
Let S =
{
S 1 ,S 2 ,...S n }
be a set of n sequences over the alphabet Σ. Compute
S =
S 1 , S 2 ,..., S n }
Σ that minimises the
the alignment
{
of S over the alphabet
S i , and
S j :
sum of the distance over all pairs
= n− 1
d ( S i , S j )
n
min
S
i =1
j = i +1
The scoring functions previously defined are too simple to be used when aligning
real biological sequences. A scoring function needs to be based on the similarity
of the characters occurring in the sequences, e.g. amino-acids. For instance, for
two amino-acids, aa i and aa j , we need a measure of the probability that they
have a common ancestor, or that one aa is the result of one or several mutations
of the other. This measure can be formulated as follows:
Definition 4 [Scoring matrix]. Let M be a
scoring matrix, where is
the cardinality of the alphabet Σ, which for any two characters a and b of the
alphabet Σ has the following properties:
×
1. M ( a, b )= M ( b, a ) ,
a, b
Σ,
2. M ( a,
)= GEP, where GEP is a fixed gap penalty,
3. M (
,
)=0 .
In general a gap of lenght h has a penalty score of h × GEP, where GEP < 0is
the fixed gap (extension) penalty. This is called the linear gap penalty function .
From a biological point of view a more appropriate penalty score is the a ne gap
penalty function ,(AGPS):givenanalignedsequence S i , the first gap receives a
gap opening penalty , GOP < GEP < 0 , which is stronger than penalty for gap
extending spaces. Hence, a gap of lenght h has a cost of GOP +( h
1) GEP. The
most common scoring matrices are the PAM and BLOSUM series. These scoring
matrices have been developed based on observed mutations in the nature. In
order to minimise redundant information, based on the relatedness of the given
sequences, each sequence usually receives a weight proportional to the amount
of independent information it contains. This kind of information can be derived
from a phylogenetic tree for the sequences.
Search WWH ::




Custom Search