Biology Reference
In-Depth Information
respectively; and
is the scaling factor. In Table 2 , we show the
entries of the BLOSUM62 scoring matrix.
λ
Given a column of an MSA, it might be reasonable to argue that the
relative proportions of the symbols in this column should be related
to the score of the column. The SP scoring scheme fails to cope
with this approach. For instance, the following two columns have
the same A : C ratio, yet their SP scores can be quite different.
2.4 Minimum
Entropy
A
A
C
C
C
C
C
C
C
C
A
C
C
C
S i ¼
S j ¼
! S
4
2
S i Þ¼
SP
ð
4 S
ð
A
;
C
Þþ
ð
C
;
C
Þ¼
4 S
ð
A
;
C
Þþ
6 S
ð
C
;
C
Þ
! S
8
2
S j Þ¼
SP
ð
S
ð
A
;
A
Þþ
16 S
ð
A
;
C
Þþ
ð
C
;
C
Þ
Þ:
This difference implies that the SP scoring scheme is not scal-
able with sequence sizes. An alternative approach calculates the
Shannon's entropy for each column as the column's score. Shan-
non's entropy [ 7 ] is used to calculate the information content of a
sequence of symbols.
Given a column S i 0 in an MSA, the score for this column is
directly related to Shannon's entropy but formally defined in a
slightly different way as follows:
¼
S
ð
A
;
A
Þþ
16 S
ð
A
;
C
Þþ
28 S
ð
C
;
C
X
S i Þ¼
S
ð
c ia log 2 p ia ;
a
where
l c ia : number of times symbol a occurs in column i
l p ia : probability of symbol a in column i
There are two extreme cases in this scoring scheme. When all
symbols in the column are the same, then the entropy score is 0. On
the other hand, the entropy score is maximum when all symbols in
the column are equally distributed. A good alignment is the one
Search WWH ::




Custom Search