Information Technology Reference
In-Depth Information
m
j = 1
n j
n
log n j
E
(
X
)=
n .
is a measure of the distance from the maximum value
of the empirical k -entropy of a genome with the same dictionary D
The k - co-entropy coE k (
G
)
(
G
)
:
coE k (
G
)=
lg
( |
D k (
G
) | )
E k (
G
) .
(2.7)
On the basis of genomic k -entropy, many other genomic concepts can be developed
which could have a great relevance in the analysis of genomes, such as mutual in-
formation or entropic divergence.
Another related index for a genome G is given by its k - lexicality ,thatis,the
ratio:
|
D k (
G
) |/|
T k (
G
) |
(2.8)
which expresses the percentage of distinct k -factors of G with respect to the all the
k -factors present in G. It is clear that the k -lexicality increases with the word length
k , and does not exhibit any regularity with the genome length. Of course, the inverse
of this ratio provides an average repeatability of the k -factors of G .
When
k
we say that a genome G is k-complete , meaning that all the
possible genomic k -long strings occur (at least once) in G .If G is not k -complete,
a non-empty set F k (
Γ
=
D k (
G
)
of “non-appearing”, say forbidden k-words (also called
“nullomers” [34]), is given by the difference of sets:
G
)
k
F k (
G
)= Γ
\
D k (
G
) .
(2.9)
Of course, genomic k -completeness is related to the genome length. In fact, it is easy
to find a genome length such that surely genomes of that length are k -complete. In
fact we can construct such a genome by concatenating, in any order, all the k -mers.
Therefore, we have 4 k ! genomes k -complete of length k 4 k . The search for the mini-
mum length providing genomic k -completeness, and of algorithms for constructing
such minimal genomes, is a non-trivial theoretical investigation of some possible
interest.
For each G , we can define its minimal forbidden length , denoted by MF
(
)
G
,as
the minimum k such that G is not k -complete.
The cardinality of F k (
G
)
,for k greater than MF
(
G
)
and within a small range over
MF
, seems to be a very specific feature of each genome. It is indeed remarkable
that in all genomes we considered MF
(
G
)
is at least 6 and below 12 (see Table 2.15),
and it does not appear directly related to the biological complexity of corresponding
organisms. Another interesting character of genomes is the factor length selectiv-
ity LS
(
G
)
, which expresses the gap between the length of factors which in principle
could be all accommodated in a genome G , and the length of those which are actu-
ally present in G (according to the value of its minimal forbidden length):
(
G
)
LS
(
G
)=
lg 4 |
G
|− (
MF
(
G
)
1
)
(2.10)
 
Search WWH ::




Custom Search