Information Technology Reference
In-Depth Information
m
j
=
1
n
j
n
log
n
j
E
(
X
)=
−
n
.
is a measure of the distance from the maximum value
of the empirical
k
-entropy of a genome with the same dictionary
D
The
k
-
co-entropy
coE
k
(
G
)
(
G
)
:
coE
k
(
G
)=
lg
(
|
D
k
(
G
)
|
)
−
E
k
(
G
)
.
(2.7)
On the basis of genomic
k
-entropy, many other genomic concepts can be developed
which could have a great relevance in the analysis of genomes, such as mutual in-
formation or entropic divergence.
Another related index for a genome
G
is given by its
k
-
lexicality
,thatis,the
ratio:
|
D
k
(
G
)
|/|
T
k
(
G
)
|
(2.8)
which expresses the percentage of distinct
k
-factors of
G
with respect to the all the
k
-factors present in G. It is clear that the
k
-lexicality increases with the word length
k
, and does not exhibit any regularity with the genome length. Of course, the inverse
of this ratio provides an average repeatability of the
k
-factors of
G
.
When
k
we say that a genome
G
is
k-complete
, meaning that all the
possible genomic
k
-long strings occur (at least once) in
G
.If
G
is not
k
-complete,
a non-empty set
F
k
(
Γ
=
D
k
(
G
)
of “non-appearing”, say
forbidden k-words
(also called
“nullomers” [34]), is given by the difference of sets:
G
)
k
F
k
(
G
)=
Γ
\
D
k
(
G
)
.
(2.9)
Of course, genomic
k
-completeness is related to the genome length. In fact, it is easy
to find a genome length such that surely genomes of that length are
k
-complete. In
fact we can construct such a genome by concatenating, in any order, all the
k
-mers.
Therefore, we have 4
k
! genomes
k
-complete of length
k
4
k
. The search for the mini-
mum length providing genomic
k
-completeness, and of algorithms for constructing
such minimal genomes, is a non-trivial theoretical investigation of some possible
interest.
For each
G
, we can define its
minimal forbidden length
, denoted by
MF
(
)
G
,as
the minimum
k
such that G is not
k
-complete.
The cardinality of
F
k
(
G
)
,for
k
greater than
MF
(
G
)
and within a small range over
MF
, seems to be a very specific feature of each genome. It is indeed remarkable
that in all genomes we considered
MF
(
G
)
is at least 6 and below 12 (see Table 2.15),
and it does not appear directly related to the biological complexity of corresponding
organisms. Another interesting character of genomes is the
factor length selectiv-
ity
LS
(
G
)
, which expresses the gap between the length of factors which in principle
could be all accommodated in a genome
G
, and the length of those which are actu-
ally present in
G
(according to the value of its minimal forbidden length):
(
G
)
LS
(
G
)=
lg
4
|
G
|−
(
MF
(
G
)
−
1
)
(2.10)