Information Technology Reference
In-Depth Information
cases, there are few elements with maximal multiplicity, indeed Zipf curves initially
slope down steeply.
Selectivity, Lexicality, and Forbidden Words
4
k
, which is the per-
centage of different
k
-mers occurring in
G
with respect to all the possible ones. Of
course,
We call
k
-
lexical fraction
of a genome
G
the value
|
D
k
(
G
)
|/
4
k
is an upper bound for
4
k
. A better evaluation for such an
|
T
k
(
G
)
|/
|
D
k
(
G
)
|/
4
k
4
k
upper bound is given by the value 1
/
(
1
+
/|
G
|
)
which approximates
|
D
k
(
G
)
|/
for a random sequence over
. In fact, let us assume that
G
is
random, then if
q
is the fraction of
k
-mers occurring at least once in
G
, then the
fraction of
k
-mers occurring at least twice in
G
is
q
2
, and in general the fraction of
k
-mers occurring at least
i
times is
q
i
, therefore, assuming
q
Γ
having length
|
G
|
1, for a very long
genome
G
, its length can be approximated in the following way [25]:
<
q
4
k
q
2
q
i
4
k
|
G
|
=
(
q
+
+
...
...
)=
q
.
1
−
Therefore,
4
k
q
|
G
|
(
1
−
q
)=
that is:
4
k
|
G
|
=
q
(
|
G
|
+
)
which implies:
4
k
1
/
q
=(
|
G
|
+
)
/|
G
|
or equivalently, the fraction of
k
-mers occurring in a random genome of length
|
G
|
(of length sensibly shorter than 4
k
)is:
4
k
/
=(
+
/|
|
)
.
1
q
1
G
(2.5)
4
k
for the genomes of Table 2.11 are in all cases
sensibly under this estimation. For example, for
H. sapiens chr. 19
,1
The computations of
|
D
k
(
G
)
|/
4
12
/
(
1
+
/|
G
|
)
4
12
is equal to 0
.
791, while
|
D
12
|/
is equal to 0
.
639. We define for a genome
G
its
k
-
dictionary selectivity
DS
k
(
G
)
as the following difference:
4
k
4
k
DS
k
(
G
)=
1
/
(
1
+
/|
G
|
)
−|
D
k
(
G
)
|/
.
(2.6)
Dictionary selectivity very often proves more indicative than the
k
-empirical entropy
of
E
k
(
G
)
, which can be defined as:
E
(
T
k
(
G
))
by applying to
T
k
(
G
)
the following general definition of entropy
E
(
X
)
of a multiset
X
of size
n
with
m
elements of multiplicities
n
1
,
n
2
,...,
n
m
: