Information Technology Reference
In-Depth Information
cases, there are few elements with maximal multiplicity, indeed Zipf curves initially
slope down steeply.
Selectivity, Lexicality, and Forbidden Words
4 k , which is the per-
centage of different k -mers occurring in G with respect to all the possible ones. Of
course,
We call k - lexical fraction of a genome G the value
|
D k (
G
) |/
4 k is an upper bound for
4 k . A better evaluation for such an
|
T k (
G
) |/
|
D k (
G
) |/
4 k
4 k
upper bound is given by the value 1
/ (
1
+
/|
G
| )
which approximates
|
D k (
G
) |/
for a random sequence over
. In fact, let us assume that G is
random, then if q is the fraction of k -mers occurring at least once in G , then the
fraction of k -mers occurring at least twice in G is q 2 , and in general the fraction of
k -mers occurring at least i times is q i , therefore, assuming q
Γ
having length
|
G
|
1, for a very long
genome G , its length can be approximated in the following way [25]:
<
q
4 k
q 2
q i
4 k
|
G
| =
(
q
+
+ ...
... )=
q .
1
Therefore,
4 k q
|
G
| (
1
q
)=
that is:
4 k
|
G
| =
q
( |
G
| +
)
which implies:
4 k
1
/
q
=( |
G
| +
) /|
G
|
or equivalently, the fraction of k -mers occurring in a random genome of length
|
G
|
(of length sensibly shorter than 4 k )is:
4 k
/
=(
+
/|
| ) .
1
q
1
G
(2.5)
4 k for the genomes of Table 2.11 are in all cases
sensibly under this estimation. For example, for H. sapiens chr. 19 ,1
The computations of
|
D k (
G
) |/
4 12
/ (
1
+
/|
G
| )
4 12
is equal to 0
.
791, while
|
D 12 |/
is equal to 0
.
639. We define for a genome G its
k - dictionary selectivity DS k (
G
)
as the following difference:
4 k
4 k
DS k (
G
)=
1
/ (
1
+
/|
G
| ) −|
D k (
G
) |/
.
(2.6)
Dictionary selectivity very often proves more indicative than the k -empirical entropy
of E k (
G
)
, which can be defined as:
E
(
T k (
G
))
by applying to T k (
G
)
the following general definition of entropy E
(
X
)
of a multiset
X of size n with m elements of multiplicities n 1 ,
n 2 ,...,
n m :
 
Search WWH ::




Custom Search