Information Technology Reference
In-Depth Information
where
is the floor (greatest integer less than the real value) of
x
.Thevalueof
LS(G) is around 5 in all the unicellular and primitive multicellular organisms, and
is around 10 in the two human chromosomes we analyzed. A clear understanding
of this behavior should be investigated in more general terms; however,
LS
x
(
)
is
surely related to an evolution selectivity action over the strings constituting genomic
dictionaries.
G
Hapax and Repeat Analysis
Two important types of factors of genomes are hapaxes and repeats. A
hapax
of a
genome
G
is a factor
1. The term hapax (from a Greek
word meaning once) came from the analysis of literary texts (for stylistic analysis
and text authorship attribution); however, now it is also used in informational text
analysis [64].
A
repeat
of
G
is a factor
α
of
G
such that
α
(
G
)=
α
of
G
such that
α
(
G
)
>
1. Of course, the set
H
(
G
)
of
hapaxes of
G
and the set
R
(
G
)
of repeats of
G
constitute a bipartition of
D
(
G
)
(at
least one element of
Γ
is a repeat and
G
is a hapax, therefore
H
(
G
)
and
R
(
G
)
are
non-empty, also disjoint sets, such that their union is
D
(
G
)
). We set:
k
H
k
(
G
)=
Γ
∩
H
(
G
)
(2.11)
k
R
k
(
G
)=
Γ
∩
R
(
G
)
(2.12)
where
∩
is the set-theoretic intersection. Therefore, given a genome
G
of length
n
,
for any
k
≤
n
we can read it according to the bi-partition of its
k
-genomic dictionaries
H
k
(
.
A more refined measure for the
average k-factors repeatability
in
G
may be
now given as:
G
)
and
R
k
(
G
)
)=
|
T
k
(
G
)
\
H
k
(
G
)
|
AR
k
(
G
(2.13)
|
R
k
(
G
)
|
where
k
-hapaxes have been excluded by both the
k
-genomic multiset and the
k
-genomic dictionary (the symbol
\
represents the set-theoretic difference). Index
AR
k
(
counts the proper (average) repeatability of
k
-repeats in genome
G
.
The concepts of hapax and repeat provide a great number of related notions.
In the following section we will discuss experimental data, reported in tables, di-
agrams, and figures, which include the measure of the ratio between
G
)
|
H
k
(
G
)
|
and
|
as a function of
k
(that is, how the number of hapax words of a given length
increases or decreases with respect to the number of repeats of that length). An im-
portant phenomenon guided us in the choice of the string lengths for the computed
dictionaries. In fact, we observed a sort of
transition phase
effect in the passage
from
D
12
R
k
(
G
)
|
, in almost all genomes of Table 2.11, where a clear inver-
sion appears in the ratio hapax-cardinality/repeat-cardinality.
Let us mention briefly other relevant indexes, related to hapax and repeat con-
cepts, that will be reconsidered in the following (for definitions see Table 2.13):
minimal hapax length
, denoted by MH,
maximal repeat length
MR,
repeat
(
G
)
to
D
18
(
G
)