Information Technology Reference
In-Depth Information
where
is the floor (greatest integer less than the real value) of x .Thevalueof
LS(G) is around 5 in all the unicellular and primitive multicellular organisms, and
is around 10 in the two human chromosomes we analyzed. A clear understanding
of this behavior should be investigated in more general terms; however, LS
x
(
)
is
surely related to an evolution selectivity action over the strings constituting genomic
dictionaries.
G
Hapax and Repeat Analysis
Two important types of factors of genomes are hapaxes and repeats. A hapax of a
genome G is a factor
1. The term hapax (from a Greek
word meaning once) came from the analysis of literary texts (for stylistic analysis
and text authorship attribution); however, now it is also used in informational text
analysis [64].
A repeat of G is a factor
α
of G such that
α (
G
)=
α
of G such that
α (
G
) >
1. Of course, the set H
(
G
)
of
hapaxes of G and the set R
(
G
)
of repeats of G constitute a bipartition of D
(
G
)
(at
least one element of
Γ
is a repeat and G is a hapax, therefore H
(
G
)
and R
(
G
)
are
non-empty, also disjoint sets, such that their union is D
(
G
)
). We set:
k
H k (
G
)= Γ
H
(
G
)
(2.11)
k
R k (
G
)= Γ
R
(
G
)
(2.12)
where
is the set-theoretic intersection. Therefore, given a genome G of length n ,
for any k
n we can read it according to the bi-partition of its k -genomic dictionaries
H k (
.
A more refined measure for the average k-factors repeatability in G may be
now given as:
G
)
and R k (
G
)
)= |
T k (
G
) \
H k (
G
) |
AR k (
G
(2.13)
|
R k (
G
) |
where k -hapaxes have been excluded by both the k -genomic multiset and the
k -genomic dictionary (the symbol
\
represents the set-theoretic difference). Index
AR k (
counts the proper (average) repeatability of k -repeats in genome G .
The concepts of hapax and repeat provide a great number of related notions.
In the following section we will discuss experimental data, reported in tables, di-
agrams, and figures, which include the measure of the ratio between
G
)
|
H k (
G
) |
and
|
as a function of k (that is, how the number of hapax words of a given length
increases or decreases with respect to the number of repeats of that length). An im-
portant phenomenon guided us in the choice of the string lengths for the computed
dictionaries. In fact, we observed a sort of transition phase effect in the passage
from D 12
R k (
G
) |
, in almost all genomes of Table 2.11, where a clear inver-
sion appears in the ratio hapax-cardinality/repeat-cardinality.
Let us mention briefly other relevant indexes, related to hapax and repeat con-
cepts, that will be reconsidered in the following (for definitions see Table 2.13):
minimal hapax length , denoted by MH, maximal repeat length MR, repeat
(
G
)
to D 18
(
G
)
 
Search WWH ::




Custom Search