Information Technology Reference
In-Depth Information
a sort of transition phase from scarce to abundant hapax/repeat distribution. This
phenomenon would surely deserve a more detailed and generalized analysis.
Any substring of a repeat word is still a repeat, with its own multiplicity along
the genome, and inside the repeat word itself. A further index is thus defined over
genomes G, called MR
(
)
G
( maximal repeat length ), as the maximal length of
words
γ
such that
γ (
G
) >
1. An algorithmic way to find it (for our genomes) starts
from repeats out of D 18
(that are less than the hapaxes) and checks how much
they may be elongated on the genome by keeping their status of repeat words. Data
related to the MR index computed over our genomes are reported in Table 2.18,
where the only MR-long repeat of each genome exhibits a non-trivial structure (that
is, different than polymers with a same nucleotide or similar patterns), and complex
repeats are obtained for many lengths.
The importance of word repeatability is crucial to understanding the information
content of texts. A genome analysis in terms of (shortest) hapaxes and (maximal)
repeats, providing their relative distribution within the genome, highlights the asso-
ciative nature of DNA as a container of information. Localization and frequency of
specific DNA fragments is indeed crucial to understand the information organiza-
tion of genomes. Hapaxes, occurring once in the genome, by their nature have a role
of address for the specific retrieval of functional elements, characterized by redun-
dancy and repeatability. On the other hand, an important characterization of repeats
may be given by means of their internal structure, that is, by the non-maximal re-
peats which compose them. These represent a second level repeatability, possibly
exhibiting various and rich genomic structural properties of functional sequences
(such as the presence of power strings).
Indexes, dictionaries, and tables given in this section identify a kernel of about
20 basic concepts, and many other notions may be derived from them. Namely, for
any numerical index I k with parameter k , the distribution k
(
G
)
I k can be defined,
and its classical statistical parameters (mean, standard deviation, median, mode,
etc.) may be derived as further indexes (the same possibility holds for multiplicity-
comultiplicity factor distribution). Moreover, extending Shannon's notion of typical
sequence in information theory, for any index I , a minimal I -typical sequence, for
a given genome G , is a portion of G such that the index I , restricted to this portion,
assumes (approximates) the same value which I assumes over the whole G .The
length and number of these sequences are other genomic indexes. The power of
some indexes in characterizing properties, which are relevant in specific contexts,
is a kind of research requiring computational experiments, mathematical analyses,
and biological interpretations and comparisons.
Bipartition of a genomic dictionary in hapax and repeat words emphasizes the
roots of precise string categories which are related to the functional organization of
genomes. The set of 18-repeats in our genomes has a size which is a couple of or-
ders smaller than the whole genome, and it seems to have a role of “lexical” coding.
Other elements, with a notably bigger size, seem to have a role of addressing, delim-
iting, coordinating, just like position-identification tags. While the lexical nature of
repeated elements points out their semantic value, the “relative localization” nature
of the others gives importance to their unrepeatability along the genomic sequence.
 
Search WWH ::




Custom Search