Strings and Genomes - Infobiotics: Information in Biotic Systems

Information Technology Reference

In-Depth Information

a sort of transition phase from scarce to abundant hapax/repeat distribution. This

phenomenon would surely deserve a more detailed and generalized analysis.

Any substring of a repeat word is still a repeat, with its own multiplicity along

the genome, and inside the repeat word itself. A further index is thus defined over

genomes G, called MR

(

)

G

( maximal repeat length ), as the maximal length of

words

γ

such that

γ (

G

) >

1. An algorithmic way to find it (for our genomes) starts

from repeats out of D 18

(that are less than the hapaxes) and checks how much

they may be elongated on the genome by keeping their status of repeat words. Data

related to the MR index computed over our genomes are reported in Table 2.18,

where the only MR-long repeat of each genome exhibits a non-trivial structure (that

is, different than polymers with a same nucleotide or similar patterns), and complex

repeats are obtained for many lengths.

The importance of word repeatability is crucial to understanding the information

content of texts. A genome analysis in terms of (shortest) hapaxes and (maximal)

repeats, providing their relative distribution within the genome, highlights the asso-

ciative nature of DNA as a container of information. Localization and frequency of

specific DNA fragments is indeed crucial to understand the information organiza-

tion of genomes. Hapaxes, occurring once in the genome, by their nature have a role

of address for the specific retrieval of functional elements, characterized by redun-

dancy and repeatability. On the other hand, an important characterization of repeats

may be given by means of their internal structure, that is, by the non-maximal re-

peats which compose them. These represent a second level repeatability, possibly

exhibiting various and rich genomic structural properties of functional sequences

(such as the presence of power strings).

Indexes, dictionaries, and tables given in this section identify a kernel of about

20 basic concepts, and many other notions may be derived from them. Namely, for

any numerical index I k with parameter k , the distribution k

(

G

)

I k can be defined,

and its classical statistical parameters (mean, standard deviation, median, mode,

etc.) may be derived as further indexes (the same possibility holds for multiplicity-

comultiplicity factor distribution). Moreover, extending Shannon's notion of typical

sequence in information theory, for any index I , a minimal I -typical sequence, for

a given genome G , is a portion of G such that the index I , restricted to this portion,

assumes (approximates) the same value which I assumes over the whole G .The

length and number of these sequences are other genomic indexes. The power of

some indexes in characterizing properties, which are relevant in specific contexts,

is a kind of research requiring computational experiments, mathematical analyses,

and biological interpretations and comparisons.

Bipartition of a genomic dictionary in hapax and repeat words emphasizes the

roots of precise string categories which are related to the functional organization of

genomes. The set of 18-repeats in our genomes has a size which is a couple of or-

ders smaller than the whole genome, and it seems to have a role of “lexical” coding.

Other elements, with a notably bigger size, seem to have a role of addressing, delim-

iting, coordinating, just like position-identification tags. While the lexical nature of

repeated elements points out their semantic value, the “relative localization” nature

of the others gives importance to their unrepeatability along the genomic sequence.

→

Infobiotics: Information in Biotic Systems

Search WWH ::

Custom Search

Home