Strings and Genomes - Infobiotics: Information in Biotic Systems

Information Technology Reference

In-Depth Information

j = 1

n j

log n j

(

)= −

n .

is a measure of the distance from the maximum value

of the empirical k -entropy of a genome with the same dictionary D

The k - co-entropy coE k (

)

(

)

coE k (

( |

D k (

) | ) −

E k (

) .

(2.7)

On the basis of genomic k -entropy, many other genomic concepts can be developed

which could have a great relevance in the analysis of genomes, such as mutual in-

formation or entropic divergence.

Another related index for a genome G is given by its k - lexicality ,thatis,the

ratio:

D k (

) |/|

T k (

) |

(2.8)

which expresses the percentage of distinct k -factors of G with respect to the all the

k -factors present in G. It is clear that the k -lexicality increases with the word length

k , and does not exhibit any regularity with the genome length. Of course, the inverse

of this ratio provides an average repeatability of the k -factors of G .

When

we say that a genome G is k-complete , meaning that all the

possible genomic k -long strings occur (at least once) in G .If G is not k -complete,

a non-empty set F k (

D k (

)

of “non-appearing”, say forbidden k-words (also called

“nullomers” [34]), is given by the difference of sets:

)

F k (

)= Γ

D k (

) .

(2.9)

Of course, genomic k -completeness is related to the genome length. In fact, it is easy

to find a genome length such that surely genomes of that length are k -complete. In

fact we can construct such a genome by concatenating, in any order, all the k -mers.

Therefore, we have 4 k ! genomes k -complete of length k 4 k . The search for the mini-

mum length providing genomic k -completeness, and of algorithms for constructing

such minimal genomes, is a non-trivial theoretical investigation of some possible

interest.

For each G , we can define its minimal forbidden length , denoted by MF

(

)

,as

the minimum k such that G is not k -complete.

The cardinality of F k (

)

,for k greater than MF

(

)

and within a small range over

, seems to be a very specific feature of each genome. It is indeed remarkable

that in all genomes we considered MF

(

)

is at least 6 and below 12 (see Table 2.15),

and it does not appear directly related to the biological complexity of corresponding

organisms. Another interesting character of genomes is the factor length selectiv-

ity LS

(

)

, which expresses the gap between the length of factors which in principle

could be all accommodated in a genome G , and the length of those which are actu-

ally present in G (according to the value of its minimal forbidden length):

(

)

(

lg 4 |

|− (

(

) −

)

(2.10)

Infobiotics: Information in Biotic Systems

Search WWH ::

Custom Search

Home