Strings and Genomes - Infobiotics: Information in Biotic Systems

Information Technology Reference

In-Depth Information

Dictionary Based Indexes

We denote by D

the set of all factors of a genome G , while we call k - genomic

dictionary of G (for some k

(

)

≤|

), denoted by D k (

)

, the set of all the k -long sub-

strings of genome G . Starting from dictionary D k (

)

,the k - genomic table T k (

)

which mathematically corresponds to a multiset , is defined by equipping the words

of D k (

)

with their multiplicities , that is, the number of their respective occurrences

in G .Let

α (

)

denote the multiplicity of

in a genome G ,and pos G ( α )

give the set

of positions of

in G (that is, the positions where the first symbol of

is placed).

Of course, it holds

may be represented by a

list of associations of strings to their corresponding multiplicities:

α (

)= |

pos G ( α ) |

. The table T k (

)

α → α (

)

, with

α ∈

D k (

) .

The sum of all the multiplicities of elements in D k (

)

is called size of

T k (

)

, denoted by

T k (

) |

. It is easy to realize that:

T k (

) | = |

|−

In general, the multiset T k (

associated to the genome G does not univocally indi-

viduate G . In fact, let us assume that G has the following string structure:

)

= −−− γ 1 −−−−

xxxxx

−−−− γ 2 −−−− γ 1 −−−−

yyyyy

−−−− γ 2 −−−

Now, if we exchange the two fragments included between

γ 2 and if their lengths

are equal to, or longer than k , the resulting genome G is such that T k (

γ 1 and

G )

because the k -factors of these two genomes are the same. In fact, the k strings which

occur internally in

T k (

−−−− γ 2

do not depend on the positions of these strings, while those which are partially inside

and partially outside to the (left and right) borders depend on the k

γ 1 −−−−

xxxxx

−−−−− γ 2 and in

γ 1 −−−−

yyyyy

−

1- contexts ,that

is, the strings of length k

1 which they have on the right and on the left. But, these

contexts in this case are exactly the same, because

−

k .

We say that two genomes G 1 and G 2 are multiset k -equivalent when T k (

| γ 1 |≥

k and

| γ 2 |≥

G 1 )=

T k (

G 2 ) .

Given a dictionary D of a genome G ,the Multiplicity-Comultiplicity distribu-

tion MC , relative to D and G , may be defined by means of a graphical profile, where

in the abscissa the multiplicities are given, in increasing order (0, 1, 2, ...), and in

the ordinate the number of words of D having a given multiplicity of occurrence in

G is indicated.

All the typical parameters of distributions (mean value, standard deviation, me-

dian, mode,...) also determinespecific values of distribution MC .

The same information of a multiplicity-comultiplicity distribution may be ex-

pressed as a rank-multiplicity Zipf map (usually employed to study word frequen-

cies in natural languages). Zipf's distributions have in the abscissa the words in

decreasing order of frequency (in alphabetical order when they have the same fre-

quency), say this order rank, and in the ordinate the value of frequency associated to

a rank. Zipf's curves prove to be sensibly different for different genomes, but in all

Infobiotics: Information in Biotic Systems

Search WWH ::

Custom Search

Home