Information Technology Reference
In-Depth Information
Dictionary Based Indexes
We denote by
D
the set of all factors of a genome
G
, while we call
k
-
genomic
dictionary of G
(for some
k
(
G
)
≤|
G
|
), denoted by
D
k
(
G
)
, the set of all the
k
-long sub-
strings of genome
G
. Starting from dictionary
D
k
(
)
,the
k
-
genomic table
T
k
(
)
,
which mathematically corresponds to a
multiset
, is defined by equipping the words
of
D
k
(
G
G
G
)
with their
multiplicities
, that is, the number of their respective occurrences
in
G
.Let
α
(
G
)
denote the multiplicity of
α
in a genome
G
,and
pos
G
(
α
)
give the set
of positions of
α
in
G
(that is, the positions where the first symbol of
α
is placed).
Of course, it holds
may be represented by a
list of associations of strings to their corresponding multiplicities:
α
(
G
)=
|
pos
G
(
α
)
|
. The table
T
k
(
G
)
α
→
α
(
G
)
, with
α
∈
D
k
(
G
)
.
The sum of all the multiplicities of elements in
D
k
(
G
)
is called
size
of
T
k
(
G
)
, denoted by
|
T
k
(
G
)
|
. It is easy to realize that:
|
T
k
(
G
)
|
=
|
G
|−
k
+
1
.
In general, the multiset
T
k
(
associated to the genome
G
does not univocally indi-
viduate
G
. In fact, let us assume that
G
has the following string structure:
G
)
G
=
−−−
γ
1
−−−−
xxxxx
−−−−
γ
2
−−−−
γ
1
−−−−
yyyyy
−−−−
γ
2
−−−
Now, if we exchange the two fragments included between
γ
2
and if their lengths
are equal to, or longer than
k
, the resulting genome
G
is such that
T
k
(
γ
1
and
G
)
,
because the
k
-factors of these two genomes are the same. In fact, the
k
strings which
occur internally in
G
)=
T
k
(
−−−−
γ
2
do not depend on the positions of these strings, while those which are partially inside
and partially outside to the (left and right) borders depend on the
k
γ
1
−−−−
xxxxx
−−−−−
γ
2
and in
γ
1
−−−−
yyyyy
−
1-
contexts
,that
is, the strings of length
k
1 which they have on the right and on the left. But, these
contexts in this case are exactly the same, because
−
k
.
We say that two genomes
G
1
and
G
2
are multiset
k
-equivalent when
T
k
(
|
γ
1
|≥
k
and
|
γ
2
|≥
G
1
)=
T
k
(
G
2
)
.
Given a dictionary
D
of a genome
G
,the
Multiplicity-Comultiplicity
distribu-
tion
MC
, relative to
D
and
G
, may be defined by means of a graphical profile, where
in the abscissa the multiplicities are given, in increasing order (0, 1, 2, ...), and in
the ordinate the number of words of
D
having a given multiplicity of occurrence in
G
is indicated.
All the typical parameters of distributions (mean value, standard deviation, me-
dian, mode,...) also determinespecific values of distribution
MC
.
The same information of a multiplicity-comultiplicity distribution may be ex-
pressed as a
rank-multiplicity
Zipf map (usually employed to study word frequen-
cies in natural languages). Zipf's distributions have in the abscissa the words in
decreasing order of frequency (in alphabetical order when they have the same fre-
quency), say this order rank, and in the ordinate the value of frequency associated to
a rank. Zipf's curves prove to be sensibly different for different genomes, but in all