Information Technology Reference
In-Depth Information
Dictionary Based Indexes
We denote by D
the set of all factors of a genome G , while we call k - genomic
dictionary of G (for some k
(
G
)
≤|
G
|
), denoted by D k (
G
)
, the set of all the k -long sub-
strings of genome G . Starting from dictionary D k (
)
,the k - genomic table T k (
)
,
which mathematically corresponds to a multiset , is defined by equipping the words
of D k (
G
G
G
)
with their multiplicities , that is, the number of their respective occurrences
in G .Let
α (
G
)
denote the multiplicity of
α
in a genome G ,and pos G ( α )
give the set
of positions of
α
in G (that is, the positions where the first symbol of
α
is placed).
Of course, it holds
may be represented by a
list of associations of strings to their corresponding multiplicities:
α (
G
)= |
pos G ( α ) |
. The table T k (
G
)
α α (
G
)
, with
α
D k (
G
) .
The sum of all the multiplicities of elements in D k (
G
)
is called size of
T k (
G
)
, denoted by
|
T k (
G
) |
. It is easy to realize that:
|
T k (
G
) | = |
G
|−
k
+
1
.
In general, the multiset T k (
associated to the genome G does not univocally indi-
viduate G . In fact, let us assume that G has the following string structure:
G
)
G
= −−− γ 1 −−−−
xxxxx
−−−− γ 2 −−−− γ 1 −−−−
yyyyy
−−−− γ 2 −−−
Now, if we exchange the two fragments included between
γ 2 and if their lengths
are equal to, or longer than k , the resulting genome G is such that T k (
γ 1 and
G )
,
because the k -factors of these two genomes are the same. In fact, the k strings which
occur internally in
G
)=
T k (
−−−− γ 2
do not depend on the positions of these strings, while those which are partially inside
and partially outside to the (left and right) borders depend on the k
γ 1 −−−−
xxxxx
−−−−− γ 2 and in
γ 1 −−−−
yyyyy
1- contexts ,that
is, the strings of length k
1 which they have on the right and on the left. But, these
contexts in this case are exactly the same, because
k .
We say that two genomes G 1 and G 2 are multiset k -equivalent when T k (
| γ 1 |≥
k and
| γ 2 |≥
G 1 )=
T k (
G 2 ) .
Given a dictionary D of a genome G ,the Multiplicity-Comultiplicity distribu-
tion MC , relative to D and G , may be defined by means of a graphical profile, where
in the abscissa the multiplicities are given, in increasing order (0, 1, 2, ...), and in
the ordinate the number of words of D having a given multiplicity of occurrence in
G is indicated.
All the typical parameters of distributions (mean value, standard deviation, me-
dian, mode,...) also determinespecific values of distribution MC .
The same information of a multiplicity-comultiplicity distribution may be ex-
pressed as a rank-multiplicity Zipf map (usually employed to study word frequen-
cies in natural languages). Zipf's distributions have in the abscissa the words in
decreasing order of frequency (in alphabetical order when they have the same fre-
quency), say this order rank, and in the ordinate the value of frequency associated to
a rank. Zipf's curves prove to be sensibly different for different genomes, but in all
Search WWH ::




Custom Search