Information Technology Reference
In-Depth Information
Ta b l e 2 . 1 7 Indexes related to D 18 dictionaries
Genomic sequences
|
D 18 |
L 18
|
H 18 |
|
R 18 |
RD 18
HR 18
AR 18
Nanoarchaeum equitans
489,465
0.99
488,802
663
0.001
737.25 3.11
Mycoplasma genitalium
569,202
0.98
563,045
6,157
0.01
91.44
2.76
Mycoplasma mycoides
987,645
0.81
913,599
74,046
0.07
12.33
4.025
Haemophilus influenzae
1,795,492
0.98
1,775,531
19,964
0.01
88.93
2.64
Escherichia coli
4,557,590
0.98
4,518,585
39,005
0.008
115.84 3.10
Pseudomonas aeruginosa
6,183,215
0.98
6,117,968
65,247
0.01
93.76
2.24
Saccharomyces cerevisiae
11,499,795
0.95
11,307,098
192,697
0.01
58.67
3.96
Sorangium cellulosum
12,640,960
0.96
12,340,846
300,114
0.02
41.12
2.30
Homo sapiens chr19
41,529,106
0.75
39,256,297
2,272,809
0.05
17.27
6.91
C. elegans
89,444,661
0.89
85,157,627
4,287,034
0.04
19.86
3.52
D. melanogaster
116,446,627 0.90
112,977,046 3,469,581
0.02
32.56
4.45
Ta b l e 2 . 1 8 MR index, positions of the only twice repeating word of length MR, and relative
distance between the two occurrences (with respect to the genome lengths)
Genomic sequences
MR
MD MR /| G |
Nanoarchaeum equitans
139
96.95%
Mycoplasma genitalium
243
0.15 %
Mycoplasma mycoides
10,963
0.019 %
Haemophilus influenzae
5,563
8.05%
Escherichia coli
2,815
0.89 %
Pseudomonas aeruginosa
5,304
12.37 %
Saccharomyces cerevisiae
8,375
0.07%
Sorangium cellulosum
2,720
27.68 %
Homo sapiens chr19
2,247
0.02%
C. elegans
38,987
0.10 %
D. melanogaster
30,892
0.02 %
The phenomenon regarding hapax statistical distribution may be observed pass-
ing from 12- to 18-genomic dictionaries (see Tables 2.14, 2.16, and 2.17). For all the
genomes, by enlarging the k value, the number of hapax increases, even relatively to
the number of repeats (roughly speaking, “most of the 12-words are repeats while
most of 18-words are hapax”). Indeed, by computing HR k , we see that repeatability
generally almost disappears for k
18, with respect to the number of hapaxes.
More interestingly, the (relative) amount of hapaxes increases by some order of
magnitude with k passing from 12 to 18. Based on this observation coming from
computational analyses, one could suppose that by increasing the word size, ge-
nomic dictionaries composed by only hapaxes may be computed. This intuition
has been invalidated (see Table 2.18). In fact, repeats having lengths of several
thousands have been found within each of our genomes, and 12
=
18 represents
 
Search WWH ::




Custom Search