Information Technology Reference
In-Depth Information
35
30
IBM
IBM_Opt
IBM2_Opt
IBM2
Pr e f ix Span
SPA M
25
20
15
10
5
0
Support
Fig. 8.10. Performances for a dataset with 130,000 rows and 20 distinct items
25
20
IBM2_Opt
IBM2
PrefixSpan
SPAM
15
10
5
0
Support
Fig. 8.11. Performances for a dataset with 100,000 rows and 35 distinct items
For more than 35 distinct items, PrefixSPAN becomes the faster, but this is true
only for this large database size. Indeed, an interesting result has been obtained on a
smaller database which has a larger alphabet (of 75 items).
This test (see Fig. 8.12) used the public dataset chess.dat , provided notably in the
UCI KDD archive. The dataset is composed of 75 distinct items with 3,196 sequences
of size between 36 to 37 items. The generation time of the structure is insignificant.
Although the number of distinct items is greater than 35 (75 here), IBM and IBM
outperform PrefixSPAN (see Fig. 8.12). The main reason is the size of sequences,
which is larger (from 36 to 37) compared to the synthetic datasets. Thus height of the
tree representing the projected database in PrefixSPAN is greater than for the syn-
thetic data. Therefore, traversing this projected data in PrefixSPAN is slower than
scanning our IBM data structure The memory space required by IBM, IBM and Pre-
fixSPAN are respectively equal to 2.3 MB, 0.29 MB and 9 MB. Nevertheless, the SV
is only composed of 91 items. Concerning the storage cost, we have measured the
memory resource consumption for each algorithm for each tested dataset.
Search WWH ::




Custom Search