Information Technology Reference
In-Depth Information
Finally, this experiment shows that IBM_OPT (respectively IBM2_OPT ) outperform
IBM (respectively IBM2 ). The performance gain (materialized by the gap between the
corresponding curves) remains constant. Indeed, this difference corresponds to the
saved time of the data structure construction and it is independent of the support
threshold. For the dataset composed of 300,000 sequences, we observe the same
behavior than in the previous experiment (Fig. 8.7). The size of SV is about 219 with
50,000 distinct sequences in the dataset and the structure is generated in 7.5 seconds.
For a dataset composed of 600,000 sequences (Fig. 8.8), SPAM algorithm produced a
memory overflow and failed. The size of SV is composed of 265 items with 90,000
distinct sequences in the dataset and the time to generate the structure is about 22.6
seconds. We observe that PrefixSPAN shows better performances than IBM.
IBM_OPT and IBM2_OPT still outperform PrefixSPAN . IBM2 remains better than
IBM and than PrefixSPAN, especially for a support bellow 0.3.
For a dataset composed of 1,000,000 sequences, the size of SV is about 370 with
160,000 distinct sequences in the dataset and the time to generate the structure is
about 80 seconds. PrefixSPAN consolidates its performance compared to IBM .
IBM_OPT always outperforms PrefixSPAN . In general, IBM shows better
performances than PrefixSPAN , unless the support is high. Conversely, IBM_OPT
outperforms PrefixSPAN when the support is greater than 0.1. In order to measure the
impact of the alphabet size on the performances, we have generated datasets using a
larger alphabet (i.e. with more than 20 distinct items). Fig.. 8.10 and 8.11 show the
results for 20 and 35 distinct items for dataset composed of respectively 130,000 and
100,000 sequences. For 20 distinct items, SV is composed of 192 items with 26,000
distinct sequences in the dataset. For the dataset composed of 35 distinct items, SV is
composed of 214 items with 32,000 distinct sequences. We observe that PrefixSPAN
outperforms IBM_OPT and IBM, but IBM and IBM_OPT win PrefixSPAN . Until 35
distinct items, IBM_OPT wins PrefixSPAN , with a support lower than 0.2. But,
PrefixSPAN outperforms IBM due to the generation of the structure.
1,000,000
700
IB M2 _ Op t
IB M_ Op t
Pr e f i x Span
IB M2
IB M
600
500
400
300
200
100
0
Support
Fig. 8.9. Performances with 1,000,000 rows
Search WWH ::




Custom Search