Database Reference
In-Depth Information
event in a source sequence using a uniform distribution over (0 , 1], thus obtaining
a collection of p-sequences.
The real dataset Gazelle has 29369 sequences and 35722 events. For synthetic
datasets, we follow the naming convention of [17]: a dataset named C i D j K means
that the average number of events per source is i and the number of sources is j
(in thousands). Alphabet size is 2K and all other parameters are set to default.
For example, the dataset C10D20K has on average 10 events per source and 20K
sources. We consider following three parameters in our experiments: number of
sources D , average number of events per source C , and support threshold θ .We
test our algorithms for increasing D ,increasing C , and decreasing θ values by
keeping the other two parameters fixed.
Scalability Testing. In the first set of experiments, we fix C =10and θ =1%,
and test the scalability of these algorithms for decreasing θ values. We report
our results for synthetic dataset ( C10D10K ) in Fig. 1(L), and for real dataset
(Gazelle) in Fig. 1(R). It can be seen that PGA performs better than both the
candidate generate-and-test approaches for real as well as for synthetic dataset.
The performace difference is more obvious for harder instances (at low θ values).
In another set of experiments, we test the scalability of these algorithms for
increasing values of D by fixing C =10and θ = 1% (Fig. 2(L)), and by fixing
D =10 K and θ = 25%, for increasing values of C (Fig. 2(R)). It can be seen that
PGA performs better than the other two algorithms for increasing D (Fig. 2(L)).
However, we do not see improvements for increasing C (Fig. 2(R)). We are cur-
rently investigating the reasons for this behaviour. Note that DFS+P processes
only one S- or I-extension of s at a time, whereas in PGA all the extensions of
s are processed simultaneously.
We also kept statistics about the number of DP computations for each algo-
rithm (Table. 3). The datasets and support thresholds are the same as in Fig. 1
and Fig. 2. We observe that PGA performs the least number of DP computa-
tions consistently, as in PGA the B i,s arrays are computed only for the frequent
sequences. As noted in [13] that candidate generation approaches suffer from an
exponential number of candidates at low θ values,thiscostisevenhigherin
probabilistic case because of the B i,s arrays. Further, note that there is no need
for keeping B i,s arrays for BFS in contrast with the PGA and therefore, PGA
has additional memory needs.
5 Conclusions and Future Work
We have considered the problem of finding all frequent sequences by pattern-
growth in SLU databases. We have evaluated PGA in contrast with the candidate
generate-and-test approaches, and we observe that the PGA performs better
than the candidate generation approaches in general. The speedup in running
time can be seen for the real dataset, in particular at low θ values, and for the
synthetic datasets as well when the source sequences are not very long or for low
θ values. The statistics about the number of DP computations also show that the
PGA performs the least number of DP computations consistently. We conclude
 
Search WWH ::




Custom Search