Mining Sequential Patterns from Probabilistic Databases by Pattern-Growth - Advances in Databases

Database Reference

In-Depth Information

event in a source sequence using a uniform distribution over (0 , 1], thus obtaining

a collection of p-sequences.

The real dataset Gazelle has 29369 sequences and 35722 events. For synthetic

datasets, we follow the naming convention of [17]: a dataset named C i D j K means

that the average number of events per source is i and the number of sources is j

(in thousands). Alphabet size is 2K and all other parameters are set to default.

For example, the dataset C10D20K has on average 10 events per source and 20K

sources. We consider following three parameters in our experiments: number of

sources D , average number of events per source C , and support threshold θ .We

test our algorithms for increasing D ,increasing C , and decreasing θ values by

keeping the other two parameters fixed.

Scalability Testing. In the first set of experiments, we fix C =10and θ =1%,

and test the scalability of these algorithms for decreasing θ values. We report

our results for synthetic dataset ( C10D10K ) in Fig. 1(L), and for real dataset

(Gazelle) in Fig. 1(R). It can be seen that PGA performs better than both the

candidate generate-and-test approaches for real as well as for synthetic dataset.

The performace difference is more obvious for harder instances (at low θ values).

In another set of experiments, we test the scalability of these algorithms for

increasing values of D by fixing C =10and θ = 1% (Fig. 2(L)), and by fixing

D =10 K and θ = 25%, for increasing values of C (Fig. 2(R)). It can be seen that

PGA performs better than the other two algorithms for increasing D (Fig. 2(L)).

However, we do not see improvements for increasing C (Fig. 2(R)). We are cur-

rently investigating the reasons for this behaviour. Note that DFS+P processes

only one S- or I-extension of s at a time, whereas in PGA all the extensions of

s are processed simultaneously.

We also kept statistics about the number of DP computations for each algo-

rithm (Table. 3). The datasets and support thresholds are the same as in Fig. 1

and Fig. 2. We observe that PGA performs the least number of DP computa-

tions consistently, as in PGA the B i,s arrays are computed only for the frequent

sequences. As noted in [13] that candidate generation approaches suffer from an

exponential number of candidates at low θ values,thiscostisevenhigherin

probabilistic case because of the B i,s arrays. Further, note that there is no need

for keeping B i,s arrays for BFS in contrast with the PGA and therefore, PGA

has additional memory needs.

5 Conclusions and Future Work

We have considered the problem of finding all frequent sequences by pattern-

growth in SLU databases. We have evaluated PGA in contrast with the candidate

generate-and-test approaches, and we observe that the PGA performs better

than the candidate generation approaches in general. The speedup in running

time can be seen for the real dataset, in particular at low θ values, and for the

synthetic datasets as well when the source sequences are not very long or for low

θ values. The statistics about the number of DP computations also show that the

PGA performs the least number of DP computations consistently. We conclude

Advances in Databases

Search WWH ::

Custom Search

Home