From Sequence Mining to Multidimensional Sequence Mining - Mining Complex Data

Information Technology Reference

In-Depth Information

IBM. If so, the corresponding frequency of this sequence stored in the NB table, is

added to the frequency of the candidate. After the comparison with each line until the

last one, the support of C is computed.

Fig. 8.5 shows the Gen-frequent-sequences algorithm that determines all frequent

candidates of size k . It takes as input a threshold t . All frequent candidates of size

k are put in the set Lk . The function scans each generated candidate Ck of size k

(line 02) and checks if it is included in the distinct sequences of size greater or equal

than k in IBM (line 03). Index [ k ] points to the first sequence of size k and max

stands for the sequence of greatest size encoded in IBM . Then if a candidate is a sub-

sequence of a given sequence encoded in IBM (line 05), his support is incremented by

its frequency ( NB [ k ]) in the database. At the end of this process, if the support of a

given candidate of size k is greater then the threshold t , the candidate is placed into

the set Lk .

Gen-frequent-sequences (Threshold t )

01 Lk =

-- Set of frequent sequences of size k

02 For all sequences s

∅

Ck -- Candidates of size k

03 For all lines l in IBM from line Index [ k ] to max

04

∈

If s

⊂

l

s .count = s .count + NB [ k ] -- frequence of s

05 If s .count ≥ t

Lk = Lk

∪

s

Fig. 8.5. Generation of Frequent Sequences of Size k

Suppose the example of Fig. 8.1 and a generated candidate C=HSH of size S=3.

Then the algorithm will access the cell number 3 of the Index which pin points to the

line 3 of the IBM table, where the first sequence of size 3 starts. This sequence does

not contain C, but those in line 4 to 6 contain C. So the frequency of C is computed as

30+20+15=65. The support of C is equal to 65/ (100+60+40+30+20+15)=0.245. If the

support threshold is equal to 0.4, C candidate will not be retained as frequent pattern.

8.4 Implementation and Performance Study

The experiments aimed to validate our approach and to compare it to other methods.

This comparison focuses on processing performances, storage costs, and scalability.

The tests were performed on a 2.5 Ghz Pentium IV with 1 GB of memory running

Microsoft Windows XP Professional. They aimed at showing: (i) our method effec-

tiveness by applying it to a real dataset, and (ii) its scalability while increasing the

data size.

Indeed, IBM has been performed on real data related to daily activity programs of

the population of a north French urban area. In this application, the number of items is

about 10. The database contains about 10,800 sequences among which 3,429 distinct

sequences. The sequence size varies between 2 and 34 with an average size 6. The

application aims at discovering frequent activity patterns in order to derive some

population profiles. Interesting and previously unknown patterns have been produced

which allow the decision makers better understanding of the daily activity and

Mining Complex Data

Search WWH ::

Custom Search

Home