Information Technology Reference
In-Depth Information
IBM. If so, the corresponding frequency of this sequence stored in the NB table, is
added to the frequency of the candidate. After the comparison with each line until the
last one, the support of C is computed.
Fig. 8.5 shows the Gen-frequent-sequences algorithm that determines all frequent
candidates of size k . It takes as input a threshold t . All frequent candidates of size
k are put in the set Lk . The function scans each generated candidate Ck of size k
(line 02) and checks if it is included in the distinct sequences of size greater or equal
than k in IBM (line 03). Index [ k ] points to the first sequence of size k and max
stands for the sequence of greatest size encoded in IBM . Then if a candidate is a sub-
sequence of a given sequence encoded in IBM (line 05), his support is incremented by
its frequency ( NB [ k ]) in the database. At the end of this process, if the support of a
given candidate of size k is greater then the threshold t , the candidate is placed into
the set Lk .
Gen-frequent-sequences (Threshold t )
01 Lk =
-- Set of frequent sequences of size k
02 For all sequences s
Ck -- Candidates of size k
03 For all lines l in IBM from line Index [ k ] to max
04
If s
l
s .count = s .count + NB [ k ] -- frequence of s
05 If s .count ≥ t
Lk = Lk
s
Fig. 8.5. Generation of Frequent Sequences of Size k
Suppose the example of Fig. 8.1 and a generated candidate C=HSH of size S=3.
Then the algorithm will access the cell number 3 of the Index which pin points to the
line 3 of the IBM table, where the first sequence of size 3 starts. This sequence does
not contain C, but those in line 4 to 6 contain C. So the frequency of C is computed as
30+20+15=65. The support of C is equal to 65/ (100+60+40+30+20+15)=0.245. If the
support threshold is equal to 0.4, C candidate will not be retained as frequent pattern.
8.4 Implementation and Performance Study
The experiments aimed to validate our approach and to compare it to other methods.
This comparison focuses on processing performances, storage costs, and scalability.
The tests were performed on a 2.5 Ghz Pentium IV with 1 GB of memory running
Microsoft Windows XP Professional. They aimed at showing: (i) our method effec-
tiveness by applying it to a real dataset, and (ii) its scalability while increasing the
data size.
Indeed, IBM has been performed on real data related to daily activity programs of
the population of a north French urban area. In this application, the number of items is
about 10. The database contains about 10,800 sequences among which 3,429 distinct
sequences. The sequence size varies between 2 and 34 with an average size 6. The
application aims at discovering frequent activity patterns in order to derive some
population profiles. Interesting and previously unknown patterns have been produced
which allow the decision makers better understanding of the daily activity and
Search WWH ::




Custom Search