Information Technology Reference
In-Depth Information
SV
1
Index
0 013
0 013
H W S M H
H W S M H
1 1 0 0 1
1 1 0 0 1
15
15
1 0 1 0 1
1 0 1 0 1
10
10
1 0 1 1 1
1 0 1 1 1
12
12
NB
IBM
2
H W S M W H
1 1 0 0 0 1
1 0 1 0 0 1
1 0 1 1 0 1
1 1 0 1 1 1
15
00 1 34
10
12
1
Fig. 8.4. IBM data structure generation process
8.3.3 Candidate Generation and Support Counting
The second phase is based on the APRIORI principle, with the difference that it
operates in a main memory data structure instead of scanning the database. Using the
Index table in IBM, the process starts by the sequences candidates of size 1. Their
support is counted (as explained later). They are retained as frequent if the support
fulfils the support threshold. Then, bigger size candidates are generated from these
frequent items using the fusion process (joining phase) as in the GSP algorithm [20].
The frequencies of the candidates are counted again, and the process is repeated.
The fusion process consists to merge two candidates having a common contiguous
subsequence of size n-2 in one sequence s of size n. For example, consider the two
candidates c = M MH and c' = MH M of size 3. MH is a common contiguous subse-
quence of c and c', and of size 2. Therefore, the candidate s = M MH M is generated
from c and c'.
More formally, given two sequences c=c1c2…cn-1 and c'=c'1c'2…c'n-1 of size
n-1, a sequence of size n s=s1s2…sn may be generated from c and c' as follows:
(i) if c = c1 and c' = c'1, then s = c1c'1.
(ii) if n > 2 and
[2..n-1] ci = c'i-1, then s= c1c2…cn-1c'n-1
As for the support counting of the generated candidates, it is facilitated by the data
structure. For a given candidate C of size S, the algorithm (see Fig. 8.5) first looks in
the cell number S of the Index where the first sequence of size S is encoded. Then,
this line is accessed. For each line starting from this line to the last line of the Bitmap
table, the algorithm determines using the SV vector if C is contained in each line of
i
Search WWH ::




Custom Search