From Sequence Mining to Multidimensional Sequence Mining - Mining Complex Data

Information Technology Reference

In-Depth Information

example, the cell number 5 (with value 6) corresponds to the line number 6 of the first

sequence of size 5 encoded in the Bitmap. The table NB stores the frequency of each

distinct sequence in the database. Thus the sequence HSMH occurs 20 times in the

database.

In the first stage of the algorithm, INDEX, SV , NB and the Bitmap are built on the

fly during one pass. At each insertion of a sequence, the Bitmap matrix may become

larger, and a set of shifting operations are applied to the bit values stored in this table.

H = Home

W = Work

S = School

M = Market

R = Restaurant

SV

H W S M H R W H

Index

Bitmap

NB

6

1 1 0 0 0 1 1 1

15

20

30

40

60

100

1 0 1 1 0 0 0 1

5

1 0 1 0 0 0 0 1

1 0 1 0 0 1 0 0

3

2

0 1 0 0 0 0 0 0

1 0 0 0 0 0 0 0

1

Fig. 8.1. The data structure

IBM ( sequence database DB , threshold t )

01 SV =

-- SV empty at the beginning

02 For each sequence s in DB

03 Merge-sequence-vector ( s ) -- merge s with SV

04 Update NB

05 If s

φ

∉

IBM

06

Encode and Insert s in the IBM

07

Update Index

Shift the IBM table according to the new SV

09 k = 1

10 Gen-frequent-sequences ( t ) -- check the frequent of size 1

11 While there exist a frequent sequence of size k

12 k = k +1

13 Generate Ck

14 Gen-frequent-sequences ( t )

Fig. 8.2. IBM algorithm

Fig. 8.2 shows the general IBM algorithm that takes as parameters: the sequence

database DB and a threshold t . This value ( t ) stands for the minimum subsequences

frequency taken into account for the generation of the candidates. Then for each se-

quence s that it reads from the database during the scan, SV (line 03) is generated

using a merging process (as detailed in Fig. 8.3). Three situations may hold. The first

Mining Complex Data

Search WWH ::

Custom Search

Home