Sequential Pattern Mining - Frequent Pattern Mining - page 265

Database Reference

In-Depth Information

Fig. 11.4 Vertical format of

the sequence database and

fragments of the SPAM

mining process. (Adapted

from [ 6 ])

(a) Data sorted by CID and TID.

Customer ID (CID)

(b) Bitmap representat ion of the dataset in (a)

CID

TID

Itemset

{a, b, c}

{b, c, d}

{b, c, d}

{b}

{a, b, c}

{a, b}

TID

{ a } b }

{ c }

{ d }

1

1

1

1

1

1

0

1

1

3

1

3

0

1

1

1

1

6

1

6

0

1

1

1

2

2

-

-

0

0

0

0

2

4

2

2

0

1

0

0

3

5

2

2

1

1

1

0

3

7

{b, c, d}

-

-

0

0

0

0

-

-

0

0

0

0

3

5

1

1

0

0

3

7

0

1

1

1

-

-

0

0

0

0

-

-

0

0

0

0

( a) S-S tep proce ssi ng

( { a }) s

(b) I- step processing

{ d }

( { a } )

{ b }

( { a }, { b } )

( { a }, { b } )

( { a }, { b , d } )

1

0

1

0

0

1

0

0

1

1

1

1

1

1

0

1

1

1

1

1

1

0

1

0

0

0

0

0

0

S-step

0

1

result

0

0

0

result

0

1

0

→

1

&1

→

0

0

&0

→

0

process

1

0

0

0

0

0

0

1

0

0

0

0

0

1

0

1

0

0

0

0

0

1

1

1

1

1

1

0

1

0

0

0

0

0

0

1

0

0

0

0

0

on s . Otherwise, it stops DFS on s by the Apriori property. The method of candidate

pruning is based on Apriori (downward closure) property and is conducted at each S-

step and I-step , which guarantees that all nodes corresponding to frequent sequences

are visited.

To allow for efficient counting, SPAM uses a vertical bitmap representation of

the data. A vertical bitmap is created for each item in the dataset, and each bitmap

has a bit corresponding to each transaction in the dataset. Each bitmap partition of a

sequence to be extended in the S-Step is first transformed to a transformed bitmap ,

such that all the bits less than or equal to the index of the first bit with value one

(denoted by k ) are set to zero, and all bits after k are set to one. Then, the resulting

bitmap can be obtained by the ANDing operation of the transformed bitmap and the

bitmap of the appended item. In the I-step, ANDing is performed directly without

transformation of the sequence. Now support-counting becomes a simple count of

how many bitmap partitions, not containing all zeros. Figure 11.4 shows the sequence

database, its bitmap representation, and an example of the mining process.

3.2.3

LAPIN-SPAM

Based on SPAM, Yang and Kitsuregawa [ 14 ] proposed a new algorithm called

LAst Position INduction Sequential PAttern Mining (abbreviated as LAPIN-SPAM),

which can efficiently get all the frequent sequential patterns from a large database.

The main difference between them is the method for candidate sequence count-

ing and verification. While SPAM does many ANDing operations for candidate

testing, LAPIN-SPAM can easily implement this process based on the following

fact that if an item's last position is smaller than or equal to the current prefix

position, the item can not appear behind the current prefix in the same sequence.

In order to exploit this fact for candidate pruning, LAPIN-SPAM constructs an

ITEM_IS_EXIST_TABLE, which is created while scanning the database for the

Next Page

Frequent Pattern Mining

Search WWH ::

Custom Search

Home