Database Reference
In-Depth Information
Fig. 11.4 Vertical format of
the sequence database and
fragments of the SPAM
mining process. (Adapted
from [ 6 ])
(a) Data sorted by CID and TID.
Customer ID (CID)
(b) Bitmap representat ion of the dataset in (a)
CID
TID
Itemset
{a, b, c}
{b, c, d}
{b, c, d}
{b}
{a, b, c}
{a, b}
TID
{ a } b }
{ c }
{ d }
1
1
1
1
1
1
0
1
1
3
1
3
0
1
1
1
1
6
1
6
0
1
1
1
2
2
-
-
0
0
0
0
2
4
2
2
0
1
0
0
3
5
2
2
1
1
1
0
3
7
{b, c, d}
-
-
0
0
0
0
-
-
0
0
0
0
3
5
1
1
0
0
3
7
0
1
1
1
-
-
0
0
0
0
-
-
0
0
0
0
( a) S-S tep proce ssi ng
( { a }) s
(b) I- step processing
{ d }
( { a } )
{ b }
( { a }, { b } )
( { a }, { b } )
( { a }, { b , d } )
1
0
1
0
0
1
0
0
1
1
1
1
1
1
0
1
1
1
1
1
1
0
1
0
0
0
0
0
0
S-step
0
1
result
0
0
0
result
0
1
0
1
&1
0
0
&0
0
process
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
1
0
0
0
0
0
1
1
1
1
1
1
0
1
0
0
0
0
0
0
1
0
0
0
0
0
on s . Otherwise, it stops DFS on s by the Apriori property. The method of candidate
pruning is based on Apriori (downward closure) property and is conducted at each S-
step and I-step , which guarantees that all nodes corresponding to frequent sequences
are visited.
To allow for efficient counting, SPAM uses a vertical bitmap representation of
the data. A vertical bitmap is created for each item in the dataset, and each bitmap
has a bit corresponding to each transaction in the dataset. Each bitmap partition of a
sequence to be extended in the S-Step is first transformed to a transformed bitmap ,
such that all the bits less than or equal to the index of the first bit with value one
(denoted by k ) are set to zero, and all bits after k are set to one. Then, the resulting
bitmap can be obtained by the ANDing operation of the transformed bitmap and the
bitmap of the appended item. In the I-step, ANDing is performed directly without
transformation of the sequence. Now support-counting becomes a simple count of
how many bitmap partitions, not containing all zeros. Figure 11.4 shows the sequence
database, its bitmap representation, and an example of the mining process.
3.2.3
LAPIN-SPAM
Based on SPAM, Yang and Kitsuregawa [ 14 ] proposed a new algorithm called
LAst Position INduction Sequential PAttern Mining (abbreviated as LAPIN-SPAM),
which can efficiently get all the frequent sequential patterns from a large database.
The main difference between them is the method for candidate sequence count-
ing and verification. While SPAM does many ANDing operations for candidate
testing, LAPIN-SPAM can easily implement this process based on the following
fact that if an item's last position is smaller than or equal to the current prefix
position, the item can not appear behind the current prefix in the same sequence.
In order to exploit this fact for candidate pruning, LAPIN-SPAM constructs an
ITEM_IS_EXIST_TABLE, which is created while scanning the database for the
Search WWH ::




Custom Search