Database Reference
In-Depth Information
Table 11.2 A horizontal
formatting sequential
database. (Adapted from [ 8 ])
Customer Id
Customer sequence
1
(30) (90)
2
(10 20)(30)(40 60 70)
3
(30 50 70)
4
(30) (40 70) (90)
5
(90)
mining maximal sequential patterns, while AprioriAll was designed for mining all se-
quential patterns. Here, we mainly introduce AprioriAll algorithm which discoveries
all sequential patterns in this section.
This problem of sequential pattern mining is solved using the following phases:
1. Sort Phase. This phase converts the original transaction database into the hori-
zontal formatting sequential database (Table 11.2 ) by sorting the original database
with customer_id as the major key and transaction_time as the minor key.
2. Litemset Phase. In the paper [ 8 ], the length of a sequence is the number of
itemsets in the sequence. The support for an itemset i is defined as the number
of customers who bought the items in i in a single transaction. Thus the itemset i
and the 1-sequence
have the same support. An itemset with minimum support
is called a large itemset or litemset . Note each itemset in a frequent sequence must
have minimum support. Hence, any frequent sequence must be a list of litemsets.
This phase finds the set of all litemsets L . It is straightforward to adapt any of
the algorithms in [ 19 ] to find litemsets. The main difference is that the support
count should be incremented only once per customer even if the customer buys
the same set of items in two different transactions. With the minimum support
set to 2, in the example database given in Table 11.2 , the litemsets are (30), (40),
(70), (40 70) and (90). Then the set of litemsets is mapped to a set of contiguous
integers for the litemset equality comparison. A possible mapping for this set
is (in the form of “litemset: mapped integer”): (30):1, (40):2, (70):3, (40 70):4,
(90):5.
3. Transformation Phase. Each transaction is replaced by the set of all litemsets
contained in that transaction. Transactions that do not contain any litemsets are not
retained and a customer sequence that does not contain any litemsets is dropped.
The customer sequences that are dropped, however, still contribute to the count
of total number of customers. The transformation of the database is shown in
Fig. 11.1 .
4. Sequence Phase. AprioriAll algorithm makes multiple passes over the data. In
each pass, it starts with a seed set of frequent sequences, and then uses the seed
set for generating new potentially frequent sequences (through Apriori-generate
function), called candidate sequences. The algorithm finds the support for these
candidate sequences during the pass over the data. At the end of each pass, it
determines which of the candidate sequences are actually frequent. These frequent
candidates become the seed for the next pass. In the first pass, all frequent 1-
sequences, obtained in the litemset phase, form the seed set. This algorithm
terminates when either no candidates are generated or no candidates meet the
minimum support.
i
 
Search WWH ::




Custom Search