Sequential Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

Table 11.2 A horizontal

formatting sequential

database. (Adapted from [ 8 ])

Customer Id

Customer sequence

1

(30) (90)

2

(10 20)(30)(40 60 70)

3

(30 50 70)

4

(30) (40 70) (90)

5

(90)

mining maximal sequential patterns, while AprioriAll was designed for mining all se-

quential patterns. Here, we mainly introduce AprioriAll algorithm which discoveries

all sequential patterns in this section.

This problem of sequential pattern mining is solved using the following phases:

1. Sort Phase. This phase converts the original transaction database into the hori-

zontal formatting sequential database (Table 11.2 ) by sorting the original database

with customer_id as the major key and transaction_time as the minor key.

2. Litemset Phase. In the paper [ 8 ], the length of a sequence is the number of

itemsets in the sequence. The support for an itemset i is defined as the number

of customers who bought the items in i in a single transaction. Thus the itemset i

and the 1-sequence

have the same support. An itemset with minimum support

is called a large itemset or litemset . Note each itemset in a frequent sequence must

have minimum support. Hence, any frequent sequence must be a list of litemsets.

This phase finds the set of all litemsets L . It is straightforward to adapt any of

the algorithms in [ 19 ] to find litemsets. The main difference is that the support

count should be incremented only once per customer even if the customer buys

the same set of items in two different transactions. With the minimum support

set to 2, in the example database given in Table 11.2 , the litemsets are (30), (40),

(70), (40 70) and (90). Then the set of litemsets is mapped to a set of contiguous

integers for the litemset equality comparison. A possible mapping for this set

is (in the form of “litemset: mapped integer”): (30):1, (40):2, (70):3, (40 70):4,

(90):5.

3. Transformation Phase. Each transaction is replaced by the set of all litemsets

contained in that transaction. Transactions that do not contain any litemsets are not

retained and a customer sequence that does not contain any litemsets is dropped.

The customer sequences that are dropped, however, still contribute to the count

of total number of customers. The transformation of the database is shown in

Fig. 11.1 .

4. Sequence Phase. AprioriAll algorithm makes multiple passes over the data. In

each pass, it starts with a seed set of frequent sequences, and then uses the seed

set for generating new potentially frequent sequences (through Apriori-generate

function), called candidate sequences. The algorithm finds the support for these

candidate sequences during the pass over the data. At the end of each pass, it

determines which of the candidate sequences are actually frequent. These frequent

candidates become the seed for the next pass. In the first pass, all frequent 1-

sequences, obtained in the litemset phase, form the seed set. This algorithm

terminates when either no candidates are generated or no candidates meet the

minimum support.

i

Frequent Pattern Mining

Search WWH ::

Custom Search

Home