Sequential Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

sequences for the purpose of retrieval efficiency. PSP organizes the candidate se-

quences in a prefix-tree structure according to their common elements. Any branch

of the tree, from the root to a leaf stands for a candidate sequence, and the terminal

node of any branch provides the support of the corresponding sequence. Adding to

the support value of candidate sequence is performed by navigating to each leaf in

the tree and then incrementing the value. As in this prefix-tree structure, initial subse-

quences common to several candidate sequences are stored only once, this structure

requires less memory than the hash-tree used in the GSP approach, which fully stores

all candidate sequences in the leaves.

The aforementioned Apriori-based algorithms depend largely on the Apriori prop-

erty and do not exploit additional strategies to narrow the search space. During each

iteration of the algorithms, they have to maintain the support count for each subse-

quence being mined, which makes them computationally expensive. To alleviate the

problem, some Apriori-based approaches [ 11 - 14 ] utilize the vertical representation

of the sequence database and employ simple temporal joins of the id-lists to calcu-

late support for each sequence. In the remainder of this section, we introduce several

vertical data format algorithms in detail.

3.2

Vertical Data Format Algorithms

3.2.1

SPADE

SPADE [ 11 ] maps a sequence database into the vertical data format which takes

each item as the center of observation and takes its associated sequence and event

identifiers as data sets. To find sequence of length-2 items, it just needs to join two

single items if they are frequent and they share the same sequence identifier and their

event identifiers (which are essentially relative timestamps) follow the sequential

ordering. Similarly, SPADE can grow the sequence from length two to length three,

and so on. SPADE relies on a lattice of frequent sequences generated by applying the

Lattice theory [ 20 ] on frequent sequences and their subsequences. It also decomposes

the original search space (lattice) into smaller pieces (sub-lattices) called equivalence

classes , which can be loaded and processed independently in main memory. Two

sequences are considered to be in the same class if they share a common k -length

prefix. Each sub-lattice can be traversed via either breadth-first or depth-first search

to enumerate the frequent sequences, whose counts are then calculated via simple

temporal joins (or intersections) on id-lists . SPADE usually requires three database

scans, or only a single scan with some preprocessed data. Some fragments of the

SPADE mining process are illustrated using the following example [ 3 ].

Example

2

(SPADE)

Given

the

sequence

database D in

Table

11.1

and

=

min _ suppor t

2, SPADE first scans D , transforms the database into the vertical

format by introducing EID (event_ID) which is a (local) timestamp for each event.

Each single item is associated with a set of SID (sequence_id) and EID (event_id)

Frequent Pattern Mining

Search WWH ::

Custom Search

Home