Database Reference
In-Depth Information
sequences for the purpose of retrieval efficiency. PSP organizes the candidate se-
quences in a prefix-tree structure according to their common elements. Any branch
of the tree, from the root to a leaf stands for a candidate sequence, and the terminal
node of any branch provides the support of the corresponding sequence. Adding to
the support value of candidate sequence is performed by navigating to each leaf in
the tree and then incrementing the value. As in this prefix-tree structure, initial subse-
quences common to several candidate sequences are stored only once, this structure
requires less memory than the hash-tree used in the GSP approach, which fully stores
all candidate sequences in the leaves.
The aforementioned Apriori-based algorithms depend largely on the Apriori prop-
erty and do not exploit additional strategies to narrow the search space. During each
iteration of the algorithms, they have to maintain the support count for each subse-
quence being mined, which makes them computationally expensive. To alleviate the
problem, some Apriori-based approaches [ 11 - 14 ] utilize the vertical representation
of the sequence database and employ simple temporal joins of the id-lists to calcu-
late support for each sequence. In the remainder of this section, we introduce several
vertical data format algorithms in detail.
3.2
Vertical Data Format Algorithms
3.2.1
SPADE
SPADE [ 11 ] maps a sequence database into the vertical data format which takes
each item as the center of observation and takes its associated sequence and event
identifiers as data sets. To find sequence of length-2 items, it just needs to join two
single items if they are frequent and they share the same sequence identifier and their
event identifiers (which are essentially relative timestamps) follow the sequential
ordering. Similarly, SPADE can grow the sequence from length two to length three,
and so on. SPADE relies on a lattice of frequent sequences generated by applying the
Lattice theory [ 20 ] on frequent sequences and their subsequences. It also decomposes
the original search space (lattice) into smaller pieces (sub-lattices) called equivalence
classes , which can be loaded and processed independently in main memory. Two
sequences are considered to be in the same class if they share a common k -length
prefix. Each sub-lattice can be traversed via either breadth-first or depth-first search
to enumerate the frequent sequences, whose counts are then calculated via simple
temporal joins (or intersections) on id-lists . SPADE usually requires three database
scans, or only a single scan with some preprocessed data. Some fragments of the
SPADE mining process are illustrated using the following example [ 3 ].
Example
2
(SPADE)
Given
the
sequence
database D in
Table
11.1
and
=
min _ suppor t
2, SPADE first scans D , transforms the database into the vertical
format by introducing EID (event_ID) which is a (local) timestamp for each event.
Each single item is associated with a set of SID (sequence_id) and EID (event_id)
Search WWH ::




Custom Search