Sequential Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

growth and projected database construction compared with the original PrefixSpan

algorithm. For both algorithms, different dimensional scopes of each element are

considered as the key factor for algorithm design. For more details, you could refer

to [ 31 ].

5.3

Incremental Methods

Many real life sequence databases grow incrementally. It is undesirable to mine

sequential patterns from scratch each time when a small set of sequences grow,

or when some new sequences are added into the database. Incremental algorithm

should be developed for sequential pattern mining so that mining can be adapted to

incremental database updates.

Parthasarathy et al. [ 32 ] developed an incremental mining algorithm ISM based

on the SPADE algorithm. Their goal is to minimize the I/O and computation re-

quirements for handling incremental updates. To achieve this goal, ISM algorithm

uses an efficient memory management scheme that indexes into the database effi-

ciently, and creates an Increment Sequence Lattice (ISL), which consists of all the

frequent sequences (FS) and all sequences in the negative border (NB) in the original

database. FS denotes the set of all frequent sequences in the updated database, and

the negative border (NB) is the collection of all sequences that are not frequent but

both of whose generating subsequences are frequent. The support of each member

is kept in the lattice, too. The algorithm consists of two phases. Phase 1 is for updat-

ing the supports of elements in NB and FS, and aims at pruning the sequences that

become infrequent from the set of frequent sequences after the update. One scan of

the database is enough to update the lattice as well as the negative border. Phase 2

is for adding to NB and FS beyond what was done in Phase 1. For the details of this

algorithm, you could refer to [ 32 ].

Masseglia et al. [ 33 ] developed another incremental mining algorithm ISE for

computing the frequent sequences in the updated database when new transactions and

new customers are added to the original database using candidate generation-and-test

approach. ISE minimizes computational costs by re-using the minimal information

from the old frequent sequences, i. e. the support of frequent sequences. To find

all new frequent sequences, three kinds of frequent sequences are considered. First,

sequences embedded in the original database could become frequent since they have

sufficient support with the incremental database, i. e. sequences similar to sequences

embedded in the original database appear in the increment. Next, new frequent

sequences not appearing in the original database may emerge in the incremental

database. Finally, sequences of the original database might become frequent when

items from the original database are added. To discover frequent sequences, the ISE

algorithm executes iteratively. During the first pass on the incremental database (db),

the ISE algorithm counts the support of individual items and finds the set of items

occurring at least once in db. Considering the set of items embedded in the original

database (DB), it determines which items of db are frequent in U (DB

∪

db). This

Frequent Pattern Mining

Search WWH ::

Custom Search

Home