Database Reference
In-Depth Information
growth and projected database construction compared with the original PrefixSpan
algorithm. For both algorithms, different dimensional scopes of each element are
considered as the key factor for algorithm design. For more details, you could refer
to [ 31 ].
5.3
Incremental Methods
Many real life sequence databases grow incrementally. It is undesirable to mine
sequential patterns from scratch each time when a small set of sequences grow,
or when some new sequences are added into the database. Incremental algorithm
should be developed for sequential pattern mining so that mining can be adapted to
incremental database updates.
Parthasarathy et al. [ 32 ] developed an incremental mining algorithm ISM based
on the SPADE algorithm. Their goal is to minimize the I/O and computation re-
quirements for handling incremental updates. To achieve this goal, ISM algorithm
uses an efficient memory management scheme that indexes into the database effi-
ciently, and creates an Increment Sequence Lattice (ISL), which consists of all the
frequent sequences (FS) and all sequences in the negative border (NB) in the original
database. FS denotes the set of all frequent sequences in the updated database, and
the negative border (NB) is the collection of all sequences that are not frequent but
both of whose generating subsequences are frequent. The support of each member
is kept in the lattice, too. The algorithm consists of two phases. Phase 1 is for updat-
ing the supports of elements in NB and FS, and aims at pruning the sequences that
become infrequent from the set of frequent sequences after the update. One scan of
the database is enough to update the lattice as well as the negative border. Phase 2
is for adding to NB and FS beyond what was done in Phase 1. For the details of this
algorithm, you could refer to [ 32 ].
Masseglia et al. [ 33 ] developed another incremental mining algorithm ISE for
computing the frequent sequences in the updated database when new transactions and
new customers are added to the original database using candidate generation-and-test
approach. ISE minimizes computational costs by re-using the minimal information
from the old frequent sequences, i. e. the support of frequent sequences. To find
all new frequent sequences, three kinds of frequent sequences are considered. First,
sequences embedded in the original database could become frequent since they have
sufficient support with the incremental database, i. e. sequences similar to sequences
embedded in the original database appear in the increment. Next, new frequent
sequences not appearing in the original database may emerge in the incremental
database. Finally, sequences of the original database might become frequent when
items from the original database are added. To discover frequent sequences, the ISE
algorithm executes iteratively. During the first pass on the incremental database (db),
the ISE algorithm counts the support of individual items and finds the set of items
occurring at least once in db. Considering the set of items embedded in the original
database (DB), it determines which items of db are frequent in U (DB
db). This
Search WWH ::




Custom Search