Database Reference
In-Depth Information
When generating maximal patterns, the depth-first strategy has clear advantages
in terms of pruning as well. We refer the reader to a detailed description of the
DepthProject algorithm, described later in this chapter. This description describes
how several specialized pruning techniques are enabled by the depth-first strategy for
maximal pattern mining. The TreeProjection algorithm has also been generalized to
sequential pattern mining [ 31 ]. There are many different types of data structures that
may be used in projection-style algorithms. The choice of data structure is sensitive
to the data set. Two common choices that are used with TreeProjection family of
algorithms are as follows:
1. Arrays: In this case, the projected database is maintained as 2-dimensional array.
One of the dimensions of the array is equal to the number of relevant transactions
and the other dimension is equal to the number of relevant items in the projected
database. Both dimensions of the projected database reduce from top level to
lower levels of the enumeration tree with successive projection.
2. BitStrings: In this case, the projected database is maintained as a 0-1 bit string
whose width is fixed to the total number of frequent 1-items, but the number
of projected transactions reduces with successive projection. Such an approach
loses the power of item-wise projection, but this is balanced by the fact that the
bit-strings can be used more efficiently for counting operations.
Assume that each transaction T contains n bits, and can therefore be expressed
in the form of
bytes. Each byte of the transaction contains the information
about the presence or absence of eight items, and the integer value of the corre-
sponding bitstring can take on any value from 0 to 2 8
n/ 8
255. Correspondingly,
for each byte of the (projected) transaction at a node, 256 counters are maintained
and a value of 1 is added to the counter corresponding to the integer value of that
transaction byte. This process is repeated for each transaction in the projected
database at node P . Therefore, at the end of this process, one has 256
1
=
counts for the d different items. At this point, a postprocessing phase is initi-
ated in which the support of an item is determined by adding the counts of the
256 / 2
d/ 8
128 counters which take on the value of 1 for that bit. Thus, the second
phase requires 128
=
d operations only, and is independent of database size. The
first phase, (which is the bottleneck) is the improvement over the naive counting
method because it performs only one operation for each byte in the transaction,
which contains eight items. Thus, the method would be a factor of eight faster
than the naive counting technique, which would need to scan the entire bitstring.
Projection is also very efficient in the bitstring representation with simple AND
operations.
The major problem with fixed width bitstrings is that they are not efficient repre-
sentations at lower levels of the enumeration tree at which only a small number of
items are relevant, and therefore most entries in these bitstrings are 0. One approach
to speed this up is to perform the item-wise projection only at selected nodes in
the tree, when the reduction in the number of items from the last ancestor at which
the item-wise projection was performed is at particular multiplicative factor. At this
point, a shorter bit string is used for representation for the descendants at that node,
Search WWH ::




Custom Search