Frequent Pattern Mining Algorithms: A Survey - Frequent Pattern Mining

Database Reference

In-Depth Information

When generating maximal patterns, the depth-first strategy has clear advantages

in terms of pruning as well. We refer the reader to a detailed description of the

DepthProject algorithm, described later in this chapter. This description describes

how several specialized pruning techniques are enabled by the depth-first strategy for

maximal pattern mining. The TreeProjection algorithm has also been generalized to

sequential pattern mining [ 31 ]. There are many different types of data structures that

may be used in projection-style algorithms. The choice of data structure is sensitive

to the data set. Two common choices that are used with TreeProjection family of

algorithms are as follows:

1. Arrays: In this case, the projected database is maintained as 2-dimensional array.

One of the dimensions of the array is equal to the number of relevant transactions

and the other dimension is equal to the number of relevant items in the projected

database. Both dimensions of the projected database reduce from top level to

lower levels of the enumeration tree with successive projection.

2. BitStrings: In this case, the projected database is maintained as a 0-1 bit string

whose width is fixed to the total number of frequent 1-items, but the number

of projected transactions reduces with successive projection. Such an approach

loses the power of item-wise projection, but this is balanced by the fact that the

bit-strings can be used more efficiently for counting operations.

Assume that each transaction T contains n bits, and can therefore be expressed

in the form of

bytes. Each byte of the transaction contains the information

about the presence or absence of eight items, and the integer value of the corre-

sponding bitstring can take on any value from 0 to 2 8

n/ 8

255. Correspondingly,

for each byte of the (projected) transaction at a node, 256 counters are maintained

and a value of 1 is added to the counter corresponding to the integer value of that

transaction byte. This process is repeated for each transaction in the projected

database at node P . Therefore, at the end of this process, one has 256

−

1

=

counts for the d different items. At this point, a postprocessing phase is initi-

ated in which the support of an item is determined by adding the counts of the

256 / 2

∗

d/ 8

128 counters which take on the value of 1 for that bit. Thus, the second

phase requires 128

=

d operations only, and is independent of database size. The

first phase, (which is the bottleneck) is the improvement over the naive counting

method because it performs only one operation for each byte in the transaction,

which contains eight items. Thus, the method would be a factor of eight faster

than the naive counting technique, which would need to scan the entire bitstring.

Projection is also very efficient in the bitstring representation with simple AND

operations.

∗

The major problem with fixed width bitstrings is that they are not efficient repre-

sentations at lower levels of the enumeration tree at which only a small number of

items are relevant, and therefore most entries in these bitstrings are 0. One approach

to speed this up is to perform the item-wise projection only at selected nodes in

the tree, when the reduction in the number of items from the last ancestor at which

the item-wise projection was performed is at particular multiplicative factor. At this

point, a shorter bit string is used for representation for the descendants at that node,

Frequent Pattern Mining

Search WWH ::

Custom Search

Home