A Compressed Vertical Binary Algorithm for Mining Frequent Patterns - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

only auxiliary bit-vectors are stored; the bit values of the items considered in

the itemset need to be checked.

Burdick et al. proposed a depth-first search algorithm using vertical binary

representation named MAFIA in 2001 [7]. They use bit-vectors, and it is

an e cient algorithm; but only for finding maximal frequent itemsets, and

especially in dense databases. Mining only maximal frequent itemsets has the

following deficiency: From them we know that all its subsets are frequent, but

we do not know the exact value of the supports of these subsets. Therefore,

we can not obtain all the possible association rules from them.

Shenoy et al. proposed another breath-first search algorithm, called

VIPER, using vertical representations. Although VIPER used a compressed

binary representation on disk, when these compressed vectors are processed

in memory they are converted right away into tid-lists, not considering

advantages of boolean operations over binary formats [20].

Many other researchers have proposed other vertical binary algorithms,

although the above mentioned are to the best of our knowledge the most

representatives.

The method we present in this chapter, CBMine, obtains all frequent item-

sets faster than these well-known a priori and vertical binary implementations,

outperforming them considerably, especially for sparse databases.

4 CBMine Algorithm

A new method applied to a priori-like algorithms, named CBMine (Com-

pressed Binary Mine), is analyzed in this section.

4.1 Storing the Transactions

Let us call the itemset that is obtained by removing infrequent items from

a transaction the filtered transaction. The size of the filtered transactions

is declared to be “substantially smaller than the size of database”. Besides,

all frequent itemsets can be determined even if only filtered transactions are

available.

The set of filtered transactions can be represented as an m x n matrix

where m is the number of transactions and n is the number of frequent items

(see Fig. 5 for an 8x5 matrix). We can denote the presence or absence of an

item in each transaction by a binary value (1 if it is present, else 0).

This representation has been considered as a logical view of the data.

Nevertheless, some researchers have employed it for counting the support for

an item and for generating the set of 1-frequent itemsets [15].

To reduce I/O cost and speed up the algorithm, the filtered transactions

could be stored in main memory instead of on disk. Although this is a reason-

able solution, any data structure could require a considerable - and probably

a prohibitive - memory space for large databases.

Search WWH ::

Custom Search

Home