Databases Reference
In-Depth Information
only auxiliary bit-vectors are stored; the bit values of the items considered in
the itemset need to be checked.
Burdick et al. proposed a depth-first search algorithm using vertical binary
representation named MAFIA in 2001 [7]. They use bit-vectors, and it is
an e cient algorithm; but only for finding maximal frequent itemsets, and
especially in dense databases. Mining only maximal frequent itemsets has the
following deficiency: From them we know that all its subsets are frequent, but
we do not know the exact value of the supports of these subsets. Therefore,
we can not obtain all the possible association rules from them.
Shenoy et al. proposed another breath-first search algorithm, called
VIPER, using vertical representations. Although VIPER used a compressed
binary representation on disk, when these compressed vectors are processed
in memory they are converted right away into tid-lists, not considering
advantages of boolean operations over binary formats [20].
Many other researchers have proposed other vertical binary algorithms,
although the above mentioned are to the best of our knowledge the most
representatives.
The method we present in this chapter, CBMine, obtains all frequent item-
sets faster than these well-known a priori and vertical binary implementations,
outperforming them considerably, especially for sparse databases.
4 CBMine Algorithm
A new method applied to a priori-like algorithms, named CBMine (Com-
pressed Binary Mine), is analyzed in this section.
4.1 Storing the Transactions
Let us call the itemset that is obtained by removing infrequent items from
a transaction the filtered transaction. The size of the filtered transactions
is declared to be “substantially smaller than the size of database”. Besides,
all frequent itemsets can be determined even if only filtered transactions are
available.
The set of filtered transactions can be represented as an m x n matrix
where m is the number of transactions and n is the number of frequent items
(see Fig. 5 for an 8x5 matrix). We can denote the presence or absence of an
item in each transaction by a binary value (1 if it is present, else 0).
This representation has been considered as a logical view of the data.
Nevertheless, some researchers have employed it for counting the support for
an item and for generating the set of 1-frequent itemsets [15].
To reduce I/O cost and speed up the algorithm, the filtered transactions
could be stored in main memory instead of on disk. Although this is a reason-
able solution, any data structure could require a considerable - and probably
a prohibitive - memory space for large databases.
Search WWH ::




Custom Search