Database Reference
In-Depth Information
support values are lost by maximal pattern mining. In other words, the set of maximal
patterns cannot be used to derive the support values of missing subsets. However, the
support values of closed frequent itemsets can be used to derive the support values
of missing subsets. Many interesting methods [ 58 , 67 , 74 ] have been designed for
identifying frequent closed patterns. The general principle of determining frequent
closed patterns has been generalized to that of determining δ -freesets [ 18 ]. This
issue is closely related to that of mining all non-derivable frequent itemsets [ 20 ]. A
survey on this topic may be found in [ 21 ]. These different forms of compression are
discussed in Chaps. 2 and 5.
Finally, a formal way of viewing compression is from the perspective of
information-theoretic models. Information-theoretic models are designed for com-
pressing different kinds of data, and can therefore be used to compress itemsets as
well. This basic principle has been used for methods such as Krimp [ 66 ]. The prob-
lem of determining compressed representations of frequent itemsets is discussed in
Chap. 8. This chapter focusses mostly on the information-theoretic issues of frequent
itemset compression.
3
Scalability Issues in Frequent Pattern Mining
In the modern era, the ability to collect large amounts of data has increased signifi-
cantly because of advances in hardware and software platforms. The amount of data
is often so large that specialized methods are required for the mining process. The
streaming and big-data architectures are slightly different and pose different chal-
lenges for the mining process. The following discussion will address each of these
challenges.
3.1
Frequent Pattern Mining in Data Streams
In recent years, data stream have become very popular because of the advances in
hardware and software technology that can collect and transmit data continuously
over time. In such cases, the major constraint on data mining algorithms is to execute
the algorithms in a single pass. This can be significantly challenging because frequent
and sequential pattern mining methods are generally designed as level-wise methods.
There are two variants of frequent pattern mining for data streams:
￿
Frequent Items or Heavy Hitters: In this case, frequent 1-itemsets need to be
determined from a data stream in a single pass. Such an approach is generally
needed when the total number of distinct items is too large to be held in main
memory. Typically, sketch-based methods are used in order to create a compress
data structure in order to maintain approximate counts of the items [ 23 , 27 ].
￿
Frequent itemsets: In this case, it is not assumed that the number of distinct items
are too large. Therefore, the main challenge in this case is computational, because
Search WWH ::




Custom Search