An Introduction to Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

support values are lost by maximal pattern mining. In other words, the set of maximal

patterns cannot be used to derive the support values of missing subsets. However, the

support values of closed frequent itemsets can be used to derive the support values

of missing subsets. Many interesting methods [ 58 , 67 , 74 ] have been designed for

identifying frequent closed patterns. The general principle of determining frequent

closed patterns has been generalized to that of determining δ -freesets [ 18 ]. This

issue is closely related to that of mining all non-derivable frequent itemsets [ 20 ]. A

survey on this topic may be found in [ 21 ]. These different forms of compression are

discussed in Chaps. 2 and 5.

Finally, a formal way of viewing compression is from the perspective of

information-theoretic models. Information-theoretic models are designed for com-

pressing different kinds of data, and can therefore be used to compress itemsets as

well. This basic principle has been used for methods such as Krimp [ 66 ]. The prob-

lem of determining compressed representations of frequent itemsets is discussed in

Chap. 8. This chapter focusses mostly on the information-theoretic issues of frequent

itemset compression.

3

Scalability Issues in Frequent Pattern Mining

In the modern era, the ability to collect large amounts of data has increased signifi-

cantly because of advances in hardware and software platforms. The amount of data

is often so large that specialized methods are required for the mining process. The

streaming and big-data architectures are slightly different and pose different chal-

lenges for the mining process. The following discussion will address each of these

challenges.

3.1

Frequent Pattern Mining in Data Streams

In recent years, data stream have become very popular because of the advances in

hardware and software technology that can collect and transmit data continuously

over time. In such cases, the major constraint on data mining algorithms is to execute

the algorithms in a single pass. This can be significantly challenging because frequent

and sequential pattern mining methods are generally designed as level-wise methods.

There are two variants of frequent pattern mining for data streams:

Frequent Items or Heavy Hitters: In this case, frequent 1-itemsets need to be

determined from a data stream in a single pass. Such an approach is generally

needed when the total number of distinct items is too large to be held in main

memory. Typically, sketch-based methods are used in order to create a compress

data structure in order to maintain approximate counts of the items [ 23 , 27 ].

Frequent itemsets: In this case, it is not assumed that the number of distinct items

are too large. Therefore, the main challenge in this case is computational, because

Search WWH ::

Custom Search

Home