Database Reference
In-Depth Information
today, the so-called “Big Data”. Web log data from social media sites such as Twitter
produce over 100 TB of raw data daily [ 32 ]. Giants such as Walmart register billions
of yearly transactions [ 5 ]. Today's high-throughput gene sequencing platforms are
capable of generating terabytes of data in a single experiment [ 16 ]. Tools are needed
that can effectively mine frequent patterns from these massive data in a timely manner.
Some of today's frequent pattern mining source data may not fit on a single
machine's hard drive, let alone in its volatile memory. The exponential nature of the
solution search space compounds this problem. Scalable parallel algorithms hold
the key to addressing pattern mining in the context of Big Data. In this chapter, we
review recent advances in solving the frequent pattern mining problem in parallel.
We start by presenting an overview of the frequent pattern mining problem and
its specializations in Sect. 2 . In Sect. 3 , we examine advantages of and challenges
encountered when parallelizing algorithms, given today's distributed and shared
memory systems, centering our discussion in the frequent pattern mining context.
We survey existing serial and parallel pattern mining methods in Sects. 4 - 6 . Finally,
Sect. 7 draws some conclusions about the state-of-the-art and further opportunities
in the field.
2
Frequent Pattern Mining: Overview
Since the well-known itemset model was introduced by Agrawal and Srikant [ 1 ]
in 1994, numerous papers have been published proposing efficient solutions to the
problem of discovering frequent patterns in databases. Most follow two well known
paradigms, which we briefly describe in this section, after first introducing notation
and concepts used throughout the paper.
2.1
Preliminaries
Let I
={
i 1 , i 2 , ... , i n }
be a set of items. An itemset C is a subset of I . We denote
by
|
C
|
its length or size , i.e. the number of items in C . Given a list of transactions
T
denotes the total number
of transactions. Transactions are generally identified by a transaction id ( tid ). The
support of C is the proportion of transactions in
, where each transaction T
T
is an itemset,
| T |
T
that contain C , i. e., φ ( C )
=
|{
T
|
T
T
, C
T
}|
/
| T |
. The support count ,or frequency of C is the number of
transactions in
that contain C . An itemset is said to be a frequent itemset if it has
a support greater than some user defined minimum support threshold, σ .
The itemset model was extended to handle sequences by Srikant and Agrawal [ 54 ].
A sequence is defined as an ordered list of itemsets, s
T
=
C 1 , C 2 , ... , C l
, where
D
| D |
C j
sequences, in which each
sequence may be associated with a customer id and elements in the sequence may
have an assigned timestamp. Without loss of generality, we assume a lexicographic
I ,1
j
l . A sequence database
is a list of
Search WWH ::




Custom Search