Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

today, the so-called “Big Data”. Web log data from social media sites such as Twitter

produce over 100 TB of raw data daily [ 32 ]. Giants such as Walmart register billions

of yearly transactions [ 5 ]. Today's high-throughput gene sequencing platforms are

capable of generating terabytes of data in a single experiment [ 16 ]. Tools are needed

that can effectively mine frequent patterns from these massive data in a timely manner.

Some of today's frequent pattern mining source data may not fit on a single

machine's hard drive, let alone in its volatile memory. The exponential nature of the

solution search space compounds this problem. Scalable parallel algorithms hold

the key to addressing pattern mining in the context of Big Data. In this chapter, we

review recent advances in solving the frequent pattern mining problem in parallel.

We start by presenting an overview of the frequent pattern mining problem and

its specializations in Sect. 2 . In Sect. 3 , we examine advantages of and challenges

encountered when parallelizing algorithms, given today's distributed and shared

memory systems, centering our discussion in the frequent pattern mining context.

We survey existing serial and parallel pattern mining methods in Sects. 4 - 6 . Finally,

Sect. 7 draws some conclusions about the state-of-the-art and further opportunities

in the field.

Frequent Pattern Mining: Overview

Since the well-known itemset model was introduced by Agrawal and Srikant [ 1 ]

in 1994, numerous papers have been published proposing efficient solutions to the

problem of discovering frequent patterns in databases. Most follow two well known

paradigms, which we briefly describe in this section, after first introducing notation

and concepts used throughout the paper.

2.1

Preliminaries

Let I

i 1 , i 2 , ... , i n }

be a set of items. An itemset C is a subset of I . We denote

its length or size , i.e. the number of items in C . Given a list of transactions

denotes the total number

of transactions. Transactions are generally identified by a transaction id ( tid ). The

support of C is the proportion of transactions in

, where each transaction T

∈ T

is an itemset,

| T |

that contain C , i. e., φ ( C )

∈ T

, C

⊆

| T |

. The support count ,or frequency of C is the number of

transactions in

that contain C . An itemset is said to be a frequent itemset if it has

a support greater than some user defined minimum support threshold, σ .

The itemset model was extended to handle sequences by Srikant and Agrawal [ 54 ].

A sequence is defined as an ordered list of itemsets, s

C 1 , C 2 , ... , C l

, where

⊆

≤

| D |

C j

sequences, in which each

sequence may be associated with a customer id and elements in the sequence may

have an assigned timestamp. Without loss of generality, we assume a lexicographic

I ,1

l . A sequence database

is a list of

Frequent Pattern Mining

Search WWH ::

Custom Search

Home