Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

4.4

Further Considerations

As is the case with most things in parallel computing, there is no silver bullet. Han

et al. [ 19 ] noted that, as the process count to number of candidate itemsets ratio

increases, the number of candidate itemsets assigned to each process in the IDD

algorithm is reduced, shrinking the size of each process' hash tree and the amount

of computation work per transaction. For a sufficiently high number of processes or

sufficiently low number of candidate itemsets, this decrease in work per transaction

will mean that, in distributed systems, the communication of the transactions between

processes will become the limiting component of the IDD algorithm. Therefore, the

authors introduce the Hybrid Distribution algorithm, which uses both the CD and

IDD algorithms. To do this, the processes are split into g equally sized groups. Each

process group is responsible for computing the support for all candidate itemsets with

respect to the

/g transactions assigned to it. This can be viewed conceptually as

executing the CD algorithm on g pseudo-processes. Within each group, the local

support for all candidate itemsets is computed by executing the IDD algorithm on

the p/g processes in the group. Each process in a group computes the support

of the candidate itemsets assigned to it with respect to the

| T |

/g assigned to its

process group. The choice of g can be determined automatically at runtime. When the

number of candidate itemsets is not sufficiently high to ensure that each process in the

system is assigned enough candidate itemsets to offset the cost of communicating the

transactions within each process group, then g is set to p , meaning the CD algorithm

is executed with each process as its own group. Otherwise, g is chosen small enough

to ensure each process gets an adequate number of candidate itemsets so that work

outweighs communication.

Shintani and Kitsuregawa [ 52 ] noted that the HPA algorithm suffers a similar

drawback as the IDD algorithm. Namely, that if the process count to number of

candidate itemsets ratio becomes sufficiently high, then the system can become under

utilized. For this reason, they also introduced the HPA with Extremely Large Itemset

Duplication (HPA-ELD) algorithm. In the case of the HPA algorithm, the under

utilization comes from the possibility that the number of candidate itemsets will be

small enough so that the memory available to each process is not completely full. To

address this, HPA-ELD duplicates the highest frequency itemsets on all processes

until the system memory is full. Then, the support for these most frequent candidate

itemsets is computed locally, just like in the CD algorithm, while the support for the

rest of the candidate itemsets is computed in the same way as in HPA.

| T |

5

Frequent Sequence Mining

Many of the solutions to FSM problems follow algorithms developed for FIM.

Furthermore, parallel FSM (PFSM) algorithms often directly extend serial ones and

their parallelization strategies are heavily inspired by previously developed PFIM

algorithms. We will briefly discuss some of the more prominent serial approaches to

solving the FSM problem, and focus the remainder of the section on the challenges

Frequent Pattern Mining

Search WWH ::

Custom Search

Home