Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

sending half of the nodes on the current level of the projection tree it is expanding,

along with their projected databases, if it has just started expanding the level. If it is

in the middle of processing the level, it responds with a wait message, including an

estimated completion time for processing the current level. Otherwise, if it is nearing

the end of processing the level, it can simply refuse to share its load. The requesting

process may choose another random process to request work from if receiving a

neutral or negative message from a donor process. In their experiments, the authors

show that DTPF achieves comparable or significantly lower run-times than STPF,

demonstrating the utility of dynamically re-distributing work in parallel FSM.

In the context of mining closed patterns from symbol sequences, Cong et al. [ 10 ]

proposed using pseudo-projections of the dataset

in Par-CSP, one for each found

1-length frequent sequences. The projection for the frequent 1-sequence

D

d

is made

up of sequences in

containing d , minus any symbols before d in those sequences.

Assuming each process has access to the input data file, a pseudo-projection is

simply a set of file pointers noting the beginning of each sub-sequence included in

the projection. Each process can then be assigned a sub-set of the 1-length frequent

sequences and their associated pseudo-projections, which they mine using the BIDE

[ 56 ] serial algorithm. Pseudo-projections are communicated to all processes via an

all-to-all broadcast. Dynamic load balancing is then achieved through a master-

worker scheme [ 17 ], where each process contacts the master process for another

1-length frequent sequence to process when they finish their current work. In an

effort to ensure tasks in the work queue have similar sizes, the authors use a selective

sampling scheme similar to that in Par-ASP [ 9 ] to sub-divide large tasks.

A key task in bioinformatics is the identification of frequently recurring patterns,

called motifs , in long genomic sequences. Sequences are defined over a small alpha-

bet, e.g.

D

for DNA sequences. Patterns are constrained by minimum

and maximum length parameters, and are required to be highly similar, rather than

identical, to count towards a frequently occurring pattern. Sahli et al. developed

ACME [ 50 ], a parallel combinatorial motif miner that decomposes the problem into

many fine-grained sub-tasks, dynamically schedules them to be executed by all avail-

able processes, and uses a cache aware search technique to perform each sub-task.

They construct a full-text suffix-tree index of the sequence at each worker process

in linear time and space, using Ukonnen's algorithm [ 55 ], and annotate each node

in the tree with the number of leaves reachable through the node. Motif mining is

then reduced to finding tree nodes that represent candidate patterns and summing up

their annotated leaf count. The search space, represented as a trie data structure, is

partitioned into prefix-coherent sub-tries by a master process s.t. the average number

of sub-tries per worker process is at least 16. The authors show experimentally that

this leads to a near-optimal workload balance in both distributed message passing

and shared memory environments. A worker then requests a new prefix (sub-task)

to search for when it finishes processing the current assignment.

A ={

A , C , G , T

}

Frequent Pattern Mining

Search WWH ::

Custom Search

Home