Database Reference
In-Depth Information
sending half of the nodes on the current level of the projection tree it is expanding,
along with their projected databases, if it has just started expanding the level. If it is
in the middle of processing the level, it responds with a wait message, including an
estimated completion time for processing the current level. Otherwise, if it is nearing
the end of processing the level, it can simply refuse to share its load. The requesting
process may choose another random process to request work from if receiving a
neutral or negative message from a donor process. In their experiments, the authors
show that DTPF achieves comparable or significantly lower run-times than STPF,
demonstrating the utility of dynamically re-distributing work in parallel FSM.
In the context of mining closed patterns from symbol sequences, Cong et al. [ 10 ]
proposed using pseudo-projections of the dataset
in Par-CSP, one for each found
1-length frequent sequences. The projection for the frequent 1-sequence
D
d
is made
up of sequences in
containing d , minus any symbols before d in those sequences.
Assuming each process has access to the input data file, a pseudo-projection is
simply a set of file pointers noting the beginning of each sub-sequence included in
the projection. Each process can then be assigned a sub-set of the 1-length frequent
sequences and their associated pseudo-projections, which they mine using the BIDE
[ 56 ] serial algorithm. Pseudo-projections are communicated to all processes via an
all-to-all broadcast. Dynamic load balancing is then achieved through a master-
worker scheme [ 17 ], where each process contacts the master process for another
1-length frequent sequence to process when they finish their current work. In an
effort to ensure tasks in the work queue have similar sizes, the authors use a selective
sampling scheme similar to that in Par-ASP [ 9 ] to sub-divide large tasks.
A key task in bioinformatics is the identification of frequently recurring patterns,
called motifs , in long genomic sequences. Sequences are defined over a small alpha-
bet, e.g.
D
for DNA sequences. Patterns are constrained by minimum
and maximum length parameters, and are required to be highly similar, rather than
identical, to count towards a frequently occurring pattern. Sahli et al. developed
ACME [ 50 ], a parallel combinatorial motif miner that decomposes the problem into
many fine-grained sub-tasks, dynamically schedules them to be executed by all avail-
able processes, and uses a cache aware search technique to perform each sub-task.
They construct a full-text suffix-tree index of the sequence at each worker process
in linear time and space, using Ukonnen's algorithm [ 55 ], and annotate each node
in the tree with the number of leaves reachable through the node. Motif mining is
then reduced to finding tree nodes that represent candidate patterns and summing up
their annotated leaf count. The search space, represented as a trie data structure, is
partitioned into prefix-coherent sub-tries by a master process s.t. the average number
of sub-tries per worker process is at least 16. The authors show experimentally that
this leads to a near-optimal workload balance in both distributed message passing
and shared memory environments. A worker then requests a new prefix (sub-task)
to search for when it finishes processing the current assignment.
A ={
A , C , G , T
}
Search WWH ::




Custom Search