Database Reference
In-Depth Information
4.4
Further Considerations
As is the case with most things in parallel computing, there is no silver bullet. Han
et al. [ 19 ] noted that, as the process count to number of candidate itemsets ratio
increases, the number of candidate itemsets assigned to each process in the IDD
algorithm is reduced, shrinking the size of each process' hash tree and the amount
of computation work per transaction. For a sufficiently high number of processes or
sufficiently low number of candidate itemsets, this decrease in work per transaction
will mean that, in distributed systems, the communication of the transactions between
processes will become the limiting component of the IDD algorithm. Therefore, the
authors introduce the Hybrid Distribution algorithm, which uses both the CD and
IDD algorithms. To do this, the processes are split into g equally sized groups. Each
process group is responsible for computing the support for all candidate itemsets with
respect to the
/g transactions assigned to it. This can be viewed conceptually as
executing the CD algorithm on g pseudo-processes. Within each group, the local
support for all candidate itemsets is computed by executing the IDD algorithm on
the p/g processes in the group. Each process in a group computes the support
of the candidate itemsets assigned to it with respect to the
| T |
/g assigned to its
process group. The choice of g can be determined automatically at runtime. When the
number of candidate itemsets is not sufficiently high to ensure that each process in the
system is assigned enough candidate itemsets to offset the cost of communicating the
transactions within each process group, then g is set to p , meaning the CD algorithm
is executed with each process as its own group. Otherwise, g is chosen small enough
to ensure each process gets an adequate number of candidate itemsets so that work
outweighs communication.
Shintani and Kitsuregawa [ 52 ] noted that the HPA algorithm suffers a similar
drawback as the IDD algorithm. Namely, that if the process count to number of
candidate itemsets ratio becomes sufficiently high, then the system can become under
utilized. For this reason, they also introduced the HPA with Extremely Large Itemset
Duplication (HPA-ELD) algorithm. In the case of the HPA algorithm, the under
utilization comes from the possibility that the number of candidate itemsets will be
small enough so that the memory available to each process is not completely full. To
address this, HPA-ELD duplicates the highest frequency itemsets on all processes
until the system memory is full. Then, the support for these most frequent candidate
itemsets is computed locally, just like in the CD algorithm, while the support for the
rest of the candidate itemsets is computed in the same way as in HPA.
| T |
5
Frequent Sequence Mining
Many of the solutions to FSM problems follow algorithms developed for FIM.
Furthermore, parallel FSM (PFSM) algorithms often directly extend serial ones and
their parallelization strategies are heavily inspired by previously developed PFIM
algorithms. We will briefly discuss some of the more prominent serial approaches to
solving the FSM problem, and focus the remainder of the section on the challenges
Search WWH ::




Custom Search