Databases Reference
In-Depth Information
Figure 2: Execution time(left) Speedup ratio(right)
itemsets thus it is the most time consuming phase. Figure 3(right) shows the speedup
ratio for each pass. The later passes, the smaller candidate itemsets. Thus non-
negligible parallelization overhead become dominant especially in passes later than
five. Depending on the size of candidate itemsets, we could change the degree of
parallelization. That is, we should reduce the number of nodes on later passes.
Such extensions will need further investigations.
Figure 3: Pass analysis (minimum support 5%). Contribution of each pass in execution
time(left) Speedup ratio of each pass(right)
5.4 Execution Behaviour
Original SETM algorithm assumes execution using sort-merge join 4) . Although
they have showed that sort-merge join is better than nested loop joi n with indexes,
sort process is hardly parallelable. Inside database server on our system, relational
joins are executed using hash joins and tables are partitioned over nodes by hashing.
As the result, parallelization efficiency is much improved. This approach is very
effective for large scale data mining.
The DB Kernel allows user to freely custom the execution plan of any query.
We have made the execution plan to accommodate the hash join while suppress
the communication among nodes to achieve better speedup ratio.
Search WWH ::




Custom Search