Parallel Execution of SQL Based Association Rule Mining - Nontraditional Database Systems

Databases Reference

In-Depth Information

Inside execution plan for this query, join to generate candidate itemsets is

executed at each node independently. And then after counting itemsets locally,

nodes exchange the local counts by applying hash function on the itemsets in

order to determine overall support count of each itemset. Another data exchange

occurs when large itemsets are distributed to every nodes. Finally each node

independently execute join again to create the transaction data.

Execution plan for SQL statement of pass k which is shown in Figure 1 is here:

Pass k:

1.

Build hash table from R_k-1.Probe the hash table using table R_k-1 itself,

insert tuples that satisfy select conditions into table RTMP_k.

2.

Count support in the local node.

3.

In order to count global support, distribute the count result to processing

nodes by applying hash function on items.

4.

Count global support for each item.

5.

Broadcast items that satisfy the minimum support to every nodes and insert

to table C_k.

6.

Build hash table from C_k.Probe hash table using transaction data, insert

tuples that satisfy select conditions into table R_k.

The detail of the execution plan is shown in Figure 4. Figure 5 give graphical

description of the execution plan.

We have also examined some other alternatives such as instead of distributing

the result of support counting in each node to obtain global item support we

distribute the tuples each pass by applying hash function on items before support

counting, therefore all support counting can be performed in one step. But

preliminary test for this plan has shown poor performance. In order to suppress

the communication cost this plan requires most of items remained at sender node.

Our preliminary test showed that more tuples are redistributed so this plan is

impractical unless the query processor is equipped with mechanism to detect the

distribution of items in each node.

This kind of query utilyze the so called self join extensively. The self join is a

join between the same table. Proper optimization of self join will definitely make

the execution faster.

DBKernel provides facility to trace the usage of resource during execution as

shown in figure 6. Second pass occupies interval from 3s until 23.5s.

Search WWH ::

Custom Search

Home