Parallel Execution of SQL Based Association Rule Mining - Nontraditional Database Systems

Databases Reference

In-Depth Information

The modified query is then called Set-oriented Apriori. However we use the

simpler modified SETM in this evaluation since we found that the performance

doesn't differ too much for our dataset.

5 Performance Evaluation on PC Cluster

At present, parallel SQL is running on expensive massively parallel machines but

not in the future. Instead it will run on inexpensive PC cluster system or WS

cluster system. Thus we believe that SQL implementation based on sophisticated

optimization would be one of reasonable approaches.

5.1 Parallel Execution Environment

The experiment is conducted on a PC cluster developed at Institute of Industrial

Science, The University of Tokyo. This pilot system consists of one hundred

commodity PCs connected by ATM network named NEDO-100. We have also

developed DBKernel database server for query processing on this system. Each PC

has Intel Pentium Pro 200MHz CPU, 4.3GB SCSI hard disk and 64 MB RAM.

The performance evaluation using TPC-D benchmark on 100 nodes cluster is

reported 13) . The results showed it can achieve significantly higher performance

especially for join intensive query such as query 9 compared to the current

commercially available high end systems.

5.2 Dataset

We use synthetic transaction data generated with program described in Apriori

algorithm paper 2) for experiment. The parameters used are: number of transactions

200000, average transaction length 10 and number of items 2000. Transaction

data is partitioned uniformly correspond to transaction ID among processing nodes'

local hard disk.

5.3 Results

The execution times for several minimum support is shown in figure 2(left). The

result is surprisingly well compared even with directly coded Apriori-based C

program on single processing node. On average, we can achieve the same level of

execution time by parallelizing SQL based mining with around 4 processing nodes.

The speedup ratio shown in figure 2(right) is also reasonably good, although the

speedup seems to be saturated as the number of processing nodes increased. As

the size of the dataset asssigned to each node is getting smaller, processing overhead

and also synchronizing cost that depends on the number of nodes cancel the gain.

Figure 3(left) shows the time percentage for each pass when the minimum support

is 0.5%. Eight passes are necessary to process entire transaction database. It is

well known that in most cases the second pass generates huge amount of candidate

Search WWH ::

Custom Search

Home