Databases Reference
In-Depth Information
The modified query is then called Set-oriented Apriori. However we use the
simpler modified SETM in this evaluation since we found that the performance
doesn't differ too much for our dataset.
5 Performance Evaluation on PC Cluster
At present, parallel SQL is running on expensive massively parallel machines but
not in the future. Instead it will run on inexpensive PC cluster system or WS
cluster system. Thus we believe that SQL implementation based on sophisticated
optimization would be one of reasonable approaches.
5.1 Parallel Execution Environment
The experiment is conducted on a PC cluster developed at Institute of Industrial
Science, The University of Tokyo. This pilot system consists of one hundred
commodity PCs connected by ATM network named NEDO-100. We have also
developed DBKernel database server for query processing on this system. Each PC
has Intel Pentium Pro 200MHz CPU, 4.3GB SCSI hard disk and 64 MB RAM.
The performance evaluation using TPC-D benchmark on 100 nodes cluster is
reported 13) . The results showed it can achieve significantly higher performance
especially for join intensive query such as query 9 compared to the current
commercially available high end systems.
5.2 Dataset
We use synthetic transaction data generated with program described in Apriori
algorithm paper 2) for experiment. The parameters used are: number of transactions
200000, average transaction length 10 and number of items 2000. Transaction
data is partitioned uniformly correspond to transaction ID among processing nodes'
local hard disk.
5.3 Results
The execution times for several minimum support is shown in figure 2(left). The
result is surprisingly well compared even with directly coded Apriori-based C
program on single processing node. On average, we can achieve the same level of
execution time by parallelizing SQL based mining with around 4 processing nodes.
The speedup ratio shown in figure 2(right) is also reasonably good, although the
speedup seems to be saturated as the number of processing nodes increased. As
the size of the dataset asssigned to each node is getting smaller, processing overhead
and also synchronizing cost that depends on the number of nodes cancel the gain.
Figure 3(left) shows the time percentage for each pass when the minimum support
is 0.5%. Eight passes are necessary to process entire transaction database. It is
well known that in most cases the second pass generates huge amount of candidate
Search WWH ::




Custom Search