Parallel Execution of SQL Based Association Rule Mining - Nontraditional Database Systems

Databases Reference

In-Depth Information

Association rule mining is a kind of mining that is known as CPU power

demanding application. This fact has driven many initial researches in data mining

to develop new efficient mining methods such as Apriori 2) and its improvements 9)

3) . Some algorithms are already available as commercial packages. Most of them

assumes the data is stored in flat file system. However in most case, the data is

managed by RDBMS. Thus one has to export the data from database and perform

the data mining with specialized software outside the database. Some softwares

also provide data access to database using cursor interface 7) .

However RDBMS has sophisticated query processing capability by means of

standard language SQL. Therefore there are some efforts recently to perform data

mining using relational database system which offer advantages such as seamless

integration with existing system and high portability. Some methods examined

ranging from directly using SQL to some extensions like user defined function

(UDF) 11) . Some efforts have been conducted to couple RDBMS more tightly with

association rule mining system. For example DMQL 5) and M-SQL 8) proposed some

SQL standard extensions to handle mining operators.

Pure SQL-92 approach is interesting since SQL-92 is standard supported by

most database system which means it offers the highest level of portability and

flexibility. Unfortunately SQL approach is reported to have drawback in

performance.

We proposed large-scale PC cluster as cost effective platform for data intensive

applications such as data mining using parallel RDBMS, which offers the advantages

of the integration without sacrificing the performance 13) .

There is a tradeoff between performance and portability. Performance is not

necessarily sufficiently high but seamless integration with existing RDBMS would

be considerably advantageous. Since RDB is already very popular, the feasibility

of association rule mining can be explored using query of standard SQL instead of

purchasing expensive mining software. In addition, parallel RDB is now also widely

accepted. We showed that paralleling the SQL execution of modified SETM query

on PC cluster can offer the same performance as those Apriori based native

programs with 4 nodes. Since most organizations have a lot of PCs, which are not

fully utilized. We are able to exploit such resources to enhance the performance

significantly.

On the other hand recently most major commercial database systems have

included capabilities to support parallelization although no report available about

how the parallelization affects the performance of complex query required by

association rule mining. This fact motivated us to examine how efficiently SQL

based association rule mining can be parallelized and speeded up using commercial

parallel database system (IBM DB2 UDB EEE). We propose two techniques to

enhance association rule mining query based on SETM [3]. And we have also

compared the performance with commercial mining tool (IBM Intelligent Miner).

Our performance evaluation shows that we can achieve comparable performance

with commercial mining tool using only 4 nodes.

This paper is composed with 6 sections. In second section we will briefly explain

Nontraditional Database Systems

Search WWH ::

Custom Search

Home