Efficiently Identifying Exploratory Rules’ Significance - Data Mining: Theory, Methodology, Techniques, and Applications

Database Reference

In-Depth Information

against required running time for these programs to discover the top 1000 sig-

nificant impact rules in figure 4 and 5. The lines with square dots show the

changes in CPU time for algorithms with neither of these eciency improving

techniques. The lines with round dots show the results for algorithm with differ-

ence set statistics derivation only, while the lines with triangular dots denote the

trends brought by the algorithms with the circular intersection approach only.

The results for algorithm with both techniques introduced are plotted using the

lines with diamond dots.

Almost every database undergoes considerable reduction in running time after

the introduction of these two eciency improving approaches. The differences

in eciency increases with the maximum number of conditions allowed on rule

antecedent. When there is no limit on the maximum number of conditions on

rule antecedent, CPU time spent for the OPUS IR Filter algorithm with the two

eciency improving techniques applied to search for top 1000 significant impact

rules in ipums.la.97 is less than one sixth of that necessary for OPUS IR Filter

without introducing the techniques. However, necessary running time is also

influenced by other factors including the size of the databases, the proportion of

trivial rules in the top 1000 impact rules, and the proportion of significant rules.

After examining the effects of these two eciency improving techniques in-

dependently, we come to the conclusion that the difference statistics derivation

technique works better in some databases like census income ; while the circular

intersection approach has a greater effect on databases including ipums.la.98 .

However, the differences in effect are associated with several subtle factors in-

cluding the order in which the available conditions are ranked as the input of

algorithm, and the order in which different parent rules are compared with the

current rule to be assessed.

7Conluon

The large number of resulting rules has long been a handicap for exploratory

rule discovery. Many techniques have been proposed to reduce the set of re-

sulting rules to a manageable size. Removing statistically insignificant rules

is one of those techniques that are popular. Such techniques lead to con-

siderable decrease in the resulting number of exploratory rules. However,

performing statistical tests to identify the significance of a rule requires con-

siderable data access and computation. We proposed two techniques in this

paper, which can improve the eciency of rule discovery by deriving differ-

ence set statistics without additional references to the data, and by reducing

the redundancy of intersection operations. We implemented the techniques in

k-most-interesting impact rule discovery, which is suitable for distributional-

consequent exploratory rule discovery in very large, dense databases. Exper-

imental results show a substantial improvement in eciency after applying

these techniques.

Search WWH ::

Custom Search

Home