Database Reference
In-Depth Information
against required running time for these programs to discover the top 1000 sig-
nificant impact rules in figure 4 and 5. The lines with square dots show the
changes in CPU time for algorithms with neither of these eciency improving
techniques. The lines with round dots show the results for algorithm with differ-
ence set statistics derivation only, while the lines with triangular dots denote the
trends brought by the algorithms with the circular intersection approach only.
The results for algorithm with both techniques introduced are plotted using the
lines with diamond dots.
Almost every database undergoes considerable reduction in running time after
the introduction of these two eciency improving approaches. The differences
in eciency increases with the maximum number of conditions allowed on rule
antecedent. When there is no limit on the maximum number of conditions on
rule antecedent, CPU time spent for the OPUS IR Filter algorithm with the two
eciency improving techniques applied to search for top 1000 significant impact
rules in ipums.la.97 is less than one sixth of that necessary for OPUS IR Filter
without introducing the techniques. However, necessary running time is also
influenced by other factors including the size of the databases, the proportion of
trivial rules in the top 1000 impact rules, and the proportion of significant rules.
After examining the effects of these two eciency improving techniques in-
dependently, we come to the conclusion that the difference statistics derivation
technique works better in some databases like census income ; while the circular
intersection approach has a greater effect on databases including ipums.la.98 .
However, the differences in effect are associated with several subtle factors in-
cluding the order in which the available conditions are ranked as the input of
algorithm, and the order in which different parent rules are compared with the
current rule to be assessed.
7Conluon
The large number of resulting rules has long been a handicap for exploratory
rule discovery. Many techniques have been proposed to reduce the set of re-
sulting rules to a manageable size. Removing statistically insignificant rules
is one of those techniques that are popular. Such techniques lead to con-
siderable decrease in the resulting number of exploratory rules. However,
performing statistical tests to identify the significance of a rule requires con-
siderable data access and computation. We proposed two techniques in this
paper, which can improve the eciency of rule discovery by deriving differ-
ence set statistics without additional references to the data, and by reducing
the redundancy of intersection operations. We implemented the techniques in
k-most-interesting impact rule discovery, which is suitable for distributional-
consequent exploratory rule discovery in very large, dense databases. Exper-
imental results show a substantial improvement in eciency after applying
these techniques.
Search WWH ::




Custom Search