Information Technology Reference
In-Depth Information
As a result gathering data that mirrors our world has become fairly easy
and rather inexpensive. On the one hand the obtained data collections for sure
contain valuable and detailed information but on the other hand analyzing such
massive datasets turned out to be much harder than expected. In brief, sizes
ranging from tens of megabytes upto several terabytes forbid simply employing
common analysis methods. Consequently during the last ten years specialized
techniques have been developed that can be subsumed under the term data
mining. The main goal behind these methods is to allowthe e - cient analysis
of even very large datasets. With its origins in machine learning, statistics and
databases, data mining has developed to a prospering and very active research
field since the early nineties.
Since its introduction in [2] the task of association rule mining has received
a great deal of attention. Today the generation of association rules is one of
the most popular data mining methods. The idea of mining association rules
originates from the analysis of market-basket data where rules like “A customer
who buys products x 1 ,x 2 ,... ,x n will also buy product y with probability c %”
are generated.
Their direct applicability to business problems together with their inherent
understandability - even for non data mining experts - made association rules
such a popular mining method. Moreover it became clear that association rules
are not restricted to dependency analysis in the context of retail applications
but are successfully applicable to a wide range of business problems.
In this paper we deal with association rules in the context of a complex,
interactive and iterative knowledge discovery process. In Section 2 we formally
introduce association rules and give a first example. Then in Section 3 we draw
the attention to the process of knowledge discovery in databases (KDD) and
describe its basics. At the end of this section we finally explain the implications
on association rule mining. Concerning human involvement and interactivity we
come to the conclusion that today the situation is still not satisfying but there
are several main starting points to cope with this problem:
First of all there is the algorithmic complexity. In brief, the number of rules
grows exponentially with the number of items. Fortunately today's algorithms
are able to e - ciently prune this immense search space based on minimal thresh-
olds for quality measures on the rules. We deal with the details of rule generation
in Section 4.
Second, the mining data is typically stored in a relational database manage-
ment system. Therefore e - cient and elegant integration with modern database
systems is one of the key factors in practical mining applications. The reason is
that simple solutions like flat file extraction of the data quickly reach their limits
in the context of massive datasets and repeated algorithms runs. A solution to
this problem is given in Section 5.
Third, interesting rules must be picked from the set of generated rules. This
might be quite costly because the generated rule sets normally are quite large -
e.g. more than 100 , 000 rules are not uncommon - and in contrast the percentage
of useful rules is typically only a very small fraction. In Section 6 we enhance
Search WWH ::




Custom Search