Data Mining of Association Rules and the Process of Knowledge Discovery in Databases - Advances in Data Mining

Information Technology Reference

In-Depth Information

be generated is typically restricted by minimal thresholds for the rule quality

measures support and confidence, minsupp and minconf respectively.

This restriction allows [3] to split the problem into two separate parts: An

itemset X is called frequent if supp( X ) ≥ minsupp. Once,

F = {X ⊆I| supp( X ) ≥ minsupp },

the set of all frequent itemsets together with the corresponding support values

is known, deriving the desired association rules is straight forward: For every

X ∈F one has to check the confidence of all rules

X \ Y → Y

with ∅ = Y X .

The so called downward closure property of itemset support ensures that

we actually can compute all necessary confidence values, c.f. [3]. This property

states that all subsets of a frequent itemset must also be frequent. So if X is

frequent we also know the support of all X \ Y with ∅ = Y X .

What remains to be done before rule generation is to determine F , the set of

all frequent itemsets. Unfortunately for obvious reason we are not able to look at

all subsets of I : A linearly growing n = |I| of course still implies an exponential

growing number of subsets to be taken into consideration.

4.2 The Generation of Frequent Itemsets

In the beginning of the mining run each itemset X ⊆I is potentially frequent.

In other words the initial search space consists of the power set of I without

the empty set, P ( I ) \∅ . Therefore even for rather small |I| the search space

easily exceeds all limits. For the set of items I = {a, b, c, d, e} this search space

is shown in Figure 2.

In order to avoid traversing the whole search space modern association min-

ing algorithms employ a candidate generation and test approach. The idea is to

generate an easy to survey set of potential frequent itemsets, a set of so called

candidates. Then the support values of these candidates are determined based

on the database D . Candidate generation always considers all information on fre-

quency and infrequency of already investigated candidates. In brief the common

strategy is as follows: from the downward closure property of itemset support we

knowthat all subsets of a frequent itemset must also be frequent. This allows us

to immediately prune those candidates as infrequent from the search space that

have at least one infrequent subset. After candidate generation the designated

candidates are counted based on the database and the algorithm proceeds to the

next iteration. The whole process stops as soon as there are no more potentially

frequent itemsets that have not been considered as candidates.

In Figure 2 the thick border separates the frequent itemsets in the upper

part from the infrequent itemsets in the lower part for a hypothetical support

threshold minsupp.Theexistenceofsuchaborderisguaranteedbythedownward

closure property of itemset support. Clearly, the proposed stepwise traversal of

the search space should start at the top with all or part of the frequent 1-itemsets

Search WWH ::

Custom Search

Home