Data Mining of Association Rules and the Process of Knowledge Discovery in Databases - Advances in Data Mining

Information Technology Reference

In-Depth Information

transaction. It is also important to note that due to the redundancy in the table

we must beware of duplicate attributes.

Each collected transaction is immediately stored in a binary cache on the

file system. For that purpose the strings representing the attribute values are

mapped to integers. The attribute values are enriched with the corresponding

attribute name because otherwise they might easily cause ambiguities, think e.g.

of color “red”. So we finally obtain a dictionary that maps integers one to one

onto strings and a compact binary file containing the actual transactions.

Algorithms that make several passes over the database, e.g. Apriori, greatly

benefit from caching the transactions.

6 Further Integration with the KDD Process

In the previous section we treated the integration of the association mining

algorithms concerning data access. In other words we showed the integration

aspects from the input point of view. Now we turn over to the output side. We

cover howto appropriately arrange and store the association mining results in

the context of an embracing KDD process, c.f. [17].

6.1 BasicIdea

InSection4welearnedthateventhemoste - cientassociationminingalgorithms

still have run times of at least several minutes upto hours depending on the

miningdata.IfweremembertheKDDprocessexemplarilydescribedinSection3

it becomes clear that even interruptions of the analysts work of a single minute

are already problematic.

Someauthorstacklethisproblembypushingconstraintsontheresultsetinto

the mining algorithm [22,24,25,29]. Actually the performance improves but the

runtimesarestillfarfromallowingtrueinteractivity.Furthermoretheconstraint

result set will probably answer fewer of the analysts questions and therefore will

provoke additional further mining runs.

The solution that we propose is to do exactly the opposite: we broaden the

result set, c.f. [17]. Instead of restricting we suggest to add everything that might

make sense to the mining data and then let the mining algorithm generate all

rules based on relatively lowthresholds for the rule quality measures. Of course

under these conditions rule generation will take its time. But what we must not

forget is that even when constraining the result set as much possible we also will

not gain true interactivity. So why not deliberately accepting an interruption but

then proceed with interactively investigating a comprehensive rule set without

any or at least very fewfurther interruptions?

What is presumed in such a scenario is a cache to e - ciently store the gener-

ated rules. Once the cache is filled by running the mining algorithm answering

mining queries means retrieving the appropriate rules from the cache instead of

mining them from the data. Accessing a properly implemented cache only takes

seconds as shown in [17]. Of course the number of generated and stored rules will

Search WWH ::

Custom Search

Home