Information Technology Reference
In-Depth Information
transaction. It is also important to note that due to the redundancy in the table
we must beware of duplicate attributes.
Each collected transaction is immediately stored in a binary cache on the
file system. For that purpose the strings representing the attribute values are
mapped to integers. The attribute values are enriched with the corresponding
attribute name because otherwise they might easily cause ambiguities, think e.g.
of color “red”. So we finally obtain a dictionary that maps integers one to one
onto strings and a compact binary file containing the actual transactions.
Algorithms that make several passes over the database, e.g. Apriori, greatly
benefit from caching the transactions.
6 Further Integration with the KDD Process
In the previous section we treated the integration of the association mining
algorithms concerning data access. In other words we showed the integration
aspects from the input point of view. Now we turn over to the output side. We
cover howto appropriately arrange and store the association mining results in
the context of an embracing KDD process, c.f. [17].
6.1 BasicIdea
InSection4welearnedthateventhemoste - cientassociationminingalgorithms
still have run times of at least several minutes upto hours depending on the
miningdata.IfweremembertheKDDprocessexemplarilydescribedinSection3
it becomes clear that even interruptions of the analysts work of a single minute
are already problematic.
Someauthorstacklethisproblembypushingconstraintsontheresultsetinto
the mining algorithm [22,24,25,29]. Actually the performance improves but the
runtimesarestillfarfromallowingtrueinteractivity.Furthermoretheconstraint
result set will probably answer fewer of the analysts questions and therefore will
provoke additional further mining runs.
The solution that we propose is to do exactly the opposite: we broaden the
result set, c.f. [17]. Instead of restricting we suggest to add everything that might
make sense to the mining data and then let the mining algorithm generate all
rules based on relatively lowthresholds for the rule quality measures. Of course
under these conditions rule generation will take its time. But what we must not
forget is that even when constraining the result set as much possible we also will
not gain true interactivity. So why not deliberately accepting an interruption but
then proceed with interactively investigating a comprehensive rule set without
any or at least very fewfurther interruptions?
What is presumed in such a scenario is a cache to e - ciently store the gener-
ated rules. Once the cache is filled by running the mining algorithm answering
mining queries means retrieving the appropriate rules from the cache instead of
mining them from the data. Accessing a properly implemented cache only takes
seconds as shown in [17]. Of course the number of generated and stored rules will
Search WWH ::




Custom Search