Database Reference
In-Depth Information
6.2
Generalization
While the above challenges concern specialization for structured data types and other
data primitives, another challenge concern the other direction: generalization. One
of the fundamental problems in data mining is that new models, algorithms, and
implementations are needed for every combination of task and data type. Though
the literature flourishes, it makes the results very hard to use for non-experts.
In this chapter we have shown that patterns can actually be useful: for summariza-
tion and characterization, as well as for other tasks. One of the upcoming challenges
will be to generalize compression-based data mining. Can patterns be defined in a
very generic way, so that mining them and using them for modeling remains possi-
ble? For that, progress with regard to both mining and modeling needs to be made.
Both are currently strongly tailored toward specific data and pattern types.
One approach may be to represent everything, both data and patterns, as queries.
With such a uniform treatment, recently proposed by Siebes [ 55 ], the ideal of ex-
ploratory data mining might become reachable. Note that the high-level goal of
generalizing data mining and machine learning is also pursued by De Raedt et
al. [ 51 , 21 ], yet with different focus: their aim is to develop declarative model-
ing languages for data mining, which can use existing solver technology to mine
solutions.
6.3
Task- and/or User-specific Usefulness
While obtaining very good results in practice, MDL is not a magic wand. In existing
approaches, the results are primarily dependent on the data and pattern languages. In
other situations it may be beneficial to take specific tasks and/or users into account.
In other words, one may want to keep the purpose of the patterns in mind.
As an example, the code table classifier described in the previous section works
well in practice, yet it is possibly sub-optimal. It works by modeling the class dis-
tributions, not by modeling the differences between these. Although classification
is hardly typical for exploratory data mining, similar arguments exist for other data
mining tasks.
In this chapter we ignore any background knowledge the user may have. If one
is interested in the optimal model given certain background knowledge, this entails
finding MDL-optimal models given prior distributions—which reduces to the MML
[ 69 ] principle. The optimal prior can be identified using the Maximum Entropy
principle [ 25 ]. 3
De Bie [ 13 ] argues that the goal of the data miner in data exploration is to model
the user's belief-state, so that we can algorithmically discover those results that will
be most informative to the user. At the core, this reduces to compression—with the
twist that the decision whether to include a pattern is made by the user.
3
See Chap. 5 for a more complete discussion on MaxEnt.
Search WWH ::




Custom Search