Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

6.2

Generalization

While the above challenges concern specialization for structured data types and other

data primitives, another challenge concern the other direction: generalization. One

of the fundamental problems in data mining is that new models, algorithms, and

implementations are needed for every combination of task and data type. Though

the literature flourishes, it makes the results very hard to use for non-experts.

In this chapter we have shown that patterns can actually be useful: for summariza-

tion and characterization, as well as for other tasks. One of the upcoming challenges

will be to generalize compression-based data mining. Can patterns be defined in a

very generic way, so that mining them and using them for modeling remains possi-

ble? For that, progress with regard to both mining and modeling needs to be made.

Both are currently strongly tailored toward specific data and pattern types.

One approach may be to represent everything, both data and patterns, as queries.

With such a uniform treatment, recently proposed by Siebes [ 55 ], the ideal of ex-

ploratory data mining might become reachable. Note that the high-level goal of

generalizing data mining and machine learning is also pursued by De Raedt et

al. [ 51 , 21 ], yet with different focus: their aim is to develop declarative model-

ing languages for data mining, which can use existing solver technology to mine

solutions.

6.3

Task- and/or User-specific Usefulness

While obtaining very good results in practice, MDL is not a magic wand. In existing

approaches, the results are primarily dependent on the data and pattern languages. In

other situations it may be beneficial to take specific tasks and/or users into account.

In other words, one may want to keep the purpose of the patterns in mind.

As an example, the code table classifier described in the previous section works

well in practice, yet it is possibly sub-optimal. It works by modeling the class dis-

tributions, not by modeling the differences between these. Although classification

is hardly typical for exploratory data mining, similar arguments exist for other data

mining tasks.

In this chapter we ignore any background knowledge the user may have. If one

is interested in the optimal model given certain background knowledge, this entails

finding MDL-optimal models given prior distributions—which reduces to the MML

[ 69 ] principle. The optimal prior can be identified using the Maximum Entropy

principle [ 25 ]. 3

De Bie [ 13 ] argues that the goal of the data miner in data exploration is to model

the user's belief-state, so that we can algorithmically discover those results that will

be most informative to the user. At the core, this reduces to compression—with the

twist that the decision whether to include a pattern is made by the user.

3

See Chap. 5 for a more complete discussion on MaxEnt.

Search WWH ::

Custom Search

Home