Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

6

Challenges Ahead

Above we showed that compression provides a powerful approach to both mining

and using patterns in a range of data mining tasks. Here we briefly identify and

discuss a number of open research problems.

6.1

Toward Mining Structured Data

When compared to other data types, compressing itemset data is relatively simple.

The most important reason is that the data is unordered over both rows and columns,

and hence tuples can be considered as sets of items, and the data as a bag of tuples.

For 'spatial' binary data, where the order of rows and columns does matter, many

tasks already become more difficult. A good example is the extension of tiling, called

geometric tiling [ 19 ], which aims at finding a hierarchy of (noisy) tiles that describe

the data well. Finding optimal sub-tiles is more difficult than mining itemsets, as we

now also have to consider every subset of rows. STIJL efficiently finds the MDL-

optimal sub-tile in order to greedily find good tilings [ 63 ].

Another possible structural constraint is time: sequences and streams are both

series of data points, where sequences consist of events while data streams usually

consists of complete tuples, e.g., itemsets. Initial attempts to characterize sequence

data with patterns using compression include [ 64 ] and [ 34 ]. Lam et al. [ 35 ] mine

sequential patterns from streams, whereas the goal of Van Leeuwen and Siebes [ 36 ]

is to detect changes in data streams. All these are limited though. For example, none

are suited for the high velocity of big data streams, as well as suboptimal for data

consisting of shifting mixtures of distributions. Other open issues include allowing

overlap between patterns, as well as allowing multiple events per time-stamp.

Adding even more structure, we have trees, graphs, as well as multi-relational

data. In this area even fewer results have been published, though arguably these data

types are most abundant. For graphs, SLASHBURN [ 26 ] uses compression to separate

communities and hubs. For multi-relational data, two variants of KRIMP have been

proposed [ 32 , 33 ], yet their modeling power is limited by their restrictive pattern

languages—nor are direct candidate mining strategies available.

Further, so far no pattern set mining approaches have been proposed for continuous

data. Moreover, all data is assumed to be 'certain'. However, in bioinformatics, for

example, many data is probabilistic in nature, e.g., representing the uncertainty

of protein-protein interactions. Bonchi et al. [ 7 ] proposed an approach to model

uncertain data by itemsets, yet they do so with 'certain' itemsets, i.e., without explicit

probabilities. Mining pattern sets from numerical and uncertain data, as well as using

them in compression-based models, are important future challenges.

Search WWH ::

Custom Search

Home