Database Reference
In-Depth Information
6
Challenges Ahead
Above we showed that compression provides a powerful approach to both mining
and using patterns in a range of data mining tasks. Here we briefly identify and
discuss a number of open research problems.
6.1
Toward Mining Structured Data
When compared to other data types, compressing itemset data is relatively simple.
The most important reason is that the data is unordered over both rows and columns,
and hence tuples can be considered as sets of items, and the data as a bag of tuples.
For 'spatial' binary data, where the order of rows and columns does matter, many
tasks already become more difficult. A good example is the extension of tiling, called
geometric tiling [ 19 ], which aims at finding a hierarchy of (noisy) tiles that describe
the data well. Finding optimal sub-tiles is more difficult than mining itemsets, as we
now also have to consider every subset of rows. STIJL efficiently finds the MDL-
optimal sub-tile in order to greedily find good tilings [ 63 ].
Another possible structural constraint is time: sequences and streams are both
series of data points, where sequences consist of events while data streams usually
consists of complete tuples, e.g., itemsets. Initial attempts to characterize sequence
data with patterns using compression include [ 64 ] and [ 34 ]. Lam et al. [ 35 ] mine
sequential patterns from streams, whereas the goal of Van Leeuwen and Siebes [ 36 ]
is to detect changes in data streams. All these are limited though. For example, none
are suited for the high velocity of big data streams, as well as suboptimal for data
consisting of shifting mixtures of distributions. Other open issues include allowing
overlap between patterns, as well as allowing multiple events per time-stamp.
Adding even more structure, we have trees, graphs, as well as multi-relational
data. In this area even fewer results have been published, though arguably these data
types are most abundant. For graphs, SLASHBURN [ 26 ] uses compression to separate
communities and hubs. For multi-relational data, two variants of KRIMP have been
proposed [ 32 , 33 ], yet their modeling power is limited by their restrictive pattern
languages—nor are direct candidate mining strategies available.
Further, so far no pattern set mining approaches have been proposed for continuous
data. Moreover, all data is assumed to be 'certain'. However, in bioinformatics, for
example, many data is probabilistic in nature, e.g., representing the uncertainty
of protein-protein interactions. Bonchi et al. [ 7 ] proposed an approach to model
uncertain data by itemsets, yet they do so with 'certain' itemsets, i.e., without explicit
probabilities. Mining pattern sets from numerical and uncertain data, as well as using
them in compression-based models, are important future challenges.
Search WWH ::




Custom Search