From Sequence Mining to Multidimensional Sequence Mining - Mining Complex Data

Information Technology Reference

In-Depth Information

It has been applied to discover all frequent activity sequences in a time use survey

database within an urban area. The experiments have shown that IBM and its variants

provide better performances than existing algorithms in most cases: a better response

time is reached, while a very low memory is required. Experimental results have also

shown that IBM2 and IBM2_OPT outperforms IBM and IBM_OPT , which in turns

outperforms SPAM and PrefixSPAN for large and very large databases and for a

limited number of distinct items.

Notice that the proposed data structure for IBM and IBM2 algorithms, and espe-

cially the SV vector, could be used for other purposes as similarity search between

sequences and sequence clustering. Another perspective is to apply it to different

application contexts, as the analysis of query plans and genomic sequences. We have

already experimented many various datasets: activity sequences, chess play se-

quences, and the web pages sequences msnbc.dat from the UCI KDD archive. In the

context of the activity-mobility survey, we will explore the mining of spatial

sequences, such as trajectories [7]. This is still a challenging research issue.

This chapter also describes a sequence centered approach for mining multidimen-

sional sequential rules . It characterizes the main sequences and performs in three

steps. First, it mines sequential patterns and sets them as class models . In the second

step, a similarity algorithm is used to gather each model with similar sequences of the

database to form the classes. Then in the third step, a characterization algorithm is

used to extract attribute values characterizing each class. Compared to existing works,

our approach: (i) allows extracting rules of the form (a 1 , a 2 , a i ,…, a n ) Î <s 1 , s 2 , s i ,…,

s n > [ R ] where attribute values (a 1 , a 2 , a i …, a n ) are characteristics of objects whom

sequences are similar to <s 1 , s 2 , s i ,…, s n > with a significance R ; ii) considers similar

sequences rather than their exact values, producing more relevant knowledge for a

better decision-making. In perspective, this approach will be extended to multidimen-

sional spatial and temporal sequences in order to characterize the trajectories of mov-

ing objects. At this end, similarity search of trajectories may use the approach of [23]

which is also based on LCSS.

References

1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proc. of the

20th Int. Conf. Very Large Data Bases (VLDB), Santiago, Chile (September 1994)

2. Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proc. of the 11th Int'l Conference

on Data Engineering, Taipei, Taiwan (March 1995)

3. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment

Search Tool. J. Mol. Biol. 215, 403-410 (1990)

4. Beyer, K., Ramakrishnan, R.: Bottom-up computation of sparse and iceberg cubes. In:

Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD 1999), Philadel-

phia, PA, pp. 359-370 (June 1999)

5. Cheng, H., Yan, X., Han, J.: IncSpan: Incremental mining of sequential patterns in large

database. In: Proc. 2004 ACM SIGKDD Int. Conf. Knowledge Discovery in Databases

(KDD 2004), Seattle, WA, pp. 527-532 (August 2004)

6. Freschi, V., Bogliolo, A.: Longest Common Subsequence between Run-Length-Encoded

Strings: a New Algorithm with Improved Parallelism. Elsevier Information Processing

Letters 90(4), 167-173 (2004)

Mining Complex Data

Search WWH ::

Custom Search

Home