From Sequence Mining to Multidimensional Sequence Mining - Mining Complex Data

Information Technology Reference

In-Depth Information

a real distinction between multidimensional values and sequential items . Moreover,

mining multidimensional sequential patterns do not allow extracting characteristic

rules. Finally, they do not use specific methods to select the sequences while some

multidimensional patterns are likely to be shared by groups of similar sequences.

8.2.2 Concepts and Definitions

In the proposed approach, we consider a database composed of sequences s and

attributes A i (with value a i ) describing the sequences (table 8.1).

Definition 1: Let I be a set of items. A sequence s = <s 1 , s 2 , …, s i ,…, s n > is defined as

an ordered list of elements s i

I. s i is also called a sequence element.

More general definition was initially proposed in [2], where each sequence element

is rather an item-set. However, we argue that basic sequences composed of single

items are sufficient for many applications, and adopt this definition.

Definition 2: A sequences database S is composed of a set of tuples ( sid , s ) where sid

denotes the identity of the sequence s .

Definition 3: A sequence s = <s 1 , s 2 , …, s m > is called a subsequence of another

sequence t = <t 1 , t 2 , …, t n >, denoted s

∈

t if and only if there exist integers 1 ≤ j1 < j2

< ... <jn ≤ m such that s 1 = t j1 , s 2 = t j2 , ..., s n = t jn .

In other words, a sequence s is includ ed in a sequence t if the ordered list of ele-

ments of s is included in the ordered list of elements of t.

Definition 4: The support of a sequence α in a sequence database S is the number of

tuples in the database containing α i.e., support(α) | {<sid, s> | (<sid, s>

⊆

s)}.

Definition 5: Given a positive integer min_support , a sequence α is called a sequen-

tial pattern in sequence database S if support(α) ≥ min_support.

Another concept is the similarity of sequences in order to compare them. Sequence

similarity is a well known problem in the fields of bio-informatics. It aims at deter-

mining if a DNA sequence is similar or not to another. The most popular algorithms

are BLAST [3], FASTA [15] and LCSS ( Longest Common Subsequence ) [6]. We use

this last method in our definition of sequence similarity. It measures the minimum

number of insertions and deletions to transform s 1 into s 2 . This method is widely im-

plemented and used for this purpose.

∈

∧

(α

⊆

Definition 6: Given two sequences s = <s 1 , s 2 , ..., s i ,…, s m > and t = <t 1 , t 2 , …, t j ,

…,t n > such that (i

[1,n]). Let lcs (s,t) the size of the longest common subse-

quence. The dissimilarity distance between s and t is defined as: d(s, t) = n + m -

2*lcs (s, t) .

Definition 7: Given a positive integer DT called dissimilarity threshold , sequence s is

said similar to a sequence t if their dissimilarity distance is lower than DT.

Definition 8: A multidimensional sequences database is of schema ( RID , S,

A 1 ,…,A m ), where RID is a primary key, A 1 ,…,A m are dimensions, and S the domain

of sequences. A multidimensional sequence is defined as ( rid, s, a 1 , …, a m ) where a i

∈

[1,m], j

∈

Ai, for 1 ≤ i ≤ m and s a sequence (see table 1).

Mining Complex Data

Search WWH ::

Custom Search

Home