Information Technology Reference
In-Depth Information
a real distinction between multidimensional values and sequential items . Moreover,
mining multidimensional sequential patterns do not allow extracting characteristic
rules. Finally, they do not use specific methods to select the sequences while some
multidimensional patterns are likely to be shared by groups of similar sequences.
8.2.2 Concepts and Definitions
In the proposed approach, we consider a database composed of sequences s and
attributes A i (with value a i ) describing the sequences (table 8.1).
Definition 1: Let I be a set of items. A sequence s = <s 1 , s 2 , …, s i ,…, s n > is defined as
an ordered list of elements s i
I. s i is also called a sequence element.
More general definition was initially proposed in [2], where each sequence element
is rather an item-set. However, we argue that basic sequences composed of single
items are sufficient for many applications, and adopt this definition.
Definition 2: A sequences database S is composed of a set of tuples ( sid , s ) where sid
denotes the identity of the sequence s .
Definition 3: A sequence s = <s 1 , s 2 , …, s m > is called a subsequence of another
sequence t = <t 1 , t 2 , …, t n >, denoted s
t if and only if there exist integers 1 ≤ j1 < j2
< ... <jn ≤ m such that s 1 = t j1 , s 2 = t j2 , ..., s n = t jn .
In other words, a sequence s is includ ed in a sequence t if the ordered list of ele-
ments of s is included in the ordered list of elements of t.
Definition 4: The support of a sequence α in a sequence database S is the number of
tuples in the database containing α i.e., support(α) | {<sid, s> | (<sid, s>
s)}.
Definition 5: Given a positive integer min_support , a sequence α is called a sequen-
tial pattern in sequence database S if support(α) ≥ min_support.
Another concept is the similarity of sequences in order to compare them. Sequence
similarity is a well known problem in the fields of bio-informatics. It aims at deter-
mining if a DNA sequence is similar or not to another. The most popular algorithms
are BLAST [3], FASTA [15] and LCSS ( Longest Common Subsequence ) [6]. We use
this last method in our definition of sequence similarity. It measures the minimum
number of insertions and deletions to transform s 1 into s 2 . This method is widely im-
plemented and used for this purpose.
S)
Definition 6: Given two sequences s = <s 1 , s 2 , ..., s i ,…, s m > and t = <t 1 , t 2 , …, t j ,
…,t n > such that (i
[1,n]). Let lcs (s,t) the size of the longest common subse-
quence. The dissimilarity distance between s and t is defined as: d(s, t) = n + m -
2*lcs (s, t) .
Definition 7: Given a positive integer DT called dissimilarity threshold , sequence s is
said similar to a sequence t if their dissimilarity distance is lower than DT.
Definition 8: A multidimensional sequences database is of schema ( RID , S,
A 1 ,…,A m ), where RID is a primary key, A 1 ,…,A m are dimensions, and S the domain
of sequences. A multidimensional sequence is defined as ( rid, s, a 1 , …, a m ) where a i
[1,m], j
Ai, for 1 ≤ i ≤ m and s a sequence (see table 1).
Search WWH ::




Custom Search