Information Technology Reference
In-Depth Information
a real distinction between
multidimensional values
and
sequential items
. Moreover,
mining
multidimensional sequential patterns
do not allow extracting characteristic
rules. Finally, they do not use specific methods to select the sequences while some
multidimensional patterns are likely to be shared by groups of similar sequences.
8.2.2 Concepts and Definitions
In the proposed approach, we consider a database composed of sequences
s
and
attributes A
i
(with value a
i
) describing the sequences (table 8.1).
Definition 1:
Let I be a set of items. A sequence s = <s
1
, s
2
, …, s
i
,…, s
n
> is defined as
an ordered list of elements s
i
I. s
i
is also called a sequence element.
More general definition was initially proposed in [2], where each sequence element
is rather an item-set. However, we argue that basic sequences composed of single
items are sufficient for many applications, and adopt this definition.
Definition 2:
A
sequences database
S is composed of a set of tuples (
sid
,
s
) where
sid
denotes the identity of the sequence
s
.
Definition 3:
A sequence s = <s
1
, s
2
, …, s
m
> is called a subsequence of another
sequence t = <t
1
, t
2
, …, t
n
>, denoted s
∈
t if and only if there exist integers 1 ≤ j1 < j2
< ... <jn ≤ m such that s
1
= t
j1
, s
2
= t
j2
, ..., s
n
= t
jn
.
In other words, a sequence s is
includ
ed in a sequence t if the ordered list of ele-
ments of s is included in the ordered list of elements of t.
Definition 4:
The support of a sequence α in a sequence database S is the number of
tuples in the database containing α i.e., support(α) | {<sid, s> | (<sid, s>
⊆
s)}.
Definition 5:
Given a positive integer
min_support
, a sequence α is called a
sequen-
tial pattern
in sequence database S if support(α) ≥ min_support.
Another concept is the similarity of sequences in order to compare them. Sequence
similarity is a well known problem in the fields of bio-informatics. It aims at deter-
mining if a DNA sequence is similar or not to another. The most popular algorithms
are BLAST [3], FASTA [15] and LCSS (
Longest Common Subsequence
) [6]. We use
this last method in our definition of sequence similarity. It measures the minimum
number of insertions and deletions to transform s
1
into s
2
. This method is widely im-
plemented and used for this purpose.
∈
S)
∧
(α
⊆
Definition 6:
Given two sequences s = <s
1
, s
2
, ..., s
i
,…, s
m
> and t = <t
1
, t
2
, …, t
j
,
…,t
n
> such that (i
[1,n]). Let lcs (s,t) the size of the longest common subse-
quence. The
dissimilarity distance
between s and t is defined as: d(s, t) = n + m -
2*lcs (s, t)
.
Definition 7:
Given a positive integer DT called
dissimilarity threshold
, sequence s is
said similar to a sequence t if their dissimilarity distance is lower than DT.
Definition 8:
A
multidimensional sequences database
is of schema (
RID
,
S,
A
1
,…,A
m
), where
RID
is a primary key, A
1
,…,A
m
are dimensions, and
S
the domain
of sequences. A
multidimensional sequence
is defined as (
rid,
s,
a
1
, …, a
m
) where a
i
∈
[1,m], j
∈
∈
Ai, for 1 ≤ i ≤ m and
s
a sequence (see table 1).