Database Reference
In-Depth Information
− c
k
)
1] +
c
k
∗ A
[
k −
A
[
k,
]=(1
∗ A
[
k, −
1
,−
1]
.
(4)
Lemma 1.
Given a p-sequence
D
i
and a sequence
s
, by applying Eq. 4 repeat-
edly, we correctly compute
Pr[
s D
i
]
.
Fas t
L
1
Computation.
It was shown by [12] that it was possible to compute all
frequent 1-sequences in a single pass over the database. The procedure for this is
as follows: Initialize two arrays
F
and
G
,eachofsize
q
=
|I|
, to zero and consider
each source
i
in turn. If
D
i
,for
k
=1
,...,r
take the pair
(
e
k
,c
k
) and iterate through each
x ∈ e
k
, setting
F
[
x
]:=(
F
[
x
]
=
(
e
1
,c
1
)
,...,
(
e
r
,c
r
)
− c
k
)) +
c
k
.
Once finished with source
i
,if
F
[
x
] is non-zero, update
G
[
x
]:=
G
[
x
]+
F
[
x
]and
reset
F
[
x
] to zero (for each source
i
use a list structure to keep track of all the
non-zero entries in
F
). Finally, for any item
x ∈I
∗
(1
x,D
p
).
,
G
[
x
]=
ES
(
PrefixSpan.
PrefixSpan is based on the idea of pattern-growth, and works as
follows: First, all frequent 1-sequences are discovered. It is argued that any of the
frequent 2-sequences must begin with a frequent 1-sequence and therefore, the
complete set of sequential patterns can be partitioned into as many subsets as
the number of frequent 1-sequences where each 1-sequence is taken as a prefix.
A
projected database
is a
smaller
databased based on some prefix (sequence).
For example, in the sample database of Table 1(R), a (
d
)-projected database is
{
. The sub-
set of sequential patterns is mined by constructing the set of projected databases
based on frequent 1-sequences and mining each recursively. For example, if (
e
)
is a frequent 1-sequence in the above (
d
)-projected database, a (
d
)(
e
)-projected
database looks like
(
a, d, e
:0
.
2)(
b, c, e
:0
.
6)
,
(
a, d, e
:0
.
3)
,
(
,e
:0
.
5)(
b, c, e
:0
.
4)
}
. This recursive mining process con-
tinues until no more sequential patterns could be found. For details see [13].
It was noted in [12] that it is not correct to simply perform the fast
L
1
computation on a projected database. For example, if an (
a
)-projected database
contained two p-sequences (
b
:0
.
5)(
b
:0
.
5)(
a
:0
.
5) and (
b
:0
.
5)(
a
:0
.
5)(
b
:0
.
5),
then when considering whether (
a
)(
b
) is frequent, it is not correct to compute
the expected support of (
b
) in the projected database (for example, both p-
sequences above would give the same contribution - 0.75 - to the support of
(
b
) in the projected database, but clearly their support for (
a
)(
b
) is different).
In this work, we show how these sub-routines could be put together to find all
frequent sequential patterns using PGA.
{
(
b, c, e
:0
.
6)
,
,
}
3.1 Pattern-Growth Step
Pre-conditions
1.
s
is a previously discovered frequent sequence.
2. The list of sources
i
,wherePr[
s D
i
]
>
0 is available.
3. The
B
i,s
arrays for all such sources
i
are also available.