Mining Sequential Patterns from Probabilistic Databases by Pattern-Growth - Advances in Databases

Database Reference

In-Depth Information

− c k )

1] + c k ∗ A [ k −

A [ k, ]=(1

∗ A [ k, −

1 ,−

1] .

(4)

Lemma 1. Given a p-sequence D i

and a sequence s , by applying Eq. 4 repeat-

edly, we correctly compute Pr[ s D i

] .

Fas t L 1 Computation. It was shown by [12] that it was possible to compute all

frequent 1-sequences in a single pass over the database. The procedure for this is

as follows: Initialize two arrays F and G ,eachofsize q =

|I|

, to zero and consider

each source i in turn. If D i

,for k =1 ,...,r take the pair

( e k ,c k ) and iterate through each x ∈ e k , setting F [ x ]:=( F [ x ]

( e 1 ,c 1 ) ,..., ( e r ,c r )

− c k )) + c k .

Once finished with source i ,if F [ x ] is non-zero, update G [ x ]:= G [ x ]+ F [ x ]and

reset F [ x ] to zero (for each source i use a list structure to keep track of all the

non-zero entries in F ). Finally, for any item x ∈I

∗

x,D p ).

, G [ x ]= ES (

PrefixSpan. PrefixSpan is based on the idea of pattern-growth, and works as

follows: First, all frequent 1-sequences are discovered. It is argued that any of the

frequent 2-sequences must begin with a frequent 1-sequence and therefore, the

complete set of sequential patterns can be partitioned into as many subsets as

the number of frequent 1-sequences where each 1-sequence is taken as a prefix.

A projected database is a smaller databased based on some prefix (sequence).

For example, in the sample database of Table 1(R), a ( d )-projected database is

{

. The sub-

set of sequential patterns is mined by constructing the set of projected databases

based on frequent 1-sequences and mining each recursively. For example, if ( e )

is a frequent 1-sequence in the above ( d )-projected database, a ( d )( e )-projected

database looks like

( a, d, e :0 . 2)( b, c, e :0 . 6)

( a, d, e :0 . 3)

( ,e :0 . 5)( b, c, e :0 . 4)

}

. This recursive mining process con-

tinues until no more sequential patterns could be found. For details see [13].

It was noted in [12] that it is not correct to simply perform the fast L 1

computation on a projected database. For example, if an ( a )-projected database

contained two p-sequences ( b :0 . 5)( b :0 . 5)( a :0 . 5) and ( b :0 . 5)( a :0 . 5)( b :0 . 5),

then when considering whether ( a )( b ) is frequent, it is not correct to compute

the expected support of ( b ) in the projected database (for example, both p-

sequences above would give the same contribution - 0.75 - to the support of

( b ) in the projected database, but clearly their support for ( a )( b ) is different).

In this work, we show how these sub-routines could be put together to find all

frequent sequential patterns using PGA.

{

( b, c, e :0 . 6)

}

3.1 Pattern-Growth Step

Pre-conditions

1. s is a previously discovered frequent sequence.

2. The list of sources i ,wherePr[ s D i

] > 0 is available.

3. The B i,s arrays for all such sources i are also available.

Advances in Databases

Search WWH ::

Custom Search

Home