Database Reference
In-Depth Information
1. The ones having itemset
a
as a prefix w.r.t.
R
, i.e., those containing item
a
; and
2. The ones having itemset
f
as a prefix w.r.t.
R
, i.e., those containing item
f
but
no
a
.
They form two projected databases [
18
] which can be mined with the constraint
C
pushed in. We examine the first one only since the second is similar.
Since
a
is a frequent itemset satisfying the constraint, the frequent itemsets having
a
as a proper prefix can be found in
a-projected database
(the subset of transactions
containing
a
). It contains two transactions:
bcdf
and
cdef
. Since items
b
and
e
is infrequent within this projected database, neither
ab
nor
ae
can be frequent. So,
they are pruned. The frequent items in the
a
-projected database is
f
,
d
,
c
, listed in
the order
. Since
ac
does not satisfy the constraint, there is no need to create an
ac
-projected database.
To check what can be mined in the
a
-projected database with
af
and
ad
, as prefix,
respectively, we need to construct the two projected databases and mine them. This
process is similar to the mining of
a
-projected databases.
The
af -projected database
contains two frequent items
d
and
c
, and only
af d
satisfy the constraint. Moreover, since
af d c
does not satisfies the constraint, the
process in this branch is complete. Since
af c
violates the constraint, there is no
need to construct
af c
-projected database. The
ad-projected database
contains one
frequent item
c
,but
adc
does not satisfy the constraint. Therefore, the set of frequent
itemsets satisfying the constraint and having
a
as prefix contains
a
,
af
,
af d
, and
ad
.
In summary, the complete set of frequent itemsets satisfying the constraint con-
tains 6 itemsets:
a
,
f
,
af
,
ad
,
af d
,
fg
. The method with ordered itemsets and
frequent pattern growth generates and tests only a small set of itemsets.
This example shows that by proper ordering of itemsets, frequent pattern growth
method may push some tough constraints (called
convertible constraints
) deeper
than the Apriori methods. A systematic classification of such constraints and a study
of how to push them into the mining process are in [
27
,
29
].
R
4
PrefixSpan: Mining Sequential Patterns by Pattern Growth
Sequential pattern mining, which discovers frequent subsequences as patterns in
a sequence database, is an important data mining problem with broad applications,
including the analyses of customer purchase behavior, Web access patterns, scientific
experiments, disease treatments, natural disasters, DNA sequences, and so on.
A
sequence database S
is a set of tuples
, where
sid
is a
sequence_id
and
s
is a sequence (i.e., an ordered list of itemsets). A tuple
sid
,
s
sid
,
s
is said to
contain
a
sequence
α
,if
α
is a subsequence of
s
, i.e.,
α
s
. The support of a sequence
α
in
a sequence database
S
is the number of tuples in the database containing
α
.Given
a positive integer
ξ
as the
support threshold
, a sequence
α
is called a
sequential
pattern
in sequence database
S
if the sequence is contained by at least
ξ
tuples in