Database Reference
In-Depth Information
ABCD
ABCD
ABC
ABD
ACD
BCD
ABC
ACD
BCD
AB
AC
AC
BC
BC
AD
BD
CD
CD
A
B
C
C
D
{}
Fig. 16.3 Pruned search space during iterative database scans of Apriori (example): itemset { C }
has been found infrequent in the first scan, therefore, itemsets { A , C } , { B , C } , { C , D } do not need
to be considered in the second scan, itemsets { A , B , C } , { A , C , D } , { B , C , D } do not need to be
considered in the third scan, etc. In this example, Apriori stops scanning the database after round
three, as there is no candidate of length 4 remaining
More precisely, the pruning criterion used in the Apriori algorithm is based on the
equivalent anti-monotonic property, describing the opposite direction of deduction:
S
is not frequent
⇒∀ T S
:
T
cannot be frequent either.
(16.2)
In the iterative procedure of repeated scans of the database for frequent itemsets,
this anti-monotonic property allows to ignore candidates that cannot be frequent
and, eventually, this pruning allows stopping at a certain size of itemsets, when
no candidates of typically moderate size remain to generate larger itemsets (see
Fig. 16.3 ).
An extension of the Apriori idea for very large itemsets has been termed 'colossal
patterns' [ 82 ]. The observation is that if one is interested in finding very large frequent
itemsets, then Apriori needs to generate many smaller frequent itemsets that are not
relevant for the result. This effect can be used positively, in that if large patterns also
have a large number of subsets, several of these subsets can be combined in order to
obtain larger candidates directly. In this sense, the idea is to avoid the full search, and
instead use some results at the bottom of the search space as a shortcut to particularly
promising candidates higher up. This approach thus trades some of the accuracy of
full search for a much more efficient frequent pattern mining algorithm. As we will
see below, both the Apriori algorithm, as well as that of colossal patterns have been
employed towards mining subspace clusters.
2.1
Generalized Monotonicity
In data clustering, we typically do not consider binary transaction data, or discrete
data in general, but instead most often study continuous real-valued vector data, typ-
ically assuming a Euclidean vector space. In this space, attributes may be noisy, or
Search WWH ::




Custom Search