Database Reference
In-Depth Information
[ 84 ]. Here, we will go more into detail of how the ideas of frequent pattern mining
have been transferred and translated to the clustering domain, and how exactly they
have found use in various clustering algorithms. To this end, in Sect. 2 we will first
discuss the generalization of the reasoning about frequent patterns for the application
to the clustering task. We will then, in Sect. 3 , detail example algorithms for both
subspace clustering and subspace search, discussing the use of ideas proposed in
frequent pattern mining in these algorithms. We conclude the chapter in Sect. 4 .
2
Generalizing Pattern Mining for Clustering
For a reader of a chapter in this topic about frequent pattern mining, we assume famil-
iarity with frequent pattern mining as discussed also in fundamental other chapters
in this topic. In particular, we assume basic knowledge of the Apriori algorithm [ 6 ].
Nevertheless, for the sake of completeness, let us briefly recapitulate the algorithmic
ingredients of Apriori that are essential to our discussion.
Considering the example of market basket analysis, we are interested in find-
ing items that are sold together (i.e., itemsets). Naïvely, the search for all frequent
itemsets is exponential in the number of available items: we would simply cal-
culate the frequency of all k -itemsets in the database over m items, resulting in
k = 1 k =
2 m
1 tests.
For identification of frequent patterns in a transaction database (i.e., a binary
database, where each row does or does not contain a certain item), the idea of
Apriori is a level-wise search for itemsets of incremental length, given a frequency
threshold. Starting with all frequent itemsets of length 1 (i.e., counting all transactions
containing a certain item, irrespective of other items possibly also contained in the
transaction), the list of potential candidates for frequent itemsets of length 2 can be
restricted based on the following observation: An itemset of length 2 can only be
frequent if both contained items (i.e., itemsets of length 1) are frequent as well. If
neither diapers nor beer is a frequent item in the transaction database, the transaction
containing both diapers and beer cannot be frequent either. This holds for itemsets
of all lengths n , that can only be frequent if all contained itemsets of length n
1
are frequent as well. For example, an itemset may contain items A , B , C , etc. If a
1-itemset containing A is not frequent (i.e., we find such an itemset less often than a
given threshold), all 2-itemsets containing A (e.g.,
) cannot be
frequent either (otherwise itemsets containing A would have been frequent as well)
and need not be tested for exceeding the threshold. Likewise, if the itemset
{ A , B }
,
{ A , C }
,
{ A , D }
{
A , B
}
is not frequent, then all 3-itemsets containing
{
A , B
}
(e.g.,
{
A , B , C
}
,
{
A , B , D
}
,
{
) cannot be frequent either, etc. Theoretically, the search space remains
exponential, yet practically the search is usually substantially accelerated.
This observation is a principle of monotonicity and is the most important ingre-
dient for a heuristic speed-up of the mining for frequent patterns. More concisely,
we can express this monotonicity over sets as follows:
T
A , B , E
}
is frequent
⇒∀ S T
:
S
is frequent.
(16.1)
Search WWH ::




Custom Search