Frequent Pattern Mining Algorithms for Data Clustering - Frequent Pattern Mining

Database Reference

In-Depth Information

[ 84 ]. Here, we will go more into detail of how the ideas of frequent pattern mining

have been transferred and translated to the clustering domain, and how exactly they

have found use in various clustering algorithms. To this end, in Sect. 2 we will first

discuss the generalization of the reasoning about frequent patterns for the application

to the clustering task. We will then, in Sect. 3 , detail example algorithms for both

subspace clustering and subspace search, discussing the use of ideas proposed in

frequent pattern mining in these algorithms. We conclude the chapter in Sect. 4 .

Generalizing Pattern Mining for Clustering

For a reader of a chapter in this topic about frequent pattern mining, we assume famil-

iarity with frequent pattern mining as discussed also in fundamental other chapters

in this topic. In particular, we assume basic knowledge of the Apriori algorithm [ 6 ].

Nevertheless, for the sake of completeness, let us briefly recapitulate the algorithmic

ingredients of Apriori that are essential to our discussion.

Considering the example of market basket analysis, we are interested in find-

ing items that are sold together (i.e., itemsets). Naïvely, the search for all frequent

itemsets is exponential in the number of available items: we would simply cal-

culate the frequency of all k -itemsets in the database over m items, resulting in

k = 1 k =

2 m

1 tests.

For identification of frequent patterns in a transaction database (i.e., a binary

database, where each row does or does not contain a certain item), the idea of

Apriori is a level-wise search for itemsets of incremental length, given a frequency

threshold. Starting with all frequent itemsets of length 1 (i.e., counting all transactions

containing a certain item, irrespective of other items possibly also contained in the

transaction), the list of potential candidates for frequent itemsets of length 2 can be

restricted based on the following observation: An itemset of length 2 can only be

frequent if both contained items (i.e., itemsets of length 1) are frequent as well. If

neither diapers nor beer is a frequent item in the transaction database, the transaction

containing both diapers and beer cannot be frequent either. This holds for itemsets

of all lengths n , that can only be frequent if all contained itemsets of length n −

−

are frequent as well. For example, an itemset may contain items A , B , C , etc. If a

1-itemset containing A is not frequent (i.e., we find such an itemset less often than a

given threshold), all 2-itemsets containing A (e.g.,

) cannot be

frequent either (otherwise itemsets containing A would have been frequent as well)

and need not be tested for exceeding the threshold. Likewise, if the itemset

{ A , B }

{ A , C }

{ A , D }

{

A , B

}

is not frequent, then all 3-itemsets containing

{

A , B

}

(e.g.,

{

A , B , C

}

{

A , B , D

}

{

) cannot be frequent either, etc. Theoretically, the search space remains

exponential, yet practically the search is usually substantially accelerated.

This observation is a principle of monotonicity and is the most important ingre-

dient for a heuristic speed-up of the mining for frequent patterns. More concisely,

we can express this monotonicity over sets as follows:

A , B , E

}

is frequent

⇒∀ S ⊆ T

is frequent.

(16.1)

Frequent Pattern Mining

Search WWH ::

Custom Search

Home