Constraint-Based Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

Frequent itemset mining (see chapter ... ), for example, is an instance of this

generic setting, with the following choices:

the database has transactions that are subsets of a given set of items

I

;

2 I ;

the pattern language is the set of all subsets of

I

:

L π =

the minimum support constraint ϕ minsup ( π ,

D

) is true if and only if the number of

transactions of

D

that contain π is large enough, in other words, it is true if and

only if:

|

cover ( π )

|=|{

d

∈ D |

π

⊆

d

}| ≥

θ ,

where θ is a user-defined threshold.

By modifying the pattern language, the data language and the constraints, different

data mining problems can be formalized. The main aim of constraint-based pattern

mining is to build generic languages in which programmers can express pattern

mining problems in terms of constraints, and to develop systems that can process

statements in these languages.

2.1

Constraints

Constraints can be categorized along several dimensions, for instance:

1. which information is used when evaluating the constraint? Possibilities include

that the constraint only evaluates the syntax of the pattern, that the constraint

requires a database of transactions, or that the constraint requires a database with

labeled transactions.

2. which properties do the constraints have? The most well-known property is that

of (anti-)monotonicity, but other properties have been identified as well.

In terms of constraint-based pattern mining, combining constraints from different

categories of the former dimension is typically easy, whereas this proves challenging

for the latter dimension. Hence, existing work has focused on the latter dimension

and we will elaborate on these constraint categories below.

Anti-monotonicity Most pattern mining algorithms assume the existence of a cov-

erage relation between patterns and transactions in the data. In the case of frequent

itemset mining, for example, an itemset π covers a transaction d

d ;

hence, the subset relation is used as coverage relation. In graph mining, the subgraph

isomorphism relation may be used; in sequence mining, the subsequence relation.

A second important relation is the generality relation . A generality relation is

essentially a partial order on the set of patterns in

∈ D

iff π

⊆

L π . We will denote this relationship

with the symbol

: if pattern π 1 is more general than pattern π 2 , we will write

π 1

π 2 .

A generality relation

is compatible with a coverage relation if it satisfies the

following property for all possible transactions d :if π 1

π 2 and π 2 covers example

d , then π 1 covers example d .

A good generality relation is usually not difficult to choose. If the coverage relation

is transitive, one can always use the coverage relation as generality relation as well.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home