Database Reference
In-Depth Information
Frequent itemset mining (see chapter ... ), for example, is an instance of this
generic setting, with the following choices:
￿
the database has transactions that are subsets of a given set of items
I
;
2 I ;
￿
the pattern language is the set of all subsets of
I
:
L π =
￿
the minimum support constraint ϕ minsup ( π ,
D
) is true if and only if the number of
transactions of
D
that contain π is large enough, in other words, it is true if and
only if:
|
cover ( π )
|=|{
d
D |
π
d
}| ≥
θ ,
where θ is a user-defined threshold.
By modifying the pattern language, the data language and the constraints, different
data mining problems can be formalized. The main aim of constraint-based pattern
mining is to build generic languages in which programmers can express pattern
mining problems in terms of constraints, and to develop systems that can process
statements in these languages.
2.1
Constraints
Constraints can be categorized along several dimensions, for instance:
1. which information is used when evaluating the constraint? Possibilities include
that the constraint only evaluates the syntax of the pattern, that the constraint
requires a database of transactions, or that the constraint requires a database with
labeled transactions.
2. which properties do the constraints have? The most well-known property is that
of (anti-)monotonicity, but other properties have been identified as well.
In terms of constraint-based pattern mining, combining constraints from different
categories of the former dimension is typically easy, whereas this proves challenging
for the latter dimension. Hence, existing work has focused on the latter dimension
and we will elaborate on these constraint categories below.
Anti-monotonicity Most pattern mining algorithms assume the existence of a cov-
erage relation between patterns and transactions in the data. In the case of frequent
itemset mining, for example, an itemset π covers a transaction d
d ;
hence, the subset relation is used as coverage relation. In graph mining, the subgraph
isomorphism relation may be used; in sequence mining, the subsequence relation.
A second important relation is the generality relation . A generality relation is
essentially a partial order on the set of patterns in
D
iff π
L π . We will denote this relationship
with the symbol
: if pattern π 1 is more general than pattern π 2 , we will write
π 1
π 2 .
A generality relation
is compatible with a coverage relation if it satisfies the
following property for all possible transactions d :if π 1
π 2 and π 2 covers example
d , then π 1 covers example d .
A good generality relation is usually not difficult to choose. If the coverage relation
is transitive, one can always use the coverage relation as generality relation as well.
Search WWH ::




Custom Search