Database Reference
In-Depth Information
and has been leveraged by [
44
] to use one type of mining technique to address
different tasks: classification, subgroup discovery, and conceptual clustering.
Eliminating Minimum Support
The above settings essentially apply statistical
measures in addition to minimum support. The minimum support parameter remains
a parameter that needs to be set. Several approaches have successfully eliminated
this parameter.
The main observation is that thresholds on quality measures can be translated
into support thresholds; hence, if a support threshold is not given, it is possible to
automatically determine an additional support threshold for use in a pattern mining
algorithm.
Returning to the accuracy measure, we can set a minimum threshold on it:
acc
(
r
)
≥
θ
acc
. This can be transformed into
p
+
(
N
−
n
)
≥
θ
acc
·|
D
|
, and fur-
ther into
p
N
. So we derive support constraints
based on the threshold on the quality measure itself [
31
].
For measures that are convex, which includes the ones mentioned above but also
many others, a similar argument is possible: convex functions take their maxima at
extreme points, i.e. points with
p
≥
θ
acc
·|
D
|−
N
+
n
≥
θ
acc
·|
D
|−
0. Thus, based on a threshold on the
minimal acceptable values for a statistical scoring function, thresholds on a pattern's
p
and
n
can be derived and enforced during mining. This makes it effective to use
the quality measure to prune
during
rule mining [
6
,
7
,
11
,
12
,
15
,
19
,
29
,
31
,
35
,
39
,
43
-
45
].
Thus far we have discussed approaches that use thresholds and exhaustively mine
all patterns that satisfy the thresholds. An even easier and often more effective ap-
proach is to perform top-
k
mining instead. In top-
k
mining, one is interested in
finding only those patterns which have the
k
highest scores; the only parameter that
needs to be specified is
k
. This has been leveraged by [
7
,
11
,
12
,
15
,
35
,
42
,
45
]. The
nature of this mining process means that the threshold(s) increase during mining,
pruning more and more candidate patterns as the search progresses. To achieve a
quick increase of the threshold, it can be useful to perform a
best-first
search during
which it is always the rule with the highest upper bound that is specialized.
=
0or
n
=
2.3
Numerical Target Values
As opposed to the setting discussed in the preceding sections, in which each transac-
tion is labeled with one out of a set of
discrete
labels, a numerical variable of interest
can have potentially infinitely many values. As a result of this, each transaction in the
data may have a different target value, and learning a predictor for particular values,
or partitioning the data into subsets consisting of transactions with the same target
value, are strategies that are unlikely to be successful. Nevertheless, there exist a
number of techniques for
discretizing
numerical values, in which case the problem
can be reduced to the classification setting.