Database Reference
In-Depth Information
Aumann and Lindell [2] and Huang and Webb [8] both proposed ideas for
filtering insignificant distributional-consequent exploratory rules. In this paper,
we use the definition proposed by the latter.
Definition 1. significant impact rule An impact rule
is signifi-
cant if the distribution of its target is significantly improved in comparison with
the target distribution of any of its direct parents'. The measure for the target
distribution can be the mean, the variance etc.
A → target
significant
(
A → target
)=
∀x ∈ A, dist
(
coverset
(
A
))
)) 1
dist
(
coverset
(
A − x
)
− coverset
(
A
An impact rule is insignificant if it is not significant .
Definitions of insignificant propositional exploratory rules are provided by Liu
et al. [10] and Bay and Pazzani [4].
In this paper, the mean of the target attribute over
)isusedasthe
interestingness measure to be compared for the impact rule. Statistical test is
done to decide whether the target means of two samples are significantly different
from each other.
coverset
(
A
4
K-Most-Interesting Impact Rule Discovery and
Notations
The impact rule discovery algorithm we adopt is based on the OPUS [14] al-
gorithm, which enable the successfully discovery of the top
k
impact rules that
satisfy a certain set of constraints.
We characterized the terminology of k-most-interesting impact rule discovery
to be used in this paper as follows:
1. An impact rule is in form of
A → target
, while the target is describe by the
following measures:
coverage
,
mean
,
variance
,
maximum
,
minimum
,
sum
and
impact
.
is an interestingness measure suggested by Webb [13] 2 :
2.
Impact
impact
(
A →
)).
3. A k-most-interesting impact rule discovery task is a 7-tuple:
KMIIRD
target
)=(
mean
(
A → target
)
− targ
)
× coverage
(
A
(
C, T , D, M,λ,I,k
).
C
: is a nonempty set of Boolean conditions, which are the set of available
conditions for impact rule antecedents.
T
: is a nonempty set of the variables in whose distribution we are interested.
D
: is a nonempty set of records, which is called the database. A record is a
pair
<c,v>,c⊆ C
and
v
is a set of values for
T
.
1
The token “ ”isusedtodenote significantly improved ,and dist ( R ) is used to
represent the distribution of the target variable over the set of records R .
2
In this formula, mean ( A → target ) denotes the mean of the targets covered by A ,
and coverage ( A ) is the number of the records covered by A .
 
Search WWH ::




Custom Search