Efficiently Identifying Exploratory Rules’ Significance - Data Mining: Theory, Methodology, Techniques, and Applications

Database Reference

In-Depth Information

Aumann and Lindell [2] and Huang and Webb [8] both proposed ideas for

filtering insignificant distributional-consequent exploratory rules. In this paper,

we use the definition proposed by the latter.

Definition 1. significant impact rule An impact rule

is signifi-

cant if the distribution of its target is significantly improved in comparison with

the target distribution of any of its direct parents'. The measure for the target

distribution can be the mean, the variance etc.

A → target

significant

(

A → target

∀x ∈ A, dist

(

coverset

(

))

)) 1

dist

(

coverset

(

A − x

)

− coverset

(

An impact rule is insignificant if it is not significant .

Definitions of insignificant propositional exploratory rules are provided by Liu

et al. [10] and Bay and Pazzani [4].

In this paper, the mean of the target attribute over

)isusedasthe

interestingness measure to be compared for the impact rule. Statistical test is

done to decide whether the target means of two samples are significantly different

from each other.

coverset

(

K-Most-Interesting Impact Rule Discovery and

Notations

The impact rule discovery algorithm we adopt is based on the OPUS [14] al-

gorithm, which enable the successfully discovery of the top

impact rules that

satisfy a certain set of constraints.

We characterized the terminology of k-most-interesting impact rule discovery

to be used in this paper as follows:

1. An impact rule is in form of

A → target

, while the target is describe by the

following measures:

coverage

mean

variance

maximum

minimum

sum

and

impact

is an interestingness measure suggested by Webb [13] 2 :

Impact

impact

(

A →

)).

3. A k-most-interesting impact rule discovery task is a 7-tuple:

KMIIRD

target

)=(

mean

(

A → target

)

− targ

)

× coverage

(

C, T , D, M,λ,I,k

: is a nonempty set of Boolean conditions, which are the set of available

conditions for impact rule antecedents.

: is a nonempty set of the variables in whose distribution we are interested.

: is a nonempty set of records, which is called the database. A record is a

pair

<c,v>,c⊆ C

and

is a set of values for

The token “ ”isusedtodenote significantly improved ,and dist ( R ) is used to

represent the distribution of the target variable over the set of records R .

In this formula, mean ( A → target ) denotes the mean of the targets covered by A ,

and coverage ( A ) is the number of the records covered by A .

Data Mining: Theory, Methodology, Techniques, and Applications

Search WWH ::

Custom Search

Home