Mining Efficiently Significant Classification Association Rules - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

Proof. Let each element v

∈

[ N ] with a uniformed probability 1 /k to be the

part of the element of

.Let H denote the number of different elements in

⊂

[ N ]thathavebeenhitby

after repeating the same procedure k times.

The probability p that

hits each set Q of size at most k on each of its

elements could be bounded by

p = E ( H ) /E ( k )

{ i =1 [( k

1 /k ) k− 1

−

i +1) /k ]

(1 /k )

−

}

= i =1 {

i +1) /k 2 ]

1 /k ) k− 1

[( k

−

}

> i =1 {

i +1) /k 2 ]

1 /k ) k

[( k

−

}

> i =1 {

[( k

−

i +1) /k 2 ]

(1 / 4)

}

( )

( k/ 2) /k 2 ]

=(1 / 4) × [(1 + k ) / (2 k )]

> (1 / 4)

=(1 / 4)

[(1 + k )

[ k/ (2 k )]

> 1 / 8,

1 /k ) k

where inequality ( ) follows from the fact that the sequence (1

−

monotonously increasing.

4 Proposed Rule Mining Approach

4.1 The Strategy of the Deterministic Approach

To identify the significant CARs for class A (as c i ∈

, we provide a de-

terministic approach that employs a single application of a “strong” (2 n ,k, 0)-

selector. This approach ensures that every potential significant rule for A

will be hit at least once. To apply a family

C )in

of subsets of [2 n ] means first

to arrange the sets of

into a sequence F 1 ,F 2 ,...,F |F| .Inthe i th step,

only the CARs in

with labels in F i will be involved in the procedure

SIGNIFICANCE-TEST, while other CARs can be ignored. Thus, we have

an O ( k 2 log 2 n )-complexity to hit each of the k potential significant rules in-

dependently at least once, due to the property of the “strong” selector. If the

current test for F i contains only one potential significant rule R j and R j is

also a significant rule for A , then we call the function LOG-TEST, which is

based on a binary search and finally find R j . With a “smaller” list of rules (all

significant rules and some insignificant rules), we then compute the weighting

score of each rule for class A , and finally catch the “best

k ” (top- k score)

rules for A .

4.2 The Strategy of the Randomised Approach

In this section, we use the randomised k -selector to substitute the “strong”

selector in Sect. 4.1. This randomised approach ensures that every potential

significant rule for class A will be hit at least once with a high probability.

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home