Database Reference
In-Depth Information
this purpose requires excessive data access and computation. However with the
status of current rule and all its parent rules known, we will be able to derive
the statistics of the difference sets for performing the significance test, without
additional access to the database. The following lemma validates this statement.
Lemma 1. Suppose we are searching for impact rules from a database
D
.If
A ⊂ B
,and
coverset
A
−coverset
B
R
,where
A
and
B
are both conjunction
(
)
(
)=
of conditions,
R
is a set of records from
D
.Ifthe
mean
and
variance
of the
target attribute over
coverset ( A ) and
coverset ( B ) areknown,aswellasthe
cardinality of both record sets, the
mean
and
variance
of the target attribute
over set R can be derived without additional data access.
Proof. Since coverset ( A ) − coverset ( B )= R, it is obvious that
|R| = coverage ( A ) − coverage ( B )
(1)
coverage ( A ) × mean ( A → target ) − coverage ( B ) × mean ( B → target )
|R|
mean ( R )=
(2)
2
( target ( x ) − mean ( A → target ))
x
coverset ( A )
variance ( A → target )=
(3)
coverage ( A ) 1
2
( target ( x ) − mean ( B → target ))
x
coverset ( B )
variance ( B → target )=
(4)
coverage ( B ) 1
target ( x )= mean ( A → target ) × coverage ( A )
(5)
x
coverset ( A )
target ( x )= mean ( B → target ) × coverage ( B )
(6)
x
coverset ( B )
From 3, 4, 5 and 6 it is feasible to derive the following equation:
2
2
2
target ( x )
=
target ( x )
target ( x )
x∈R
x∈coverset ( A )
x∈coverset ( B )
= variance ( A → target ) × ( coverage ( A ) 1)
+ mean ( A → target )
2
× coverage ( A )
− variance ( B → target ) × ( coverage ( B ) 1)
− mean ( B → target )
2
× coverage ( B )
(7)
target ( x )=
target ( x )
target ( x )
(8)
x
∈R
x∈coverset ( A )
x∈coverset ( B )
Thus,
2
∈R ( target ( x ) − mean ( R ))
x
variance ( R )=
|R| − 1
2
target ( x )
2 mean ( R )
target ( x )
+ |R|mean ( R )
2
x
∈R
x
∈R
=
|R| − 1
|R| − 1
|R| − 1
Since all the parameters in the right hand side of the equation are already known, we
are able to derive all the necessary statistics for doing significance test without accessing
the records in R. The lemma is proved.
Search WWH ::




Custom Search