Dealing with Missing Values - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

(

a ks ,

a cj )

(

a ks ,

a cj )

Let obs

represent the actual observed frequency of

in S .The

expression

(

obs ks −

exp

(

a ks ,

a cj ))

(4.51)

exp

(

a ks ,

a cj )

j = 1

summing over the outcomes of C in the contingency table, possesses an asymptotic

chi-squared propertywith

degrees of freedom. D can then be used in a criterion

for testing the statistical dependency between a ks , and C at a presumed significant

level as described below. For this purpose, we define a mapping

(

−

)

if D

>χ

(

−

) ;

h k (

a ks ,

) =

(4.52)

otherwise

where

is the tabulated chi-squared value. The subset of selected events

of X k , which has statistical interdependency with C , is defined as

E k = a ks |

(

−

)

h k (

a ks ,

) =

(4.53)

We call E k the covered event subset of X k with respect to C . Likewise, the covered

event subset E c of C with respect to X k can be defined. After finding the covered

event subsets of E c

and E k between a variable pair

(

X k )

, information measures

can be used to detect the statistical pattern of these subsets. An interdependence

redundancy measure between X k and C k can be defined as

X k ,

C k

(

)

X k ,

C k

(

) =

(4.54)

X k ,

(

C k

)

X k ,

C k

X k ,

C k

where I

(

)

is the expected MI and H

(

)

is the Shannon's entropy defined

respectively on X k and C k :

(

a cu ,

a ks )

X k ,

C k

(

) =

(

a cu ,

a ks )

log

(4.55)

(

a cu )

(

a ks )

E k

E c

a ks ∈

a cu ∈

and

X k ,

C k

(

) =−

(

a cu ,

a ks )

log P

(

a cu ,

a ks ).

(4.56)

a ks ∈ E k

a cu ∈

E c

The interdependence redundancy measure has a chi-squared distribution:

X k ,

C k

(

)

(4.57)

x k ,

C k

(

)

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home