Data Preparation Basic Models - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

Automatic approaches used to integrate the data can be found in the literature,

from techniques that match and find the schemas of the data [ 7 , 8 ], to automatic

procedures that reconcile different schemas [ 6 ].

3.2.1 Finding Redundant Attributes

Redundancy is a problem that should be avoided as much as possible. It will usually

cause an increment in the data set size, meaning that the modeling time of DM

algorithms is incremented as well, and may also induce overfitting in the obtained

model. An attribute is redundant when it can be derived from another attribute or set

of them. Inconsistencies in dimension or attribute names can cause redundancies as

well.

Redundancies in attributes can be detected using correlation analysis. By means

of such analysis we can measure how strong is the implication of one attribute to

the other. When the data is nominal and the set of values is thus finite, the

2 (chi-

squared) test is commonly applied. In numeric attributes the use of the correlation

coefficient and the covariance is typical.

3.2.1.1

Correlation Test

Suppose that two nominal attributes, A and B , contain c and r distinct values each,

namely a 1 ,...,

a c and b 1 ,...,

a r . We can check the correlation between them using

2 test. In order to do so, a contingency table, with the joint events

the

which attribute A takes the value a i and the attribute B takes the value b j , is created.

Every possible joint event

(

A i ,

B j )

2 value (or

(

A i ,

B j )

has its own entry in the table. The

2 statistic) is computed as:

Pearson

(

o ij −

e ij )

(3.1)

e ij

where o ij is the observed frequency of the joint event

(

A i ,

B j )

, and e ij is the expected

frequency of

(

A i ,

B j )

computed as:

count

(

a i ) ×

count

(

b j )

e ij =

(3.2)

where m is the number of instances in the data set, count

(

a i )

is the number

of instances with the value a i for attribute A and count

(

b j )

is the number of

instances having the value b j for attribute B .

The

2 test checks the hypothesis that A and B are independent, with

(

−

) × (

−

2 statistic obtained in Eq. ( 3.1 ) is compared against any

)

degrees of freedom. The

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home