Graphics Reference
In-Depth Information
Automatic approaches used to integrate the data can be found in the literature,
from techniques that match and find the schemas of the data [ 7 , 8 ], to automatic
procedures that reconcile different schemas [ 6 ].
3.2.1 Finding Redundant Attributes
Redundancy is a problem that should be avoided as much as possible. It will usually
cause an increment in the data set size, meaning that the modeling time of DM
algorithms is incremented as well, and may also induce overfitting in the obtained
model. An attribute is redundant when it can be derived from another attribute or set
of them. Inconsistencies in dimension or attribute names can cause redundancies as
well.
Redundancies in attributes can be detected using correlation analysis. By means
of such analysis we can measure how strong is the implication of one attribute to
the other. When the data is nominal and the set of values is thus finite, the
2 (chi-
squared) test is commonly applied. In numeric attributes the use of the correlation
coefficient and the covariance is typical.
χ
2
3.2.1.1
χ
Correlation Test
Suppose that two nominal attributes, A and B , contain c and r distinct values each,
namely a 1 ,...,
a c and b 1 ,...,
a r . We can check the correlation between them using
2 test. In order to do so, a contingency table, with the joint events
the
in
which attribute A takes the value a i and the attribute B takes the value b j , is created.
Every possible joint event
χ
(
A i ,
B j )
2 value (or
(
A i ,
B j )
has its own entry in the table. The
χ
2 statistic) is computed as:
Pearson
χ
c
r
2
(
o ij
e ij )
2
χ
=
,
(3.1)
e ij
i
=
1
j
=
1
where o ij is the observed frequency of the joint event
(
A i ,
B j )
, and e ij is the expected
frequency of
(
A i ,
B j )
computed as:
count
(
A
=
a i ) ×
count
(
B
=
b j )
e ij =
,
(3.2)
m
where m is the number of instances in the data set, count
(
A
=
a i )
is the number
of instances with the value a i for attribute A and count
(
B
=
b j )
is the number of
instances having the value b j for attribute B .
The
2 test checks the hypothesis that A and B are independent, with
χ
(
r
1
) × (
c
2 statistic obtained in Eq. ( 3.1 ) is compared against any
1
)
degrees of freedom. The
χ
 
 
Search WWH ::




Custom Search