Graphics Reference
In-Depth Information
(
·
) =
negative. When the two variables are independent, it is satisfied that E
A
B
(
) ·
(
)
E
A
E
B
, and thus the covariance verifies
Cov
(
A
,
B
) =
E
(
A
·
B
)
A B
=
E
(
A
) ·
E
(
B
)
A B
=
0
.
(3.6)
The reader must be cautious, as having Cov
0 does not imply the two
attributes being independent, as some random variables may present a covariance of
0 but still being dependent. Additional assumptions (like the data follows multivariate
normal distributions) are necessary if covariance is 0 to determine whether the two
attributes are independent.
(
A
,
B
) =
3.2.2 Detecting Tuple Duplication and Inconsistency
It is interesting to check, when the tuples have been obtained, that there are not
any duplicated tuple. One source of duplication is the use of denormalized tables,
sometimes used to speed up processes involving join operations.
Having duplicate tuples can be troublesome, not only wasting space and comput-
ing time for the DM algorithm, but they can also be a source of inconsistency. Due
to errors in the entry process, differences in some attribute values (for example the
identifier value) may produce identical repeated instances but which are considered
as different. These samples are harder to detect than simply scanning the data set for
duplicate instances.
Please note that sometimes the duplicity is subtle. For example, if the information
comes from different sources, the systems of measurement may be different as well,
resulting in some instances being actually the same, but not identified like that.
Their values can be represented using the metric system and the imperial system in
different sources, resulting in a not-so-obvious duplication. The instances may also
be inconsistent if attribute values are out of the established range (usually indicated
in the associated metadata for the data set), but this is an easy to check condition.
One of the most common sources of mismatches in the instances are the nominal
attributes [ 9 ]. Analyzing the similarity between nominal attributes is not trivial, as
distance functions are not applied in a straightforward way and several alternatives
do exist. Several character-based distance measures for nominal values can be found
in the literature. These and can be helpful to determine whether two nominal values
are similar (even with entry errors) or different [ 9 ]:
The edit distance [ 23 ] between two strings
σ 2 is the minimum number
of string operations (or edit operations ) needed to convert one string in the other.
Three types of edit operations are usually considered: inserting a character, replac-
ing a character or deleting a character. Using dynamic programming the number of
operations can be established. Modern versions of this distance measure establish
different costs for each edit operation, depending on the characters involved [ 31 ].
σ 1 and
 
 
Search WWH ::




Custom Search