Data Preparation Basic Models - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

(

·

) =

negative. When the two variables are independent, it is satisfied that E

A

B

(

) ·

(

)

E

A

E

B

, and thus the covariance verifies

Cov

(

A

,

B

) =

E

(

A

·

B

) −

A B

=

E

(

A

) ·

E

(

B

) −

A B

=

0

.

(3.6)

The reader must be cautious, as having Cov

0 does not imply the two

attributes being independent, as some random variables may present a covariance of

0 but still being dependent. Additional assumptions (like the data follows multivariate

normal distributions) are necessary if covariance is 0 to determine whether the two

attributes are independent.

(

A

,

B

) =

3.2.2 Detecting Tuple Duplication and Inconsistency

It is interesting to check, when the tuples have been obtained, that there are not

any duplicated tuple. One source of duplication is the use of denormalized tables,

sometimes used to speed up processes involving join operations.

Having duplicate tuples can be troublesome, not only wasting space and comput-

ing time for the DM algorithm, but they can also be a source of inconsistency. Due

to errors in the entry process, differences in some attribute values (for example the

identifier value) may produce identical repeated instances but which are considered

as different. These samples are harder to detect than simply scanning the data set for

duplicate instances.

Please note that sometimes the duplicity is subtle. For example, if the information

comes from different sources, the systems of measurement may be different as well,

resulting in some instances being actually the same, but not identified like that.

Their values can be represented using the metric system and the imperial system in

different sources, resulting in a not-so-obvious duplication. The instances may also

be inconsistent if attribute values are out of the established range (usually indicated

in the associated metadata for the data set), but this is an easy to check condition.

One of the most common sources of mismatches in the instances are the nominal

attributes [ 9 ]. Analyzing the similarity between nominal attributes is not trivial, as

distance functions are not applied in a straightforward way and several alternatives

do exist. Several character-based distance measures for nominal values can be found

in the literature. These and can be helpful to determine whether two nominal values

are similar (even with entry errors) or different [ 9 ]:

•

The edit distance [ 23 ] between two strings

σ 2 is the minimum number

of string operations (or edit operations ) needed to convert one string in the other.

Three types of edit operations are usually considered: inserting a character, replac-

ing a character or deleting a character. Using dynamic programming the number of

operations can be established. Modern versions of this distance measure establish

different costs for each edit operation, depending on the characters involved [ 31 ].

σ 1 and

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home