Data Preparation Basic Models - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

•

The edit distance does not work well if the string values have been shortened. The

affine gap distance adds two edit operations: opening gap and extending gap.

With these operations the truncated values can be matched with their original full

string counterparts.

•

Jaro introduced in [ 17 ] a string comparison algorithm mainly aimed to compare

first and last names in census data. It uses the string lengths

| σ 1 |

and

| σ 2 |

,the

common c characters between strings

σ 2 , and t the number of transpositions

by comparing the ordering of the most common characters in the string and calling

a transposition each mismatch in the order. The Jaro distance value is computed

as

σ 1 and

c

| σ 1 | +

1

3

c

| σ 2 | +

c

−

t

/

2

Jaro

(σ 1 ,σ 2 ) =

.

(3.7)

c

•

Originally formulated for speech-recognition [ 22 ], q-grams are substrings of

length q that are commonly shared between strings

σ 2 . From a string,

the q -gram is obtained using a window of size q over the string and adding a

special character shared between both strings and not in the alphabet used in the

original encoding to fill in the beginning and the end of the string if needed. By

means of hashing algorithms, matching between q -grams can be speeded up [ 12 ].

σ 1 and

Although these measures work well for typographical errors, they fail with rearrange-

ments of the string (for example placing the last name before the first name in some

of the strings). Several options are available to the user to overcome this problem.

Token-based similarity metrics are devised to compensate this problem. Measures

based on atomic strings [ 26 ] or the WHIRL distance [ 4 ] are examples of these. Alter-

natives based on phonetic similarities can also be found. They try tomatch two strings

considering how they are pronounced instead of how they are encoded. Metaphone

[ 27 ] and ONCA [ 11 ] are examples of phonetic-based string matching algorithms.

Please note that these algorithms are limited to the particular language used to read

the strings and they are very dependant on the dialects as well.

Trying to detect similarities in numeric data is harder. Some authors encode the

numbers as strings or use range comparisons (numbers within a range threshold are

considered equivalent). However, these approaches are quite naive. More sophis-

ticated techniques are being proposed, such as considering the distribution of the

numeric data [ 21 ] or the extension of the cosine similarity metric used in WHIRL

for numeric data [ 1 ]. Nevertheless, discrepancies in numeric data are often detected

in the data cleaning step as will be shown in the next section, thanks to outlier or

noise detection algorithms.

After we have presented some measures to detect duplicity in each attribute of an

instance, describing procedures to establish whether two instances are duplicated or

not is the following step. There are different approaches to this task, and they can be

summarized as follows:

•

Probabilistic approaches. Fellegi and Sunter [ 10 ] formulated the duplicate instance

detection problem as Bayesian inference problem, and thus the Fellegi-Sunter

model is the most widely used in probabilistic approaches. If the density func-

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home