Graphics Reference
In-Depth Information
The edit distance does not work well if the string values have been shortened. The
affine gap distance adds two edit operations: opening gap and extending gap.
With these operations the truncated values can be matched with their original full
string counterparts.
Jaro introduced in [ 17 ] a string comparison algorithm mainly aimed to compare
first and last names in census data. It uses the string lengths
| σ 1 |
and
| σ 2 |
,the
common c characters between strings
σ 2 , and t the number of transpositions
by comparing the ordering of the most common characters in the string and calling
a transposition each mismatch in the order. The Jaro distance value is computed
as
σ 1 and
c
| σ 1 | +
1
3
c
| σ 2 | +
c
t
/
2
Jaro
1 2 ) =
.
(3.7)
c
Originally formulated for speech-recognition [ 22 ], q-grams are substrings of
length q that are commonly shared between strings
σ 2 . From a string,
the q -gram is obtained using a window of size q over the string and adding a
special character shared between both strings and not in the alphabet used in the
original encoding to fill in the beginning and the end of the string if needed. By
means of hashing algorithms, matching between q -grams can be speeded up [ 12 ].
σ 1 and
Although these measures work well for typographical errors, they fail with rearrange-
ments of the string (for example placing the last name before the first name in some
of the strings). Several options are available to the user to overcome this problem.
Token-based similarity metrics are devised to compensate this problem. Measures
based on atomic strings [ 26 ] or the WHIRL distance [ 4 ] are examples of these. Alter-
natives based on phonetic similarities can also be found. They try tomatch two strings
considering how they are pronounced instead of how they are encoded. Metaphone
[ 27 ] and ONCA [ 11 ] are examples of phonetic-based string matching algorithms.
Please note that these algorithms are limited to the particular language used to read
the strings and they are very dependant on the dialects as well.
Trying to detect similarities in numeric data is harder. Some authors encode the
numbers as strings or use range comparisons (numbers within a range threshold are
considered equivalent). However, these approaches are quite naive. More sophis-
ticated techniques are being proposed, such as considering the distribution of the
numeric data [ 21 ] or the extension of the cosine similarity metric used in WHIRL
for numeric data [ 1 ]. Nevertheless, discrepancies in numeric data are often detected
in the data cleaning step as will be shown in the next section, thanks to outlier or
noise detection algorithms.
After we have presented some measures to detect duplicity in each attribute of an
instance, describing procedures to establish whether two instances are duplicated or
not is the following step. There are different approaches to this task, and they can be
summarized as follows:
Probabilistic approaches. Fellegi and Sunter [ 10 ] formulated the duplicate instance
detection problem as Bayesian inference problem, and thus the Fellegi-Sunter
model is the most widely used in probabilistic approaches. If the density func-
 
 
Search WWH ::




Custom Search