Information Technology Reference
In-Depth Information
these cases. More complex multi-lingual cases such as John being used interchange-
ably with Sean(Irish) or Jean(French) are often simply ignored.
One of the best name matching algorithms, in terms of performance and robust-
ness, identified by both Cohen and Christen in their separate studies is the Jaro-
Winkler algorithm [48]. This is an extension of the algorithm Jaro proposed in [38].
The Jaro-Winkler algorithm starts with the computation of the Jaro measure then
adjusts the value to give more similarity in the prefix and reduce the disagreement
value of characters that are similar, such as ā€œ1ā€ and ā€œlā€.
Field comparison functions for numeric values are not as advanced as those for
string functions [30]. Numeric fields can be treated as strings and compared using
string distance functions. Alternatively the percentage difference between the fields
can be used to quantify a normalised difference [22] measure.
4.1.4
Classification
Once field comparison is complete, each pair of records has to be classified as ei-
ther a match or a non-match. The first approaches to classification came from the
statistics community and relied on probability theory, to estimate the probability of
a record being a match or otherwise. Felligi and Sunter [32] contributed to two main
aspects of entity recognition; the calculation of field weights based on the informa-
tion quality of the field, and the definition of thresholds to classify record pairs into
three classes.
Before records can be classified, potentially identical records need to be com-
pared based on their fields, however not all fields contribute equally to the final de-
cision of whether a pair is a match or not. For example, a match on identical names
is usually quite significant in identifying matching records, however a match on the
date of birth or sex of a person can be less significant. In order to quantify the impor-
tance of the field, each field can be weighted according its importance, with more
representative fields having a higher weight. To determine the field importance, Fel-
ligi and Sunter proposed the use of two probabilities m and u , that determine the
agreement and disagreement weights of the individual fields.
Once the comparison of each of the attributes is complete, the total agreement
weight can be calculated to determine the value of the weight vector. To classify
the records into the three different sets, two cutoff thresholds must be defined. The
upper threshold defines all the pairs that are matches, the lower threshold defines
pairs that are not matches, and the records that fall in between are possible matches
that could be manually reviewed if necessary. In practice, the two thresholds can be
determined empirically based on the specific data set.
The probabilistic model of Fellegi and Sunter was subsequently revised and im-
proved by other researchers [38, 57]. Subsequent approaches used rule bases written
with the help of domain experts to classify records [34]. Elmagarmid et al [30] pro-
vide a comprehensive overview of the individual classification algorithms that fall
into the above broad classes.
Availability of training data and advances in machine learning brought about the
use of machine learning techniques to tackle the problem. The current state of the
 
Search WWH ::




Custom Search