Information Technology Reference
In-Depth Information
does not miss any possible matches. If the number of blocks is too small, for exam-
ple when choosing gender as the blocking key, the number of records in each block
will be too big, resulting in many extra unnecessary comparisons. If, on the other
hand, blocks are too small, such as for example selecting a passport number as the
key, then potential errors in the data can cause true duplicate matches to be missed.
One disadvantage with this blocking procedure is that typing errors in the key
will result in potentially matching records being missed, since these records are
separated into different blocks. The number of missing records in a field should
also be taken into consideration. If a field has many missing values, then records
with those missing values are not going to form part of any block, reducing the
likelihood of matching duplicates.
To mitigate the effect of typing errors, phonetic encodings or string functions
can be applied to the chosen keys [20]. A substring function that extracts the first
character from the name field, can place all the names starting with that letter in the
same block. Phonetic encodings convert the string of characters into a code repre-
senting the pronunciation of the word. By definition such encodings are language
dependent, therefore selecting the right encoding is dependent on the language. The
oldest and best known English based phonetic encoding is Soundex [42]. Soundex
converts a string into the first character of the string and a set of numbers according
to an encoding table. Phonex [41] and Phonix [33]are two variations on the Soundex,
that attempt to improve the encoding scheme by applying more transformation to the
words.
In order to evaluate blocking algorithms the measures of pair completeness and
reduction ratio [8] are typically used. Pair completeness measures the number of the
identified pairs by the algorithm compared with the true number of duplicates that
exist in the whole dataset.The reduction ratio measures the reduction in the number
comparisons when using the blocking algorithms.
4.1.3
Field Comparison
After the blocks of records have been identified, the record pairs need to be com-
pared to determine the similarity between pairs. Depending on the classification
algorithm used to classify the records, the output of each field comparison can be
binary or a continuous measure of distance, typically between 0 and 1.
Functions for comparison depend on the type of data contained in the fields.
Classically, most of the data in an entity resolution process involves string data, so
often string distance algorithms are used to measure the similarity of fields [44].
Both Christen and Cohen et al [25] and [20] studied string matching functions
in the context of name matching for entity recognition. All entity recognition pro-
cesses involving person entities typically contain personal name fields that have to
be compared. Personal names can have different characteristics from general text,
such as multiple spellings for the same name, initial and middle name abbreviations
and shortened names. The variation in name spelling can be considered as a special
case of misspelling, however sometimes names change completely with name short-
ening. Generic string comparison algorithms typically don't cater for the worst of
Search WWH ::




Custom Search