Information Technology Reference
In-Depth Information
Fig. 7 Blocking Result
the name. This approach however will make the number of comparisons too high,
with only 24 possible buckets, one for each letter of the alphabet. The benefit in
terms of accuracy is marginal while the number of extra comparisons is significantly
large.
5.6
Weight Generation
After the potential matching names are grouped together in the blocking stage, the
weight of each pair of records was calculated to form a weight vector. The weights
were calculated based on the field datatype and the content of the field. Some mea-
surements that were used include the following:-
Jaro-Winkler: This fuzzy string comparison function defined by Jaro-Winkler
[56] returns a figure between 0-1 depending on the similarity between strings.
Exact String Match: Extract string comparison. If the strings are not exactly the
same then 0 is returned.
Max String Difference: This defines the maximum number of characters that can
differ between strings. If the maximum number of differences is smaller than the
threshold then 1 is returned.
Minimum Set Membership: If at least X members of the set are the same, where
X is the set threshold then a match of 1 is returned.
Flight Boolean Match:
0 in both fields, or the number
of flights = 0 in both fields, then the function returns a match value of 0.
Numeric Percentage Difference: If the difference between the two numbers is less
than the percentage threshold then 1 is returned, 0 otherwise.
If the number of flights
>
Search WWH ::




Custom Search