Information Technology Reference
In-Depth Information
on behalf of the person travelling, giving his contact details instead of the travel-
ling passenger's contact details. The main purpose of using this information is to
uniquely identify the passengers, rather than direct marketing and contact, therefore
as long as the information is consistent, this information can still be used. Should
the approach be extended to direct marketing, then stricter rules should be applied
to further clean the data.
5.5
Blocking
The data set used for entity resolution consists of over 9 million name records. If
one were to blindly compare all the 9 million records against each other in a cross
product, the process will involve over 7
9 12 comparisons, most of which will be
unnecessary. The only two fields that could effectively be used for blocking are
the name and surname, since all other fields contained many missing elements that
made them unsuitable as blocking keys (see Table 4).
Field blocking with encodings and sorted neighbourhood blocking were tested
in the blocking stage. Three phonetic encodings were tested; Soundex, Phonex and
Phonix. For these encodings, the name and surname were independently encoded
phonetically, then concatenated together. The best phonetic encoding according to
our tests was the Soundex encoding as it has both the highest pair completeness, and
the highest reduction ratio (see Figure 7).
The sorted neighbourhood approach was explored with two different window
sizes. The accuracy was only slightly better than the Soundex encoding of name
and surname combined, however the number of pairs compared was significantly
greater. The sorted neighbourhood approach is more efficient when several keys are
used to define multiple blocks, which are then combined together [34]. Since in this
scenario the number of possible keys for blocking is limited, the sorted neighbour-
hood approach could not be applied to its full potential.
Further experiments were held to improve the efficiency of the blocking proce-
dure for this particular data set. The best result was achieved by using a Soundex
encoding of the surname concatenated with the first two characters of the name. This
approached reached a pair completeness of 99%, resulting in less than 100 actual
records missed.
Figure 7 compares the efficiency of the different blocking types. The first three
bars are for the concatenated name and surname with Soundex, Phonex and Phonix
encodings. The next two are for the Soundex encoded surname and the first or sec-
ond character of the name. The last measurement is for the sorted neighbourhood
approach with a window size of 10. For all the measures the reduction ratio was
over .999 of all the number of possible comparisons.
The records that are missed with this blocking approach are mainly due to pas-
sengers changing their surname after marriage. Using the name and surname keys
alone makes this case very difficult to identify automatically. Using any part of the
surname is always prone to this problem, however names are also prone to abbrevia-
tions, therefore the most accurate blocking key in this case is using the first letter of
.
 
Search WWH ::




Custom Search