Actor Identification in Implicit Relational Data Sources - Mining and Analyzing Social Networks

Information Technology Reference

In-Depth Information

on behalf of the person travelling, giving his contact details instead of the travel-

ling passenger's contact details. The main purpose of using this information is to

uniquely identify the passengers, rather than direct marketing and contact, therefore

as long as the information is consistent, this information can still be used. Should

the approach be extended to direct marketing, then stricter rules should be applied

to further clean the data.

5.5

Blocking

The data set used for entity resolution consists of over 9 million name records. If

one were to blindly compare all the 9 million records against each other in a cross

product, the process will involve over 7

9 12 comparisons, most of which will be

unnecessary. The only two fields that could effectively be used for blocking are

the name and surname, since all other fields contained many missing elements that

made them unsuitable as blocking keys (see Table 4).

Field blocking with encodings and sorted neighbourhood blocking were tested

in the blocking stage. Three phonetic encodings were tested; Soundex, Phonex and

Phonix. For these encodings, the name and surname were independently encoded

phonetically, then concatenated together. The best phonetic encoding according to

our tests was the Soundex encoding as it has both the highest pair completeness, and

the highest reduction ratio (see Figure 7).

The sorted neighbourhood approach was explored with two different window

sizes. The accuracy was only slightly better than the Soundex encoding of name

and surname combined, however the number of pairs compared was significantly

greater. The sorted neighbourhood approach is more efficient when several keys are

used to define multiple blocks, which are then combined together [34]. Since in this

scenario the number of possible keys for blocking is limited, the sorted neighbour-

hood approach could not be applied to its full potential.

Further experiments were held to improve the efficiency of the blocking proce-

dure for this particular data set. The best result was achieved by using a Soundex

encoding of the surname concatenated with the first two characters of the name. This

approached reached a pair completeness of 99%, resulting in less than 100 actual

records missed.

Figure 7 compares the efficiency of the different blocking types. The first three

bars are for the concatenated name and surname with Soundex, Phonex and Phonix

encodings. The next two are for the Soundex encoded surname and the first or sec-

ond character of the name. The last measurement is for the sorted neighbourhood

approach with a window size of 10. For all the measures the reduction ratio was

over .999 of all the number of possible comparisons.

The records that are missed with this blocking approach are mainly due to pas-

sengers changing their surname after marriage. Using the name and surname keys

alone makes this case very difficult to identify automatically. Using any part of the

surname is always prone to this problem, however names are also prone to abbrevia-

tions, therefore the most accurate blocking key in this case is using the first letter of

.

Mining and Analyzing Social Networks

Search WWH ::

Custom Search

Home