Actor Identification in Implicit Relational Data Sources - Mining and Analyzing Social Networks

Information Technology Reference

In-Depth Information

they travel frequently. In order to determine the network of all the airline passen-

gers, the remaining 65% of the passengers have to be uniquely identified.

5.3

Entity Resolution Process

For our case study we use an attribute based approach to entity recognition. The

overall aim of this research is to eventually infer large networks from data that has

no explicit network links or where links can be ambiguous. In this scenario, node

identification is the first stage towards identifying the network structure. Current

relational network approaches require the network to be already available to dis-

ambiguate between nodes. Once the network is identified the network information

can be fed back to a second entity resolution pass to improve the accuracy of entity

recognition and then again the accuracy of the generated network. In future work

we plan to embed this stage in the network inference process and study in depth the

interplay between the relationship inference and relational entity resolution.

Each of the the four stages described in section 4.1, data standardisation, block-

ing, field comparison, classification, involve design decisions that influence the effi-

ciency and the outcome of the entity resolution process. In this section we will look

at the design decisions we considered during this process, and the results of the most

efficient and effective solution used for identifying passengers.

To facilitate the development of this procedure the Febrl framework [22] and

toolkit were used. Febrl packages all the stages of the ER process in an easily ex-

tendable and customisable open source toolkit, written in Python. Originally, Febrl

was developed as a research platform to assist with medical record linkage, how-

ever the generic framework made it straightforward to adapt it and use it to identify

duplicate airline passengers here.

5.4

Data Standardisation and Cleansing

All the data elements extracted from the booking were sanitised to ensure process-

ing consistency. The four main data elements that can be used to uniquely identify

a passenger are the contact details available in the booking. The contact details in-

clude the frequent flyer number, email addresses, phone numbers and a single mail

address. The email addresses and the mail address are not linked with the individual

passengers but with the whole booking, therefore all records in the same booking

had the same mail and e-mail addresses.

Apart from the personal contact details of the passengers, additional information

on the passenger's route travelled was added. Frequent travellers tend to travel on

the same routes multiple times, therefore this information can be used to improve

the identification of the same passenger. The route information can be represented

in different ways, for instance it can be represented by flight type, flight distance, or

the starting point of origin of the journey.

As with any other real data source, the information contained in the booking can

be incorrect or misleading. For example, a booking can be made by a second person

Search WWH ::

Custom Search

Home