Information Technology Reference
In-Depth Information
they travel frequently. In order to determine the network of all the airline passen-
gers, the remaining 65% of the passengers have to be uniquely identified.
5.3
Entity Resolution Process
For our case study we use an attribute based approach to entity recognition. The
overall aim of this research is to eventually infer large networks from data that has
no explicit network links or where links can be ambiguous. In this scenario, node
identification is the first stage towards identifying the network structure. Current
relational network approaches require the network to be already available to dis-
ambiguate between nodes. Once the network is identified the network information
can be fed back to a second entity resolution pass to improve the accuracy of entity
recognition and then again the accuracy of the generated network. In future work
we plan to embed this stage in the network inference process and study in depth the
interplay between the relationship inference and relational entity resolution.
Each of the the four stages described in section 4.1, data standardisation, block-
ing, field comparison, classification, involve design decisions that influence the effi-
ciency and the outcome of the entity resolution process. In this section we will look
at the design decisions we considered during this process, and the results of the most
efficient and effective solution used for identifying passengers.
To facilitate the development of this procedure the Febrl framework [22] and
toolkit were used. Febrl packages all the stages of the ER process in an easily ex-
tendable and customisable open source toolkit, written in Python. Originally, Febrl
was developed as a research platform to assist with medical record linkage, how-
ever the generic framework made it straightforward to adapt it and use it to identify
duplicate airline passengers here.
5.4
Data Standardisation and Cleansing
All the data elements extracted from the booking were sanitised to ensure process-
ing consistency. The four main data elements that can be used to uniquely identify
a passenger are the contact details available in the booking. The contact details in-
clude the frequent flyer number, email addresses, phone numbers and a single mail
address. The email addresses and the mail address are not linked with the individual
passengers but with the whole booking, therefore all records in the same booking
had the same mail and e-mail addresses.
Apart from the personal contact details of the passengers, additional information
on the passenger's route travelled was added. Frequent travellers tend to travel on
the same routes multiple times, therefore this information can be used to improve
the identification of the same passenger. The route information can be represented
in different ways, for instance it can be represented by flight type, flight distance, or
the starting point of origin of the journey.
As with any other real data source, the information contained in the booking can
be incorrect or misleading. For example, a booking can be made by a second person
 
Search WWH ::




Custom Search