Information Technology Reference
In-Depth Information
4
Entity Resolution Approaches
In entity resolution literature, perhaps due to the original work from the statistics
community [58], the predominant approach is undertake a pairwise comparison be-
tween records. This involves comparing the data attributes to determine the similar-
ity between pairs of records in order to classify the pair as a match or a non match.
This approach is attribute based and considers each pair independently. Transitive
closure is often calculated on the resulting pairs to merge records that are pointing
to the same entity.
Since the problem has been tackled by different research communities, it has
been formulated in a variety of ways [10], some of which exploit the nature of the
data to infer more information than attributes alone can contain. Relational infor-
mation such as child-parent relationships and co-authorship links between paper
authors, can be used to create a graph of common neighbours which provides more
information to make the entity resolution process more accurate.
4.1
Attribute Based Entity Recognition
A typical attribute based entity resolution solution is divided into five stages [22].
Before attempting to identify entities, the data available has to be cleaned and con-
sistently divided into separate fields that are used in the following stages of entity
resolution. After the cleaning stage, the data is typically divided into blocks to re-
duce the number of comparisons between potential duplicates. Next, field compar-
isons measure the similarity between pairs of records to enable a classification of
which records are identical and which are not. Finally, the output of the classifica-
tion is evaluated to measure the quality of the whole process. The following sections
will describe each of these stages in more detail.
4.1.1
Data Cleansing and Data Standardisation
As the title of Hernandez's paper states, “Real word data is dirty” [34]. The entire
process of entity recognition is itself often a preprocessing stage before data mining
or analysis, which explains why entity resolution is sometimes referred to as data
cleansing in the database community. Despite entity resolution being a cleansing
stage for data analysis, the raw input data itself needs to be standardised in a single
well defined common format prior to other stages in the entity resolution process.
Data cleansing is well a understood problem in the database and datawarehousing
communities [49]. Leading database providers have commercialised research and
are now providing tools specifically designed to assist in data cleansing as part of the
ETL (extraction, transformation and loading) process to populate datawarehouses.
At this stage of the entity resolution process the main concern is ill formatted data,
different encodings of the same data, or data residing in incorrect fields.
Dates, addresses and phone numbers are typical examples of fields that require
standardisation for entity recognition. For example in Table 1 the 3 phone numbers
are encoded differently. In order to make comparison accurate this data must be
 
Search WWH ::




Custom Search