Actor Identification in Implicit Relational Data Sources - Mining and Analyzing Social Networks

Information Technology Reference

In-Depth Information

4

Entity Resolution Approaches

In entity resolution literature, perhaps due to the original work from the statistics

community [58], the predominant approach is undertake a pairwise comparison be-

tween records. This involves comparing the data attributes to determine the similar-

ity between pairs of records in order to classify the pair as a match or a non match.

This approach is attribute based and considers each pair independently. Transitive

closure is often calculated on the resulting pairs to merge records that are pointing

to the same entity.

Since the problem has been tackled by different research communities, it has

been formulated in a variety of ways [10], some of which exploit the nature of the

data to infer more information than attributes alone can contain. Relational infor-

mation such as child-parent relationships and co-authorship links between paper

authors, can be used to create a graph of common neighbours which provides more

information to make the entity resolution process more accurate.

4.1

Attribute Based Entity Recognition

A typical attribute based entity resolution solution is divided into five stages [22].

Before attempting to identify entities, the data available has to be cleaned and con-

sistently divided into separate fields that are used in the following stages of entity

resolution. After the cleaning stage, the data is typically divided into blocks to re-

duce the number of comparisons between potential duplicates. Next, field compar-

isons measure the similarity between pairs of records to enable a classification of

which records are identical and which are not. Finally, the output of the classifica-

tion is evaluated to measure the quality of the whole process. The following sections

will describe each of these stages in more detail.

4.1.1

Data Cleansing and Data Standardisation

As the title of Hernandez's paper states, “Real word data is dirty” [34]. The entire

process of entity recognition is itself often a preprocessing stage before data mining

or analysis, which explains why entity resolution is sometimes referred to as data

cleansing in the database community. Despite entity resolution being a cleansing

stage for data analysis, the raw input data itself needs to be standardised in a single

well defined common format prior to other stages in the entity resolution process.

Data cleansing is well a understood problem in the database and datawarehousing

communities [49]. Leading database providers have commercialised research and

are now providing tools specifically designed to assist in data cleansing as part of the

ETL (extraction, transformation and loading) process to populate datawarehouses.

At this stage of the entity resolution process the main concern is ill formatted data,

different encodings of the same data, or data residing in incorrect fields.

Dates, addresses and phone numbers are typical examples of fields that require

standardisation for entity recognition. For example in Table 1 the 3 phone numbers

are encoded differently. In order to make comparison accurate this data must be

Mining and Analyzing Social Networks

Search WWH ::

Custom Search

Home