Actor Identification in Implicit Relational Data Sources - Mining and Analyzing Social Networks

Information Technology Reference

In-Depth Information

art in classification [21] uses Support Vector Machines (SVMs) for training models

and classifying records, when training examples are available. SVMs have been suc-

cessfully applied to several classification domains such as handwriting recognition,

classifying facial expressions and text categorisation [18]. Originally, SVMs were

designed to classify binary class problems, which makes them a prime candidate

for entity resolution tasks, where the goal is to divide record pairs into two sets of

matches and non-matches.

4.2

Relational Entity Resolution

Relational entity resolution approaches require that the data already has an inherent

relational structure. These approaches exploit this relational structure to add more

information to the entity resolution process to improve the classification accuracy.

In the research surveyed here, relational information always improves the entity

relationship accuracy when compared to attribute only techniques.

The simpler relational entity resolution techniques treat relational information as

just another attribute between pairs. These approaches are based on the attribute

resolution process but some of the attributes contain relational information. Rela-

tionship information is added to the comparison vector and if two records share the

same relationship then the similarity of that attribute is a true match. Ananthakr-

ishna et al [3] describe a database centric approach that exploits data hierarchies in

the database as additional relational information. This information is also used to

reduce the number of comparisons during the entity resolution process.

Bhattacharya and Getoor [12] describe a more complete relational model with

their collective entity resolution approach. They define the entity resolution problem

as a clustering problem where each cluster represents a unique entity. Clusters are

merged based on their similarity which is calculated with a similarity measure that

combines relational similarities and attribute similarities. The authors have shown

that this approach improves both on attribute based entity resolution and on tech-

niques that treat relationships as attributes.

4.3

Evaluation

Traditionally, information retrieval evaluation of accuracy, precision, recall and a

combined f-measure score, have been used to evaluate the quality of an entity res-

olution process [4]. Christen and Goiser provide a comprehensive overview of the

main quality measures used in entity recognition [23].

In entity resolution, it is common that there is a disproportionate ratio between

the number of matches and the number of non-matches in a data set. True negatives

typically occupy the vast majority of the results and if one were to blindly classify

all matched pairs as negatives high scores of accuracy can still be achieved. For this

reason in the case of unbalanced data sets any measures that involve a measure of

true negatives should be avoided [54].

Search WWH ::

Custom Search

Home