Information Technology Reference
In-Depth Information
art in classification [21] uses Support Vector Machines (SVMs) for training models
and classifying records, when training examples are available. SVMs have been suc-
cessfully applied to several classification domains such as handwriting recognition,
classifying facial expressions and text categorisation [18]. Originally, SVMs were
designed to classify binary class problems, which makes them a prime candidate
for entity resolution tasks, where the goal is to divide record pairs into two sets of
matches and non-matches.
4.2
Relational Entity Resolution
Relational entity resolution approaches require that the data already has an inherent
relational structure. These approaches exploit this relational structure to add more
information to the entity resolution process to improve the classification accuracy.
In the research surveyed here, relational information always improves the entity
relationship accuracy when compared to attribute only techniques.
The simpler relational entity resolution techniques treat relational information as
just another attribute between pairs. These approaches are based on the attribute
resolution process but some of the attributes contain relational information. Rela-
tionship information is added to the comparison vector and if two records share the
same relationship then the similarity of that attribute is a true match. Ananthakr-
ishna et al [3] describe a database centric approach that exploits data hierarchies in
the database as additional relational information. This information is also used to
reduce the number of comparisons during the entity resolution process.
Bhattacharya and Getoor [12] describe a more complete relational model with
their collective entity resolution approach. They define the entity resolution problem
as a clustering problem where each cluster represents a unique entity. Clusters are
merged based on their similarity which is calculated with a similarity measure that
combines relational similarities and attribute similarities. The authors have shown
that this approach improves both on attribute based entity resolution and on tech-
niques that treat relationships as attributes.
4.3
Evaluation
Traditionally, information retrieval evaluation of accuracy, precision, recall and a
combined f-measure score, have been used to evaluate the quality of an entity res-
olution process [4]. Christen and Goiser provide a comprehensive overview of the
main quality measures used in entity recognition [23].
In entity resolution, it is common that there is a disproportionate ratio between
the number of matches and the number of non-matches in a data set. True negatives
typically occupy the vast majority of the results and if one were to blindly classify
all matched pairs as negatives high scores of accuracy can still be achieved. For this
reason in the case of unbalanced data sets any measures that involve a measure of
true negatives should be avoided [54].
 
Search WWH ::




Custom Search