Information Technology Reference
In-Depth Information
Ta b l e 3 Effect of entity resolution on the social network
Measure
Under-matched Correct Overmatched
network density
low
medium high
number of components high
medium low
network distance
high
medium low
3
Entity Resolution
The problem of identifying multiple records referring to the same single entity was
recognised more than six decades ago. The first definition of the problem came
from H. L. Dunn [28] who used the term record linkage to define the problem. Later
geneticist Howard Newcombe proposed some key approaches, including matching
methods, that are still in use in today's systems [45]. The seminal paper by Fel-
legi and Sunter [32] from the statistics community formally defined record linkage
building on prior work by Newcombe. Although the problem is well understood and
has had considerable attention within the research and development community, it
is still considered as one of data mining's grand challenges [47].
In computer science the same problem spans many different research communi-
ties, often under different names. In the database and KDD communities the prob-
lem is often called the merge/purge , data cleansing or duplicate elimination [34]
problem. In this context, the aim is to identify which tuples within the same table
or different tables, correspond to the same real world object. Computer scientists
and AI practitioners refer to the problem as entity resolution . In computer vision the
term correspondence problem [50] is used to describe the identification of features
belonging to the same object in two different images. The problem has also been
an open topic in Natural Language Processing, under the term coreference resolu-
tion . In NLP, coreference resolution is part of information extraction, where names
referring to the same entity in free form text need to be identified as referring to
the same person. The message understanding conferences (MUCs) sponsored by
DARPA aided with the definition and evaluation of coreference resolution by intro-
ducing coreference tasks in the yearly challenge after its 6th conference [36].
The application of entity resolution has been applied and documented in several
domains. The first applications were on medical data [45] and since then there have
been more than a thousand references to articles on the subject published in med-
ical literature [24]. Significant studies on US census data have been conducted by
Winkler [58], and applied by national statistics bodies of other countries [53]. Entity
resolution can also be used to identify fraud. For example, matching employment
records with records to disability claims can uncover cases of disability compen-
sation fraud [37]. Other examples include deduplicating lists of potential customer
names for direct marketing [34] and deduplicating search results in meta search
engines [13].
 
Search WWH ::




Custom Search