Actor Identification in Implicit Relational Data Sources - Mining and Analyzing Social Networks

Information Technology Reference

In-Depth Information

these cases. More complex multi-lingual cases such as John being used interchange-

ably with Sean(Irish) or Jean(French) are often simply ignored.

One of the best name matching algorithms, in terms of performance and robust-

ness, identified by both Cohen and Christen in their separate studies is the Jaro-

Winkler algorithm [48]. This is an extension of the algorithm Jaro proposed in [38].

The Jaro-Winkler algorithm starts with the computation of the Jaro measure then

adjusts the value to give more similarity in the prefix and reduce the disagreement

value of characters that are similar, such as “1” and “l”.

Field comparison functions for numeric values are not as advanced as those for

string functions [30]. Numeric fields can be treated as strings and compared using

string distance functions. Alternatively the percentage difference between the fields

can be used to quantify a normalised difference [22] measure.

4.1.4

Classification

Once field comparison is complete, each pair of records has to be classified as ei-

ther a match or a non-match. The first approaches to classification came from the

statistics community and relied on probability theory, to estimate the probability of

a record being a match or otherwise. Felligi and Sunter [32] contributed to two main

aspects of entity recognition; the calculation of field weights based on the informa-

tion quality of the field, and the definition of thresholds to classify record pairs into

three classes.

Before records can be classified, potentially identical records need to be com-

pared based on their fields, however not all fields contribute equally to the final de-

cision of whether a pair is a match or not. For example, a match on identical names

is usually quite significant in identifying matching records, however a match on the

date of birth or sex of a person can be less significant. In order to quantify the impor-

tance of the field, each field can be weighted according its importance, with more

representative fields having a higher weight. To determine the field importance, Fel-

ligi and Sunter proposed the use of two probabilities m and u , that determine the

agreement and disagreement weights of the individual fields.

Once the comparison of each of the attributes is complete, the total agreement

weight can be calculated to determine the value of the weight vector. To classify

the records into the three different sets, two cutoff thresholds must be defined. The

upper threshold defines all the pairs that are matches, the lower threshold defines

pairs that are not matches, and the records that fall in between are possible matches

that could be manually reviewed if necessary. In practice, the two thresholds can be

determined empirically based on the specific data set.

The probabilistic model of Fellegi and Sunter was subsequently revised and im-

proved by other researchers [38, 57]. Subsequent approaches used rule bases written

with the help of domain experts to classify records [34]. Elmagarmid et al [30] pro-

vide a comprehensive overview of the individual classification algorithms that fall

into the above broad classes.

Availability of training data and advances in machine learning brought about the

use of machine learning techniques to tackle the problem. The current state of the

Mining and Analyzing Social Networks

Search WWH ::

Custom Search

Home