Actor Identification in Implicit Relational Data Sources - Mining and Analyzing Social Networks

Information Technology Reference

In-Depth Information

does not miss any possible matches. If the number of blocks is too small, for exam-

ple when choosing gender as the blocking key, the number of records in each block

will be too big, resulting in many extra unnecessary comparisons. If, on the other

hand, blocks are too small, such as for example selecting a passport number as the

key, then potential errors in the data can cause true duplicate matches to be missed.

One disadvantage with this blocking procedure is that typing errors in the key

will result in potentially matching records being missed, since these records are

separated into different blocks. The number of missing records in a field should

also be taken into consideration. If a field has many missing values, then records

with those missing values are not going to form part of any block, reducing the

likelihood of matching duplicates.

To mitigate the effect of typing errors, phonetic encodings or string functions

can be applied to the chosen keys [20]. A substring function that extracts the first

character from the name field, can place all the names starting with that letter in the

same block. Phonetic encodings convert the string of characters into a code repre-

senting the pronunciation of the word. By definition such encodings are language

dependent, therefore selecting the right encoding is dependent on the language. The

oldest and best known English based phonetic encoding is Soundex [42]. Soundex

converts a string into the first character of the string and a set of numbers according

to an encoding table. Phonex [41] and Phonix [33]are two variations on the Soundex,

that attempt to improve the encoding scheme by applying more transformation to the

words.

In order to evaluate blocking algorithms the measures of pair completeness and

reduction ratio [8] are typically used. Pair completeness measures the number of the

identified pairs by the algorithm compared with the true number of duplicates that

exist in the whole dataset.The reduction ratio measures the reduction in the number

comparisons when using the blocking algorithms.

4.1.3

Field Comparison

After the blocks of records have been identified, the record pairs need to be com-

pared to determine the similarity between pairs. Depending on the classification

algorithm used to classify the records, the output of each field comparison can be

binary or a continuous measure of distance, typically between 0 and 1.

Functions for comparison depend on the type of data contained in the fields.

Classically, most of the data in an entity resolution process involves string data, so

often string distance algorithms are used to measure the similarity of fields [44].

Both Christen and Cohen et al [25] and [20] studied string matching functions

in the context of name matching for entity recognition. All entity recognition pro-

cesses involving person entities typically contain personal name fields that have to

be compared. Personal names can have different characteristics from general text,

such as multiple spellings for the same name, initial and middle name abbreviations

and shortened names. The variation in name spelling can be considered as a special

case of misspelling, however sometimes names change completely with name short-

ening. Generic string comparison algorithms typically don't cater for the worst of

Mining and Analyzing Social Networks

Search WWH ::

Custom Search

Home