EIGHT YEARS USING GRIDS FOR LIFE SCIENCES - Collaborative Computational Technologies for Biomedical Research

Biomedical Engineering Reference

In-Depth Information

15.4.2.5 Patient Data Linkage Data linkage does not consist of a simple

string comparison; the two main problems are related to looking through a

patient's information (homonyms, same address, equivalent birthdates) and

overall errors in names. Three levels of errors appear:

• Typographical errors (despite known spelling)

• Cognitive errors (comprehension problem)

• Phonetic errors (similar spelling)

The errors and variations are mainly related to the typing of handwritten data,

keyboard neighbors (k-i, e-r, etc), data input during a telephone conversation,

and software or database limitation of input fi elds (length limitation) that

force the use of abbreviations or initials. Several matching techniques aim to

measure similarity between strings. Two different approaches can be adopted:

• Pattern matching for fl exible matching between two strings

• A combination of phonetic encoding and exact matching

The similarity measurement is generally normalized: two strings are equivalent

with score

0 .

The effi ciency of the solution will impact the percentage of automatic

matching. This ratio must be as high as possible while guaranteeing a lower

level of false positive. For this linkage process the usage of a combination of

Jaro-Winkler [21] and Phonex [22] (French) algorithms are used. According

to the relevance and accuracy of information in the data set, different weights

are attributed.

For each fi eld, four different criteria defi ne how to interpret matching

scores according to fi eld types:

=

1 and if totally different score

=

• Accuracy, which defi nes the relevance of information

• Blocking, in case of false matching (under threshold), where the corre-

spondence would be automatically rejected

• Weight (similar), which represents a factor attributed in case of similarity

(over threshold)

• Weight (different), in case of false matching, a divide factor attributed to

global similarity

Weight distinction between similar and different matching is necessary. As in

the following example: The probability of having a last name different for only

one patient in distributed databases is small so it considerably reduces the

matching chance. However, having two entries with the same address does not

mean that the patient is identical for these two entries. Table 15.1 summarizes

the proposition of criteria adjustment for automatic record linkage. A weight

factor is attributed for each fi eld and is submitted as input for the linkage

algorithm.

Collaborative Computational Technologies for Biomedical Research

Search WWH ::

Custom Search

Home