Database Reference
In-Depth Information
opportunities to disagree as well. A customer might give their home phone to A and their
cell phone to B. Or they might move, and tell B but not A (because they no longer had need
for a relationship with A). Area codes of phones sometimes change.
The strategy for identifying records involved scoring the differences in three fields:
name, address, and phone. To create a score describing the likelihood that two records,
one from A and the other from B, described the same person, 100 points was assigned to
each of the three fields, so records with exact matches in all three fields got a score of
300. However, there were deductions for mismatches in each of the three fields. As a first
approximation, edit-distance ( Section 3.5.5 ) was used, but the penalty grew quadratically
with the distance. Then, certain publicly available tables were used to reduce the penalty in
appropriate situations. For example, “Bill” and “William” were treated as if they differed
in only one letter, even though their edit-distance is 5.
However, it is not feasible to score all one trillion pairs of records. Thus, a simple LSH
was used to focus on likely candidates. Three “hash functions” were used. The first sent
records to the same bucket only if they had identical names; the second did the same but
for identical addresses, and the third did the same for phone numbers. In practice, there was
no hashing; rather the records were sorted by name, so records with identical names would
appear consecutively and get scored for overall similarity of the name, address, and phone.
Then the records were sorted by address, and those with the same address were scored. Fin-
ally, the records were sorted a third time by phone, and records with identical phones were
scored.
This approach missed a record pair that truly represented the same person but none of
the three fields matched exactly. Since the goal was to prove in a court of law that the per-
sons were the same, it is unlikely that such a pair would have been accepted by a judge as
sufficiently similar anyway.
When Are Record Matches Good Enough?
While every case will be different, it may be of interest to know how the experiment of Section 3.8.3 turned out on
the data of Section 3.8.2 . For scores down to 185, the value of x was very close to 10; i.e., these scores indicated that
the likelihood of the records representing the same person was essentially 1. Note that a score of 185 in this example
represents a situation where one field is the same (as would have to be the case, or the records would never even be
scored), one field was completely different, and the third field had a small discrepancy. Moreover, for scores as low as
115, the value of x was noticeably less than 45, meaning that some of these pairs did represent the same person. Note
that a score of 115 represents a case where one field is the same, but there is only a slight similarity in the other two
fields.
Search WWH ::




Custom Search