Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

opportunities to disagree as well. A customer might give their home phone to A and their

cell phone to B. Or they might move, and tell B but not A (because they no longer had need

for a relationship with A). Area codes of phones sometimes change.

The strategy for identifying records involved scoring the differences in three fields:

name, address, and phone. To create a score describing the likelihood that two records,

one from A and the other from B, described the same person, 100 points was assigned to

each of the three fields, so records with exact matches in all three fields got a score of

300. However, there were deductions for mismatches in each of the three fields. As a first

approximation, edit-distance ( Section 3.5.5 ) was used, but the penalty grew quadratically

with the distance. Then, certain publicly available tables were used to reduce the penalty in

appropriate situations. For example, “Bill” and “William” were treated as if they differed

in only one letter, even though their edit-distance is 5.

However, it is not feasible to score all one trillion pairs of records. Thus, a simple LSH

was used to focus on likely candidates. Three “hash functions” were used. The first sent

records to the same bucket only if they had identical names; the second did the same but

for identical addresses, and the third did the same for phone numbers. In practice, there was

no hashing; rather the records were sorted by name, so records with identical names would

appear consecutively and get scored for overall similarity of the name, address, and phone.

Then the records were sorted by address, and those with the same address were scored. Fin-

ally, the records were sorted a third time by phone, and records with identical phones were

scored.

This approach missed a record pair that truly represented the same person but none of

the three fields matched exactly. Since the goal was to prove in a court of law that the per-

sons were the same, it is unlikely that such a pair would have been accepted by a judge as

sufficiently similar anyway.

When Are Record Matches Good Enough?

While every case will be different, it may be of interest to know how the experiment of Section 3.8.3 turned out on

the data of Section 3.8.2 . For scores down to 185, the value of x was very close to 10; i.e., these scores indicated that

the likelihood of the records representing the same person was essentially 1. Note that a score of 185 in this example

represents a situation where one field is the same (as would have to be the case, or the records would never even be

scored), one field was completely different, and the third field had a small discrepancy. Moreover, for scores as low as

115, the value of x was noticeably less than 45, meaning that some of these pairs did represent the same person. Note

that a score of 115 represents a case where one field is the same, but there is only a slight similarity in the other two

fields.

Search WWH ::

Custom Search

Home