Database Reference
In-Depth Information
(2) Matching Fingerprints : It is possible to represent fingerprints as sets. However, we
shall explore a different family of locality-sensitive hash functions from the one we
get by minhashing.
(3) Matching Newspaper Articles : Here, we consider a different notion of shingling that
focuses attention on the core article in an on-line newspaper's Web page, ignoring all
the extraneous material such as ads and newspaper-specific material.
3.8.1
Entity Resolution
It is common to have several data sets available, and to know that they refer to some of
the same entities. For example, several different bibliographic sources provide information
about many of the same topics or papers. In the general case, we have records describing
entities of some type, such as people or topics. The records may all have the same format,
or they may have different formats, with different kinds of information.
There are many reasons why information about an entity may vary, even if the field in
question is supposed to be the same. For example, names may be expressed differently in
different records because of misspellings, absence of a middle initial, use of a nickname,
and many other reasons. For example, “Bob S. Jomes” and “Robert Jones Jr.” may or may
not be the same person. If records come from different sources, the fields may differ as
well. One source's records may have an “age” field, while another does not. The second
source might have a “date of birth” field, or it may have no information at all about when a
person was born.
3.8.2
An Entity-Resolution Example
We shall examine a real example of how LSH was used to deal with an entity-resolution
problem. Company A was engaged by Company B to solicit customers for B. Company B
would pay A a yearly fee, as long as the customer maintained their subscription. They later
quarreled and disagreed over how many customers A had provided to B. Each had about
1,000,000 records, some of which described the same people; those were the customers A
had provided to B. The records had different data fields, but unfortunately none of those
fields was “this is a customer that A had provided to B.” Thus, the problem was to match
records from the two sets to see if a pair represented the same person.
Each record had fields for the name, address, and phone number of the person. However,
the values in these fields could differ for many reasons. Not only were there the mis-
spellings and other naming differences mentioned in Section 3.8.1 , but there were other
Search WWH ::




Custom Search