Information Technology Reference
In-Depth Information
deduplicating a single data set the maximum number of comparisons is | A ( | A |− 1 2 .
For any commercial scale data set, the maximum number of comparisons rises
rapidly. For example, if one were to link two data sets of 10,000 records each, the
total number of comparisons is 100,000,000. The record comparison operation is the
most expensive operation of the entire entity resolution process, therefore reducing
the number of comparisons will improve the scaleability of the process.
Typically, the number of true matches from the total number of possible matches
is usually only a very small fraction of the cross product of the two data sets. So if
the two 10,000 record data sets were to overlap completely one to one, the number
of matches is still only 10% of the total number of matches, and depending on
the data sets this can be much less. A heuristic process called blocking [8] can be
applied to reduce the number of comparisons, this partitions the dataset into different
blocks that are likely to contain duplicate records. The records within these blocks
are compared together, however records in one block are not compared with others
from a different block. Figure 6 shows how segmenting a 1000 record dataset into
5 blocks of 200 records each, reduces the number of number of comparisons from
one million to 200,000. Blocking methods have also been applied in many domains,
in for example, the detection of repeated lines of program code in a large software
system, so called “Clone Detection” [7].
Fig. 6 Blocking of 1000 records (adapted from [53])
In order to partition the data a key is built from one or more fields of the data set
in use. A typical example of a key, in data containing addresses, is the postcode. In
this case only records belonging to the same post code are compared. As a result, the
number of blocks is equal to the number of unique postcodes. A key of the length
of a line of code is used in [7] to block a million line code base.
The choice of the key used for blocking is the most important decision in the
blocking process [32]. The key must be carefully selected to provide a good re-
duction in the number of comparisons, while at the same time ensuring the process
 
Search WWH ::




Custom Search