Database Reference
In-Depth Information
3.5.3.1 How Is Our Study Related?
There are three differences between our study and [116, 118, 117]. First, the uncer-
tain data models adopted in the above work are different from the uncertain stream
model discussed in this topic, due to different application requirements. Second,
the monitored queries are different: [116, 118, 117] deal with general SQL queries,
Min/Max queries, and probabilistic queries, respectively, but our study focuses on
top- k queries on uncertain streams specifically. Last, while the above work only pro-
vides approximate answers, our study can provide a spectrum of methods including
an exact algorithm, a random method and their space efficient versions.
3.6 Probabilistic Linkage Queries
The problem of ranking queries on probabilistic linkages that will be discussed in
Chapter 7 is mainly related to the existing work on record linkages and probabilistic
graphical models.
3.6.1 Record Linkage
Computing record linkages has been studied extensively. Please refer to [37] as a
nice tutorial. Generally, linkage methods can be partitioned into two categories. The
deterministic record linkage methods [119] link two records if their values in certain
matching attributes such as “name”, “address” and “social insurance number” are
exactly identical. The deterministic record linkage methods are not very effective in
real-life applications due to data incompleteness and inconsistency.
Probabilistic record linkage methods [38] estimate the likelihood of two records
being a match based on some similarity measures in the matching attributes. The
similarity measures used in probabilistic record linkage methods fall into three
classes [37].
The first class is based on the Fellegi-Sunter theory [120]. The general idea is
to model the values of the records in matching attributes as comparison vectors,
and estimate the probability of two records being matched or unmatched given
their comparison vectors [121, 122, 123].
The second class is “edit based” measures such as the Levenshtein distance [124]
and the edit distance [125].
The third class is “term based” measures, where terms can be defined as words
in matching attributes or Q-grams [126]. Such similarity measures include the
fuzzy matching similarity [127] and the hybrids similarity measure developed
in [128].
Search WWH ::




Custom Search