Related Work - Ranking Queries on Uncertain Data

Database Reference

In-Depth Information

3.5.3.1 How Is Our Study Related?

There are three differences between our study and [116, 118, 117]. First, the uncer-

tain data models adopted in the above work are different from the uncertain stream

model discussed in this topic, due to different application requirements. Second,

the monitored queries are different: [116, 118, 117] deal with general SQL queries,

Min/Max queries, and probabilistic queries, respectively, but our study focuses on

top- k queries on uncertain streams specifically. Last, while the above work only pro-

vides approximate answers, our study can provide a spectrum of methods including

an exact algorithm, a random method and their space efficient versions.

3.6 Probabilistic Linkage Queries

The problem of ranking queries on probabilistic linkages that will be discussed in

Chapter 7 is mainly related to the existing work on record linkages and probabilistic

graphical models.

3.6.1 Record Linkage

Computing record linkages has been studied extensively. Please refer to [37] as a

nice tutorial. Generally, linkage methods can be partitioned into two categories. The

deterministic record linkage methods [119] link two records if their values in certain

matching attributes such as “name”, “address” and “social insurance number” are

exactly identical. The deterministic record linkage methods are not very effective in

real-life applications due to data incompleteness and inconsistency.

Probabilistic record linkage methods [38] estimate the likelihood of two records

being a match based on some similarity measures in the matching attributes. The

similarity measures used in probabilistic record linkage methods fall into three

classes [37].

•

The first class is based on the Fellegi-Sunter theory [120]. The general idea is

to model the values of the records in matching attributes as comparison vectors,

and estimate the probability of two records being matched or unmatched given

their comparison vectors [121, 122, 123].

•

The second class is “edit based” measures such as the Levenshtein distance [124]

and the edit distance [125].

•

The third class is “term based” measures, where terms can be defined as words

in matching attributes or Q-grams [126]. Such similarity measures include the

fuzzy matching similarity [127] and the hybrids similarity measure developed

in [128].

Search WWH ::

Custom Search

Home